编程

Scrapy 入门教程

939 2022-01-16 17:40:23

1.安装

pip install Scrapy

2. 创建项目

scrapy startproject myspider

项目创建完毕后,其目录结构如下:

mypider/
    scrapy.cfg            # deploy configuration file

    myspider/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

3. 创建爬虫

cd myspider
scrapy genspider example example.com

在 /myspider/myspider/spiders下会生成一个爬虫文件example.py。示例代码如下:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        pass

allowed_domains 是允许爬取的域名,start_urls 是爬虫入口网址。

可在parse()函数中自定义解析方式,使用yield将解析项传递到pipelines.py中

4. 启动爬虫爬取数据

scrapy crawl example