python爬虫笔记（七）——scrapy文档阅读（一）——scrapy的基本使用

281次阅读

没有评论

python爬虫笔记（七）——scrapy文档阅读（一）——scrapy的基本使用

一、创建一个新的爬虫项目：

scrapy startproject tutorial

创建的项目目录如下：

tutorial/ scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here __init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders __init__.py

二、自定义爬虫

自定义的爬虫需要继承scrapy.Spider类，放到spiders目录下：

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes"

def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)

自定义Scrapy使用的爬虫类有一些特殊的属性

name：一个Spider独一无二的标识 start_requests():返回一个可遍历的url列表 parse():处理每一个url下载的html页面，response即返回的应答，parse通常用来提取数据，生成一个字典，提取新的url，创建新的可遍历列表

三、运行spider

进入到爬虫目录（博客开始时的目录），命令行中输入命令：

scrapy crawl quotes

这条命令会运行name为quotes的爬虫

对于上述例子来说，start_requests方法会返回scrapy.Request对象，每接收到一个应答，就会初始化一个response对象，调用parse方法去解析

除了定义start_requests方法，可以直接将要爬取的url定义为爬虫类的一个属性：

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ]

def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body)

四、提取数据

输入以下命令行，允许使用选择器提取数据（不需要运行我们自定义的爬虫）

scrapy shell "http://quotes.toscrape.com/page/1/"

会有这样的输出：

[ … Scrapy log here … ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>>

文档中举例的函数如下：

#提取名为title的元素，返回列表 >>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

#提取名为title的元素的文本，返回列表 >>> response.css('title::text').extract() ['Quotes to Scrape']

#提取名为title的元素的文本，同时只返回第一个提取结果 >>> response.css('title::text').extract_first() 'Quotes to Scrape'

#提取class为quote的div元素 >>> response.css("div.quote")

#选择class为next的li元素中的a子元素，提取其中的href标签的值 >>> response.css('li.next a::attr(href)').extract_first() '/page/2/'

#使用正则表达式 >>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape'] >>> response.css('title::text').re(r'Qw+') ['Quotes'] >>> response.css('title::text').re(r'(w+) to (w+)') ['Quotes', 'Scrape']

#查看html页面 view(response)

#使用xpath >>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape'

Xpath是Scrapy选择器的基础，CSS选择器底层使用的就是Xpath

上述是通过shell命令提取数据，也可以在parse方法中提取数据，scrapy使用了yield关键字：

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ]

def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }

运行这个爬虫，得到的结果如下：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

上述例子只是返回了提取信息，也可以在parse中返回下一次要爬取的url，同样使用到了yield关键字

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ]

def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }

next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

由于next_page可能是相对路径，所以使用urljoin保证是绝对路径，scrapy.Request方法的callback的值指定的是回调函数，当请求了这个url后，会调用这个方法

上述代码需要将相对url变为绝对url，response.follow方法会自动处理这种情况：

import scrapy

class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ]

def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('span small::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }

next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)

如果有多个url需要提取，可以将上述提取url的代码更改为：

for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)

response.follow会自动使用a元素的href属性，因此上述代码可进一步简化为：

for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)

response.follow只能处理单个url，不能一次处理一批url

scrapy可以自动管理url，因此不必担心重复爬取统一url，可以在命令后面添加-a选项，来传递参数给自定义的爬虫：

scrapy crawl quotes -o quotes-humor.json -a tag=humor

这些参数将传递给__init__函数并成为爬虫类的一个属性

五、存储数据

最简单的方式是使用下列命令：

scrapy crawl quotes -o quotes.json

爬取的数据会序列化为json进行存储，由于历史原因，Scrapy会将内容添加到某个文件的末尾，而不是覆盖，如果我们使用两次这条命令（没有删除对应的文件），将会得到一个损坏的json文件

对于简单项目来说，上述的存储命令（文件名、文件类型可以更改）基本可以满足需求，如果项目比较复杂，可以考虑使用Item Pipeline，在pipelines.py中定义

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

python爬虫笔记（七）——scrapy文档阅读（一）——scrapy的基本使用

相关文章：

集合访问方式python（python集合类型的操作符）

限时Python免费学（可以免费学python的网站）

阿里python代码检查（python 代码检查工具）

调用python加入参（python函数参数调用）

读取串口python代码（python 串口读取）

词法分析程序python（词法分析程序流程图）

设计模式python版本（python做设计）

腾讯云python认证（腾讯云函数部署python）

相似图片识别python（相似图片识别 docker）

淘宝登录python脚本（淘宝python基础教程）