Python 爬虫学习笔记（十(3)）scrapy 不同页面数据的爬取

662次阅读

没有评论

我们有时需要爬取的数据并不在同一页上，不能简单的请求一个url然后解析网页。以dytt网站为例，例如我们想要“国内电影”里所有电影的名字，和点进去它的图片（在另一个网页中）。如何把这两个数据定义为同一个item对象呢？
Python

Python

在PyCharm终端依次输入：

scrapy startproject dytt_movie

cd dytt_movie\dytt_movie

scrapy genspider movie + 网址

注：genspider后面的网址不需要带http://和url最后的斜线，.html的网页末尾一定不要有‘/’ 最后测试一下这个url有没有robots协议或者反爬手段等.

我们需要的一个是电影的名字，一个是电影图片url

class DyttMovieItem(scrapy.Item): name = scrapy.Field() src = scrapy.Field()

分析网页源代码，如下标签中包含着我们需要的电影名字，以及其图片所在的另一个网页的url。我们需要获取这两个信息
Python

用xpath解析找到我们需要的信息,注意图片所在的url需要拼接

def parse(self, response): a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

for a in a_list: name = a.xpath('./text()').extract_first() url = '自己输入域名' + a.xpath('./@href').extract_first()

我们以及获取到url，接下来就要访问这个图片的url，从而才能提取出图片的地址。

这里我们用到了yield关键字

还有scrapy.Request()即GET请求

还要自己写一个parse_second作为回调函数（每获取到一个url，对其发起一个GET请求）

一定记得将allowed_domains范围扩大为域名 否则无法访问第二页

def parse(self, response): a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

for a in a_list: name = a.xpath('./text()').extract_first() url = '自己输入域名' + a.xpath('./@href').extract_first()

yield scrapy.Request(url=url, callback=self.parse_second)

def parse_second(self, response): print('999999999999')

之后是对图片地址的解析（具体过程略）

def parse_second(self, response): src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first() print(src)

但还有一个问题，我们需要将图片的src和电影的名字对应好，一并获取，这时就需要meta参数。

设置meta参数，传递需要的参数

在parse_second中可以response.meta[‘name’]来获取

在第二个parse函数中，创建movie对象

返回给管道

def parse(self, response): a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

for a in a_list: name = a.xpath('./text()').extract_first() url = '自己输入域名' + a.xpath('./@href').extract_first()

yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name})

def parse_second(self, response): src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first() name = response.meta['name']

movie = DyttMovieItem(name=name, src=src) yield movie

先开启管道，再在管道中下载数据

class DyttMoviePipeline:

def open_spider(self, spider): self.f = open('movie.json', 'w', encoding='utf-8')

def process_item(self, item, spider): self.f.write(str(item)) return item

def close_spider(self, spider): self.f.close()

movie.py

import scrapy from dytt_movie.items import DyttMovieItem

class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['自己输入域名'] start_urls = ['国内电影的url']

def parse(self, response): a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

for a in a_list: name = a.xpath('./text()').extract_first() url = '自己输入域名' + a.xpath('./@href').extract_first()

yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name})

def parse_second(self, response): src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first() name = response.meta['name']

movie = DyttMovieItem(name=name, src=src) yield movie

items.py

import scrapy

class DyttMovieItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() src = scrapy.Field()

pipelines.py

class DyttMoviePipeline:

def open_spider(self, spider): self.f = open('movie.json', 'w', encoding='utf-8')

def process_item(self, item, spider): self.f.write(str(item)) return item

def close_spider(self, spider): self.f.close()

最关键的是上面的四，其他步骤都是套路。

分页获取数据需要用yield来调用自定义的第二个parse函数

当涉及到两个网页时，数据一定要用meta去传

meta参数接收一个字典类型，可以用get拿到数据

要检查xpath语句，例如span等标签可能爬不到，尽量避开

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

Python 爬虫学习笔记（十(3)）scrapy 不同页面数据的爬取

一、创建scrapy项目

二、items.py中定义数据的结构

三、定位、爬取数据

四、对第二页的链接发起访问

五、保存数据到管道

六、所有代码

总结

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置