使用scrapy框架爬取网页图片——详解

587次阅读

前言：使用scrapy框架爬取网页图片，并做持久化存储！使用scrapy做图片存储必须先下载 Pillow 库

安装方法：pip install Pillow

目标网址：https://sc.chinaz.com/tupian/huaxuetupian.html

import scrapy from imgsPro.items import ImgsproItem import time

class ImgsSpider(scrapy.Spider): # 爬虫文件名，运行文件的时候就是使用该名字 name = 'imgs' # 允许请求的url，建议直接注释掉 # allowed_domains = ['www.sc.chinaz.com.com'] # 目标url start_urls = ['https://sc.chinaz.com/tupian/huaxuetupian.html']

# scrapy自带的请求解析方法 def parse(self, response): # 使用xpath匹配所有的图片页面url与名称 tree = response.xpath('//div[@id="container"]/div') # 匹配img_page (href) img_page_url=tree.xpath('./div/a/@href').extract() # 匹配图片名称（alt） imgalt=tree.xpath('./div/a/@alt').extract() # 使用for循环遍历向每一页的图片url发送请求： for page,alt in zip(img_page_url,imgalt): # 拼接url（爬取的url很多都不是完整的！） page = 'https:' + page # print(page) # print(alt) # 实例化item对象（item就是一个空字典） item = ImgsproItem() # 将爬取到的值以字典形式存储到item中！ item['alt'] =alt # print(item) # 单独向每个img_page_url发送请求： time.sleep(0.6) yield scrapy.Request(url=page, callback=self.imgs_parse,meta={'item':item}) # 自定义的请求解析方法 def imgs_parse(self,response): # 接收item（因为item是在parse方法中定义的所以需要在自定义方法imgs_parse接收！） item = response.meta['item'] # print(item) # print(response.text) time.sleep(1.7) # 使用xpath解析图片页中的图片url（.extract_first()表示提取列表中的第一个值） img_url = response.xpath('//div[@class="imga"]/a/@href').extract_first() img_url = 'https:' + img_url # 拼接url item['img_url']=img_url # j将获取到的url存储到item中方便提交给管道！ # print(img_url) yield item # 将图片名称跟图片的url通过item提交给管道做解析并存储！

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ImgsproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 定义爬虫文件中封装在item对象中的对象（atl是图片名称，img_url是图片地址） alt = scrapy.Field() img_url = scrapy.Field() pass

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface #下面的语句都是需要自己定义 #导入ImagesPipelin管道类（处理图片数据的！） from scrapy.pipelines.images import ImagesPipeline import time import scrapy #导入scrapy（必须） # 创建一个图片管道类 class ImgsproPipeline(ImagesPipeline): print('正在初始化img管道对象')

#图片url请求方法： def get_media_requests(self, item, info): # print(item) # 手动发送requsts请求 time.sleep(1.26) # print(item['img_url']) # print(item['alt']) # 向item中的图片url发送请求！（item是一个dict） yield scrapy.Request(url=item['img_url'])

# 定义图片名称及路径： def file_path(self, request, response=None, info=None, *, item): # 定义图片存储名称 imgName=request.url.split('/')[-1] print(f'正在下载：{imgName}') # 返回图片名称，写入到指定目录文件中 return imgName def item_completed(self, results, item, info): # 返回item给下一个管道对象 return item # 自定义一个__del__方法（方便最后执行！） def __del__(self): print('已全部下完毕！')

【adi：IMAGES_STORE = '图片存储路径' 该参数需要自己手动添加！图片路径中不包含图片名称！】

使用scrapy框架爬取网页图片——详解

以上就是crapy爬取图片的全部内容！

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

使用scrapy框架爬取网页图片——详解

spider爬虫对象源码：

item对象源码：

pilines管道对象源码：

seting文件中的设置：

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置