python爬虫之十一

1,242次阅读

scrapy_实战经验

一、filespipeline 下载文件

1.setiing设置
ITEM_PIPELINES、FILES_STORE两个参数缺一不可，将FILES_STORE写成FILE_STORE导致pipeline未启动

2020–02–13 15:48:27 [scrapy.middleware] INFO: Enabled item pipelines: []

2.下载结果
urlretrieve下载文件，边下载文件边显示，而且urlretrieve在参数中要指定下载的文件名
filespipeline下载文件，文下载完成后文件才显示，文件名自动生成，如果url中有文件的后缀名则自动采用后缀名，如果没有需要自己手动改

3.下载设置
filespipeline下载文件设置限制条件：下载时间（DOWNLOAD_TIMEOUT）、重试次数（Retring）、下载重定向（Redirect）、同一域名同时下载最大个数、同一IP同时下载最大个数

当下载文件超时会停止，

4.下载过程

5.下载报错

{'downloader/exception_count': 29, 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 1, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 20, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8, 'downloader/request_bytes': 89242, 'downloader/request_count': 202, 'downloader/request_method_count/GET': 202, 'downloader/response_bytes': 3689994587, 'downloader/response_count': 260, 'downloader/response_status_count/200': 226, 'downloader/response_status_count/302': 30, 'downloader/response_status_count/404': 4, 'elapsed_time_seconds': 6032.416843, 'file_count': 42, 'file_status_count/downloaded': 41, 'file_status_count/uptodate': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 2, 13, 19, 25, 10, 570224), 'item_scraped_count': 75, 'log_count/DEBUG': 1469, 'log_count/ERROR': 19, 'log_count/INFO': 110, 'log_count/WARNING': 33, 'request_depth_max': 2, 'response_received_count': 230, 'retry/max_reached': 29, 'scheduler/dequeued': 185, 'scheduler/dequeued/memory': 185, 'scheduler/enqueued': 185, 'scheduler/enqueued/memory': 185, 'spider_exceptions/AttributeError': 4, 'spider_exceptions/TypeError': 15, 'start_time': datetime.datetime(2020, 2, 13, 17, 44, 38, 153381)} 2020–02–14 03:25:10 [scrapy.core.engine] INFO: Spider closed (finished)

一共94个URL，spider解析失败15+4=19个，生成item75个；下载42个文件，异常29个，404有4个，一共75个；下载超时20个，请求超时1个，请求拒绝8个；
【404和其他异常都会导致文件没有下载，文件没有下载files为空】
【3个是selenium打不开，直接返回None的，应该是交给scrapy下载器仍然打不开，返回了404】

解析失败原因：
TypeError：
12例中，videoSrc不是video标签的属性，而是video标签内的source标签的属性；改正方法：改用//video//@src
3例中，selenium加载超时直接返回None；改正方法：改成点击body而不是div试一试（只要将焦点移过去就能触发JS而不需要点击）；
AttributeError：
4例中，时间不是写死在HTML源码中的；改正方法：如果为空则以当天时间为日期，另外日期要改成YYYYMMDD格式；

File下载失败：
下载404原因：
下载超时原因：
TCP超时原因：
拒绝轻易原因：
都只有改成随机请求头试一试

另外，补上修改文件名的功能

2020–02–14 01:56:08 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc1.com> (failed 1 times): User timeout caused connection failure: Getting http://www.abc.com took longer than 600.0 seconds.. 2020–02–14 01:56:08 [scrapy.pipelines.files] WARNING: File (unknown–error): Error downloading file from <GET http://www.abc1.com> referred in <None>: User timeout caused connection failure: Getting http://www.abc.com took longer than 600.0 seconds..

20个下载超时全部是[scrapy.downloadermiddlewares.retry]，注意这个downloadermiddlewares中间件指的是下载中文件中的，而不是自己设置的下载中间件

2020–02–14 01:45:44 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc2.com> (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。. 2020–02–14 01:45:44 [scrapy.pipelines.files] WARNING: File (unknown–error): Error downloading file from <GET http://www.abc2.com> referred in <None>: TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。.

2020–02–14 01:49:36 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.abc3.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non–clean fashion: Connection lost.>] 2020–02–14 01:49:36 [scrapy.pipelines.files] WARNING: File (unknown–error): Error downloading file from <GET http://www.abc3.com> referred in <None>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non–clean fashion: Connection lost.>]

爬虫代码错

2020–02–14 01:46:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://shell.sososhell.com/embed/2e654ed621033add1a2f48e42a62f633> (referer: http://cl.3211i.xyz/htm_data/2002/22/3812228.html) File "F:…", line 49, in parse_realurl ... print('B' * 50+' '+videoSrc) TypeError: must be str, not NoneType

2020–02–14 02:42:20 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "https://…"} 2020–02–14 02:42:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 408 1155 2020–02–14 02:42:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 打开超时：https://... 2020–02–14 02:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://...> 1 (referer: http://...) 2020–02–14 02:42:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://ppse026.com/play/video.php?id=15> (referer: http://cl.3211i.xyz/htm_data/2002/22/3812218.html) Traceback (most recent call last): File "F:…", line 49, in parse_realurl print('B' * 50+' '+videoSrc) TypeError: must be str, not NoneType

#selenium打开页面点击后加载超时，返回None，scrapy下载器直接打开页面（页面本来就能打开返回200，只是视频点击后加载超时），由于JS加载后才有videoSrc属性，所以当前为NoneType #另外发现，即使没有点击页面加载后也有videoSrc属性了，改为点击body试一试？

还有部分 http://333.thumbfox.com/embed/7986/
src属性不再video中，而在source中，改用//video//@src
这种有12例
12+3=15 TypeError

日期绝大多数页面是在源码中写死的，有4个是JS生成的，AttributeError

正常下载流程：

2020–02–14 02:52:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行：从abc_2.com response中解析获得abc_3.com，并生成Request(url=abc_3.com)，交由下载器中间件中的selenium处理 2020–02–14 02:53:25 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "abc_3.com"} 2020–02–14 02:53:26 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 2020–02–14 02:53:26 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 2020–02–14 02:53:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: abc_2.com.html) #此处省略步骤parse2执行：从abc_3.com response中解析获得abc_4.com，作为item的file_urls，交由filespipeline处理 2020–02–14 03:02:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://abc_5.com> from <GET https://abc_4.com> 2020–02–14 03:18:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_5.com> (referer: None) 2020–02–14 03:18:04 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://abc_4.com> referred in <None> 2020–02–14 03:18:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://abc_3.com> {'date': '2020-02-14', 'file_urls': ['https://abc_4.com'], 'files': [{'checksum': 'd0f5f18d89666750d2c0b85980791930', 'path': 'full/0f6e48154b4b3e886c7d2d96f254c1de0fbc4084', 'url': 'https://abc_4.com'}], 'title': '…'}

注意：
[scrapy.downloadermiddlewares.redirect] ，downloadermiddlewares这一条信息不一定会出现，也就是说重定向实际使用了系统自带的downloadermiddlewares

从以上流程中：
1.abc_1.com为目录页，abc_2.com为详情页，abc_3.com为详情页中的外链video页，abc_4.com为video页中videoSrc（实际url为http://abc_4.mp4），abc_5.com为videoSrc重定向的地址
2.abc_1.com、abc_2.com页面scrapy正常爬取，abc_3.com页面通过selenium爬取，abc_4.mp4通过filespipeline下载
3.[scrapy.core.engine] DEBUG: Crawled (200) 表示请求成功返回response
[scrapy.core.scraper] DEBUG: Scraped 表示对response解析完成（？这个自己猜的）
4.scrapy.core.engine通知响应完成（Crawled 200 https://abc_5.com）、scrapy.pipelines.files通知下载完成（Downloaded file from https://abc_4.com）、scrapy.core.scraper通知爬虫完成（Scraped from）基本发生在同时
5.在file下载完成后，显示对应item的信息，此时files属性自动产生了值；

下载失败时流程：
404流程

2020–02–14 01:45:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行：从abc_2.com response中解析获得abc_3.com，并生成Request(url=abc_3.com)，交由下载器中间件中的selenium处理 2020–02–14 01:45:53 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "abc_3.com"} 2020–02–14 01:45:53 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 2020–02–14 01:45:53 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 2020–02–14 01:45:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: abc_2.com.html) #此处省略步骤parse2执行：从abc_3.com response中解析获得abc_4.com，作为item的file_urls，交由filespipeline处理 2020–02–14 01:48:00 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://162.209.170.99/20200213/62.mp4> (referer: None) 2020–02–14 01:48:00 [scrapy.pipelines.files] WARNING: File (code: 404): Error downloading file from <GET http://162.209.170.99/20200213/62.mp4> referred in <None> 2020–02–14 01:48:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ppse026.com/play/video.php?id=62> {'date': '2020-02-14', 'file_urls': ['https://abc_4.com'], 'files': [], 'title': '…'}

请求失败流程

2020–02–14 01:48:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://abc_2.com> (referer: http://abc_1.com) #此处省略步骤parse1执行：从abc_2.com response中解析获得abc_3.com，并生成Request(url=abc_3.com)，交由下载器中间件中的selenium处理 2020–02–14 01:48:14 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:30992/session/34e24976cab3906d1d75eb7c30cef921/url {"url": "https://abc_3.com"} 2020–02–14 01:48:17 [urllib3.connectionpool] DEBUG: http://127.0.0.1:30992 "POST /session/34e24976cab3906d1d75eb7c30cef921/url HTTP/1.1" 200 14 2020–02–14 01:48:17 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 2020–02–14 01:48:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://abc_3.com> 1 (referer: http://abc_2.com) #此处省略步骤parse2执行：从abc_3.com response中解析获得abc_4.com，作为item的file_urls，交由filespipeline处理 2020–02–14 01:48:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://abc_5.com> from <GET https://abc_4.com> 2020–02–14 01:49:36 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://abc_5.com> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non–clean fashion: Connection lost.>] 2020–02–14 01:49:36 [scrapy.pipelines.files] WARNING: File (unknown–error): Error downloading file from <GET https://abc_4.com> referred in <None>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non–clean fashion: Connection lost.>] 2020–02–14 01:49:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://abc_3.com> {'date': '2020-02-13', 'file_urls': ['https://abc_4.com'], 'files': [], 'title': '…'}

自定义FilesPipeline
1.将item传给scrapy后，scrapy检查file_urls属性，将file_urls属性加入调度器，发起请求；
2.注意，item的其他属性是不在

ImagesPipeline：
1.ImagesPipeline通过get_media_requests(self,item,info)方法从item中获取image_urls，image_urls（是url列表）创建Request列表，返回给scrapy
2.get_media_requests在请求发出前调用，file_path在请求完成将要把图片存储在本地时调用，所以file_path中要想用到item中的信息，必须将先item中信息绑定到Request对象上，然后再在file_path中获取
【自己试验发现在process_item中绑定在FilePipeline中也行，说明process_item在file_path之前调用】
3.file_path返回值要么是绝对路径，要么是相对FILES_STORE的相对路径；建议在FILES_STORE中写绝对路径（可以用__file__获取），file_path中再获取FILE_STORE，组合后返回，这样file_path返回直接就是绝对路径了
4.有个疑问：file_path中有错是不是不会报错，而是直接导致下载失败；另外，pipeline中的request经过下载器中间件处理吗

5.下载器中间件：process_request()返回值是什么？
修改请求头时，没写返回值！！！！

一个情形：
在yield dict情况下，如果dict中某个字段是列表，列表长度不定每个元素分别来自不同的URL，怎么实现？
scrapy中如果yield request，那么之后的response、解析结果都是自动处理并发到pipeline的，根本没法获取后作为中间结果。感觉上没法在python中获取完整对象后再存入数据库，只有先存进去，之后再不停地查询并写入。但是这样感觉会影响效率啊。
恩，对上面这种解决方式换个角度来想的话，可以将list中的元素单独作为一个实体。单独建一个table/collection。也就不存在先查找后写入的问题了。
scrapy的处理流程，有种很强的流水线的感觉，只能向前，不能回退。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

python爬虫之十一

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置