爬虫高级技巧

639次阅读

n为你想要实现的时间间隔

import time time.sleep(n)

#使用urllib.request的两个方法进行代理的设置 proxy = urlrequest.ProxyHandler({'https':'47.91.78.201:3128'}) opener = urlrequest.build_opener(proxy)

网站是可以识别你是否在使用Python进行爬取，需要你在发送网络请求时，把header部分伪装成浏览器。
opener.addheaders = [(‘User-Agent’,’…’)]
用不同浏览器访问的header字符串，放入上述代码省略号的部分即可。
常用的浏览器header有：

示例：Place Pulse Google街景图爬取
课程参考代码：
1.准备工作:载入包,定义存储目录,连接API

import urllib.request as urlrequest import time import random IMG_PATH = './imgs/{}.jpg' DATA_FILE = './data/votes.csv' STORED_IMG_ID_FILE = './data/cached_img.txt' STORED_IMG_IDS = set() IMG_URL = 'https://maps.googleapis.com/maps/api/streetview?size=400×300&location={},{}'

2.应用爬虫技巧：使用代理服务器、User-Agent

proxy = urlrequest.ProxyHandler({'https':'47.91.78.201:3128'}) opener = urlrequest.build_opener(proxy) opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Sa urlrequest.install_opener(opener)

3.读取图片的id

with open(STORED_IMG_ID_FILE) as input_file: for line in input_file: STORED_IMG_IDS.add(line.strip())

4.根据提供的图片id文档，进行google街景图片的爬取

with open(DATA_FILE) as input_file: skip_first_line = True for line in input_file: if skip_first_line: skip_first_line = False continue left_id, right_id, winner, left_lat, left_long, right_lat, right_long, category = line.split(',') if left_id not in STORED_IMG_IDS: print ('saving img {}…'.format(left_id)) urlrequest.urlretrieve(IMG_URL.format(left_lat, left_long), IMG_PATH.format(left_id)) STORED_IMG_IDS.add(left_id) with open(STORED_IMG_ID_FILE, 'a') as output_file: output_file.write('{}n'.format(left_id)) time.sleep(1) # wait some time, trying to avoid google forbidden (of crawler) if right_id not in STORED_IMG_IDS: print ('saving img {}…'.format(right_id)) urlrequest.urlretrieve(IMG_URL.format(right_lat, right_long), IMG_PATH.format(right_id)) STORED_IMG_IDS.add(right_id) with open(STORED_IMG_ID_FILE, 'a') as output_file: output_file.write('{}n'.format(right_id)) time.sleep(1) # wait some time, trying to avoid google forbidden (of crawler)

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-24

# Python爬虫

复制链接

赏

爬虫高级技巧

设置程序休止时间

设置代理

User-Agent

实例

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置