爬虫高级技巧

377次阅读
没有评论
爬虫高级技巧

设置程序休止时间

n为你想要实现的时间间隔

import time time.sleep(n)

设置代理

#使用urllib.request的两个方法进行代理的设置 proxy = urlrequest.ProxyHandler({'https':'47.91.78.201:3128'}) opener = urlrequest.build_opener(proxy)

User-Agent

网站是可以识别你是否在使用Python进行爬取,需要你在发送网络请求时,把header部分伪装成浏览器。
opener.addheaders = [(‘User-Agent’,’…’)]
用不同浏览器访问的header字符串,放入上述代码省略号的部分即可。
常用的浏览器header有:

实例

示例:Place Pulse Google街景图爬取
课程参考代码:
1.准备工作:载入包,定义存储目录,连接API

import urllib.request as urlrequest import time import random IMG_PATH = './imgs/{}.jpg' DATA_FILE = './data/votes.csv' STORED_IMG_ID_FILE = './data/cached_img.txt' STORED_IMG_IDS = set() IMG_URL = 'https://maps.googleapis.com/maps/api/streetview?size=400×300&location={},{}'

2.应用爬虫技巧:使用代理服务器、User-Agent

proxy = urlrequest.ProxyHandler({'https':'47.91.78.201:3128'}) opener = urlrequest.build_opener(proxy) opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Sa urlrequest.install_opener(opener)

3.读取图片的id

with open(STORED_IMG_ID_FILE) as input_file: for line in input_file: STORED_IMG_IDS.add(line.strip())

4.根据提供的图片id文档,进行google街景图片的爬取

with open(DATA_FILE) as input_file: skip_first_line = True for line in input_file: if skip_first_line: skip_first_line = False continue left_id, right_id, winner, left_lat, left_long, right_lat, right_long, category = line.split(',') if left_id not in STORED_IMG_IDS: print ('saving img {}…'.format(left_id)) urlrequest.urlretrieve(IMG_URL.format(left_lat, left_long), IMG_PATH.format(left_id)) STORED_IMG_IDS.add(left_id) with open(STORED_IMG_ID_FILE, 'a') as output_file: output_file.write('{}n'.format(left_id)) time.sleep(1) # wait some time, trying to avoid google forbidden (of crawler) if right_id not in STORED_IMG_IDS: print ('saving img {}…'.format(right_id)) urlrequest.urlretrieve(IMG_URL.format(right_lat, right_long), IMG_PATH.format(right_id)) STORED_IMG_IDS.add(right_id) with open(STORED_IMG_ID_FILE, 'a') as output_file: output_file.write('{}n'.format(right_id)) time.sleep(1) # wait some time, trying to avoid google forbidden (of crawler)

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

相关文章:

版权声明:Python教程2022-10-24发表,共计2079字。
新手QQ群:570568346,欢迎进群讨论 Python51学习