python爬虫request模块详解

890次阅读

使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）

注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求
官方文档：http://cn.python-requests.org/zh_CN/latest/
安装：pip3 install requests

python爬虫request模块详解
以上方法均是在此方法的基础上构建

requests.request(method, url, **kwargs)

其中最常用的请求方式就是post和get请求，泵智商，post和get就是封装了request请求的请求方式

>>> r = requests.get('https://api.github.com/events') 相当于requests,request(method='get', 'https://api.github.com/events') >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) 相当于requests,request(method='post', 'https://api.github.com/events', data = {'key':'value'})

def request(method, url, **kwargs): """Constructs and sends a :class:`Request <Request>`.

:param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers to add for the file. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How many seconds to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. :param verify: (optional) Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults to ``True``. :param stream: (optional) if ``False``, the response content will be immediately downloaded. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :return: :class:`Response <Response>` object :rtype: requests.Response

Usage::

>>> import requests >>> req = requests.request('GET', 'http://httpbin.org/get') <Response [200]> """

# By using the 'with' statement we are sure the session is closed, thus we # avoid leaving sockets open which can trigger a ResourceWarning in some # cases, and look like a memory leak in others. with sessions.Session() as session: return session.request(method=method, url=url, **kwargs)

下面对源码中的各个属性进行分析

指名请求方式和请求路径

requests.request(method='get', url='http://127.0.0.1:8000/test/') requests.request(method='post', url='http://127.0.0.1:8000/test/')

requests模块发送请求有data、json、params三种携带参数的方法。
params在get请求中使用，data、json在post请求中使用。
params可以接收的参数：

– 可以是字典 – 可以是字符串字典字符串都会被自动编码发送到url – 可以是字节（必须是ascii编码以内）

接收字典字符串都会被自动编码发送到url，如下

import requests wd='egon老师' pn=1

response=requests.get('https://www.baidu.com/s', params={ 'wd':wd, 'pn':pn }, headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', }) print(response.url) # 输出为：https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1 # 可见url已被自动编码

上面代码相当于如下代码，params编码转换本质上是用urlencode

import requests from urllib.parse import urlencode wd='egon老师' encode_res=urlencode({'k':wd},encoding='utf-8') keyword=encode_res.split('=')[1] print(keyword) # 然后拼接成url url='https://www.baidu.com/s?wd=%s&pn=1' %keyword

response=requests.get(url, headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', }) print(response.url) # 输出为：https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1

还有一点注意的就是接收字节数据时，不能传非ASCII码外的字符，如下就是错误的

import requests

# re = requests.request(method='get', # url='http://127.0.0.1:8000/test/', # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))

requests模块发送请求有data、json、params三种携带参数的方法。params在get请求中使用，data、json在post请求中使用。
data可以接收的参数为：字典，字符串，字节，文件对象，data和json两者的区别在于data的请求体为name=alex&age=18格式而json请求体为‘{‘k1’: ‘v1’, ‘k2’: ‘水电费’}’（字符串）

requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': '水电费'})

requests.request(method='POST', url='http://127.0.0.1:8000/test/', data="k1=v1; k2=v2; k3=v3; k3=v4" )

requests.request(method='POST', url='http://127.0.0.1:8000/test/', data="k1=v1;k2=v2;k3=v3;k3=v4", headers={'Content-Type': 'application/x-www-form-urlencoded'} )

requests.request(method='POST', url='http://127.0.0.1:8000/test/', data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4 headers={'Content-Type': 'application/x-www-form-urlencoded'} )

将json中对应的数据进行序列化成一个字符串，json.dumps(…)
然后发送到服务器端的body中，并且Content-Type是 {‘Content-Type’: ‘application/json’}
标志：payload

requests.request(method='POST', url='http://127.0.0.1:8000/test/', json={'k1': 'v1', 'k2': '水电费'})

发送请求头到服务器

requests.request(method='POST', url='http://127.0.0.1:8000/test/', json={'k1': 'v1', 'k2': '水电费'}, headers={'Content-Type': 'application/x-www-form-urlencoded'} )

# 发送Cookie到服务器端 requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies={'cook1': 'value1'}, ) # 也可以使用CookieJar（字典形式就是在此基础上封装） from http.cookiejar import CookieJar from http.cookiejar import Cookie

obj = CookieJar() obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False, port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False) ) requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies=obj)

#发送文件 file_dict = { 'f1': open('readme', 'rb') } requests.request(method='POST', url='http://127.0.0.1:8000/test/', files=file_dict)

#发送文件，定制文件名 file_dict = { 'f1': ('test.txt', open('readme', 'rb')) } requests.request(method='POST', url='http://127.0.0.1:8000/test/', files=file_dict)

#发送文件，定制文件名 file_dict = { 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf") } requests.request(method='POST', url='http://127.0.0.1:8000/test/', files=file_dict)

#发送文件，定制文件名 file_dict = { 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'}) } requests.request(method='POST', url='http://127.0.0.1:8000/test/', files=file_dict)

解决浏览器的自带认证问题
认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的，但本质原理是拼接成请求头发送

r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

一般的网站都不用默认的加密方式，都是自己写，那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法
得到加密字符串后添加到请求头：r.headers[‘Authorization’] =func(’…’)
python爬虫request模块详解
HTTPBasicAuth实际是向浏览器发一个带有Authorization:…的请求

HTTPBasicAuth from requests.auth import HTTPBasicAuth, HTTPDigestAuth

ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text)

auth别的使用方式

# ret = requests.get('http://192.168.1.1', # auth=HTTPBasicAuth('admin', 'admin')) # ret.encoding = 'gbk' # print(ret.text)

# ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass')) # print(ret)

两种超时:float or tuple
timeout=0.1 #代表接收数据的超时时间
timeout=(0.1,0.2)#0.1代表链接超时 0.2代表接收数据的超时时间

import requests respone=requests.get('https://www.baidu.com', timeout=0.0001)

ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False) print(ret.text)

代理设置

# 根据协议来确定发送请求时候的ip地址 proxies = { "http": "61.172.249.96:80", "https": "http://61.185.219.126:3128", }

# 根据接收请求的地址来确定用什么地址发送

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

ret = requests.get("http://www.proxyIPIPGO.cn/Proxy", proxies=proxies) print(ret.headers)

from requests.auth import HTTPProxyAuth

proxyDict = { 'http': '77.75.105.165', 'https': '77.75.105.165' } auth = HTTPProxyAuth('username', 'mypassword')

r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) print(r.text)

#支持socks代理,安装:pip install requests[socks] import requests proxies = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port' } respone=requests.get('https://www.12306.cn', proxies=proxies)

print(respone.status_code)

ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close()

# from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此处理响应。 # for i in r.iter_content(): # print(i)

import requests

session = requests.Session()

### 1、首先登陆任何页面，获取cookie

i1 = session.get(url="http://dig.chouti.com/help/service")

### 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } )

i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text)

import requests response=requests.get('http://www.autohome.com/news') # response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码 print(response.text)

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

python爬虫request模块详解

requests模块

requests模块的各种请求方式

源码构成如下

requests,request方法详解

request（）源码

method和url

params

data

json

headers

cookies

files

auth认证

timeout

redirects

proxies

stream

session

编码问题

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置