python爬虫的4个实例

883次阅读

- - - 1、京东商品页面的爬取
    - 2、亚马逊商品页面的爬取

可以先看网络爬虫基础知识，然后结合下面的实例学习爬虫的常用方法。

import requests url = "https://item.jd.com/3112072.html" try: r = requests.get(url) r.raise_for_status() #查看状态信息，返回的是200，说明返回信息正确并且已经获得该链接相应内容。 r.encoding = r.apparent_encoding #查看编码格式，这个格式是jbk，说明我们从http的头部分已经可以解析出网站信息。 print(r.text[:1000]) except: print("爬取失败")

结果：

C:UsersAdminAnaconda3python.exe "E:/2019/May 1/spider JD.py" <!DOCTYPE HTML> <html lang="zh-CN"> <head> <!– shouji –> <meta http-equiv="Content-Type" content="text/html; charset=gbk" /> <title>【全棉时代棉柔巾】全棉时代居家洁面纯棉柔巾纯棉抽纸巾湿水可用洗脸巾擦脸巾20cm*20cm 6包/提【行情报价价格评测】-京东</title> <meta name="keywords" content="PurCotton棉柔巾,全棉时代棉柔巾,全棉时代棉柔巾报价,PurCotton棉柔巾报价"/> <meta name="description" content="【全棉时代棉柔巾】京东JD.COM提供全棉时代棉柔巾正品行货，并包括PurCotton棉柔巾网购指南，以及全棉时代棉柔巾图片、棉柔巾参数、棉柔巾评论、棉柔巾心得、棉柔巾技巧等信息，网购全棉时代棉柔巾上京东,放心又轻松" /> <meta name="format-detection" content="telephone=no"> <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/3112072.html"> <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/3112072.html"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <link rel="canonical" href="//item.jd.com/3112072.html"/> <link rel="dns-prefetch" href="//misc.360buyimg.com"/> <link rel="dns-prefetch" href="//static.360buyimg.com"/> <link rel="dns-prefetch" href="

Process finished with exit code 0

import requests url = "https://www.amazon.cn/dp/B01M8L5Z3Y/ref=sr_1_1?ie=UTF8&qid=1551540666&sr=8-1&keywords=%E6%9E%81%E7%AE%80" try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失败")

结果：
和爬取京东商品一样的操作，但是并没有爬取到商品信息，因此我们联想到可能是亚马逊限制了我们的爬虫访问。

限制网络爬虫的方法：
来源审查： 检查来访HTTP协议头的User – Agent域，只响应浏览器或友好爬虫的访问。
发布公告： Robots协议，告知所有爬虫网站的爬取策略，要求爬虫遵守。

import requests url = "https://www.amazon.cn/dp/B01M8L5Z3Y/ref=sr_1_1?ie=UTF8&qid=1551540666&sr=8-1&keywords=%E6%9E%81%E7%AE%80" r = requests.get(url) print(r.status_code) print(r.encoding) print(r.request.headers) #Response对象包含request请求，通过r.request.headers查看我们发给亚马逊的request请求的头部倒是是什么内容。

可以看到头部有一个字段是’User-Agent’: ‘python-requests/2.18.4’，说明我们的爬虫告诉亚马逊服务器这次的访问是由一个python的requests库的程序产生的。而亚马逊的来源审查可能不支持这样的访问。

那么我们可以试着更改头部信息，模拟一个浏览器向亚马逊发送请求，操作如下：

import requests kv = {'User-Agent': 'Mozilla/5.0'} # 是一个标准的浏览器的身份标识的字段 url = "https://www.amazon.cn/dp/B07G7K1Z98/ref=sr_1_3?ie=UTF8&qid=1551539393&sr=8-3&keywords=%E5%B0%8F%E9%B8%9F%E8%80%B3%E6%9C%BA" r = requests.get(url,headers=kv) #注意这里要加headers，因为headers已经更该过。 print(r.status_code) print(r.request.headers) print(r.text[1000:2000])

结果：

C:UsersAdminAnaconda3python.exe "E:/2019/May 1/spider Amazon.py" 200 {'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

ue.stub(ue,"event");ue.stub(ue,"onSushiUnload");ue.stub(ue,"onSushiFlush");

var ue_url='/gp/product/B07G7K1Z98/uedata/unsticky/460-8525492-7331338/NoPageType/ntpoffrw', ue_sid='460-8525492-7331338', ue_mid='AAHKV2X7AFYLW', ue_sn='www.amazon.cn', ue_furl='fls-cn.

Process finished with exit code 0

可见，更改User-Agent属性之后的爬虫可以正常爬取信息。
尝试和修改后的爬虫程序如下：

import requests url = "https://www.amazon.cn/dp/B07G7K1Z98/ref=sr_1_3?ie=UTF8&qid=1551539393&sr=8-3&keywords=%E5%B0%8F%E9%B8%9F%E8%80%B3%E6%9C%BA" kv = {'User-Agent': 'Mozilla/5.0'} # 是一个标准的浏览器的身份标识的字段 try: r = requests.get(url, headers=kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失败")

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-28

# Python爬虫

复制链接

赏

python爬虫的4个实例

文章目录

1、京东商品页面的爬取

2、亚马逊商品页面的爬取

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置