Python 爬虫（抓取网页内容简单实现）

608次阅读

没有评论

在这里我使用的是 https://m.douban.com/group/729027/

抓取的内容是这个网页下的：

Python

所有的讨论

使用F12对当前页面进行解析：

Python

点击圈起来的部分后，点击讨论中的 “婉卿……” 右边就能自动跳转到这一句的源码了

右键单击源码后点击复制中的复制selector

复制出来的是： #group-topics > div:nth-child(2) > table > tbody > tr:nth-child(2) > td.title > a

这个可以理解为这句评论在html中的地址

多复制几个其他的讨论找到规律：

#group-topics > div:nth-child(2) > table > tbody > tr:nth-child(5) > td.title > a

发现后三位主要就是tr:nth-child不一样，那么我们就取 tr td.title a 作为我们想要的选择条件

from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlwt

url = input('Please enter the URL here:') headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'} ret = Request(url, headers=headers) res = urlopen(ret) aa = res.read().decode('utf-8')

soup = BeautifulSoup(aa,'html.parser') comment = soup.select('tr td.title a') for i in range(0,len(comment)): comment[i] = comment[i].get('title')

代码的简单原理就是用你的电脑模拟访问网页并且获得服务器返回的html源码

BeautifulSoup是python使用爬虫时的一个包。使用我们刚刚拿到的selector，将整个页面的html代码过滤，得到想要的部分.

我们在刚刚在网页看到的html源码里面可以看到：

Python

这个评论是title底下，所以使用 get('title') 获得title里面的值。最后comment就是我们想要的评论啦

如果遇到这样的html，怎么获取“小悠哉”这个名字呢？

同样使用selector拿到这一整片的代码，然后使用 .string 就可以啦

soup = BeautifulSoup(aa,'html.parser') comment = soup.select('tr td.title a') for i in range(0,len(comment)): comment[i] = comment[i].get('title')

author = soup.select('td:nth-child(2) a') for i in range(0,len(author)): author[i] = author[i].string

count = soup.select('tr td.r-count') for i in range(0,len(count)): count[i] = count[i].string

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-24

# Python爬虫

复制链接

赏

Python 爬虫（抓取网页内容简单实现）

1. 首先第一步我们先找到自己抓取的网站网址以及内容

2. 对这个网页的html进行解析，找到讨论这一栏的html源码

3. 使用python开始编写代码

4. 补充

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置