半自动化爬虫

1,393次阅读

需求：爬取某个帖子的内容以及回复。
步骤：

首先到该页面存储好该页面的源代码。右击该页面，选择【查看该页面源代码】，将得到的页面中的文本全部复制到一个文本文档中。

将文件存储好之后，修改好需要得到的一些值的正则表达式，将文件路径改好，运行以下代码。
注意：正则表达式一定要根据需要获取数据的需要来进行编写。多个空格可以使用\s+来匹配。

import re import csv with open('./data/半自动化爬虫-抗压背锅吧.txt','r',encoding='utf-8') as f: source=f.read()

result_list=[] username_list=re.findall('username="(.*?)"',source,re.S) content_list=re.findall('class="d_post_content j_d_post_content " style="display:;">\s+(.*?)<',source, re.S) reply_time_list=re.findall('class="tail-info">(2022.*?)<',source, re.S)

for i in range(len(username_list)): result={'username': username_list[i], 'content': content_list[i], 'reply_time': reply_time_list[i]} result_list.append(result) with open('半自动化爬虫-抗压背锅吧.csv','w',encoding='utf-8') as f: writer=csv.DictWriter(f,fieldnames=['username','content','reply_time']) writer.writeheader() writer.writerows(result_list)

最终得到的结果是一个表格，由于有些部分是图片，所以提取不出来，这里只能提取出文字部分。
修改后的代码

import re import csv with open('./data/半自动化爬虫-抗压背锅吧.txt','r',encoding='utf-8') as f: source=f.read()

# 获得包含每一层所有信息的大块文本 every_reply=re.findall('class="l_post l_post_bright j_l_post clearfix "(.*?)p_props_tail props_appraise_wrap',source, re.S)

# 从每一个大文本快里面提取出各个楼层的发帖人姓名，发帖时间和发帖内容 for each in every_reply: result={} result['username'] = re.findall('username="(.*?)"',source,re.S)[0] result['content'] = re.findall('class="d_post_content j_d_post_content " style="display:;">\s+(.*?)<',source, re.S) result['reply_time'] = re.findall('class="tail-info">(2022.*?)<',source, re.S)

with open('半自动化爬虫-抗压背锅吧1.csv','w',encoding='utf-8') as f: writer=csv.DictWriter(f,fieldnames=['username','content','reply_time']) writer.writeheader() writer.writerows(result_list)

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

半自动化爬虫

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置