爬虫爬取多个不相同网页

323次阅读

‘’’
本任务要求大家完成一个简单的爬虫项目，包括网页爬取、信息提取以及数据保存
在完成本次任务时，建议大家认真思考，结合自己的逻辑，完成任务。
注意：本任务的得分将按照任务提交时间的先后顺序与任务正确率结合来计算，
由于每位同学的题目都不相同，建议不要抄袭，一旦发现抄袭情况，本次任务判为0分’’’
from typing import Any, Tuple

‘’’
第一题：请使用爬虫技术，爬取以下5个url地址的网页信息，并进行关键信息提取。
从爬取到的页面源码中提取下列4种信息：
1.文章标题
2.正文内容（注意，只提取文章的文本内容，不得提取页面中其他无关的文本内容）
3.图片链接（如果有）
4.时间、日期（如果有）’’’
#你分配到的url为：url = [‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’,‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’,‘http://www.smartshe.com/trends/20191009/56414.html’,‘https://dress.yxlady.com/202004/1560779.shtml’,‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’]
url1 =‘http://fashion.cosmopolitan.com.cn/2019/1020/287733.shtml’;url2 = ‘http://dress.pclady.com.cn/style/liuxing/1003/520703.html’;url3 = ‘http://www.smartshe.com/trends/20191009/56414.html’;url4 = ‘https://dress.yxlady.com/202004/1560779.shtml’;url5 = ‘http://www.yoka.com/fashion/roadshow/2019/0513/52923401100538.shtml’
headers={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36’,}
import requests
from bs4 import BeautifulSoup
def get_url1(url,data=None):
url = requests.get(url1, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘p2’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.string
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url2(url,data=None):
url = requests.get(url2, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘2:文章标题’, title)
body_text = soup.find_all(class_=‘artText’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_=‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间’,time)
print(’—’ * 50)
def get_url3(url,data=None):
url = requests.get(url3, headers=headers)
url.encoding = ‘utf-8’ # 页面编码为utf-8，将编码方式转换为utf-8
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’,title)
body_text = soup.find_all(class_=‘art-body’)
body_image = soup.find_all(‘img’)
time = soup.select(’.art-auther > span:nth-child(1)’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’,body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’,body_image)
for time in time:
time = time.string
print(‘时间:’,time)
print(’—’*50)
def get_url4(url,data=None):
url = requests.get(url4, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.select(’.left1 > div.ArtCon > p’)
body_image = soup.find_all(‘img’)
time = soup.select(’#acxc > span’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)
def get_url5(url,data=None):
url = requests.get(url5, headers=headers)
url.encoding = ‘GBk’ # 页面编码为gbk，将编码方式转换为gbk
soup = BeautifulSoup(url.text, ‘lxml’)
title = soup.title.string
print(‘1:文章标题’, title)
body_text = soup.find_all(class_= ‘textCon’)
body_image = soup.find_all(‘img’)
time = soup.find_all(class_= ‘time’)
for body_text in body_text:
body_text = body_text.text
print(‘正文:’, body_text)
for body_image in body_image:
body_image = body_image.get(‘src’)
print(‘图片:’, body_image)
for time in time:
time = time.text
print(‘时间:’, time)
print(’—’ * 50)

with open(“record.json”,‘w’, encoding=‘utf-8’) as f:
f.write(str(data))
print(“加载入文件完成…”)

最后函数的调用没有处理好，大家仅供参考

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-25

# Python爬虫

复制链接

赏

爬虫爬取多个不相同网页

任务四

相关文章：

集合访问方式python（python集合类型的操作符）

限时Python免费学（可以免费学python的网站）

阿里python代码检查（python 代码检查工具）

调用python加入参（python函数参数调用）

读取串口python代码（python 串口读取）

词法分析程序python（词法分析程序流程图）

设计模式python版本（python做设计）

腾讯云python认证（腾讯云函数部署python）

相似图片识别python（相似图片识别 docker）

淘宝登录python脚本（淘宝python基础教程）