Python网络爬虫学习笔记(定向)

598次阅读

Windows安装python运行环境

Python安装，建议安装3.的版本，因为3.的版本编码语言模式utf-8。安装包下载网址为：python官网下载地址，双击一步步执行下去即可。IDE的安装，个人习惯了JetBrains的PyCharm，我们平日里做各种小程序，学习之类的下载社区版本(免费版)即可，下载网址为：PyCharm下载地址。双击一步步执行下去即可。以安装Django为例，讲解一下pip命令的使用方法。
Python网络爬虫学习笔记(定向)

网络爬虫的准备

requests库安装

Python网络爬虫学习笔记(定向)

#通过如下四行代码就可以把百度首页的内容显示出来： r=requests.get("https://www.baidu.com",timeout=30) print(r.status_code) r.encoding=r.apparent_encoding print(r.text)

requests库概述

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 。

beautifulsoup4库安装

Python网络爬虫学习笔记(定向)

beautifulsoup4库概述

Beautiful Soup是用Python写的一个HTML/XML的解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航navigating，搜索以及修改剖析树的操作。它可以大大节省你的编程时间。使用html.parser解析方式是比较慢的，推荐使用lxml解析方式，或者尝试使用scrapy。
Python网络爬虫学习笔记(定向)

lxml库安装

Python网络爬虫学习笔记(定向)

lxml库概述

Python网络爬虫学习笔记(定向)

定向爬虫实例：

抓取中国大学排名前十的列表

import requests from bs4 import BeautifulSoup import bs4 def getHtml(url): try: r = requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return "" def fillUnivList(ulist,html): soup=BeautifulSoup(html,"html.parser") for tr in soup.find('tbody').children: if isinstance(tr, bs4.element.Tag): tds=tr('td') ulist.append([tds[0].string,tds[1].string,tds[3].string]) def printUnivList(ulist,num): tplt="{0:^10}\t{1:{3}^10}\t{2:^10}" print(tplt.format("排名","学校名称","分数",chr(12288))) for i in range(num): u=ulist[i] print(tplt.format(u[0],u[1],u[2],chr(12288))) def main(): uinfo=[] url="http://www.zuihaodaxue.com/zuihaodaxuepaiming2016.html" html=getHtml(url) fillUnivList(uinfo,html) printUnivList(uinfo,10) main()

从互联网抓取天气新闻的python代码

已经被我应用到具体项目中了，效果不错。

#创意网络爬虫：爬取互联网天气信息，为公司业务软件服务，改善以往的陈旧老办法 import requests from bs4 import BeautifulSoup import bs4 def getHtml(url): try: r = requests.get(url,timeout=30) r.raise_for_status() r.encoding=r.apparent_encoding return r.text except: return ""

def fillWeatherList(ulist,html): soup=BeautifulSoup(html,"html.parser") focusnews=soup.find('div','focusnews') for a in focusnews.descendants: if(a.name=='a'): ulist.append([a.string, 'http://shanxi.weather.com.cn/'+a.attrs['href']]) def printWeatherList(ulist): tplt="{0:{2}^50}\t{1:{2}^50}" print(tplt.format("天气新闻","网址",chr(12288))) for i in range(len(ulist)): u=ulist[i] print(tplt.format(u[0],u[1],chr(12288))) def main(): uinfo=[] url="http://shanxi.weather.com.cn/" html=getHtml(url) fillWeatherList(uinfo,html) printWeatherList(uinfo) main()

Python网络爬虫学习笔记(定向)
由衷的感谢MOOC平台下的嵩天老师，分享了很多全面又实用的课程，大家可以参加试试：Python网络爬虫与信息提取公开课

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2022-10-24

# Python爬虫

复制链接

赏

Python网络爬虫学习笔记(定向)

Windows安装python运行环境

网络爬虫的准备

requests库安装

requests库概述

beautifulsoup4库安装

beautifulsoup4库概述

lxml库安装

lxml库概述

定向爬虫实例：

抓取中国大学排名前十的列表

从互联网抓取天气新闻的python代码

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置