Python爬虫之BeautifulSoup4

1,131次阅读

爬虫——BeautifulSoup4解析器

BeautifulSoup用来解析HTML比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持lxml的XML解析器。

其相较与正则而言，使用更加简单。

示例：

首先必须要导入bs4库

#!/usr/bin/python3
# -*- coding:utf-8 -*- 
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 格式化输出 soup 对象的内容
print(soup.prettify())

运行结果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

四大对象种类

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

1.Tag

Tag 通俗点讲就是HTML中的一个个标签，例如：

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面title head a p 等等HTML标签加上里面包括的内容就是Tag，那么试着使用BeautifulSoup来获取Tags：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# # 打印title标签
print(soup.title)
 
# 打印head标签
print(soup.head)
 
# 打印a标签
print(soup.a)
 
# 打印p标签
print(soup.p)
 
# 打印soup.p的类型
print(type(soup.p))

运行结果

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>

我们可以利用soup加标签名轻松地获取这些标签内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签，后面会进行介绍。

对于Tag，它有两个重要的属性，就是name和attrs。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# soup对象比较特殊，它的name为[document]
print(soup.name)
 
# 对于其他内部标签，输出的值便为标签本身的名称
print(soup.head.name)
 
# 打印p标签的所有属性，其类型是一个字典
print(soup.p.attrs)
 
# 打印p标签的class属性
print(soup.p['class'])
# 还可以利用get方法获取属性，传入属性的名称，与上面的方法等价
print(soup.p.get('class'))
 
print(soup.p)
 
# 修改属性
soup.p['class'] = "newClass"
print(soup.p)
 
# 删除属性
del soup.p['class']
print(soup.p)

运行结果

[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们想要获取标签内部的文字怎么办呢？很简单，用.string即可，例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 打印p标签的内容
print(soup.p.string)
 
# 打印soup.p.string的类型
print(type(soup.p.string))

运行结果

The Dormouse's story
<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作Tag对象，是一个特殊的Tag，我们可以分别获取它的类型，名称，以及属性。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 类型
print(type(soup.name))
 
# 名称
print(soup.name)
 
# 属性
print(soup.attrs)

运行结果

<class 'str'>
[document]
{}

4.Comment

Comment对象是一个特殊类型的NavigableString对象，其输出的内容不包括注释符号。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 创建 Beautiful Soup 对象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.a)
 
print(soup.a.string)
 
print(type(soup.a.string))

运行结果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

a标签里的内容实际上是注释，但是如果我们利用.string来输出它的内容时，注释符号已经去掉了。

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2019-10-20

复制链接

赏

Python爬虫之BeautifulSoup4

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置