简单粗暴的使用lxml从网页HTML/XML中提取数据

845次阅读

没有评论

Python 的 lxml 模块是一个非常好用且性能高的HTML、XML解析工具，通过它解析网页，爬虫就可以轻松的从网页中提取想要的数据。lxml是基于C语言的libxml2和libxslt库开发的，所以速度是相当的快。

使用lxml提取网页数据的流程

要从网页里面提取数据，使用lxml需要两步：

第一步，用lxml把网页（或xml）解析成一个DOM树。这个过程，我们可以选择etree、etree.HTML 和 lxml.html 这三种来实现，它们基本类似但又有些许差别，后面我们会详细讲到。

第二步，使用xpath遍历这棵DOM 树，找到你想要的数据所在的节点并提取。这一步要求我们对xpath规则比较熟练，xpath规则很多，但别怕，我来总结一些常用的套路。

生成DOM树

上面我们说了，可以有三种方法来把网页解析成DOM树，有选择困难症的同学要犯难了，选择那种好呢？别急，我们逐一探究一下。下面我通过实例来解析一下下面这段html代码：

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

使用etree.fromstring()函数

先看看这个函数的说明(docstring)：

In [3]: etree.fromstring?
Signature:      etree.fromstring(text, parser=None, *, base_url=None)
Call signature: etree.fromstring(*args, **kwargs)
Type:           cython_function_or_method
String form:    <cyfunction fromstring at 0x7fe538822df0>
Docstring:
fromstring(text, parser=None, base_url=None)

Parses an XML document or fragment from a string.  Returns the
root node (or the result returned by a parser target).

To override the default parser with a different parser you can pass it to
the ``parser`` keyword argument.

The ``base_url`` keyword argument allows to set the original base URL of
the document to support relative Paths when looking up external entities
(DTD, XInclude, ...).

这个函数就是把输入的html解析成一棵DOM树，并返回根节点。它对输入的字符串text有什么要求吗？首先，必须是合法的html字符串，然后我们看看下面的例子：

In [19]: html = ''' 
...: <div class="1"> 
...:     <p class="p_1 item">item_1</p> 
...:     <p class="p_2 item">item_2</p> 
...: </div> 
...: <div class="2"> 
...:     <p id="p3"><a href="/go-p3">item_3</a></p> 
...: </div> 
...: '''

In [20]: etree.fromstring(html)
Traceback (most recent call last):

    File "/home/veelion/.virtualenvs/py3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
    line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

    File "<ipython-input-20-aea2e2c2317e>", line 1, in <module>
etree.fromstring(html)

    File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring

    File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument

    File "src/lxml/parser.pxi", line 1758, in lxml.etree._parseDoc

    File "src/lxml/parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc

    File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc

    File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult

    File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError

    File "<string>", line 6
    XMLSyntaxError: Extra content at the end of the document, line 6, column 1

竟然报错了！究其原因，我们的html是两个并列的<div>标签，没有一个单独的root节点。那么给这个html再加一个最外层的<div>标签呢？

In [22]: etree.fromstring('<div>' + html + '</div>')
Out[22]: <Element div at 0x7fe53aa978c8>

这样就可以了，返回了root节点，它是一个Element对象，tag是div。

总结一下，etree.fromstring()需要最外层是一个单独的节点，否则会出错。这个方法也适用于生成 XML 的DOM树。

使用etree.HTML()函数

这个函数更像是针对 HTML 的，看看它的docstring：

In [23]: etree.HTML?
Signature:      etree.HTML(text, parser=None, *, base_url=None)
Call signature: etree.HTML(*args, **kwargs)
Type:           cython_function_or_method
String form:    <cyfunction HTML at 0x7fe538822c80>
Docstring:     
HTML(text, parser=None, base_url=None)

Parses an HTML document from a string constant.  Returns the root
node (or the result returned by a parser target).  This function
can be used to embed "HTML literals" in Python code.

To override the parser with a different ``HTMLParser`` you can pass it to
the ``parser`` keyword argument.

The ``base_url`` keyword argument allows to set the original base URL of
the document to support relative Paths when looking up external entities
(DTD, XInclude, ...).

接口参数跟etree.fromstring()一模一样，实操一下：

In [24]: etree.HTML(html)
Out[24]: <Element html at 0x7fe53ab03748>

输入两个并列节点的html也没有问题。等等，返回的root节点对象Element的标签是html？把它用etree.tostring()还原成html代码看看：

In [26]: print(etree.tostring(etree.HTML(html)).decode())
<html><body><div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>
</body></html>

In [27]: print(html)

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

也就是说，etree.HTML()函数会补全html代码片段，给它们加上<html>和<body>标签。

使用lxml.html函数

lxml.html是lxml的子模块，它是对etree的封装，更适合解析html网页。用这个子模块生成DOM树的方法有多个：
lxml.html.document_fromstring()
lxml.html.fragment_fromstring()
lxml.html.fragments_fromstring()
lxml.html.fromstring()

它们的docstring可以在ipython里面查一下，这里就不再列举。通常，我们解析网页用最后一个fromstring()即可。这个fromstring()函数也会给我们的样例html代码最顶层的两个并列节点加一个父节点div。

上面三种方法介绍完，相信你自己已经有了选择，那必须是lxml.html。

因为它针对html做了封装，所以也多了写特有的方法:

比如我们要获得某个节点下包含所有子节点的文本内容时，通过etree得到的节点没办法，它的每个节点有个text属性只是该节点的，不包括子节点，必须要自己遍历获得子节点的文本。而lxml.html有一个text_content()方法可以方便的获取某节点内包含的所有文本。

再比如，好多网页的链接写的都是相对路径而不是完整url：<a href="/index.html">，我们提取链接后还要自己手动拼接成完整的url。这个时候可以用lxml.html提供的make_links_absolute()方法，这个方法是节点对象Element的方法，etree的Element对象却没有。

使用xpath提取数据

我们还以下面这段html代码为例，来看看如何定位节点提取数据。

<div class="1">
    <p class="p_1 item">item_1</p>
    <p class="p_2 item">item_2</p>
</div>
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

首先导入lxml.html模块，生成DOM树：

In [50]: import lxml.html as lh

In [51]: doc = lh.fromstring(html)

(1)通过标签属性定位节点
比如我们要获取<div class="2">这节点：

In [52]: doc.xpath('//div[@class="2"]')
Out[52]: [<Element div at 0x7fe53a492ea8>]

In [53]: print(lh.tostring(doc.xpath('//div[@class="2"]')[0]).decode())
<div class="2">
    <p id="p3"><a href="/go-p3">item_3</a></p>
</div>

(2)contains语法
html中有两个<p>标签的class含有item，如果我们要提取这两个<p>标签，则：

In [54]: doc.xpath('//p[contains(@class, "item")]')
Out[54]: [<Element p at 0x7fe53a6a3ea8>, <Element p at 0x7fe53a6a3048>]

## 获取<p>的文本：
In [55]: doc.xpath('//p[contains(@class, "item")]/text()')
Out[55]: ['item_1', 'item_2']

(3)starts-with语法
跟（2）一样的提取需求，两个<p>标签的class都是以p_开头的，所以：

In [60]: doc.xpath('//p[starts-with(@class, "p_")]')
Out[60]: [<Element p at 0x7fe53a6a3ea8>, <Element p at 0x7fe53a6a3048>]

## 获取<p>的文本：
In [61]: doc.xpath('//p[starts-with(@class, "p_")]/text()')
Out[61]: ['item_1', 'item_2']

(4)获取某一属性的值
比如，我们想提取网页中所有的链接：

In [63]: doc.xpath('//@href')
Out[63]: ['/go-p3']

神龙|纯净稳定代理IP免费测试>>>>>>>>天启|企业级代理IP免费测试>>>>>>>>IPIPGO|全球住宅代理IP免费测试

发表于：Python爬虫

2019-10-05

复制链接

赏

简单粗暴的使用lxml从网页HTML/XML中提取数据

相关文章：

HTTP代理设置详解：一步步配置指南

什么是Socks5代理IP及其优势

Socks5代理配置教程及注意事项

什么是代理服务器IP：如何选择合适的

国外代理服务器的优势及选择建议

如何找到可靠的免费代理服务器

在线代理服务器的使用与推荐

HTTP代理服务器的设置及应用实例

静态代理IP怎么填写：步骤与示例

海外静态IP的代理选择与配置