Python lxml Module Report
lxml
is a Python library for processing XML and HTML documents. It provides a powerful and efficient way to parse and interact with XML and HTML content, leveraging the libxml2
and libxslt
libraries. This report covers installation, basic usage, and provides examples to illustrate how to use lxml
.
Introduction
lxml
is a comprehensive library for XML and HTML processing. It provides facilities for parsing, navigating, and modifying XML and HTML documents. lxml
is known for its speed and ease of use, and it supports XPath and XSLT.
Installation
To use lxml
, you need to install the module via pip. The library also requires libxml2
and libxslt
, which are typically installed automatically if you are using precompiled binaries.
Install lxml
with pip:
Basic Usage
The lxml
library supports both XML and HTML parsing. The core classes you will use are lxml.etree.ElementTree
for XML and lxml.html
for HTML.
Example Code
Here’s how to use lxml
to parse and interact with XML and HTML documents:
from lxml import etree, html
# Example XML content
xml_content = """
<root>
<child name="child1">Content1</child>
<child name="child2">Content2</child>
</root>
"""
# Parse XML
xml_tree = etree.fromstring(xml_content)
print(etree.tostring(xml_tree, pretty_print=True).decode())
# Example HTML content
html_content = """
<html>
<head><title>My Page</title></head>
<body>
<p class="text">Hello World</p>
</body>
</html>
"""
# Parse HTML
html_tree = html.fromstring(html_content)
print(etree.tostring(html_tree, pretty_print=True).decode())
Examples
Parsing XML
Parsing XML with lxml
is straightforward. You can load XML from a string, file, or URL.
Example: Parsing from a String
from lxml import etree
xml_string = """
<root>
<element key="value">Text</element>
</root>
"""
root = etree.fromstring(xml_string)
print(etree.tostring(root, pretty_print=True).decode())
Example: Parsing from a File
from lxml import etree
tree = etree.parse('example.xml')
root = tree.getroot()
print(etree.tostring(root, pretty_print=True).decode())
Parsing HTML
lxml
can also parse HTML content. It handles both well-formed and malformed HTML.
Example: Parsing HTML from a String
from lxml import html
html_string = """
<html>
<head><title>Page Title</title></head>
<body>
<p class="description">This is a paragraph.</p>
</body>
</html>
"""
tree = html.fromstring(html_string)
print(etree.tostring(tree, pretty_print=True).decode())
Example: Parsing HTML from a URL
from lxml import html
import requests
response = requests.get('http://example.com')
tree = html.fromstring(response.content)
print(etree.tostring(tree, pretty_print=True).decode())
XPath Queries
XPath is a powerful way to query XML and HTML documents. lxml
supports XPath expressions for searching and navigating the document.
Example: Using XPath to Find Elements
from lxml import etree
xml_string = """
<root>
<item id="1">First</item>
<item id="2">Second</item>
</root>
"""
root = etree.fromstring(xml_string)
items = root.xpath('//item')
for item in items:
print(item.text)
Example: Using XPath to Extract Attributes
from lxml import etree
html_string = """
<html>
<body>
<a href="http://example.com" id="link1">Example</a>
</body>
</html>
"""
tree = etree.HTML(html_string)
href = tree.xpath('//a/@href')[0]
print(href)
Modifying Documents
lxml
allows you to modify XML and HTML documents. You can add, change, or remove elements.
Example: Adding Elements
from lxml import etree
xml_string = """
<root>
<element>Original</element>
</root>
"""
root = etree.fromstring(xml_string)
new_element = etree.Element('new_element')
new_element.text = 'Added Text'
root.append(new_element)
print(etree.tostring(root, pretty_print=True).decode())
Example: Modifying Attributes
from lxml import etree
html_string = """
<html>
<body>
<p class="text">Old Text</p>
</body>
</html>
"""
tree = etree.HTML(html_string)
p = tree.xpath('//p')[0]
p.text = 'New Text'
p.attrib['class'] = 'new-class'
print(etree.tostring(tree, pretty_print=True).decode())
Example: Removing Elements
from lxml import etree
xml_string = """
<root>
<element>Remove Me</element>
</root>
"""
root = etree.fromstring(xml_string)
element_to_remove = root.find('.//element')
root.remove(element_to_remove)
print(etree.tostring(root, pretty_print=True).decode())
Handling Common Issues
Handling Malformed HTML
lxml
can handle malformed HTML but using html.parser
might be better for extremely broken HTML. If you face issues with broken HTML, validate and clean the HTML where possible.
Memory Usage
For large XML or HTML documents, be mindful of memory usage. lxml
processes documents in memory, so consider using incremental parsing for very large files.
Conclusion
lxml
is a robust library for processing XML and HTML documents in Python. It provides efficient parsing, querying with XPath, and document modification capabilities. With its comprehensive set of features, lxml
is ideal for a wide range of tasks involving structured data in XML or HTML formats.
For more detailed information and advanced usage, refer to the lxml documentation.