Python lxml Module Report

lxml is a Python library for processing XML and HTML documents. It provides a powerful and efficient way to parse and interact with XML and HTML content, leveraging the libxml2 and libxslt libraries. This report covers installation, basic usage, and provides examples to illustrate how to use lxml.

Introduction

lxml is a comprehensive library for XML and HTML processing. It provides facilities for parsing, navigating, and modifying XML and HTML documents. lxml is known for its speed and ease of use, and it supports XPath and XSLT.

Installation

To use lxml, you need to install the module via pip. The library also requires libxml2 and libxslt, which are typically installed automatically if you are using precompiled binaries.

Install lxml with pip:

pip install lxml

Basic Usage

The lxml library supports both XML and HTML parsing. The core classes you will use are lxml.etree.ElementTree for XML and lxml.html for HTML.

Example Code

Here’s how to use lxml to parse and interact with XML and HTML documents:

from lxml import etree, html

# Example XML content
xml_content = """
<root>
    <child name="child1">Content1</child>
    <child name="child2">Content2</child>
</root>
"""

# Parse XML
xml_tree = etree.fromstring(xml_content)
print(etree.tostring(xml_tree, pretty_print=True).decode())

# Example HTML content
html_content = """
<html>
<head><title>My Page</title></head>
<body>
<p class="text">Hello World</p>
</body>
</html>
"""

# Parse HTML
html_tree = html.fromstring(html_content)
print(etree.tostring(html_tree, pretty_print=True).decode())

Examples

Parsing XML

Parsing XML with lxml is straightforward. You can load XML from a string, file, or URL.

Example: Parsing from a String

from lxml import etree

xml_string = """
<root>
    <element key="value">Text</element>
</root>
"""

root = etree.fromstring(xml_string)
print(etree.tostring(root, pretty_print=True).decode())

Example: Parsing from a File

from lxml import etree

tree = etree.parse('example.xml')
root = tree.getroot()
print(etree.tostring(root, pretty_print=True).decode())

Parsing HTML

lxml can also parse HTML content. It handles both well-formed and malformed HTML.

Example: Parsing HTML from a String

from lxml import html

html_string = """
<html>
<head><title>Page Title</title></head>
<body>
<p class="description">This is a paragraph.</p>
</body>
</html>
"""

tree = html.fromstring(html_string)
print(etree.tostring(tree, pretty_print=True).decode())

Example: Parsing HTML from a URL

from lxml import html
import requests

response = requests.get('http://example.com')
tree = html.fromstring(response.content)
print(etree.tostring(tree, pretty_print=True).decode())

XPath Queries

XPath is a powerful way to query XML and HTML documents. lxml supports XPath expressions for searching and navigating the document.

Example: Using XPath to Find Elements

from lxml import etree

xml_string = """
<root>
    <item id="1">First</item>
    <item id="2">Second</item>
</root>
"""

root = etree.fromstring(xml_string)
items = root.xpath('//item')
for item in items:
    print(item.text)

Example: Using XPath to Extract Attributes

from lxml import etree

html_string = """
<html>
<body>
<a href="http://example.com" id="link1">Example</a>
</body>
</html>
"""

tree = etree.HTML(html_string)
href = tree.xpath('//a/@href')[0]
print(href)

Modifying Documents

lxml allows you to modify XML and HTML documents. You can add, change, or remove elements.

Example: Adding Elements

from lxml import etree

xml_string = """
<root>
    <element>Original</element>
</root>
"""

root = etree.fromstring(xml_string)
new_element = etree.Element('new_element')
new_element.text = 'Added Text'
root.append(new_element)

print(etree.tostring(root, pretty_print=True).decode())

Example: Modifying Attributes

from lxml import etree

html_string = """
<html>
<body>
<p class="text">Old Text</p>
</body>
</html>
"""

tree = etree.HTML(html_string)
p = tree.xpath('//p')[0]
p.text = 'New Text'
p.attrib['class'] = 'new-class'

print(etree.tostring(tree, pretty_print=True).decode())

Example: Removing Elements

from lxml import etree

xml_string = """
<root>
    <element>Remove Me</element>
</root>
"""

root = etree.fromstring(xml_string)
element_to_remove = root.find('.//element')
root.remove(element_to_remove)

print(etree.tostring(root, pretty_print=True).decode())

Handling Common Issues

Handling Malformed HTML

lxml can handle malformed HTML but using html.parser might be better for extremely broken HTML. If you face issues with broken HTML, validate and clean the HTML where possible.

Memory Usage

For large XML or HTML documents, be mindful of memory usage. lxml processes documents in memory, so consider using incremental parsing for very large files.

Conclusion

lxml is a robust library for processing XML and HTML documents in Python. It provides efficient parsing, querying with XPath, and document modification capabilities. With its comprehensive set of features, lxml is ideal for a wide range of tasks involving structured data in XML or HTML formats.

For more detailed information and advanced usage, refer to the lxml documentation.