Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,10 +110,14 @@ functionality:
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);

- ``genshi`` has a treewalker (but not builder); and
- ``genshi`` has a treewalker (but not builder);

- ``chardet`` can be used as a fallback when character encoding cannot
be determined.
be determined; and

- ``beautifulsoup4`` can use html5lib as a parser backend for
HTML5-compliant parsing. Simply pass ``'html5lib'`` as the parser
name when creating a BeautifulSoup object.


Bugs
Expand Down
173 changes: 173 additions & 0 deletions doc/beautifulsoup.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
BeautifulSoup Integration
=========================

html5lib can be used as a parser backend for `BeautifulSoup 4 <https://www.crummy.com/software/BeautifulSoup/>`_,
providing HTML5-compliant parsing for your BeautifulSoup projects.

Using html5lib with BeautifulSoup
----------------------------------

To use html5lib as your BeautifulSoup parser, simply pass ``'html5lib'`` as the parser name:

.. code-block:: python

from bs4 import BeautifulSoup

markup = '<p>Hello <span>World</span>!</p>'
soup = BeautifulSoup(markup, 'html5lib')

print(soup.prettify())

This will output:

.. code-block:: html

<html>
<head>
</head>
<body>
<p>
Hello
<span>
World
</span>
!
</p>
</body>
</html>

Key Differences from Other Parsers
-----------------------------------

When using html5lib with BeautifulSoup, there are some important differences compared to other parsers:

Document Structure
~~~~~~~~~~~~~~~~~~

html5lib always creates a complete HTML5 document structure, even when parsing fragments:

.. code-block:: python

from bs4 import BeautifulSoup

markup = '<p>Fragment</p>'

# With html5lib - adds full document structure
soup_html5lib = BeautifulSoup(markup, 'html5lib')
print(soup_html5lib.html is not None) # True
print(soup_html5lib.body is not None) # True

# With html.parser - keeps it as a fragment
soup_htmlparser = BeautifulSoup(markup, 'html.parser')
print(soup_htmlparser.html is None) # True
print(soup_htmlparser.body is None) # True

Error Handling
~~~~~~~~~~~~~~

html5lib follows the HTML5 specification's error handling rules, which means it will:

- Automatically close unclosed tags
- Fix misnested tags
- Handle invalid markup gracefully

.. code-block:: python

from bs4 import BeautifulSoup

# Malformed HTML with missing closing tags
markup = '<p>Paragraph 1<p>Paragraph 2'
soup = BeautifulSoup(markup, 'html5lib')

# html5lib properly closes and structures the paragraphs
paragraphs = soup.find_all('p')
print(len(paragraphs)) # 2

Encoding Detection
~~~~~~~~~~~~~~~~~~

html5lib has sophisticated encoding detection capabilities and handles various character encodings correctly:

.. code-block:: python

from bs4 import BeautifulSoup

markup = '<p>Héllo Wörld</p>'
soup = BeautifulSoup(markup, 'html5lib')

print('Héllo' in soup.get_text()) # True
print('Wörld' in soup.get_text()) # True

When to Use html5lib
--------------------

Consider using html5lib with BeautifulSoup when you need:

- **HTML5 compliance**: You want parsing that matches how modern web browsers handle HTML
- **Robust error handling**: You're dealing with malformed or broken HTML
- **Consistent behavior**: You need parsing that follows the HTML5 specification exactly
- **Encoding detection**: You're working with documents in various character encodings

Performance Considerations
--------------------------

html5lib prioritizes correctness and compliance over speed. If you're parsing large amounts of HTML and performance is critical, you might want to consider other parsers like lxml. However, if correctness and compliance with HTML5 standards are more important than raw speed, html5lib is an excellent choice.

Installation
------------

To use html5lib with BeautifulSoup, you need to install both packages:

.. code-block:: bash

pip install beautifulsoup4 html5lib

Limitations
-----------

When using html5lib with BeautifulSoup, note these limitations:

- ``SoupStrainer`` is not supported - the entire document will be parsed
- Some BeautifulSoup features that depend on custom element types may not work
- html5lib is generally slower than other parsers

Example: Complete Workflow
---------------------------

Here's a complete example showing how to use html5lib with BeautifulSoup:

.. code-block:: python

from bs4 import BeautifulSoup

# Read HTML from a file or string
html_content = '''
<html>
<head><title>Example Page</title></head>
<body>
<h1>Welcome</h1>
<p>This is a <a href="/page1">link</a></p>
<p>Another <a href="/page2">link</a></p>
</body>
</html>
'''

# Parse with html5lib for HTML5-compliant parsing
soup = BeautifulSoup(html_content, 'html5lib')

# Navigate the parse tree
title = soup.find('title')
print('Page title: {}'.format(title.get_text()))

# Find all links
links = soup.find_all('a')
for link in links:
href = link.get('href')
text = link.get_text()
print('{}: {}'.format(text, href))

See Also
--------

- `BeautifulSoup Documentation <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_
- `HTML5 Specification <https://html.spec.whatwg.org/>`_
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Overview
:maxdepth: 2

movingparts
beautifulsoup
modules
changes
License <license>
Expand Down
22 changes: 22 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# html5lib Examples

This directory contains example scripts demonstrating various uses of html5lib.

## BeautifulSoup Integration

**File:** `beautifulsoup_example.py`

This example demonstrates how to use html5lib as a parser backend for BeautifulSoup. It compares the behavior of html5lib with Python's built-in html.parser and shows the advantages of using html5lib for HTML5-compliant parsing.

To run:
```bash
python beautifulsoup_example.py
```

Requirements:
- beautifulsoup4
- html5lib

## About html5lib

html5lib is a pure-Python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as implemented by all major web browsers. This makes it particularly useful when you need parsing behavior that matches what browsers do, rather than just parsing valid HTML.
83 changes: 83 additions & 0 deletions examples/beautifulsoup_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#!/usr/bin/env python
"""
Example demonstrating html5lib integration with BeautifulSoup.

This example shows how to use html5lib as a parser backend for BeautifulSoup,
providing HTML5-compliant parsing with robust error handling for malformed HTML.
"""

from __future__ import print_function

try:
from bs4 import BeautifulSoup
except ImportError:
print("Error: BeautifulSoup4 is required to run this example.")
print("Install it with: pip install beautifulsoup4 html5lib")
exit(1)

import html5lib


def main():
print("=" * 60)
print("html5lib with BeautifulSoup - Example")
print("=" * 60)

# Test markup with potential parsing challenges
markup = '''
<html>
<body>
<p>Hello <span>World</span>!</p>
<p>Unclosed paragraph
<div>Nested content</div>
</body>
</html>
'''

print("\n1. Parsing with html5lib via BeautifulSoup:")
print("-" * 60)
soup_html5lib = BeautifulSoup(markup, 'html5lib')
print("Parser used: html5lib")
print("Found <p> tags:", len(soup_html5lib.find_all('p')))
print("Found <span> tags:", len(soup_html5lib.find_all('span')))
print("Found <div> tags:", len(soup_html5lib.find_all('div')))

print("\n2. Parsing with html.parser via BeautifulSoup:")
print("-" * 60)
soup_htmlparser = BeautifulSoup(markup, 'html.parser')
print("Parser used: html.parser")
print("Found <p> tags:", len(soup_htmlparser.find_all('p')))
print("Found <span> tags:", len(soup_htmlparser.find_all('span')))
print("Found <div> tags:", len(soup_htmlparser.find_all('div')))

print("\n3. Direct html5lib parsing:")
print("-" * 60)
doc = html5lib.parse(markup)
print("Parser: html5lib (direct)")
print("Document type:", type(doc))

print("\n4. Comparing results:")
print("-" * 60)

# Test with malformed HTML
malformed = '<p>First<p>Second<p>Third'

soup_html5lib = BeautifulSoup(malformed, 'html5lib')
soup_htmlparser = BeautifulSoup(malformed, 'html.parser')

print("Malformed HTML: {}".format(malformed))
print("html5lib found {} paragraphs".format(len(soup_html5lib.find_all('p'))))
print("html.parser found {} paragraphs".format(len(soup_htmlparser.find_all('p'))))

print("\n" + "=" * 60)
print("CONCLUSION:")
print("=" * 60)
print("html5lib works correctly as a BeautifulSoup parser backend.")
print("It provides HTML5-compliant parsing with robust error handling.")
print("The choice between 'html5lib' and 'html.parser' depends on your")
print("specific needs for compliance vs. performance.")
print("=" * 60)


if __name__ == '__main__':
main()
Loading