jsoup SAX + DOM StreamParser
I’ve added a cookbook example for jsoup’s StreamParser, that demonstrates how to use this hybrid Java SAX + DOM parser. Based on examples from the original PR.
Problem
You need to parse an HTML or XML document that is too large to fit entirely into memory, or you want to process elements progressively as they are encountered. A typical use case is extracting specific elements from a large document, or handling streamed HTML from a network source efficiently.
Traditional Java SAX parsers offer efficient streaming parsing for XML and HTML, but they lack an ergonomic way to traverse or manipulate elements like a DOM parser. Meanwhile, standard DOM parsers, such as
Jsoup.parse()
, require loading the entire document into memory, which may be inefficient for large files.Solution
Use the
StreamParser
, which allows you to parse parsing an HTML or XML document in an event driven hybrid DOM + SAX style. Elements are emitted as they are completed, enabling efficient memory use and incremental processing. This hybrid approach allows you to process elements as they arrive, including their children and ancestors, while still leveraging jsoup’s intuitive API.This makes StreamParser a viable alternative to traditional SAX parsers while providing a more ergonomic and familiar API. And jsoup’s robust handling of malformed HTML and XML ensures that real-world documents can be processed effectively.