Shards of Development: Jericho

Oct 1, 2012

Jericho HTML Parser

[Originally posted on G+ on February 6th 2012]

Yesterday evening I was looking for a HTML parser that would be tolerant wrt malformed documents. I first looked at Jericho HTML http://jericho.htmlparser.net (I read about it a while back), and when looking for a comparison between Jericho and CyberNeko HTML Parser http://sourceforge.net/projects/nekohtml/ (which is mentioned in the Jericho home page), I stumbled upon JSoup http://jsoup.org/

I immediately fell in love with JSoup because:

it's open source under the MIT license
it's DOM oriented and declares to support HTML 5.0
it's a very small library with no dependencies (CyberNeko to the contrary depends on Apache Xerces)
it declares Android compatibility
it offers a decent API for performing HTTP requests (design looks similar to Apache HTTP Client)
it is fully documented and comes with several code examples

I made a quick test fetching and parsing a couple of HTML documents resulting from the flow of a dynamic website and everything worked fine.

Don't know how it behaves on more complex pages, but the positive comments I read in Stack Overflow and other sites give me good hope.

Shards of Development

A blog about Java, Eclipse, OSGi

Translate

Oct 1, 2012

Jericho HTML Parser

About Me

Categories

Popular Posts

Blog Archive