Oct 1, 2012

Jericho HTML Parser

[Originally posted on G+ on February 6th 2012]

Yesterday evening I was looking for a HTML parser that would be tolerant wrt malformed documents. I first looked at Jericho HTML http://jericho.htmlparser.net (I read about it a while back), and when looking for a comparison between Jericho and CyberNeko HTML Parser http://sourceforge.net/projects/nekohtml/ (which is mentioned in the Jericho home page), I stumbled upon JSoup http://jsoup.org/

I immediately fell in love with JSoup because:

  1. it's open source under the MIT license
  2. it's DOM oriented and declares to support HTML 5.0
  3. it's a very small library with no dependencies (CyberNeko to the contrary depends on Apache Xerces)
  4. it declares Android compatibility
  5. it offers a decent API for performing HTTP requests (design looks similar to Apache HTTP Client)
  6. it is fully documented and comes with several code examples

I made a quick test fetching and parsing a couple of HTML documents resulting from the flow of a dynamic website and everything worked fine.

Don't know how it behaves on more complex pages, but the positive comments I read in Stack Overflow and other sites give me good hope.


Post a Comment