[Originally posted on G+ on February 6th 2012]
Yesterday evening I was looking for a HTML parser that would be tolerant wrt malformed documents. I first looked at Jericho HTML http://jericho.htmlparser.net (I read about it a while back), and when looking for a comparison between Jericho and CyberNeko HTML Parser http://sourceforge.net/projects/nekohtml/ (which is mentioned in the Jericho home page), I stumbled upon JSoup http://jsoup.org/
I immediately fell in love with JSoup because:
I made a quick test fetching and parsing a couple of HTML documents resulting from the flow of a dynamic website and everything worked fine.
Don't know how it behaves on more complex pages, but the positive comments I read in Stack Overflow and other sites give me good hope.
Yesterday evening I was looking for a HTML parser that would be tolerant wrt malformed documents. I first looked at Jericho HTML http://jericho.htmlparser.net (I read about it a while back), and when looking for a comparison between Jericho and CyberNeko HTML Parser http://sourceforge.net/projects/nekohtml/ (which is mentioned in the Jericho home page), I stumbled upon JSoup http://jsoup.org/
I immediately fell in love with JSoup because:
- it's open source under the MIT license
- it's DOM oriented and declares to support HTML 5.0
- it's a very small library with no dependencies (CyberNeko to the contrary depends on Apache Xerces)
- it declares Android compatibility
- it offers a decent API for performing HTTP requests (design looks similar to Apache HTTP Client)
- it is fully documented and comes with several code examples
I made a quick test fetching and parsing a couple of HTML documents resulting from the flow of a dynamic website and everything worked fine.
Don't know how it behaves on more complex pages, but the positive comments I read in Stack Overflow and other sites give me good hope.