Java Open Source Projects Directory

...dedicated into Java open source projects

  • Increase font size
  • Default font size
  • Decrease font size
HTML Parsers

jericho-html-parser

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
The library distinguishes itself from other HTML parsers with the following major features:

  • The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
  • ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
  • It is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
  • Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
  • The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
  • The row and column number of each position in the source document are easily accessible.
  • Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
  • Custom tag types can be easily defined and registered for recognition by the parser.
  • Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
  • Built-in functionality to render HTML markup with simple text formatting.
  • Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy.
  • Built-in functionality to compact HTML source code by removing all unnecessary white space.
 

java-mozilla-html-parser

MozillaParser is a Java Html parser based on mozilla's html parser. it acts as a bridge from java classes to Mozilla's classes and outputs a java Document object from a raw ( and dirty) HTML input

 

tagsoup

TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

 

hotsax

HotSAX is a fast, small footprint, non-validating SAX2 parser for HTML/XML/XHTML. It can be used in simple web agents, page scrapers, and spiders. It is similar to the Apache Xerces parser, except that it can generate SAX events for badly formatted HTML as well.

 

htmlcleaner

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web-browsers use in order to create document object model. However, user may provide custom tag and rule set for tag filtering and balancing.

 

nekohtml

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

 

jtidy

JTidy is a Java port of HTML Tidy , a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

 

java-html-parser

HTML Parser that produces a stream of tag objects, which can be further parsed into a searchable tree structure.

 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  Next 
  •  End 
  • »


Page 1 of 2