Java Open Source Projects Directory

...dedicated into Java open source projects

  • Increase font size
  • Default font size
  • Decrease font size
Crawlers

web-harvest

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

 

java-web-crawler

Java Web Crawler is a simple Web crawling utility written in Java. It supports the robots exclusion standard.

 

arachnid

Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.

 

weblech

WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.

 

webeater

A 100% pure Java program for web site retrieval and offline viewing.

 

heritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

 

websphinx

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically.

 

jobo

JoBo is a simple program to download complete websites to your local computer. Internally it is basically a web spider. The main advantage to other download tools is that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling. Compared to other products the GUI seems to be very simple, but the internal features matters ! Do you know any download tool that allows it to login to a web server and download content if that server uses a web forms for login and cookies for session handling ? It also features very flexible rules to limit downloads by URL, size and/or MIME type.

 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  Next 
  •  End 
  • »


Page 1 of 2