没有任何数据可供显示
开源项目社区 | 当前位置 : |
|
www.trustie.net/open_source_projects | 主页 > 开源项目社区 > crawler4j |
crawler4j
|
3 | 0 | 21 |
贡献者 | 讨论 | 代码提交 |
Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!
Sample UsageFirst, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:
import java.util.ArrayList;
import java.util.regex.Pattern;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.ics.uci.edu/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
ArrayList links = page.getURLs();
}
}
As can be seen in the above code, there are two main functions that should be overridden:
shouldVisit: This function decides whether the given URL should be crawled or not. visit: This function is called after the content of a URL is downloaded successfully. You can easily get the text, links, url and docid of the downloaded page.
You should also implement a controller class which specifies the seeds of the crawl, the folder in which crawl data should be stored and number of concurrent thread:
import edu.uci.ics.crawler4j.crawler.CrawlController;
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.ics.uci.edu/");
controller.start(MyCrawler.class, 10);
}
}PolitenessCrawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. This parameter can be tuned with the "setPolitenessDelay" function in controller.
DependenciesThe following libraries are used in the implementation of crawler4j. In order to make life easier all of them are bundled in the "crawler4j-dependencies-lib.zip" package:
Berkeley DB Java Edition 4.0.71 or higher fastutil 5.1.5 DSI Utilities 1.0.10 or higher Apache HttpClient 4.0.1 Apache Log4j 1.2.15 Apache Commons Logging 1.1.1 Apache Commons Codec 1.4 Source CodesSource codes are available for checkout from this subversion repository: https://crawler4j.googlecode.com/svn/trunk/