java - How to adapt the URL that I want to crawl in crawler4j -

i tried modifying code crawler4j-quickstart example

i want crawl following link

https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.tmd9narkiru

which google news search link keyword obama

i tried modifying mycrawler.java

 @override  public boolean shouldvisit(page referringpage, weburl url) {      string href = url.geturl().tolowercase();      return !filters.matcher(href).matches()             && href.startswith("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.tmd9narkiru/");  }

also, controller.java

 /*   * each crawl, need add seed urls. these first   * urls fetched , crawler starts following links   * found in these pages   */   //controller.addseed("http://www.ics.uci.edu/~lopes/");   // controller.addseed("http://www.ics.uci.edu/~welling/");     controller.addseed("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.tmd9narkiru");   /*   * start crawl. blocking operation, meaning code   * reach line after when crawling finished.   */   controller.start(mycrawler.class, numberofcrawlers);

then, shows error

slf4j: failed load class "org.slf4j.impl.staticloggerbinder". slf4j: defaulting no-operation (nop) logger implementation slf4j: see http://www.slf4j.org/codes.html#staticloggerbinder further details. build successful (total time: 43 seconds)

is code modification wrong?

update

i tried use other url other google search link .it works. m guessing cannot crawl google search link .any idea tackle ?

the error you're receiving has nothing code modification. instead, related incorrect configuration , missing jars.

slf4j binding required in order slf4j perform logging, else it'll use nop logger implementation you've seen in error message.

to resolve issue, add slf4j binding jar file project, such slf4j-simple-<version>.jar

you may refer slf4j manual more detailed explaination.

update

i don't think you're allowed crawl google search results based on google's robots.txt disallowed sites suffix /search crawled , in tos.

don’t misuse our services. example, don’t interfere our services or try access them using method other interface , instructions provide. may use our services permitted law, including applicable export , re-export control laws , regulations. may suspend or stop providing our services if not comply our terms or policies or if investigating suspected misconduct.

you may consider using google's custom search api conformance tos.

Search This Blog

Tomorrow

java - How to adapt the URL that I want to crawl in crawler4j -

update

Comments

Post a Comment

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -