Nov 072015
 

There are several tutorials about how to install Nutch 2.x with HBase and Solr. However, with all of them I found errors. In this post I am going to describe how I made the integration of these products.

The integration was made with:

These versions are very important, because if you use another combination of versions, the stack will likely not work.

In our case, we are going to install a basic example under the directory /Applications/searchengine in an OSX version 1.8.5. Nevertheless, in some points we will explain some issues for Windows and Linux.

After extracting hbase, you need to configure it. Go to hbase-site.xml, which you will find in /conf and modify it with:

Next, you need to specify the Gora backend in $NUTCH_HOME/conf/nutch-site.xml.

After that, ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml.

Then make sure that HBaseStore is set as the default data store in the gora.properties file. You will find this file in /conf. Add the following line:

Now go to nutch home directory and type the following command from your terminal:

ant runtime

This will build your Apache Nutch and create the respective directories in the Apache Nutch’s home directory. It is needed because Apache Nutch 2.x is only distributed as source code. Your will need ant 1.9.x version in order to build nutch 2.x properly, mainly, if you use java 1.8.
We had to update ant versión. Basically, what we did was:

  • Download and unzip latest version of ant
  • Copy new version of ant under /Library directory.
  • Remove current symbolic link of ant
  • Create a new symbolic link pointing to new ant

In the following extract you can see what was done:

Part of the result of executing ant runtime is:

The tree structure of the generated directories would be as shown in the following picture:

The most important directory is runtime, which containts all the necessary scripts which are required for crawling.

Now, make sure HBase is started and is working properly. Previously, JAVA_HOME environment variable must be properly set. Then, to check whether HBase is running properly, go to the home directory of Hbase. Type the
following command from your terminal:

./bin/hbase shell

You will get an output as follows:

Setting up HBase on Windows

If you are installing on Windows, you will need cygwin to start HBase. You can follow instructions on this page in order to set up HBase on Windows. Be careful: due to a bug, you will need to create a dummy file, such as, zzz.jar, under hbase/lib. If you do not create this file, you will get an error similar to:

java.lang.NoClassDefFoundError: org/apache/zookeeper/KeeperException

More information about this error is on this page.

Besides, do not forget to set parameter
export HBASE_MANAGES_ZK=false
in hbase-env.sh.

Besides, in hbase-site.xml add:

in order to avoid error

Could not start ZK at requested port of 2181.

You can get more information at this link.

Part of the installation you can see in the following images:

Starting Hbase on Windows:

Verifying Apache Nutch installation

The installation of Nutch is finished. In order to verify your Apache Nutch installation, go to /runtime/local and type the following command:

bin/nutch
You will get an output as follows:

Crawling websites

Crawling is driven by the Apache Nutch crawling tool and certain related tools for building and maintaining
several data structures. It includes web database, the index, and a set of segments. The steps for crawlin are:

  1. Add an agent name in the value field of the http.agent.name property in the nutch-site.xml file.

  2. Under directory runtime/local, create a directory called urls.
    mkdir -p urls
    If you are using Windows
    mkdir urls
  3. Then you have to create a file called seed.txt, under urls. Inside this file your will put the urls you want to crawl; one url per line. For example (in Linux or Mac),
    echo "http://nutch.apache.org" >> urls/seed
  4. In order to filter URLs for crawling, you will have to edit the regex-urlfilter.txt file. This file is self-explanatory. Lines starting with + will include the url into the indexes and lines starting with – will ignore the url.

Integration of Solr with Nutch

First of all, you need to download Solr and unzip it. In our case, Solr was installed under
/Applications/searchengine/solr-4.8.1/
Then you have to export SOLR_HOME environment variable. You can do that into your .bash_profile or .bashrc. For example,
export SOLR_HOME=/Applications/searchengine/solr-4.8.1/example/solr.
Now, you can start Solr going to the example directory under your Apache Solr’s home (export SOLR_HOME=/Applications/searchengine/solr-4.8.1/example) and typing:

java -jar start.jar

You will get the following output:

Hit the following URL on your browser:
http://localhost:8983/solr/admin/
and you will get the image of Running Apache Solr on your browser, as shown in the following screenshot:


Solr can be started from Tomcat, JBoss and so on, but in this post, Solr has been started as standalone application.

Next copy the schema.xml file from <nutch_home>/conf to <solr>/example/solr/collection1/conf. Now Nutch is integrated with Solr. You need to reestar Solr executing again
java -jar start.jar

At this moment almost everything is ready to crawl your website. The only thing left to do is to set the property plugin.includes in nutch-site.xml files both under runtime/local/conf and /conf directories. The final nutch-site.xml under /conf is:

The final nutch-size.xml under runtime/local/conf is:

This configure of the property plugin.includes prevents from the error

No IndexWriters activated – check your configuration

when the crawling of the websites starts.

Finally, follow the next steps in order to start the crawling of the websites:

  1. Go to the home directory of HBase and execute:
    ./bin/start-hbase.sh
    You will get the following output:

    starting master, logging to /Applications/searchengine/hbase-0.94.14/bin/../logs/hbase-lostinsoftware-master-LostInSoftware.local.out

  2. Now go to the local directory of Apache Nutch and type the following command
    bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2

The meaning of each parameter of the command is:

  • urls/seed.txt: seed.txt is the file which contains urls for crawling
  • TestCrawl: This is the crawl data directory which will be automatically created inside Apache Hbase with the name TestCrawl_Webpage, which will contain information on all the URLs which are crawled by Apache Nutch.
  • http://localhost:8983/solr/: This is the URL of running Apache Solr.
  • 2: This is the number of iterations, which will tell Apache Nutch in how many iterations this crawling will end.

if all succeeds, you will get the following output:

Now you can search for document using Solr console:

 

Sorry, the comment form is closed at this time.