There are several tutorials about how to install Nutch 2.x with HBase and Solr. However, with all of them I found errors. In this post I am going to describe how I made the integration of these products.
Continue reading »
Basic ideas about Apache Spark
Apache Spark is a fast and general engine for large scale data processing. It is written in Scala, a functional programming language that runs in a JVM. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. You can use Spark through Spark Shell for learning or data exploration (in Scala or Python, and since 1.4, in R) or through Spark Applications, for large scale data processing (mainly in Python, Scala or Java). Continue reading »
Basic ideas about Hadoop
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Continue reading »