You are here:Home » Recommender Systems » Run Mahout build-in examples on Hadoop

Run Mahout build-in examples on Hadoop

1. Install Mahout
$ git clone git://git.apache.org/mahout.git mahout-trunk
$ cd mahout-trunk
$ mvn install
or $ mvn install -DskipTests=true ( to skip Mahout's tests)
$ cd bin
$ ./mahout

2. Set up Hadoop home and Mahout home in ~/.bash_profile
$ vim ~/.bash_profile
------------------------------------------------------
# Hadoop-2.2.0
export HADOOP_HOME=/Users/hadoop-2.2.0
export PATH=$PATH:$HADOOP_HOME/bin


# Mahout Release 0.9 unreleased
export MAHOUT_HOME=/Users/mahout-trunk
export PATH=$PATH:$MAHOUT_HOME/bin
------------------------------------------------------
$ source ~/.bash_profile

3. Put data into HDFS
DataSet -> here


$ hadoop fs mkdir -p /user/<user-name>/testdata
or $ bin/hdfs dfs -mkdir -p 
/user/<user-name>/testdata

$ hadoop fs -put synthetic_control.data synthetic_control.data
or $ bin/hdfs dfs -copyFromLocal <local_synthetic_control.data> /user/<user-name>/testdata

4. Run example on hadoop

$ cd ${MAHOUT_HOME}/core
$ mvn clean package
then you will find mahout-examples-0.9-SNAPSHOT-job.jar under ${MAHOUT_HOME}/examples/target

$ hadoop jar ${
MAHOUT_HOME}/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

or $ mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

5. Output is in folder /user/<user-name>/output on HDFS
$ bin/hdfs dfs -ls /user/<user-name>/output
1> clusteredPoints
shows cluster-id and documents-id
mahout seqdumper reads the result(<IntWritable key,WeightedVectorWritable value>) of clusteredPoints
2>clusters-0~9 (i-th)
the result <Test, Cluster> of the i-th clustering
n -> number of sample in i-th category
c -> the center of each category
r -> the radios of each category
3> data

original data (vector format), can be read by using mahout vectordump (the read result don't have related key) or by using mahout seqdumper (the read result have key -> url, value -> the description of class not vector)

6. dump SequenceFile to local -> Import Export Sequence File Formats

1> seqdumper
$ mahout seqdumper -i /user/
<user-name>/output/clusteredPoints/part-m-00000 -o ~/workspace/Mahout/mahout-examples-data/output/syntheticcontrol_kmeans_seqdumper.txt

Key: 161: Value: wt: 1.0 distance-squared: 1042.6314794478094 vec: [28.781, 34.463, 31.338, 31.283, 28.921, 33.760, 25.397, 27.785, 35.248, 27.116, 32.872, 29.217, 36.025, 32.337, 34.525, 32.872, 34.117, 26.524, 27.662, 26.369, 25.774, 29.270, 30.733, 29.505, 33.029, 25.040, 28.917, 24.344, 26.120, 34.942, 25.029, 26.631, 35.654, 28.435, 29.150, 28.158, 26.193, 33.318, 30.977, 27.044, 35.534, 26.235, 28.996, 32.004, 31.056, 34.255, 28.072, 28.940, 35.497, 29.747, 31.433, 24.556, 33.743, 25.047, 34.932, 34.988, 32.472, 33.376, 25.465, 25.872]


2> clusterdump

$ mahout clusterdump --input /user/<user-name>/output/clusters-10-final --pointsDir /user/<user-name>/output/clusteredPoints --output ~/workspace/Mahout/mahout-examples-data/output/clusteranalyze_clusterdump.txt

3> vectordump
$ mahout vectordump --/user/hadoop/output/data/part-m-00000


DataSet:
https://s3.amazonaws.com/asf-mail-archives-7-18-2011/index.html
20newsgroups dataset 

http://select.cs.cmu.edu/code/graphlab/datasets.html

Reference:
http://mahout.apache.org/users/basics/quickstart.htmlhttp://www.ibm.com/developerworks/cn/java/j-lo-mahout/Implementing a recommender engine using Hadoop and Mahouthttps://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroupshttp://www.ibm.com/developerworks/java/library/j-mahout-scaling/Mahout/Hadoop on Amazon EC2 


Others:
$ mkdir -p scaling_mahout/data/sample
$ hadoop fs -put rating.data rating.data
$ hadoop fs -getmerge output output.txt
$ bin/hdfs dfs -rm -r /test

The input data should be sequence file in Mahout, so we need convert txtfile to sequenceFile(a class of hadoop)

${MAHOUT_HOME}/conf/driver.classes.props -> if you want to add in new algorithms, can register in this file

5 comments:

  1. Hi pls help i am getting error while running this command
    $mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

    Exception in thread "main" java.lang.IllegalStateException: No input clusters found in output/random-seeds/part-randomSeed. Check your -c argument.
    at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
    at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:135)
    at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:60)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

    ReplyDelete
    Replies
    1. Did you run this command before:

      $ hadoop jar ${MAHOUT_HOME}/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
      org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

      which works the same as

      $mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

      Delete
  2. i am getting the same error please help did kmeans for a csv file but when i do kmeans it says no input clusters found even after specifying the file generated from the canopy

    ReplyDelete
  3. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training course contents | big data training course contents

    ReplyDelete