Setup Hadoop 2.2.0 on Mac OS X

This tutorial will give the reader instructions to setup Hadoop 2.2.0 on Mac OS X with minimal configurations for exercise and development purposes. I have done the installation on my 13-inc MacBook Pro Retina.

Enable Remote Access

  1. Open System Preferences and click on “Sharing”.
  2. Select the checkbox next to “Remote Login” to enable it.

The SSH server will be started in the background.

Configure SSH

$ ssh-keygen -t rsa -P ''
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

Download Hadoop 2.2.0

$ cd ~
$ curl -O http://ftp.halifax.rwth-aachen.de/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ tar xzf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 hadoop

Set Hadoop-related Environment Variables

$ vi ~/.profile
# Add these variables
export JAVA_HOME=`/usr/libexec/java_home -v 1.6`
export HADOOP_INSTALL=$HOME/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop/
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
$ . ~/.profile

Configure Hadoop

Create directories

$ mkdir -p $HOME/data/hdfs/namenode
$ mkdir -p $HOME/data/hdfs/datanode

Edit etc/hadoop/hadoop-env.sh

export JAVA_HOME=`/usr/libexec/java_home -v 1.6`

Edit etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
</configuration>

Edit etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/Users/mabduh/data/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/Users/mabduh/data/hdfs/datanode</value>
    </property>
</configuration>

Edit etc/hadoop/yarn-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Edit etc/hadoop/mapred-site.xml

$ cd $HADOOP_INSTALL
$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Format Name Node

$ hdfs namenode -format

Start Hadoop Services

$ start-dfs.sh
$ start-yarn.sh

Run Hadoop Example

$ cd $HADOOP_INSTALL
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5
Number of Maps  = 2
Samples per Map = 5
2013-12-17 17:24:04.985 java[40261:1203] Unable to load realm info from SCDynamicStore
13/12/17 17:24:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/12/17 17:24:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/17 17:24:06 INFO input.FileInputFormat: Total input paths to process : 2
13/12/17 17:24:06 INFO mapreduce.JobSubmitter: number of splits:2
...
...
13/12/17 17:24:15 INFO mapreduce.Job:  map 0% reduce 0%
13/12/17 17:24:23 INFO mapreduce.Job:  map 100% reduce 0%
13/12/17 17:24:29 INFO mapreduce.Job:  map 100% reduce 100%
13/12/17 17:24:29 INFO mapreduce.Job: Job job_1387297380543_0001 completed successfully
13/12/17 17:24:29 INFO mapreduce.Job: Counters: 43
	File System Counters
		FILE: Number of bytes read=50
		FILE: Number of bytes written=238825
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=532
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=11
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=11613
		Total time spent by all reduces in occupied slots (ms)=4142
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=36
		Map output materialized bytes=56
		Input split bytes=296
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=56
		Reduce input records=4
		Reduce output records=0
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=65
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=481087488
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=236
	File Output Format Counters
		Bytes Written=97
Job Finished in 23.596 seconds
Estimated value of Pi is 3.60000000000000000000