PDF Archive

Easily share your PDF documents with your contacts, on the Web and Social Networks.

Share a file Manage my documents Convert Recover PDF Search Help Contact



SettingupHDFSclusteron3nodes .pdf


Original filename: SettingupHDFSclusteron3nodes.pdf

This PDF 1.5 document has been generated by http://www.convertapi.com, and has been sent on pdf-archive.com on 10/06/2017 at 21:35, from IP address 14.139.x.x. The current document download page has been viewed 657 times.
File size: 135 KB (6 pages).
Privacy: public file




Download original PDF file









Document preview


Setting up HDFS cluster on 3 nodes
 Install Ubuntu 16.04 LTS on 3 machines
 Usernames:

hduser@hduser-j
hduser@hduser-r
hduser@hduser-s

master and slave
slave 1
slave 2 (username should be same)

 Give static ips to the 3 machines
hduser-j
192.168.0.3(master node)
hduser-r
192.168.0.2
hduser-s
192.168.0.1
 Edit the file /etc/hosts in the 3 machines
On master
192.168.0.3
jayesh
hduser-j
192.168.0.2
rachit
hduser-r
192.168.0.1
sups
hduser-s
On slave1
192.168.0.3
jayesh
hduser-j
192.168.0.2
rachit
hduser-r
192.168.0.1
sups
hduser-s
On slave 2
192.168.0.3
jayesh
hduser-j
192.168.0.2
rachit
hduser-r
192.168.0.1
sups
hduser-s

master
slave1
slave2
master
slave1
slave2
master
slave1
slave2

localhost

localhost

localhost

 Install oracle-java-8 on all machines
 Set Java path in .bashrc file on all machines

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:/usr/lib/jvm/java-8-oracle/bin
 Run the command on all machines
source .bashrc

 Install openssh server on all machines using following command
sudo apt-get install openssh-server
 Generate ssh keys and send the keys to other machines using following commands
ssh-keygen
ssh-copy-id -i hduser@ipaddress
E.g. ssh-copy-id -i hduser@192.168.0.2

 Check that all machines can connect to each other using ssh by running following
commands
ssh master
ssh slave1
ssh slave2


Download and extract hadoop-2.7.3 on all machines

 On all machines make the following two files
hadoop-2.7.3/etc/hadoop/masters
master

hadoop-2.7.3/etc/hadoop/slaves
master
slave1
slave2
 Set the following environment variables in .bashrc file
export HADOOP_HOME=$HOME/hadoop-2.7.3
export HADOOP_CONF_DIR=$HOME/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.3
export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.3
export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.3
export YARN_HOME=$HOME/hadoop-2.7.3
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HOME/hadoop-2.7.3/bin


Run command source .bashrc on all machines

 Configure JAVA_HOME in hadoop-2.7.3/etc/hadoop/hadoop-env.sh in all machines

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
 create NameNode and DataNode directories using following commands in all
machines.
mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/namenode
mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/datanode
If these directories’ path is not specified in hdfs-site.xml they are made in/tmp
directory and get deleted every time the system restarts.

 Change the following xml files in all machines. They are located in hadoop2.7.3/etc/hadoop/
1. core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

2.

hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hduser/hadoop2.7.3/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hduser/hadoop2.7.3/hadoop2_data/hdfs/datanode</value>
</property>
</configuration>

3.

yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

4. mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Now the hadoop file system is configured on all systems.
 Run following commands on the master
In hadoop-2.7.3/bin run
hadoop namenode -format
Note : This is required at the first time of hadoop installation. Do not
format a running hadoop file-system, this will erase all your HDFS data.



On master run
./sbin/start-dfs.sh
This will start the datanode,namenode and secondary namenode on master
and the datanodes on slaves

 Check on all machines running jps command
We can check on browser of master at localhost:50070 and
on slave machines’ browsers at master:50070
 We can make directories in the hdfs using hdfs fs -mkdir <path> and
store files from local machine to hdfs using hdfs fs -put
<localsrc> ...<dst>
 To stop the hdfs, on master run
./sbin/stop-dfs.sh
Restart and check if all files exist.

Setting up Spark multinode standalone cluster after setting up
hadoop on the 3 nodes (Continued after above steps)
 The etc/hosts file is already fixed. Java path and ssh are also already set for
hadoop
 Download spark and extract it in the home/hduser/ (same place where we have in
already put hadoop)
 Set the SPARK_HOME variable in .bashrc file on all machines
export SPARK_HOME=/home/hduser/spark-2.1.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH


Run command source .bashrc on all machines



Make the exact same slaves file as we had done in case of hadoop and copy it to the
following path on all machines
/home/hduser/spark-2.1.0-bin-hadoop2.7/conf/



Make the following edits to spark-env.sh file on all machines ( we can also edit on 1
machine and copy it using scp on all machines) .
export SPARK_WORKER_MEMORY=12G
export SPARK_WORKER_CORES=6
export SPARK_MASTER_HOST=master
The location of the file is
/home/hduser/spark-2.1.0-bin-hadoop2.7/conf/

 Similarly edit the file spark-defaults.conf on all machines. The location of file is same as
above
spark.master spark://master:7077
spark.driver.memory
12g
 On master run the following command to start Spark
./sbin/start-all.sh

This will start the master and worker on the master and the workers on
slaves
 Check on all machines running jps command
We can check the status of spark on browser of master at
localhost:8080 and on slave machines’ browsers at master:8080.
 Use the spark-submit command to submit any job to spark. If the hdfs is
on and we want to run some job on a file in the hdfs, we can use the sparksubmit command as follows
./bin/spark-submit <jobname>
hdfs://master:9000/<filepath>
The status of application can also be seen at master:8080
 To stop spark run the following command on master
./sbin/stop-all.sh


Related documents


settinguphdfsclusteron3nodes
hadoop online course content pdf
hadoop
apache spark scala online course content pdf
ijetr2137
cv tayciryahmed recommendations


Related keywords