How To Install Hadoop On Ubuntu 20.04 LTS [Focal Fossa]

Learn How To Install Hadoop On Ubuntu 20.04 LTS step by step. Hadoop is a java based framework for data. This open-source software provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs. Buy your own Ubuntu VPS and join us with this article to start learning about it and enjoy the most important ability of Hadoop, quick storing, and processing huge amounts of any kind of data.

The tutorial may be more useful if you consider below prerequisites:

1- A server running Ubuntu 20.04.

2- Sudo or root privileges on local /remote machines

3- 16GB Ram/8vCPU/20GB Boot disk/100GB Raw disk for Data storage

Table of Contents

Tutorial Install Hadoop On Ubuntu 20.04 LTS

Hadoop is famous for its computing power. While you use it you understand that the more you use computing nodes, the more you have the processing power. By processing over collected data from the company, Hadoop deduces the result make a future decision.

Recommended Article: Tutorial Setup Indicator SysMonitor on Ubuntu 20.04

Install And Configure Apache Hadoop On Ubuntu 20.04

Let’s go through the 10 steps of this tutorial to finish learning about Hadoop.

Step 1: How To Install Java

Since Hadoop is written in Java, you can install OpenJDK 8 from the default apt repositories after updating the system:

sudo apt update

sudo apt install openjdk-8-jd

Run the below command to check the installed version of Java:

java -version

Step 2: How To Create A Hadoop User

To consider security reasons, you are recommended to create a separate user to run Hadoop. So to do this, type the following command with the name Hadoop:

sudo adduser hadoop

Step 3: How To Configure SSH key-based Authentication

In this step, you need to configure passwordless SSH authentication for the local system. So, log in with the user hadoop you created in the above step and run the command below:

su - hadoop

To generate public and private key pairs, type:

ssh-keygen -t rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 640 ~/.ssh/authorized_keys

Then, verify passwordless SSH authentication with the following command:

ssh localhost

Step 4: Install Hadoop On Ubuntu 20.04

Run the following command after ensure of logging with the user hadoop

su - hadoop

To download the latest version of Hadoop type:

Once the download is finished, use the following command to extract the downloaded file:

tar -xvzf hadoop-3.3.0.tar.gz

Use the command below to you rename the extracted directory to hadoop:

mv hadoop-3.3.0 hadoop

To configure Hadoop and Java environment variables on your system, open the file ~/.bashrcin your favorite text editor:

nano ~/.bashrc

And then, add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/  export HADOOP_HOME=/home/hadoop/hadoop  export HADOOP_INSTALL=$HADOOP_HOME  export HADOOP_MAPRED_HOME=$HADOOP_HOME  export HADOOP_COMMON_HOME=$HADOOP_HOME  export HADOOP_HDFS_HOME=$HADOOP_HOME  export HADOOP_YARN_HOME=$HADOOP_HOME  export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin  export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Now, you can save and close the file.

To activate the environment variables, run:

source ~/.bashrc

Then, open the Hadoop environment variable file:

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment and change the variable JAVA_HOMEaccording to the Java installation path:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/

Now, you can save and close the file when done.

Step 5: How To Configure Hadoop

To create both directories, run:

mkdir -p ~/hadoopdata/hdfs/namenode

mkdir -p ~/hadoopdata/hdfs/datanode

Then, you can edit the file core-site.xml and update with your system’s hostname:

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Next, you should edit the following file to match the hostname of the system:

<configuration>          <property>                 <name>fs.defaultFS</name>                 <value>hdfs://example.com:9000</value>          </property>  </configuration>

Now, you can save and close the file. Then, edit the file hdfs-site.xml:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

The directory path NameNode and DataNode should be changed as the below:

<configuration>             <property>                  <name>dfs.replication</name>                  <value>1</value>          </property>             <property>                  <name>dfs.name.dir</name>                  <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>          </property>             <property>                  <name>dfs.data.dir</name>                  <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>          </property>

</configuration>

Now you can save and close the file. Then, edit the file mapred-site.xml:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:

<configuration>          <property>                  <name>mapreduce.framework.name</name>                  <value>yarn</value>          </property>  </configuration>

Now, you can save and close the file. Then, edit the y file arn-site.xml:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

And finally, make the following changes:

<configuration>          <property>                  <name>yarn.nodemanager.aux-services</name>                  <value>mapreduce_shuffle</value>          </property>  </configuration>

Now, you can save and close the file when done.

Recommended Article: How To Install Samba On Ubuntu 20.04

Step 6: How To Start Hadoop Cluster

First, you should format the Namenode as the hadoop user. Use the command below to do this:

hdfs namenode -format

To start the hadoop cluster, run:

start-dfs.sh

Then, type the following command to start the service YARN

start-yarn.sh

Use the command below to check the status of all services Hadoop

jps

Step 7: How To Configure the firewall

Allow Hadoop connections through the firewall by running:

ufw allow 9870/tcp

ufw allow 8088/tcp

Step 8: How To Log in To Hadoop Namenode And Resource Manager

You can visit the URL ”http://example.com:9870” to access the Namenode. There you can see the service summary screen.

Also, to access Resource Manage you should visit the URL ”http://example.com:8088” to see the Hadoop management screen.

Step 9: How To Check Hadoop Cluster

While the Hadoop cluster is installed and configured till here, you need to create some directories in the HDFS filesystem to test Hadoop.

Note: Please ensure of logging with the user hadoop

su - hadoop

Use the following command to create a directory in the HDFS filesystem:

hdfs dfs -mkdir /test1

hdfs dfs -mkdir /logs

And to list the directory above, type:

hdfs dfs -ls /

And then try to put some files in hadoop’s file system. You can add the log files from the host machine to the hadoop file system.

hdfs dfs -put /var/log/* /logs/

Note: In case you need to check the above files and directory, you can do this in the Hadoop Namenode web interface.

Go to the Namenode web interface, click Utilities->Browse the file system. You should see the directories that you created earlier.

Step 10: How To Stop Hadoop Cluster

You need to run the stop-dfs.sh and stop-yarn.sh scripts as a Hadoop user to be able to stop the Hadoop Namenode and Yarn service

Run the command below to stop the Hadoop Namenode service:

stop-dfs.sh

To stop the Hadoop Resource Manager service, type:

stop-yarn.sh

Conclusion

In this article, you have learned How To Install Hadoop On Ubuntu 20.04 LTS. Due to the kind of your business requirements, you can use its important role in today’s life by adopting this technology in your organization as it fits with any domain.

TAGS: