Learn How To Install Hadoop On Ubuntu 20.04 LTS step by step. Hadoop is a java based framework for data. This open-source software provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs. Buy your own Ubuntu VPS and join us with this article to start learning about it and enjoy the most important ability of Hadoop, quick storing, and processing huge amounts of any kind of data.
The tutorial may be more useful if you consider below prerequisites:
1- A server running Ubuntu 20.04.
2- Sudo or root privileges on local /remote machines
3- 16GB Ram/8vCPU/20GB Boot disk/100GB Raw disk for Data storage
Table of Contents
Tutorial Install Hadoop On Ubuntu 20.04 LTS
Hadoop is famous for its computing power. While you use it you understand that the more you use computing nodes, the more you have the processing power. By processing over collected data from the company, Hadoop deduces the result make a future decision.
Install And Configure Apache Hadoop On Ubuntu 20.04
Let’s go through the 10 steps of this tutorial to finish learning about Hadoop.
Step 1: How To Install Java
Since Hadoop is written in Java, you can install OpenJDK 8 from the default apt repositories after updating the system:
sudo apt update
sudo apt install openjdk-8-jd
Run the below command to check the installed version of Java:
java -version
Step 2: How To Create A Hadoop User
To consider security reasons, you are recommended to create a separate user to run Hadoop. So to do this, type the following command with the name Hadoop:
sudo adduser hadoop
Step 3: How To Configure SSH key-based Authentication
In this step, you need to configure passwordless SSH authentication for the local system. So, log in with the user hadoop you created in the above step and run the command below:
su - hadoop
To generate public and private key pairs, type:
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
Then, verify passwordless SSH authentication with the following command:
ssh localhost
Step 4: Install Hadoop On Ubuntu 20.04
Run the following command after ensure of logging with the user hadoop
su - hadoop
To download the latest version of Hadoop type:
Once the download is finished, use the following command to extract the downloaded file:
tar -xvzf hadoop-3.3.0.tar.gz
Use the command below to you rename the extracted directory to hadoop:
mv hadoop-3.3.0 hadoop
To configure Hadoop and Java environment variables on your system, open the file ~/.bashrcin your favorite text editor:
nano ~/.bashrc
And then, add the following lines:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Now, you can save and close the file.
To activate the environment variables, run:
source ~/.bashrc
Then, open the Hadoop environment variable file:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Uncomment and change the variable JAVA_HOMEaccording to the Java installation path:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
Now, you can save and close the file when done.
Step 5: How To Configure Hadoop
To create both directories, run:
mkdir -p ~/hadoopdata/hdfs/namenode
mkdir -p ~/hadoopdata/hdfs/datanode
Then, you can edit the file core-site.xml and update with your system’s hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Next, you should edit the following file to match the hostname of the system:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://example.com:9000</value> </property> </configuration>
Now, you can save and close the file. Then, edit the file hdfs-site.xml:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
The directory path NameNode and DataNode should be changed as the below:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value> </property>
</configuration>
Now you can save and close the file. Then, edit the file mapred-site.xml:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Now, you can save and close the file. Then, edit the y file arn-site.xml:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
And finally, make the following changes:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Now, you can save and close the file when done.
Step 6: How To Start Hadoop Cluster
First, you should format the Namenode as the hadoop user. Use the command below to do this:
hdfs namenode -format
To start the hadoop cluster, run:
start-dfs.sh
Then, type the following command to start the service YARN
start-yarn.sh
Use the command below to check the status of all services Hadoop
jps
Step 7: How To Configure the firewall
Allow Hadoop connections through the firewall by running:
ufw allow 9870/tcp
ufw allow 8088/tcp
Step 8: How To Log in To Hadoop Namenode And Resource Manager
You can visit the URL ”http://example.com:9870” to access the Namenode. There you can see the service summary screen.
Also, to access Resource Manage you should visit the URL ”http://example.com:8088” to see the Hadoop management screen.
Step 9: How To Check Hadoop Cluster
While the Hadoop cluster is installed and configured till here, you need to create some directories in the HDFS filesystem to test Hadoop.
Note: Please ensure of logging with the user hadoop
su - hadoop
Use the following command to create a directory in the HDFS filesystem:
hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs
And to list the directory above, type:
hdfs dfs -ls /
And then try to put some files in hadoop’s file system. You can add the log files from the host machine to the hadoop file system.
hdfs dfs -put /var/log/* /logs/
Note: In case you need to check the above files and directory, you can do this in the Hadoop Namenode web interface.
Go to the Namenode web interface, click Utilities->Browse the file system. You should see the directories that you created earlier.
Step 10: How To Stop Hadoop Cluster
You need to run the stop-dfs.sh and stop-yarn.sh scripts as a Hadoop user to be able to stop the Hadoop Namenode and Yarn service
Run the command below to stop the Hadoop Namenode service:
stop-dfs.sh
To stop the Hadoop Resource Manager service, type:
stop-yarn.sh
Conclusion
In this article, you have learned How To Install Hadoop On Ubuntu 20.04 LTS. Due to the kind of your business requirements, you can use its important role in today’s life by adopting this technology in your organization as it fits with any domain.