Hadoop Commands

Introduction to Apache Hadoop

Apache Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. Hadoop is known for its reliability and ability to handle vast amounts of data efficiently.

The Hadoop Distributed File System (HDFS) is a fundamental component of Hadoop, providing scalable and reliable data storage. HDFS is designed to store very large files across a cluster of machines, ensuring fault tolerance and high availability.

Another critical component of Hadoop is the MapReduce programming model, which enables efficient data processing across the distributed storage provided by HDFS. MapReduce breaks down data processing tasks into two main phases: the Map phase, which filters and sorts the data, and the Reduce phase, which performs aggregation and summarization.Together with YARN (Yet Another Resource Negotiator), which manages and schedules resources in the Hadoop cluster, these components make Hadoop a robust platform for big data analytics, allowing organizations to gain insights from their data at unprecedented scales.

Hadoop Installation and Setup Guide

Before starting Hadoop, ensure you have:

Downloaded Hadoop from the official Hadoop website and Java from the Oracle website. Note that Java is downloaded from Oracle, so if you don’t have an account, you will need to create one.
Checked if the Hadoop files are already associated with WinRAR (indicated by red box logos on the files). If they are, WinRAR is already installed. If not, you can download WinRAR from the official WinRAR website. After downloading, unzip the Hadoop file into the `C:\Hadoop` directory by right-clicking on the file and selecting "Extract Here."
Verified if `winutils` is running by opening a Command Prompt and typing winutils.exe. If it is not recognized or does not run, you will need to download the necessary `msvcr120.dll` file. Follow these steps:
- Go to the msvcr120.dll download page and download the file.
- Navigate to the `C:\Windows\System32` directory.
- Copy and paste the downloaded `msvcr120.dll` file into the `System32` folder.
- After copying the DLL file, try running winutils.exe again in Command Prompt.
During Java installation, create two folders within the Java directory: one for the JDK (C:\Java\jdk) and one for the JRE (C:\Java\jre). Ensure to update the paths accordingly in the installation wizard.
Created a folder named Hadoop in the root of the Windows (C:) drive for installing Hadoop.
Created a `data` folder within the `Hadoop` directory, and inside the `data` folder, created two subfolders: `dfs/namenode` and `dfs/datanode`. These directories are used by Hadoop to store the namenode and datanode data.
Properly installed Hadoop on the system.
Configured the necessary environment variables:
- Set HADOOP_HOME to C:\Hadoop\bin in the system variables.
- Set JAVA_HOME to the JDK path C:\Java\jdk in system variables.
- Added C:\Hadoop\bin and C:\Hadoop\sbin to the system PATH variable.
- Added JAVA_HOME\bin to the user PATH variable.
Updated the core Hadoop configuration files (core-site.xml, hdfs-site.xml, yarn-site.xml, and mapred-site.xml). Update the bin file in the Hadoop folder. You can download the updated bin folder from here, then copy it directly into your Hadoop directory. For the XML files, navigate to C:\Hadoop\etc\hadoop and download the updated XML files. Just copy and paste them directly into this folder to update.
Updated the hadoop-env.cmd file to include the JAVA_HOME variable. Set JAVA_HOME to C:\Java\jdk in the C:\Hadoop\etc\hadoop\hadoop-env.cmd file.
Launch Command Prompt as an Administrator and execute the command hdfs namenode -format. This step is only necessary during the initial setup, as it formats all the namenodes.

Troubleshooting

If your DataNode is not running, follow these steps:

1. DataNode and NameNode Version Mismatch

Navigate to C:\Hadoop\data\dfs\current\namenode and open the VERSION file.
Copy the ClusterID value from the VERSION file.
Go to C:\Hadoop\data\dfs\current\datanode and paste the same ClusterID value into the corresponding VERSION file in this directory.

2. Permission Issues

Check if the DataNode directory is not creating the necessary files.
Navigate to C:\Hadoop\data\dfs\current\datanode and right-click on the directory.
Select Properties, then go to the Security tab.
Click on Edit and ensure that the user has full control by selecting Allow for all permissions.
Click Apply and then OK to save the changes.

In most cases, these steps will resolve the issue with the DataNode not running.

Example Workflow

Start Command Prompt in Admin Mode:
- Press Windows Key + X, select Command Prompt (Admin), and confirm any prompts.
Navigate to the sbin Directory:
- cd C:\Hadoop\sbin
Start HDFS:
- start-dfs.cmd
Start YARN:
- start-yarn.cmd
Verify HDFS:
- Open http://localhost:9864/ in the web browser for datanode information.
- Open http://localhost:9870/ in the web browser for namenode information.
Verify YARN:
- Open http://localhost:8088/ in the web browser.

Following these steps will help ensure that Hadoop services are correctly started and running on the local machine. For more detailed information, refer to the official Apache Hadoop documentation. Detailed steps for starting this services are provided later in this page.

Starting Apache Hadoop Services

Follow these steps to start the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator):

Step 1: Open Command Prompt in Administrator Mode

To avoid permission issues, it's important to run the command prompt with administrative privileges.

Press Windows Key + X and select Command Prompt (Admin) or Windows PowerShell (Admin).
Confirm any prompts to allow the command prompt to run as an administrator.

Step 2: Navigate to the Hadoop sbin Directory

Change the directory to the Hadoop sbin directory where the startup scripts are located.

cd C:\Hadoop\sbin

Description:

The command cd C:\Hadoop\sbin changes the current directory to the sbin folder in the Hadoop installation directory.

Output:

The prompt will change to C:\Hadoop\sbin, indicating that you are now in the correct directory.

Starting HDFS Services

Step 3: Start the Hadoop Distributed File System (HDFS)

Run the following command to start the HDFS daemons (NameNode, Secondary NameNode, and DataNode):

start-dfs.cmd

Description:

The command start-dfs.cmd starts the Hadoop Distributed File System (HDFS) daemons, including the NameNode, Secondary NameNode, and DataNode processes.
These processes are responsible for managing the distributed storage in Hadoop, ensuring data is stored across multiple nodes for fault tolerance and high availability.

Output:

Log messages indicating that the NameNode, Secondary NameNode, and DataNode services have started successfully.
The prompt will return, indicating the command has finished executing.

Starting YARN Services and Verification

Step 4: Start YARN

Run the following command to start YARN daemons (ResourceManager and NodeManager):

start-yarn.cmd

Description:

The command start-yarn.cmd starts the YARN (Yet Another Resource Negotiator) daemons, including the ResourceManager and NodeManager processes.
The ResourceManager is responsible for managing resources across the cluster and scheduling applications, while the NodeManager manages resources on a single node.

Output:

Log messages indicating that the ResourceManager and NodeManager services have started successfully.
The prompt will return, indicating the command has finished executing.

Step 5: Verify Hadoop Services

HDFS Verification

Open a web browser and navigate to the HDFS web UI to verify that the DataNode and NameNode are running correctly:

For Datanode: http://localhost:9864/

For Namenode: http://localhost:9870/

This URL provides information about the HDFS NameNode status and allows you to browse the file system.

YARN Verification

To verify that YARN is running correctly, navigate to the YARN ResourceManager web UI:

http://localhost:8088/

This URL provides information about the cluster status, including running applications and available resources.

Hadoop Commands

Introduction to Apache Hadoop

Hadoop Installation and Setup Guide

Troubleshooting

Example Workflow

Thank You