Apache Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. Hadoop is known for its reliability and ability to handle vast amounts of data efficiently.
The Hadoop Distributed File System (HDFS) is a fundamental component of Hadoop, providing scalable and reliable data storage. HDFS is designed to store very large files across a cluster of machines, ensuring fault tolerance and high availability.
Another critical component of Hadoop is the MapReduce programming model, which enables efficient data processing across the distributed storage provided by HDFS. MapReduce breaks down data processing tasks into two main phases: the Map phase, which filters and sorts the data, and the Reduce phase, which performs aggregation and summarization.Together with YARN (Yet Another Resource Negotiator), which manages and schedules resources in the Hadoop cluster, these components make Hadoop a robust platform for big data analytics, allowing organizations to gain insights from their data at unprecedented scales.
Before starting Hadoop, ensure you have:
winutils.exe
.
If it is
not recognized or does not run, you will need to download the necessary `msvcr120.dll` file. Follow
these
steps:
winutils.exe
again in Command Prompt.
Java
directory: one for the JDK
(C:\Java\jdk
) and one for the JRE (C:\Java\jre
). Ensure to update the
paths
accordingly in the installation wizard.Hadoop
in the root of the Windows (C:) drive for installing
Hadoop.HADOOP_HOME
to C:\Hadoop\bin
in the system variables.JAVA_HOME
to the JDK path C:\Java\jdk
in system
variables.C:\Hadoop\bin
and C:\Hadoop\sbin
to the system PATH
variable.JAVA_HOME\bin
to the user PATH variable.core-site.xml
, hdfs-site.xml
,
yarn-site.xml
, and mapred-site.xml
). Update the bin file in the Hadoop
folder. You
can download the updated bin folder from here, then copy it directly into your
Hadoop
directory. For the XML files, navigate to C:\Hadoop\etc\hadoop
and download the updated XML files. Just copy and paste them directly into this
folder
to update.
hadoop-env.cmd
file to include the JAVA_HOME
variable. Set
JAVA_HOME
to C:\Java\jdk
in the
C:\Hadoop\etc\hadoop\hadoop-env.cmd
file.
hdfs namenode -format
. This
step is only necessary during the initial setup, as it formats all the namenodes.
If your DataNode is not running, follow these steps:
1. DataNode and NameNode Version Mismatch
C:\Hadoop\data\dfs\current\namenode
and open the
VERSION file.
C:\Hadoop\data\dfs\current\datanode
and paste the same
ClusterID value into the corresponding VERSION file in this directory.
2. Permission Issues
C:\Hadoop\data\dfs\current\datanode
and
right-click on
the directory.In most cases, these steps will resolve the issue with the DataNode not running.
Windows Key + X
, select Command Prompt (Admin), and
confirm any
prompts.sbin
Directory:
cd C:\Hadoop\sbin
start-dfs.cmd
start-yarn.cmd
Following these steps will help ensure that Hadoop services are correctly started and running on the local machine. For more detailed information, refer to the official Apache Hadoop documentation. Detailed steps for starting this services are provided later in this page.
Follow these steps to start the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator):
Step 1: Open Command Prompt in Administrator Mode
To avoid permission issues, it's important to run the command prompt with administrative privileges.
Windows Key + X
and select Command Prompt (Admin) or
Windows
PowerShell (Admin).
Step 2: Navigate to the Hadoop sbin
Directory
Change the directory to the Hadoop sbin
directory where the startup scripts are located.
cd C:\Hadoop\sbin
cd C:\Hadoop\sbin
changes the current directory to the
sbin
folder in
the Hadoop installation directory.
C:\Hadoop\sbin
, indicating that you are now in the
correct directory.
Step 3: Start the Hadoop Distributed File System (HDFS)
Run the following command to start the HDFS daemons (NameNode, Secondary NameNode, and DataNode):
start-dfs.cmd
start-dfs.cmd
starts the Hadoop Distributed File System (HDFS) daemons,
including
the NameNode, Secondary NameNode, and DataNode processes.Step 4: Start YARN
Run the following command to start YARN daemons (ResourceManager and NodeManager):
start-yarn.cmd
start-yarn.cmd
starts the YARN (Yet Another Resource Negotiator)
daemons, including
the ResourceManager and NodeManager processes.Step 5: Verify Hadoop Services
HDFS Verification
Open a web browser and navigate to the HDFS web UI to verify that the DataNode and NameNode are running correctly:
For Datanode: http://localhost:9864/
For Namenode: http://localhost:9870/
This URL provides information about the HDFS NameNode status and allows you to browse the file system.
YARN Verification
To verify that YARN is running correctly, navigate to the YARN ResourceManager web UI:
This URL provides information about the cluster status, including running applications and available resources.
Thank you for visiting this website. If you have any questions or would like to get in touch, please feel free to contact me.
Visit on GitHub