What is a good big data tutorial

Apache Hadoop: distributed storage architecture for large amounts of data

Hadoop Distributed File System (HDFS)

HDFS is a highly available file system, which is used to store large amounts of data in a computer cluster and is therefore responsible for data management within the framework. For this purpose, files are broken down into data blocks and distributed redundantly to different nodes without a classification scheme. According to the developers, HDFS is able to manage a number of files in the hundreds of millions. Both the length of the file blocks and the degree of redundancy can be configured individually.

The Hadoop cluster basically works according to the Master-slave principle. The architecture of the framework thus consists of a master node to which a large number of nodes are subordinate as slaves. This principle is also reflected in the structure of the HDFS, which is based on aNameNode and various subordinates DataNodes based. The NameNode manages all metadata on the file system, directory structures and files. The actual data storage takes place on the subordinate DataNotes. In order to minimize data loss, files are broken down into individual blocks and stored several times on different nodes. The standard configuration provides that each data block is available in triplicate.

Each DataNode sends the NameNode a sign of life, the so-called Heartbeat. If this signal is not received, the NameNote declares the respective slave "dead" and, with the help of the data copies on other nodes, ensures that enough copies of the relevant data blocks are available in the cluster despite the failure. The NameNode thus has a central role within the framework. So that this does not become a "single point of failure", it is common practice to have this master node SecondaryNameNode to the side, which records all changes regarding the metadata and thus enables a restoration of the central control instance.

In the transition from Hadoop 1 to Hadoop 2, HDFS was expanded to include additional backup systems: NameNode HA (High Availability) supplements the system with automatic failover, which automatically starts a replacement component in the event of a NameNode failure. A snapshot function also enables the system to be restored to an earlier state. In addition, the Federation extension allows multiple NameNodes to be managed within a cluster.