How do I practice to learn Hadoop

Software infrastructure

Hadoop is fully available through the Apache license. No additional licenses are required. Linux offers itself as the basic operating system, which is also available free of charge in most distributions.

If you want to use or take advantage of additional services, support and other functions, companies must of course pay for the extensions and support.

Hadoop - Framework for Big Data

Hadoop is a framework based on Java and the MapReduce algorithm from Google. With the Apache license, Hadoop is basically available free of charge. The task of Hadoop is to be able to process and calculate very large amounts of data efficiently in clusters. For processing, administrators and developers must work together in order for the cluster to function optimally. The service can be installed or operated via the cloud.

This is what Hadoop is made of

Hadoop is made up of a cluster. One node takes over the control (NameNode), the others the calculations (DataNodes) in the cluster. The basis is "Hadoop Common". This represents the interface for all other components. MapReduce is the most important function for processing the data. The technology divides large amounts of data into smaller parts, distributes them to the nodes in the Hadoop cluster and merges them again after the calculation. The basis is HDFS or GPFS which takes over the storage. MapReduce takes care of the calculation of the data between the cluster nodes. MapReduce was developed by Google.

Local operation or cloud - Hadoop in the Azure cloud

In order to run Hadoop, companies need a cluster that contains the various nodes for calculation. However, it is easier to operate in the cloud. For example, Microsoft offers HDInsight, a cloud service in Azure with which you can operate a full-fledged Hadoop cluster in the cloud. In contrast to many other Microsoft solutions, the software group has not integrated any of its own standards, but adhered entirely to the Hortonworks Data Platform (HDP).

Is Hadoop replacing business intelligence in the company?

Big data solutions like Hadoop complement business intelligence. In contrast to BI solutions, big data solutions do not require perfectly compiled data, but can instead issue effective reports and analyzes from a large number of different data sources with completely different data. A BI system can, for example, show exactly which product has been sold in different countries with which percentage, turnover and margin. This information is also important. Big data solutions, in turn, can record which customer group the product is particularly popular with, what connections there are with other products, whether the transport of a product and its delivery time had an impact on sales figures. A connection between defects and next-generation sales figures can also be recorded.

IBM General Parallel File System in Big Data use

The IBM General Parallel File System (GPFS) is a special file system from IBM, which is also used in Hadoop clusters. These often use the Hadoop File System (HDFS), but can also use GPFS. These two file systems can process large amounts of data extremely quickly and are therefore superior to other file systems. The advantage of GPFS is, for example, the fast access to very large files. The data is mirrored and distributed to hundreds or thousands of cluster nodes, but still remains accessible.

GPFS can also store data intelligently. When companies use different technologies such as SSD, SAN, NAS and DAS, GPFS can save frequently used data in fast areas and save old files on slower volumes. This is particularly important when processing with Hadoop.

Hadoop in Amazon Web Services, Google Cloud Platform and Rackspace

In addition to Microsoft Azure HDInsight, Hadoop clusters can also be operated in Amazon Web Services (AWS). If you use AWS, the data of the Hadoop cluster is stored in the AWS storage service S3. Rackspace also offers a cloud solution based on Apache Hadoop and Hortonworks Data Platform. However, Hadoop can also be operated on the Google Cloud Platform.

The main Hadoop distributions

In addition to the options of operating Hadoop in Microsoft Azure HDInsight or Amazon Web Services, you can of course also rely on your own installations. The following providers are particularly well-known in this context:

• Hortonworks Data Platform

• Cloudera

• MapR

Extending Hadoop - YARN and Co.

There are numerous extensions on the market that can be used to expand the functionality of Hadoop. Examples are Hadoop YARN and Apache Hive. Developers can use Hive to directly query the data stored in HDFS.

Apache Spark also plays an important role in this context. Yarn is a cluster management technology for Hadoop. Many big data professionals also refer to YARN as MapReduce 2.

With Apache ZooKeeper you can centrally control the Hadoop infrastructure. Apache HCatalog is a management solution for various process processing tools.

Security and surveillance in the Hadoop cluster - Apache Knox and Chukwa

Apache Knox is a REST API gateway for Hadoop clusters. The Hadoop extension increases Hadoop's security model and integrates authentication and user roles.

The best way to monitor the Hadoop infrastructure is to use Apache Chukwa. The solution monitors HDFS data access and the MapReduce framework.

Oracle, IBM and Co. - expand Hadoop commercially

With Big Data SQL, Oracle offers the possibility of accessing Big Data data via SQL queries. IBM InfoSphere BigInsights extends Hadoop with numerous possibilities. The data can be managed better and offers more options for querying. (mje)