Apache Hadoop is a distributed file system for storing large amounts of data cross multiple commodity servers. It is said to store both unstructured and structured data, which is true, but you can use Apache Pig and Apache Hive to write a schema around this data to give it structure. That makes it something you can query. Otherwise it would not be of much use, yes?
Hadoop data is stored in a Hadoop Cluster. A Hadoop Cluster is the single name node plus multiple data nodes that make up the Hadoop Distributed File System (HDFS). The namenodes keep track of what data is located on which virtual machine. The datanodes are responsible for writing the files there. Datanodes also run the batch jobs that retrieve data from the Hadoop Cluster when the user executes a query.
Hadoop queries and gathers using the batch jobs: MapReduce, Pig, Hive, plus other tools. These are Hadoop tasks that run in parallel, thus giving the boost in performance of a distributed storage scheme over having one big server, like some kind of UNIX mainframe.
MapReduce jobs crawl across the Hadoop Distributed File System (HDFS) to obtain a subset of the data (i.e. Reduce) based on the query (i.e. Map). Pig and Hive do the same thing. These are tools to allow the developer to write this MapReduce logic using SQL, which is something practically every developer already knows. To use this against unstructured data, the developer writes a scheme that describes the different types of data in Hadoop (logs, database extracts, Excel files, and other). These use regular expressions to split strings of text into their correspond fields which can they be queried using SQL.
Hadoop uses replication to provide fault tolerance. But how does one use Hadoop in a virtualized cloud environment? There the vCD (Virtual Cloud Director) user might not have access to the vSphere configuration that spells out what virtual machine is assigned to which SAN LUNs and which blade chassis slot.
Why is this an issue? Hadoop by default makes 3 copies of each data block. Hadoop is rack-aware. The Hadoop data dispersal algorithm copies these data blocks onto different storage medium in a manner designed to provide data redundancy, plus it takes into consideration in which rack is each physical server is located to provide additional data protection.
With vCD riding on top of vCenter, the customer does not have direct access to the vCenter details. So, in the worst case, multiple virtual machines could all be on the same or nearly the same rack and their data stored on the same LUN (a logical partition of one physical drive). Stratogen knows about this and configures vCenter to provide the required redundancy. But part of the responsibility of doing that falls on VMware, which is what the Stratogen cloud uses.
VMware is aware of this issue and has been working since 2012 to address that and provide a tool for deploying Hadoop in VMware. First, they launched the open-source Apache Serengeti project, which is a tool that makes deploying Hadoop clusters across multiple virtual machines easier. Second, VMware has dedicated programmers and architects to the Apache Hadoop community to contribute changes to VMware to “enhance the support for failure and locality topologies by making Hadoop virtualization-aware.”
VMware summarizes the description of what they are doing and have done with the Apache Hadoop project (I fixed their grammar mistakes. They are great engineers, but need a copy editor.)
The current Hadoop network topology (described in some previous issues like: Hadoop-692) works well in classic three-tier networks… However, it does not take into account other failure models or changes in the infrastructure that can affect network bandwidth efficiency like virtualization.
A virtualized platform has the following genes that shouldn’t been ignored by Hadoop topology in scheduling tasks, placing replicas, doing balancing or fetching blocks for reading:
1. VMs on the same physical host are affected by the same hardware failure. In order to match the reliability of a physical deployment, replication of data across two virtual machines on the same host should be avoided.
2. The network between VMs on the same physical host has higher throughput and lower latency and does not consume any physical switch bandwidth.
Thus, we propose to make Hadoop network topology extendable and introduce a new level in the hierarchical topology, a node group level, which maps well onto an infrastructure that is based on a virtualized environment.
As you can see, the goal is to make Hadoop network-aware to boost performance by adding a node group level.
VMware Hadoop Project Serengeti
Serengeti is a tool that lets the Hadoop administrators deploy and set up a Hadoop cluster in an easier fashion than using Hadoop tools natively. Some of what Serengeti does is:
- Tune Hadoop configuration
- Define storage (i.e., local or shared)
- Provide extensions to give Hive access to SQL databases
- Enable VMware vMotion for moving clusters with machines
- Provide additional control over HDFS clusters
VMware Hadoop Project Spring
Another VMware project is Apache Spring. Spring is an open-source umbrella of projects. For example, the Spring Framework provides lets developers model relationships between Java classes using XML so that objects can be instantiated in configuration files instead of given explicitly given in Java code. It also handles things like transactions.
The Spring Hadoop project lets programmers do various tasks like written Java code to do Hadoop tasks instead of using the Hadoop command line. It also extends the Spring Batch framework to manage the workflow of Hadoop batch jobs like MapReduce, Pig, and Hive. Spring provide data access objects (Think of JDBC or ODBC.) to HBase data. HBase is a way to turn Hadoop into something similar to a relational database by providing random read write access to the data there. Remember that Hadoop is not one file, like a database, but a collection of files, each of which could be of different types. So HBase is an abstraction layer of that as is Hadoop itself.
Find out more: http://www.stratogen.net/products/hadoop-hosting.html