Deploying Hadoop in the Virtualized the Cloud

Apache Hadoop is a distributed file system for storing large amounts of data cross multiple commodity servers. It is said to store both unstructured and structured data, which is true, but you can use Apache Pig and Apache Hive to write a schema around this data to give it structure. That makes it something you can query. Otherwise it would not be of much use, yes?

Hadoop data is stored in a Hadoop Cluster. A Hadoop Cluster is the single name node plus multiple data nodes that make up the Hadoop Distributed File System (HDFS).  The namenodes keep track of what data is located on which virtual machine.  The datanodes are responsible for writing the files there.  Datanodes also run the batch jobs that retrieve data from the Hadoop Cluster when the user executes a query.

Hadoop queries and gathers using the batch jobs: MapReduce, Pig, Hive, plus other tools.  These are Hadoop tasks that run in parallel, thus giving the boost in performance of a distributed storage scheme over having one big server, like some kind of UNIX mainframe.

MapReduce jobs crawl across the Hadoop Distributed File System (HDFS) to obtain a subset of the data (i.e. Reduce) based on the query (i.e. Map).  Pig and Hive do the same thing.  These are tools to allow the developer to write this MapReduce logic using SQL, which is something practically every developer already knows.  To use this against unstructured data, the developer writes a scheme that describes the different types of data in Hadoop (logs, database extracts, Excel files, and other).  These use regular expressions to split strings of text into their correspond fields which can they be queried using SQL.

Hadoop uses replication to provide fault tolerance.  But how does one use Hadoop in a virtualized cloud environment?  There the vCD (Virtual Cloud Director) user might not have access to the vSphere configuration that spells out what virtual machine is assigned to which SAN LUNs and which blade chassis slot.

Why is this an issue?  Hadoop by default makes 3 copies of each data block.  Hadoop is rack-aware.  The Hadoop data dispersal algorithm copies these data blocks onto different storage medium in a manner designed to provide data redundancy, plus it takes into consideration in which rack is each physical server is located to provide additional data protection.

With vCD riding on top of vCenter, the customer does not have direct access to the vCenter details.  So, in the worst case, multiple virtual machines could all be on the same or nearly the same rack and their data stored on the same LUN (a logical partition of one physical drive).  Stratogen knows about this and configures vCenter to provide the required redundancy.  But part of the responsibility of doing that falls on VMware, which is what the Stratogen cloud uses.

VMware is aware of this issue and has been working since 2012 to address that and provide a tool for deploying Hadoop in VMware. First, they launched the open-source Apache Serengeti project, which is a tool that makes deploying Hadoop clusters across multiple virtual machines easier. Second, VMware has dedicated programmers and architects to the Apache Hadoop community to contribute changes to VMware to “enhance the support for failure and locality topologies by making Hadoop virtualization-aware.”

VMware summarizes the description of what they are doing and have done with the Apache Hadoop project (I fixed their grammar mistakes.  They are great engineers, but need a copy editor.)

The current Hadoop network topology (described in some previous issues like: Hadoop-692) works well in classic three-tier networks… However, it does not take into account other failure models or changes in the infrastructure that can affect network bandwidth efficiency like virtualization.

A virtualized platform has the following genes that shouldn’t been ignored by Hadoop topology in scheduling tasks, placing replicas, doing balancing or fetching blocks for reading:

1. VMs on the same physical host are affected by the same hardware failure. In order to match the reliability of a physical deployment, replication of data across two virtual machines on the same host should be avoided.

2. The network between VMs on the same physical host has higher throughput and lower latency and does not consume any physical switch bandwidth.

Thus, we propose to make Hadoop network topology extendable and introduce a new level in the hierarchical topology, a node group level, which maps well onto an infrastructure that is based on a virtualized environment.

As you can see, the goal is to make Hadoop network-aware to boost performance by adding a node group level.

VMware Hadoop Project Serengeti

Serengeti is a tool that lets the Hadoop administrators deploy and set up a Hadoop cluster in an easier fashion than using Hadoop tools natively.  Some of what Serengeti does is:

  • Tune Hadoop configuration
  • Define storage (i.e., local or shared)
  • Provide extensions to give Hive access to SQL databases
  • Enable VMware vMotion for moving clusters with machines
  • Provide additional control over HDFS clusters

VMware Hadoop Project Spring

Another VMware project is Apache Spring.  Spring is an open-source umbrella of projects.  For example, the Spring Framework provides lets developers model relationships between Java classes using XML so that objects can be instantiated in configuration files instead of given explicitly given in Java code. It also handles things like transactions.

The Spring Hadoop project lets programmers do various tasks like written Java code to do Hadoop tasks instead of using the Hadoop command line. It also extends the Spring Batch framework to manage the workflow of Hadoop batch jobs like MapReduce, Pig, and Hive.  Spring provide data access objects (Think of JDBC or ODBC.) to HBase data.  HBase is a way to turn Hadoop into something similar to a relational database by providing random read write access to the data there. Remember that Hadoop is not one file, like a database, but a collection of files, each of which could be of different types. So HBase is an abstraction layer of that as is Hadoop itself.

Find out more:

Migrating to the Cloud – Challenges and Considerations

As organizations continue to experience vibrant growth and rapid entry into new markets, the need to architect new data environments which perform flawlessly, deliver cutting edge technology solutions, and conserve resources has become paramount. It is often assumed that a transition from a private in-house data center to a cloud-based infrastructure is the direction in which most organizations should embark. However, there are multiple challenges and considerations that should be addressed before you take the plunge. Cloud Hosting

Preparing for Migration across the Enterprise

The decision to transition to the cloud is by no means a purely technical one. It involves important issues such as vendor selection, strategies to handle possible service disruption during the transition, and cost considerations only to name a few. Let us examine them briefly:

Vendor Selection

With new Cloud hosting companies appearing on the horizon regularly and promoting themselves rigorously, choices may be difficult to make. Make sure you are looking at more than just the cost or the cheapest deal. Examine issues such as industry reputation, awards, and accreditations, read case studies and ask to speak to a current customer. Find out if telephone support is provided 24X7? Do members of your senior technical team have instant direct access to their counterparts at the cloud hosting provider or do they have to go through several hoops to reach them? These often overlooked factors can end up costing more money in the long run and what appears to be a cheaper provider could end up being much more expensive.

 Service Disruption

Advance planning is the key to disruption management when connecting with the cloud. If your decision to consider the cloud involves only internal corporate data, a replication model may be the right answer. In this model, your data center and your Cloud operation function simultaneously until such a time that the transition is complete. However, if you have a large number of tier 1 customers who rely on you for service as is the case with live chat / videoconferencing / SaaS providers for instance, service disruption will have to be planned for well in advance and your service provider should offer you a migration plan and assistance.

 Resource Optimization and Costing

Cost savings are frequently mentioned as one of the main reasons why enterprises should vote for the Cloud.  Having a hardware free environment can certainly save a huge amount of money and resource. Outsourcing to a cloud hosting provider also gives you the option to re-deploy your technical workforce giving them the ability to concentrate on your core IT. Resource optimization & re-deployment options will vary depending on whether you choose the public, private or the hybrid Cloud model.

 Are you ready to migrate to the Cloud?

You are ready…..

When there are frequent spikes in service usage and on demand resources become an attractive proposition.
When your applications are known to perform better in the cloud (via previous testing).
When data privacy and regulatory compliance become top priorities because of new clients you have recently acquired.
When control cost is important and a pay-as-you-go model becomes viable.
If you are in need of a hardware refresh and want to lower your cost and optimize performance.
If you are moving to a new premise and no longer have in-house space.
If you want to re-deploy technical resource and concentrate on your core IT.

Migration to the cloud, especially by the technically savvy, startups and SaaS has experienced a dramatic rise in the past few years and for good reasons. Enterprise cloud computing investment is expected to grow from $76.9B in 2010 to $210B in 2016, according to a Gartner study.

Has your organization stepped into the cloud yet? Have you finally found your silver lining? What are some of the constraints you have experienced in your decision-making process? We would love to hear from you through your comments.

For more information read the AIP Case Study.



Pexip showcase live virtual videoconferencing at InfoComm14, via StratoGen Hybrid Cloud

Live from the show floor….

StratoGen has partnered with Pexip to demonstrate live hybrid cloud videoconferencing that bridges the gap between on premise and cloud hosted environments. During this interactive demo users are able to see instantly how easy it is to set up a private virtual meeting for users, regardless of location and capability.

Pexip Stand

Showcasing the Pexip Infinity product with the StratoGen hybrid cloud, users are able to see the solution working seamlessly. Each user has been given an instant personal virtual meeting room and can opt to experience true interoperability by dialling in with any client or device they choose.

The Pexip solution is particularly appealing because it’s revolutionising the unified communications industry.  Through the virtual backplane, Pexip’s virtualized collaboration platform provides a consistent experience to all connected endpoints, regardless of type, location, and other factors that typically hinder the meeting.

This coupled with StratoGen’s industry leading quality and experience in the cloud hosting space, gives a truly robust experience to the Pexip SaaS videoconferencing model as demonstrated live at InfoComm14 today.

InfoComm14, is the largest annual conference and exhibition for AV buyers and sellers worldwide, held in Las Vegas, Nevada.

INFOCOMM14 outside