Category Archives: Vmware Hosting

Deploying Hadoop in the Virtualized the Cloud

Apache Hadoop is a distributed file system for storing large amounts of data cross multiple commodity servers. It is said to store both unstructured and structured data, which is true, but you can use Apache Pig and Apache Hive to write a schema around this data to give it structure. That makes it something you can query. Otherwise it would not be of much use, yes?

Hadoop data is stored in a Hadoop Cluster. A Hadoop Cluster is the single name node plus multiple data nodes that make up the Hadoop Distributed File System (HDFS).  The namenodes keep track of what data is located on which virtual machine.  The datanodes are responsible for writing the files there.  Datanodes also run the batch jobs that retrieve data from the Hadoop Cluster when the user executes a query.

Hadoop queries and gathers using the batch jobs: MapReduce, Pig, Hive, plus other tools.  These are Hadoop tasks that run in parallel, thus giving the boost in performance of a distributed storage scheme over having one big server, like some kind of UNIX mainframe.

MapReduce jobs crawl across the Hadoop Distributed File System (HDFS) to obtain a subset of the data (i.e. Reduce) based on the query (i.e. Map).  Pig and Hive do the same thing.  These are tools to allow the developer to write this MapReduce logic using SQL, which is something practically every developer already knows.  To use this against unstructured data, the developer writes a scheme that describes the different types of data in Hadoop (logs, database extracts, Excel files, and other).  These use regular expressions to split strings of text into their correspond fields which can they be queried using SQL.

Hadoop uses replication to provide fault tolerance.  But how does one use Hadoop in a virtualized cloud environment?  There the vCD (Virtual Cloud Director) user might not have access to the vSphere configuration that spells out what virtual machine is assigned to which SAN LUNs and which blade chassis slot.

Why is this an issue?  Hadoop by default makes 3 copies of each data block.  Hadoop is rack-aware.  The Hadoop data dispersal algorithm copies these data blocks onto different storage medium in a manner designed to provide data redundancy, plus it takes into consideration in which rack is each physical server is located to provide additional data protection.

With vCD riding on top of vCenter, the customer does not have direct access to the vCenter details.  So, in the worst case, multiple virtual machines could all be on the same or nearly the same rack and their data stored on the same LUN (a logical partition of one physical drive).  Stratogen knows about this and configures vCenter to provide the required redundancy.  But part of the responsibility of doing that falls on VMware, which is what the Stratogen cloud uses.

VMware is aware of this issue and has been working since 2012 to address that and provide a tool for deploying Hadoop in VMware. First, they launched the open-source Apache Serengeti project, which is a tool that makes deploying Hadoop clusters across multiple virtual machines easier. Second, VMware has dedicated programmers and architects to the Apache Hadoop community to contribute changes to VMware to “enhance the support for failure and locality topologies by making Hadoop virtualization-aware.”

VMware summarizes the description of what they are doing and have done with the Apache Hadoop project (I fixed their grammar mistakes.  They are great engineers, but need a copy editor.)

The current Hadoop network topology (described in some previous issues like: Hadoop-692) works well in classic three-tier networks… However, it does not take into account other failure models or changes in the infrastructure that can affect network bandwidth efficiency like virtualization.

A virtualized platform has the following genes that shouldn’t been ignored by Hadoop topology in scheduling tasks, placing replicas, doing balancing or fetching blocks for reading:

1. VMs on the same physical host are affected by the same hardware failure. In order to match the reliability of a physical deployment, replication of data across two virtual machines on the same host should be avoided.

2. The network between VMs on the same physical host has higher throughput and lower latency and does not consume any physical switch bandwidth.

Thus, we propose to make Hadoop network topology extendable and introduce a new level in the hierarchical topology, a node group level, which maps well onto an infrastructure that is based on a virtualized environment.

As you can see, the goal is to make Hadoop network-aware to boost performance by adding a node group level.

VMware Hadoop Project Serengeti

Serengeti is a tool that lets the Hadoop administrators deploy and set up a Hadoop cluster in an easier fashion than using Hadoop tools natively.  Some of what Serengeti does is:

  • Tune Hadoop configuration
  • Define storage (i.e., local or shared)
  • Provide extensions to give Hive access to SQL databases
  • Enable VMware vMotion for moving clusters with machines
  • Provide additional control over HDFS clusters

VMware Hadoop Project Spring

Another VMware project is Apache Spring.  Spring is an open-source umbrella of projects.  For example, the Spring Framework provides lets developers model relationships between Java classes using XML so that objects can be instantiated in configuration files instead of given explicitly given in Java code. It also handles things like transactions.

The Spring Hadoop project lets programmers do various tasks like written Java code to do Hadoop tasks instead of using the Hadoop command line. It also extends the Spring Batch framework to manage the workflow of Hadoop batch jobs like MapReduce, Pig, and Hive.  Spring provide data access objects (Think of JDBC or ODBC.) to HBase data.  HBase is a way to turn Hadoop into something similar to a relational database by providing random read write access to the data there. Remember that Hadoop is not one file, like a database, but a collection of files, each of which could be of different types. So HBase is an abstraction layer of that as is Hadoop itself.

Find out more:

Pexip showcase live virtual videoconferencing at InfoComm14, via StratoGen Hybrid Cloud

Live from the show floor….

StratoGen has partnered with Pexip to demonstrate live hybrid cloud videoconferencing that bridges the gap between on premise and cloud hosted environments. During this interactive demo users are able to see instantly how easy it is to set up a private virtual meeting for users, regardless of location and capability.

Pexip Stand

Showcasing the Pexip Infinity product with the StratoGen hybrid cloud, users are able to see the solution working seamlessly. Each user has been given an instant personal virtual meeting room and can opt to experience true interoperability by dialling in with any client or device they choose.

The Pexip solution is particularly appealing because it’s revolutionising the unified communications industry.  Through the virtual backplane, Pexip’s virtualized collaboration platform provides a consistent experience to all connected endpoints, regardless of type, location, and other factors that typically hinder the meeting.

This coupled with StratoGen’s industry leading quality and experience in the cloud hosting space, gives a truly robust experience to the Pexip SaaS videoconferencing model as demonstrated live at InfoComm14 today.

InfoComm14, is the largest annual conference and exhibition for AV buyers and sellers worldwide, held in Las Vegas, Nevada.

INFOCOMM14 outside

Zerto Disaster Recover – (At A Glance) by Walker Rowe, Guest Blogger, Computer Technology Writer

 Zerto offers hypervisor replication, replacing the storage-based-replication approach to disaster recovery.

zerto 1







The Zerto system is just one piece of software, the Virtual Replication Appliance, plus an administrator console. The name “appliance,” is not a good choice, since it is not an appliance at all.  Instead it is software installed on the physical machines that host the virtual machines. The installation is automatic, meaning it is pushed out as new machines are provisioned.

 The administrator configures View Protection Groups (VPG) to provide application-level replication. A VPG is the set of virtual machines that comprise the application. This, for example, would be the web servers, applications server, and database servers. So it is a way to group virtual machines based on their function and treat them as a cohesive set.

 For each View Protection Group, the administrator configures the boot order of the virtual machines and which disaster recover environment to replicate to. For each View Protection Group, the administrator configures a priority.  This throttles the replication according to network congestion.

 The replication works by writing data blocks from the source datacentre to the target datacentre.  The source storage devices could be of one type and the target storage devices be of a different type.  That does not matter, since the replication is done at the hypervisor layer and not the physical storage layer. It does not replicate the data; it replicates the write operation.

 Data is collected on the target location Virtual Replication Appliance in a journal.  This allows point-in-time recovery.  For example, it would let you roll back to the point in time before, say, someone dropped a table in production.

 Replication is over the wide area network.  The administrator can configure WAN data compression and throttling in Zerto, or not, if you already have a data compression and traffic throttling solution in place.

 Zerto is administered through a web page or you can use it as a plug into to VMware Vcenter.

Zerto 2

Zerto replaces the physical approach to replicating data with the abstracted logical approach that is appropriate to a virtual environment.  Because it is vendor and device-independent, you could use it to replace vendor-specific storage replication solutions like EMC SRDF (disk-array based), NetApp SnapMirror (disk-array based), Veritas Volume Replicator (OS-based), or EMC Replicator (appliance-based).

 One advantage of doing so is each of these approaches—disk-array, OS, and appliance—is based on replicating storage. Recovering all of that in the event of a disaster and getting the application working again is more difficult than if the virtual machines were replicated independent of the storage.  Using the hypervisor approach makes it more likely that you will meet your RPO (recovery point objective) stipulated in your disaster recovery plan.

 Zerto has signed up lots of large customers in the marketplace.  This includes many customers running enterprise-wide Siebel, Oracle and SAP systems.  They even list the storage vendor Fujitsu as a customer, leading one to wonder why Fujitsu would be adopting a product that would compete with their own storage replication technology.

 StratoGen Disaster Recovery

At StratoGen we can administer the Zerto disaster recovery system for you with our Disaster Recovery offering.  With that, we provide an RTO of less than 15 minutes by replicating your applications in real-time across the StratoGen cloud, spreading that across distinct geographical locations for added security. Moving DR to the cloud lets you roll out a disaster recovery strategy with zero up-front capital expenditure.

For more information feel free to get in touch with our sales staff and explore the other blog posts and pages on our website.