Analyst Says AWS Can Help Tame Hadoop Management Complexity -- AWSInsider

Analyst Says AWS Can Help Tame Hadoop Management Complexity

By David Ramel
12/08/2015

The notorious complexity involved with deploying enterprise Hadoop implementations can be daunting, says an analyst in a new research note, but cloud services such as Amazon Web Services Inc. (AWS) are one way to address those issues.

However, while AWS offers some tools to help with running Big Data analytics in its cloud, they come with trade-offs and disadvantages, as do the other proposed approaches.

"AWS has its own homegrown, increasingly integrated set of Big Data services," George Gilbert said in his Nov. 24 "The Manageability Challenge Facing Hadoop" research note for Wikibon, an open community featuring contributions from writers, experts, practitioners, users, vendors and researchers. "There's Kenisis Firehose for dealing with high velocity streaming data, DynamoDB for operational processing, Redshift SQL MPP, a machine learning service, and the Data Pipeline to orchestrate everything."

Those services could be attractive alternatives to typical enterprise Hadoop distributions that include many moving parts -- leveraging up to 20 or more disparate components -- which leads to complex management and total cost of ownership (TCO) issues, Gilbert said. However, the native cloud approach comes with less customer choice and lacks the portability of open source technology, said Gilbert, who favors Hadoop-as-a-Service (HaaS) as a means to tame some of the attendant concerns.

Note that Gilbert doesn't see HaaS as a panacea for manageability challenges, however, just perhaps the best of several potential solutions that each come with those aforementioned trade-off pros and cons.

"Prospective customers as well as those who are still in proof-of-concept or pilot need to understand that there are no easy solutions," Gilbert said

Gilbert explored the trade-offs of three proposed alternative solutions to solving the TCO and manageability problem:

Running Hadoop-as-a-Service.
Using Spark as the computing core of Hadoop.
Building on the native services of the major cloud vendors such as AWS (Kinesis Firehose, DynamoDB, Redshift and more), Microsoft Azure, or Google Cloud Platform while integrating specialized third-party services such as Redis.

Spark, the current darling of the Big Data ecosystem -- described as the most active open source project -- might seem to be the logical choice for enterprise Big Data application development, but it doesn't ingest data, manage data or come with a database or file system. Cassandra is a popular choice to fill that latter role, Gilbert said. Spark also needs a service "to make sure a majority of the other services are live and talking to each other," Gilbert said, such as Zookeeper, which is described as "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." Kafka, meanwhile is becoming the de facto standard for Big Data ingestion. And, of course, Spark lacks a management console.

**[Click on image for larger view.]** Individual Projects in the Hortonworks Data Platform. *(source: Wikibon via Hortonworks)*

"So getting up and running with a Spark cluster takes no less than 12 servers: three for each of the services," Gilbert said. "Again, even though Spark is a single, unified processing engine, it requires at least four different services. And that's where the management complexity comes back into play. Each service has its own way of failing; its own way of managing access; its own attack surface; and its own admin model."

Another option, relying on native cloud services from providers such as AWS, Azure and Google Cloud Platform, can provide dramatic manageability gains, but that approach involves its own aforementioned trade-offs, such as limited choice and lack of open source portability.

"All the cloud providers will provide ever more powerful DevOps tools to simplify development and operations of applications on their platforms," Gilbert said. "But as soon as customers want the ability to plug in specialized third-party functionality, that tooling will likely break down. The overhead of opening up these future tools is far more difficult than building them knowing in advance just what services they'll be managing."

That leaves the managed HaaS service option, in which AWS also figures with its Elastic MapReduce (EMR) service. However, Gilbert said it apparently suffers from the separation of compute and storage resources favored by AWS, which increases management complexity, among other trade-offs.

Another HaaS vendor, Altiscale, simplifies Hadoop management with a purpose-built, proprietary hardware/software infrastructure. By being familiar with system internals, Altiscale can make Hadoop management less labor intensive, said Gilbert, who added the caveat: "Of course, customers have to get their data to the datacenters that host Altiscale, and they don't have the rich ecosystem of complementary tools on AWS."

So, with no "perfect" solution available for all scenarios, Gilbert offered the following concluding "action item" with which Wikibon research notes end:

Customers building their outward facing Web and mobile applications on public clouds while trying to build Hadoop applications on-premises should evaluate vendors offering it as-a-service. Hadoop already comes with significant administrative complexity by virtue of its multi-product design. On top of that, operating elastic applications couldn't be more different from the client-server systems IT has operated for decades.

"Hadoop's unprecedented pace of innovation comes precisely because it is an ecosystem, not a single product," Gilbert said. "Total cost of ownership and manageability have to change in order for 'Big Data' production applications to go mainstream. And if the Hadoop ecosystem doesn't fix the problem, there are alternatives competing for attention."

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

Subscribe on YouTube

AWS Cloud Report

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

4-Hour Hands-on Workshop: MCP Demystified
June 30, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Securing IT in the AI Era
July 23, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

TechMentor @ Microsoft HQ
August 11-15, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Microsoft 365 Security Masterclass
August 25-26, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

TechMentor Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free Whitepapers

> More TechLibrary

Free Webcasts

> More Webcasts