Analyst Says AWS Can Help Tame Hadoop Management Complexity

The notorious complexity involved with deploying enterprise Hadoop implementations can be daunting, says an analyst in a new research note, but cloud services such as Amazon Web Services Inc. (AWS) are one way to address those issues.

However, while AWS offers some tools to help with running Big Data analytics in its cloud, they come with trade-offs and disadvantages, as do the other proposed approaches.

"AWS has its own homegrown, increasingly integrated set of Big Data services," George Gilbert said in his Nov. 24 "The Manageability Challenge Facing Hadoop" research note for Wikibon, an open community featuring contributions from writers, experts, practitioners, users, vendors and researchers. "There's Kenisis Firehose for dealing with high velocity streaming data, DynamoDB for operational processing, Redshift SQL MPP, a machine learning service, and the Data Pipeline to orchestrate everything."

Those services could be attractive alternatives to typical enterprise Hadoop distributions that include many moving parts -- leveraging up to 20 or more disparate components -- which leads to complex management and total cost of ownership (TCO) issues, Gilbert said. However, the native cloud approach comes with less customer choice and lacks the portability of open source technology, said Gilbert, who favors Hadoop-as-a-Service (HaaS) as a means to tame some of the attendant concerns.

Note that Gilbert doesn't see HaaS as a panacea for manageability challenges, however, just perhaps the best of several potential solutions that each come with those aforementioned trade-off pros and cons.

"Prospective customers as well as those who are still in proof-of-concept or pilot need to understand that there are no easy solutions," Gilbert said

Gilbert explored the trade-offs of three proposed alternative solutions to solving the TCO and manageability problem:

  • Running Hadoop-as-a-Service.
  • Using Spark as the computing core of Hadoop.
  • Building on the native services of the major cloud vendors such as AWS (Kinesis Firehose, DynamoDB, Redshift and more), Microsoft Azure, or Google Cloud Platform while integrating specialized third-party services such as Redis.

Spark, the current darling of the Big Data ecosystem -- described as the most active open source project -- might seem to be the logical choice for enterprise Big Data application development, but it doesn't ingest data, manage data or come with a database or file system. Cassandra is a popular choice to fill that latter role, Gilbert said. Spark also needs a service "to make sure a majority of the other services are live and talking to each other," Gilbert said, such as Zookeeper, which is described as "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." Kafka, meanwhile is becoming the de facto standard for Big Data ingestion. And, of course, Spark lacks a management console.

 Individual Projects in the Hortonworks Data Platform
[Click on image for larger view.] Individual Projects in the Hortonworks Data Platform. (source: Wikibon via Hortonworks)

"So getting up and running with a Spark cluster takes no less than 12 servers: three for each of the services," Gilbert said. "Again, even though Spark is a single, unified processing engine, it requires at least four different services. And that's where the management complexity comes back into play. Each service has its own way of failing; its own way of managing access; its own attack surface; and its own admin model."

Another option, relying on native cloud services from providers such as AWS, Azure and Google Cloud Platform, can provide dramatic manageability gains, but that approach involves its own aforementioned trade-offs, such as limited choice and lack of open source portability.

"All the cloud providers will provide ever more powerful DevOps tools to simplify development and operations of applications on their platforms," Gilbert said. "But as soon as customers want the ability to plug in specialized third-party functionality, that tooling will likely break down. The overhead of opening up these future tools is far more difficult than building them knowing in advance just what services they'll be managing."

That leaves the managed HaaS service option, in which AWS also figures with its Elastic MapReduce (EMR) service. However, Gilbert said it apparently suffers from the separation of compute and storage resources favored by AWS, which increases management complexity, among other trade-offs.

Another HaaS vendor, Altiscale, simplifies Hadoop management with a purpose-built, proprietary hardware/software infrastructure. By being familiar with system internals, Altiscale can make Hadoop management less labor intensive, said Gilbert, who added the caveat: "Of course, customers have to get their data to the datacenters that host Altiscale, and they don't have the rich ecosystem of complementary tools on AWS."

So, with no "perfect" solution available for all scenarios, Gilbert offered the following concluding "action item" with which Wikibon research notes end:

Customers building their outward facing Web and mobile applications on public clouds while trying to build Hadoop applications on-premises should evaluate vendors offering it as-a-service. Hadoop already comes with significant administrative complexity by virtue of its multi-product design. On top of that, operating elastic applications couldn't be more different from the client-server systems IT has operated for decades.

"Hadoop's unprecedented pace of innovation comes precisely because it is an ecosystem, not a single product," Gilbert said. "Total cost of ownership and manageability have to change in order for 'Big Data' production applications to go mainstream. And if the Hadoop ecosystem doesn't fix the problem, there are alternatives competing for attention."

About the Author

David Ramel is an editor and writer for Converge360.


Subscribe on YouTube