AWS Big Data Service Supports More Apache Tools

Amazon Web Services Inc. (AWS) announced that its Big Data processing service now supports more tools in the open source Apache ecosystem.

The Amazon Elastic MapReduce (Amazon EMR) Web service now supports Apache Tez, described by AWS as a "dataflow-driven data processing task orchestration," and Apache Phoenix, which provides "fast SQL for OLTP and operational analytics," according to a recent AWS blog post.

Amazon EMR provides a managed Hadoop for Big Data processing on Amazon EC2 compute instances, supporting many other open source projects in the Hadoop ecosystem, such as the popular Apache Spark project. That portfolio of supported projects has been filled out with the addition of Tez and Phoenix.

"Tez runs on top of Apache Hadoop YARN," AWS said in its announcement. "Tez provides you with a set of dataflow definition APIs that allow you to define a DAG (Directed Acyclic Graph) of data processing tasks. Tez can be faster than Hadoop MapReduce, and can be used with both Hive and Pig."

Phoenix, meanwhile, "uses HBase (another member of the Hadoop ecosystem) as its datastore," AWS said. "You can connect to Phoenix using a JDBC driver included on the cluster or from other applications that are running on or off of the cluster. Either way, you get access to fast, low-latency SQL with full ACID transaction capabilities. Your SQL queries are compiled into a series of HBase scans, the scans are run in parallel, and the results are aggregated to produce the result set."

AWS also updated several Big Data apps that were already in its arsenal, including HBase 1.2.1, Mahout 0.12.0 and Presto 0.147. A new Redshift JDBC driver is also available to work with data housed on clusters in the Redshift data warehouse service.

About the Author

David Ramel is an editor and writer for Converge360.


Subscribe on YouTube