Kinesis Simplifies Streaming Data Capture

Amazon Kinesis, the Big Data streaming analytics tool on the Amazon Web Services Inc. (AWS) cloud, has received several updates, including a new simplified capture mechanism.

Introduced in November 2013, Kinesis is a managed service that facilitates real-time processing of streaming data coming from sources such as Web clickstreams, e-commerce transactions, social media outlets, system logs, sensors and so on. "With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more," the AWS site says.

The tool yesterday received a series of updates, including a new abstraction layer for data ingestion via a new Kinesis Producer Library (KPL) that lets developers optimize their applications to take advantage of higher data throughput. Applications that put data into a Kinesis stream (ingestion) are called producers, and the KPL improves the performance of such producers by simplifying the process of achieving high write throughput. Previously, to do this, developers had to write complicated batching or multithreading logic, for example, along with code to retry transactions and de-aggregate records in consumer applications.

"In order to achieve high throughput, you should combine multiple records into a single call to PutRecords," AWS exec Jeff Barr said in a blog post yesterday. "You should also consider aggregating multiple user records into a single Kinesis record, and then de-aggregating them immediately prior to consumption. Finally, you will need code to detect and retry failed calls.

"The new KPL will help you with all of the tasks that I identified above," Barr continued. "It will allow you to write to one or more Kinesis streams with automatic and configurable retry logic; collect multiple records and write them in batch fashion using PutRecords; aggregate user records to increase payload size and throughput, and submit Amazon CloudWatch metrics (including throughput and error rates) on your behalf."

The KPL features an asynchronous and non-blocking API written in Java. It runs on the Linux and OSX OSes, with binary packages available for Amazon Linux AMI, Ubuntu, Red Hat Enterprise Linux (RHEL), OSX and OSX Server.

It works with a separate companion library for the client side, the Kinesis Client Library (KCL). "The KCL takes care of many of the more complex tasks associated with consuming and processing streaming data in a distributed fashion, including load balancing across multiple instances, responding to instance failures, checkpointing processed records and reacting to changes in sharding," Barr said. Shards are a measurement of desired stream capacity, with one shard able to handle 1,000 write transactions and up to five read transactions.

Other Kinesis updates include an increase in the maximum size of a record -- or blob of data, also known as a payload -- from 50 KB to 1 MB. "This gives you a lot more flexibility and opens the door to some interesting new ways to use Kinesis," Barr said. "For example, you can now send larger log files, semi-structured documents, e-mail messages, and other data types without having to split them in to small chunks."

Finally, a new pricing structure was put into effect, which basically lowers the cost of putting small records into the stream.

About the Author

David Ramel is an editor and writer for Converge360.


Subscribe on YouTube