The UPS of Data -- AWSInsider

The UPS of Data

How to choose the right data-delivery method for the right kind of data, and designing the cloud infrastructure to make that happen.

By Aaron Black
04/14/2015

I don't know of many kids who dream of working at UPS to deliver packages. Growing up in a small town in Ohio, I would have laughed at the thought that I would be in charge of a package-delivery business. In reality, that is what I am. The package my team delivers is data.

When UPS started in 1907 (as the American Messenger Company), did its founders think they were building a billion-dollar business? Did they think they would scale and compete with the U.S. Postal Service? How many of us take for granted how cheaply and efficiently we can get packages delivered to us overnight? Packages delivered by UPS are small and large, hot and cold. There are so many obscure and sensational things that have been shipped by UPS, some so custom that it's hard to put a price tag on the service.

In our data-driven world, there are many delivery mechanisms to help move our large, small, intricate, private and sensational data from our electronic warehouses to our waiting (and often anxious) consumers. In my current role, is the problem is not a lack of options to store and deliver data -- it's choosing the correct mechanism. Can we think of IT departments as the UPS of data? Can we build and deliver data and information that were never dreamed of when we first started?

What we have found in health care is that there is a diverse set of needs when it comes to data delivery. For patient care, the data needs to be delivered with security, accurancy and high reliability. For research data, the need is for high speed and performance, with massive datasets spinning on high IOPS disks. The results need to be returned back to researchers as quickly as possible. For exchanging patient data, it's high security as well as reliability that are most important.

If there were unlimited IT budgets, then all data would be delivered on the fastest components of hardware, network, databases and applications. For Inova, we first concentrate on gathering, storing and tranforming secure and reliable patient data. We do this with the solid and consistent systems and architecture of our health system. This secure environment is where we gather, review and de-identify our patient information before we move it to our research environments. Our first priority of delivery is security and reliability of the data.

With aggregate data sizes starting to measure in petabytes, we need to think about tiered disk storage based on cost and preformance. We think about data as a series of temperatures: cold, medium and hot. For our research data, the cold data is large in size, mostly unprocessed and rarely accessed. This data can be placed on storage that is cost-efficient but durable.

We use Amazon Web Services (AWS) for most of our long-term storage needs, and we continue to build Amazon S3 policies to automate the movement of our files to longer-term Amazon Glacier class storage. This significantly reduces cost versus on-premise storage options. However, there is cost to move this data out of Glacier storage. For us, this data is rarely accessed, and so the cost is minimal

Our medium data is accessed more frequently, but not every day. This is a quite a bit smaller than the cold data, but still can be terabytes in size. We need this data to be accessed faster and moved with little latency. We expect faster response times for accessing the data compared to the cold tier, but we are willing to wait seconds for it to be put in motion. We have solid state drives for this, which is costlier than the cold tier. We use Amazon EC2 and on-premises disks to store and deliver this data.

Lastly, the hot data is smaller than the medium-tier data. This data has been further refined by extract transfer load (ETL) or larger analysis by our scientific teams, and we expect this data to be able to be moved and queried quickly. This is where we want the data and its manipulation to be done in milliseconds, so the user interaction latency is almost non-existent. We use various places -- both on-premises and cloud infrastructure -- to store and deliver this datain a unique combination of high-speed disks and databases.

The ability to tier this data into cold, medium and hot has allowed us to spend our IT budget more effectively. Our goal as a "delivery" business is to allow our data consumers to get the most efficient use of their data and improve the way we allow them to request and interact with it.

How have you been able to deliver your data to your consumers? How can you utilize storage and delivery mechanisms to enhance their experience? Is this enabling them to do their work jobs better? Leave a comment below.

About the Author

Aaron Black is the director of informatics for the Inova Translational Medicine Institute (ITMI), where he and his team are creating a hybrid IT architecture of cloud and on-premises technologies to support the ever-changing data types being collected in ITMI studies. Aaron is a certified Project Management Professional (PMP) and a Certified Scrum Master (CSM), and has dozens of technical certifications from Microsoft and accounting software vendors. He can be reached at @TheDataGuru or via LinkedIn.