The Power of AWS Object Store -- AWSInsider

The Power of AWS Object Store

On using Amazon S3 to wrangle complex and constantly changing datasets.

By Aaron Black
05/19/2015

In genomics, the problem for our researchers and clinicians is not data storage (thanks to Amazon Web Services). It's data analysis.

There is no shortage of data today, and there won't be in the future. Those who will persevere through this data deluge will be those who can manage it effectively and extract key insights and knowledge from it. Amazon S3, AWS' object store service, and the services built around it are key components in how we manage and quantify our data needs and processes, which translates to better and more efficient processes and insights.

From my perspective as a steward of extremely large and diverse datasets, any efficiency that can be gained from data quantification, categorization and movement can pay huge dividends. In past roles, my first step was to address the accounting for what we had and, more importantly, what we didn't have. This initial step may seem trivial, but with massive amounts of data and files, it can be a daunting and time-consuming endeavor. Even smaller datasets can be difficult if, like for many of us, the data is in constant motion. I can't tell you how many times in my career data movement produced overtime heartache.

One of the first things I did when I got to my current role at Inova was go on a data-quantification and accounting mission. For context, our institute had been in business for almost three years before I was hired. We enroll participants in our research studies and generate large genomic and other biological data. For AWS, the best way for us to quantify and manage our data was to use the AWS API to create an inventory of data stored on AWS.

Our AWS skillsets were raw in the beginning, but we were able to start quantifying our data early on after building a set of small Python-based scripts. Initially, the results of the scripts gave us useful information about our data objects, such as the fully qualified file location, file size (in bytes) and when the data file was last modified. We were able to quickly see the motion of our data, how much was coming in from our outside vendors, and how much was derived or secondary analysis data. We also could see the variability of our data, both in file sizes and file counts. Our data was "alive," moving and constantly growing.

At first, this was a manual process, and we only ran the scripts on specific intervals. At times, it would take hours to run. It seemed like certain times provided better performance. However, as our data grew and there were more people working with data, we decided to automate. We settled on running the scripts at 2 a.m.

Step 1 was accounting; Step 2 was analysis. We were missing the context of our files. Therefore, once our processes and data needs normalized, we began to integrate that data into our internal data warehouse. We needed to know more about the objects we were storing, and that data did not exist within the AWS object store.

We started by adding internal metadata to our AWS data objects. For example, we wanted to answer the question, "From what study participants were biological data files derived?" To do this, we added additional columns to our data warehouse table that stored these objects and wrote custom Extract Transfer Load (ETL) jobs to match our objects to participants.

Genomic data at this scale and size was new to our team and the health system, but we could now start answering questions about our data files in relationship to our participants, or the entire family of participants. We could now measure the different sizes of data objects based on the vendors who were providing us the data.

Step 3 is use the information to drive business decisions. Information on our data in the context of our participants led us to better estimates on data storage costs and the best way to move and share this data. We were able to estimate the data storage costs for the study a year in advance, and could now take proactive steps to suppress data redundancy across our cloud and on-premises resources.

That knowledge alone has saved and will save us tens of thousands of dollars during the life of these studies, some of which could go on for more than 10 years! As our knowledge and curiosity grew, we started to use the AWS API to query storage class (S3 or Glacier), eTag (which is hash of the object) and other object variables to make our data more manageable, both in cost and data movement.

We now have created intricate quality-control mechanisms for our data pipelines. With the knowledge we gained from our AWS object store data, we have streamlined these processes, and continue to manage and control these over time. We can now measure and visualize our data processing, looking for trends as well as anomalies, and make better decisions.

I cannot overstate the ease with which we are able to find this data, ingest it and enrich it. It has made our data management processes more effective, which has translated to lower-cost and faster data transfers, making our leadership and end users happy. I highly recommend evaluating the power of the AWS object store as a data management tool.

About the Author

Aaron Black is the director of informatics for the Inova Translational Medicine Institute (ITMI), where he and his team are creating a hybrid IT architecture of cloud and on-premises technologies to support the ever-changing data types being collected in ITMI studies. Aaron is a certified Project Management Professional (PMP) and a Certified Scrum Master (CSM), and has dozens of technical certifications from Microsoft and accounting software vendors. He can be reached at @TheDataGuru or via LinkedIn.

Featured

Subscribe on YouTube

AWS Cloud Report

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

4-Hour Hands-on Workshop: MCP Demystified
June 30, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Securing IT in the AI Era
July 23, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

TechMentor @ Microsoft HQ
August 11-15, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Microsoft 365 Security Masterclass
August 25-26, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

TechMentor Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free Whitepapers

> More TechLibrary

Free Webcasts

> More Webcasts