Data Delivery
The Power of AWS Object Store
On using Amazon S3 to wrangle complex and constantly changing datasets.
In genomics, the problem for our researchers and clinicians is not data storage (thanks to Amazon Web Services). It's data analysis.
There is no shortage of data today, and there won't be in the future. Those who will persevere through this data deluge will be those who can manage it effectively and extract key insights and knowledge from it. Amazon S3, AWS' object store service, and the services built around it are key components in how we manage and quantify our data needs and processes, which translates to better and more efficient processes and insights.
From my perspective as a steward of extremely large and diverse datasets, any efficiency that can be gained from data quantification, categorization and movement can pay huge dividends. In past roles, my first step was to address the accounting for what we had and, more importantly, what we didn't have. This initial step may seem trivial, but with massive amounts of data and files, it can be a daunting and time-consuming endeavor. Even smaller datasets can be difficult if, like for many of us, the data is in constant motion. I can't tell you how many times in my career data movement produced overtime heartache.
One of the first things I did when I got to my current role at Inova was go on a data-quantification and accounting mission. For context, our institute had been in business for almost three years before I was hired. We enroll participants in our research studies and generate large genomic and other biological data. For AWS, the best way for us to quantify and manage our data was to use the AWS API to create an inventory of data stored on AWS.
Our AWS skillsets were raw in the beginning, but we were able to start quantifying our data early on after building a set of small Python-based scripts. Initially, the results of the scripts gave us useful information about our data objects, such as the fully qualified file location, file size (in bytes) and when the data file was last modified. We were able to quickly see the motion of our data, how much was coming in from our outside vendors, and how much was derived or secondary analysis data. We also could see the variability of our data, both in file sizes and file counts. Our data was "alive," moving and constantly growing.
At first, this was a manual process, and we only ran the scripts on specific intervals. At times, it would take hours to run. It seemed like certain times provided better performance. However, as our data grew and there were more people working with data, we decided to automate. We settled on running the scripts at 2 a.m.
Step 1 was accounting; Step 2 was analysis. We were missing the context of our files. Therefore, once our processes and data needs normalized, we began to integrate that data into our internal data warehouse. We needed to know more about the objects we were storing, and that data did not exist within the AWS object store.
We started by adding internal metadata to our AWS data objects. For example, we wanted to answer the question, "From what study participants were biological data files derived?" To do this, we added additional columns to our data warehouse table that stored these objects and wrote custom Extract Transfer Load (ETL) jobs to match our objects to participants.
Genomic data at this scale and size was new to our team and the health system, but we could now start answering questions about our data files in relationship to our participants, or the entire family of participants. We could now measure the different sizes of data objects based on the vendors who were providing us the data.
Step 3 is use the information to drive business decisions. Information on our data in the context of our participants led us to better estimates on data storage costs and the best way to move and share this data. We were able to estimate the data storage costs for the study a year in advance, and could now take proactive steps to suppress data redundancy across our cloud and on-premises resources.
That knowledge alone has saved and will save us tens of thousands of dollars during the life of these studies, some of which could go on for more than 10 years! As our knowledge and curiosity grew, we started to use the AWS API to query storage class (S3 or Glacier), eTag (which is hash of the object) and other object variables to make our data more manageable, both in cost and data movement.
We now have created intricate quality-control mechanisms for our data pipelines. With the knowledge we gained from our AWS object store data, we have streamlined these processes, and continue to manage and control these over time. We can now measure and visualize our data processing, looking for trends as well as anomalies, and make better decisions.
I cannot overstate the ease with which we are able to find this data, ingest it and enrich it. It has made our data management processes more effective, which has translated to lower-cost and faster data transfers, making our leadership and end users happy. I highly recommend evaluating the power of the AWS object store as a data management tool.
About the Author
Aaron Black is the director of informatics for the Inova Translational Medicine Institute (ITMI), where he and his team are creating a hybrid IT architecture of cloud and on-premises technologies to support the ever-changing data types being collected in ITMI studies. Aaron is a certified Project Management Professional (PMP) and a Certified Scrum Master (CSM), and has dozens of technical certifications from Microsoft and accounting software vendors. He can be reached at @TheDataGuru or via LinkedIn.