Startup Snowflake Launches Cloud-Based Data Warehouse Service on AWS
Former Microsoft exec Bob Muglia has launched on Amazon Web Services Inc. (AWS) a cloud-based data warehouse service that it claims will significantly extend the limits of traditional analytics platforms.
Muglia described the Snowflake Elastic Data Warehouse from Snowflake Computing as a Big Data platform built from scratch and designed specifically to run in the public cloud. Its release is the latest effort to bring data warehousing to the masses.
Muglia joined Snowflake, founded in 2012, last year. The company, which was founded by the lead architect for the Oracle RAC product along with other database and storage veterans, also said this week it received a C Series investment of $45 million from Altimeter Capital. That brings the total amount invested in the company up to $71 million.
Snowflake says what separates its offering from other high performance data warehouse technology is that it was built from scratch to run in the public cloud. Company officials argue that the cirumstance makes the service much more scalable and less expensive because it uses low-cost cloud storage. Muglia said Snowflake has come across numerous customers capturing machine-generated, semi-structured data using Hadoop-based clusters who have struggled to transform that data into a form that a traditional data warehouse can handle that can enable a business analyst to connect to it with common BI tools such as Excel or Tableau.
"We just eliminate those steps and load the semi-structured data directly into Snowflake and immediately business analysts can run queries against it directly using the tools they understand, and this is a huge savings of complexity and time," Muglia said. With traditional data warehouses, "typically there's a loss that happens from a data perspective as you go through that transformation. Some data is not available because it is not transformed and also the data warehouses tend to only be able to handle a subset of the data. One of our customers was only loading one week's worth of data into their data warehouse because that's all they could support in that, whereas with Snowflake they could keep all the historical data around and when analysts wanted to run a query on data that's say three months old, that was no problem."
Muglia said it's not a problem because of the infinite amount of storage available in the back-end repository of Snowflake -- Amazon Web Services S3 storage. Many of the early customers using Snowflake are using hundreds of terabytes of S3 data. Muglia said the service can easily scale to multiple petabytes of data stored in S3.
"There's no limit to the amount of data you can store," Muglia said. "Obviously if you issued a query that for any reason needed to scan a petabyte of data, that would be very costly and it would be long query to run. Physics still apply but one of the key things is that the way we store that data and the information we gather about the data allows our query processor to do something we call pruning. If you stored five years' worth of data and let's say it was 5 petabytes in size, and you issued a query against one week's worth of that data, regardless of what week it was, we could just select exactly the data we needed. And let's just say we only need half a terabyte of something, we could munch through that relatively quickly and return the results pretty quickly even though you may have 5 petabytes of data."
Asked if in scenarios where a customer has petabytes of data whether Snowflake uses AWS's Glacier archive service, Muglia said that's not necessary given the current economics of storing data in S3. "It's probably the most economical place to store it actually," Muglia said. "Compared to enterprise-class storage it's crazy cheaper, and compared to putting it on nodes, which is what you'd need to do with Hadoop, it's also quite a bit less expensive. Even what people tend to think of as the low-cost alternative they talk about building things like data lakes using Hadoop. In those cases, the data is stored on the active nodes and while that's a whole lot cheaper than EMC storage, it's much more expensive than S3 would be and much more expensive therefore than Snowflake would be."
At today's rate of about $30 per terabyte per month for S3 storage from AWS, Snowflake charges "a slight premium" on top of that for the service. Snowflake is a software-as-a-service (SaaS) offering, so the underlying cloud storage infrastructure is relevant only to the extent customers have other data there. Despite the fact that Muglia left Microsoft four years ago after 23 years, it was ironic to hear him hawking AWS. As Microsoft's Server and Tools president, which was a $17 billion business when he left in 2011, Muglia was one of the first to extoll the virtues of Azure to IT pros prior and right after its launch -- he frequently gave the keynote addresses at the company's annual TechEd conferences. Snowflake hasn't ruled out offering its service on Azure in the future. The company has not conducted a detailed analysis of Microsoft's Azure Blob Storage. "It's feasible to do it in the short run. The question is customer demand," Muglia said. "I think there will probably be a day but it's not tomorrow that we'll do it."
While running Snowflake on S3, Snowflake most pointedly will compete with Amazon's own Redshift data warehousing service and ultimately Microsoft's Azure SQL Data Warehouse, announced in late April at the company's Build conference and set for release this summer. Snowflake Vice President of Product Marketing Jon Bock said for now Amazon's Redshift is a most affordable cloud-based data warehouse service today but argued that its underlying engine is based on code from existing database technology.
"We have the benefit of starting with a completely new architecture and writing the full code base ourselves," Bock said. "One simple example of that is semi-structured data support, that machine data that's increasingly common that people want to analyze. Traditional databases weren't designed for that at all. So what you ended up having to do is put another system in front of that, which is where Hadoop came in. Then you preprocess that data, take the result and load it onto a relational database. We spent time making sure you didn't have to do that with Snowflake. You can take that machine data, load it directly into Snowflake and be able to query it immediately without any of that preprocessing or delay in the pipeline."
If the service lives up to its claims, it should boast a nice valuation, or become an attractive takeover target.
About the Author
Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.