Good talk by Siva Raghupathy on Big Data architectural patterns.
If you have not already ready my post on zulily engineering blog, it is a good read on what we are building at zulily. Send me a note if you want to learn more or if you are interested in joining our team.
In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.
“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com
Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).
I need to read more about these 5. Apache Flink, Apache Samza, Ibis, Apache Twill and Apache Mahout-samsara. Mahout is the one I have read a bit about but others were not on my radar yet.
There are a lot of open source projects out there, and keeping track of them all is next to impossible. Here are five important ones in the Big Data space that you may not know about.
Good paper on how we leverage Hortonworks Data Platform for our big data processing and benefits we get from this. You should also check out our joint video with Google on leveraging HDP + Google Big Query.
zulily is a publicly traded, Seattle-based online retailer. The company launched in 2010 with a mission of bringing its customers (primarily moms) fresh and interesting product every day. The company has over four and a half million active customers and expects to do over $1B in sales for 2014.
Unlike search-based e-commerce sites whose users come looking for specific items, the zulily customer visits the company’s web site and mobile app on a daily basis to browse and discover products. The company focuses on crafting engaging, unique and compelling experiences for its customer.
The zulily experience promises “something special every day,” and creating that experience is no small feat. To do so, the company’s 500 merchandising professionals create over 100 daily “events” that launch more than 9,000 unique products each day.
To turn this raw content into an engaging experience for customers, zulily has invested heavily in personalization. Showing the right sales event and images to the right member at the right time is critical for zulily. A mother with a 6-year old daughter should have a completely different experience from that of an expecting mother.
…take advantage of the latest technologies available in the industry for managing both structured and unstructured data.
Accomplishing this level of personalization required that zulily build systems to understand members coming to its web site and then instantly determine what to show them. To do this, zulily’s systems must capture, integrate and analyze many different inputs from a wide variety of sources.
The company was founded with a data platform built on a relational database, but two years after launch the number of events, SKUs, customers and interactions were growing too rapidly for that system to keep up. To continue delivering relevant, customized content to its rapidly growing customer base, zulily knew that it needed to modernize its platform.
For personalization at scale, zulily built a Hadoop-based system for collecting clickstream data from across the company’s web, mobile and email engagement channels. This system allowed the company to turn clickstream data into engines that produce personalized, targeted and relevant product recommendations.
The zulily platform helped it achieve a new level of precision and maturity in its ability to personalize its customers’ experiences, but one challenge still remained. It still had a silo of structured data in one place (including transactions, customer records and product details), which was separate from its clickstream data in Hadoop.
“We really struggled to integrate and analyze the data across the two different silos,” said Sudhir Hasbe, director of software engineering for data services at zulily. “We often found ourselves making decisions based exclusively on a single type of data, or needing to get developers involved to produce new reports or analyses.”
The need to constantly involve the company’s development team was expensive and time consuming, and it distracted the developers from focusing on their own priorities. Due to the complexity of its siloed data platform, the company found itself limited in its ability to agilely respond to changes in the marketplace or company strategy.
After zulily thoroughly examined its analytical priorities and the challenges posed by its current infrastructure, the company concluded that it needed to move beyond its legacy relational database.
This platform is allowing us to take questions that we couldn’t answer to the point where we can answer them
“We knew we couldn’t build what we needed on traditional relational database technology,” said Hasbe. “We would have to innovate and take advantage of the latest technologies available in the industry for managing both structured and unstructured data.”
As Hasbe and his team further defined the company’s requirements, they formed a vision for what the company now calls the zulily Data Platform (ZDP). They planned to make ZDP a primary, central repository for all of the business’ data. It would:
- Support the company’s efforts to enhance the customer experience through better personalization and targeting,
- Give the company’s business analysts easy access to all of the company’s information,
- Allow the team to make smarter business decisions without needing IT support, and
- Scale to support the company’s growth over the long term.
To meet these goals, Hasbe and his team created a modern data architecture that combined the strengths of both Apache Hadoop and cloud computing to deliver a highly scalable unified data platform for structured and unstructured data.
ZDP is based on:
- Hortonworks Data Platform (HDP). With YARN as its architectural center, HDP provides a data platform for multi-workload data processing across an array of processing methods – from batch through interactive to real-time, supported by key capabilities required of an enterprise data platform – spanning governance, security and operations.
- Google Cloud Platform (GCP), Google’s public cloud infrastructure-as-a-service (IaaS) offering.
- Google BigQuery, a cloud-based tool for super-fast data query across massive data sets.
- Tableau, a visualization and reporting tool suite.
After identifying its path forward, the zulily team was able to move from its vision to an in-production data platform in a mere four months. It will migrate all existing data processing to the new platform by the end of 2014.
“Our new platform enables analytics scenarios that were difficult to achieve with our former technology stack,” said Hasbe. “And we now have the ability to scale both storage and analytics on demand. To finally be ahead of the company’s growth curve is exciting for us.”
For Luke Friang, zulily’s chief information officer, the ZDP platform creates important new opportunities for the company.
“Data is everything to us, yet we really struggled with how to properly consume and harvest a mass of data to provide our customers with a great experience,” said Friang. “Our new platform empowers us to use data all over the business. It drives the content of the email that our customers receive in the morning. It drives how and when we ship customers the products they order. It drives what customer sees in the mobile app versus what customer sees in a browser on their computer. It’s allowing us to make sure that we’re tailoring customer experiences appropriately, throughout the entire lifecycle as a zulily customer.”
From Friang’s perspective, it all comes down to supporting the business’ ability to derive new insights and make quick decisions.
“This platform is allowing us to take questions that we couldn’t answer to the point where we can answer them,” he said. “It is allowing us to accelerate decision making processes from weeks, days and hours to minutes, seconds and milliseconds. From an off-line analytics activity to a real-time decision-making processes embedded within a piece of software. That’s the value.”
“Hortonworks’ depth of knowledge was invaluable to us in this process,” added Friang. “The responsiveness of their team, and their ability to get things done and get issues fixed, were key to our ability to get ZDP off the ground.”
This week Ambari 1.5 version was released. Need to try it out. Check out post from HortonWorks.
Yesterday the Apache Ambari community proudly released version 1.5.1. This is the result of constant, concerted collaboration among the Ambari project’s many members. This release represents the work of over 30 individuals over 5 months and, combined with the Ambari 1.5.0 release, resolves more than 1,000 JIRAs.
This version of Ambari makes huge strides in simplifying the deployment, management and monitoring of large Hadoop clusters, including those running Hortonworks Data Platform 2.1.
Ambari 1.5.1 contains many new features – let’s take a look at those.
I am evaluating various NOSQL technologies as part of my new role at Zulily. This article was forwarded by someone on our team. This is a good read.
HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.
HBase is modeled after Google BigTable and is part of the world’s most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?
Michael Hausenblas of MapR argues that Hadoop’s popularity and HBase’s scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.
Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop’s HDFS architecture to overcome. These flaws will forever limit HBase’s applicability to high-velocity workloads, he says.
Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.
In my new role at Zulily I am responsible for our Big Data Platform. I have been investigating different options available in the space especially the best MPP database product in the market that we could leverage… I came across this awesome article by Marcos Ortiz. It is a great read…
Like the title says, to choose an enterprise-level Massive Parallel Processing (MPP) database is actually a big headache for every Data Science Manager; basically because there are very good choices around the tech world.
But, I will give my top reasons to choose a good platform of this kind.
Fast Query processing
I think that I don’t have to explain very much here, because you should know that this feature is critical for every Data-Driven business to answer bigger questions to be able to take action more quickly. If you have a platform where you can query huge data sets in matters of seconds or minutes, this is a huge advantage over your competitors. So, I think like a Product Manager, focused in Big Data Analytics, this is critical for my company.
Integration with Apache Hadoop
Apache Hadoop has become in the de-facto Analytics platform for Big Data processing, so, for a new business interested in Big Data, you have to build an integrated platform where Hadoop could play a critical role, and if you have a database which can communicate easily with the yellow elephant; you will be able to adapt to changes in the future more quickly, of course in terms of Business Analytics.
Am investigating various offerings with Hadoop. This is a very good article by Sriram Krishnan and Eva Tse from Netflix. Awesome..
Hadoop has become the de facto standard for managing and processing hundreds of terabytes to petabytes of data. At Netflix, our Hadoop-based data warehouse is petabyte-scale, and growing rapidly. However, with the big data explosion in recent times, even this is not very novel anymore. Our architecture, however, is unique as it enables us to build a data warehouse of practically infinite scale in the cloud (both in terms of data and computational power).
In this article, we discuss our cloud-based data warehouse, how it is different from a traditional data center-based Hadoop infrastructure, and how we leverage the elasticity of the cloud to build a system that is dynamically scalable. We also introduce Genie, which is our in-house Hadoop Platform as a Service (PaaS) that provides REST-ful APIs for job execution and resource management.
In a traditional data center-based Hadoop data warehouse, the data is hosted on the Hadoop Distributed File System (HDFS). HDFS can be run on commodity hardware, and provides fault-tolerance and high throughput access to large datasets. The most typical way to build a Hadoop data warehouse in the cloud would be to follow this model, and store your data on HDFS on your cloud-based Hadoop clusters. However, as we describe in the next section, we have chosen to store all of our data on Amazon’s Storage Service (S3), which is the core principle on which our architecture is based. A high-level overview of our architecture is shown below, followed by the details.
Interesting article but I think the 3 key companies to look at are Hortonworks, Cloudera and MapR. AWS with EMR and Azure with HDInsights will be interesting to watch out too. I am planning to play around with Hortonworks offering this week…
Network World – If you’ve got a lot of data, then Hadoop either is, or should be on your radar.
Once reserved for the Internet empires like Google and Yahoo, the most popular and well-known big data management system is now creeping into the enterprise. There are two big reasons for that: 1) Businesses have a lot more data to manage, and Hadoop is a great platform, especially for combining both legacy old data, and new, unstructured data 2) A lot of vendors are jumping into the game of offering support and services around Hadoop, making it more palatable for enterprises.
Most firms estimate that they are only analyzing 12% of the data that they already have, leaving 88% of it on the cutting-room floor.
— According to Forrester’s Software Survey Q4, 2013
“Hadoop is unstoppable as its open source roots grow wildly and deeply into enterprise data management architectures,” Forrester analysts Mike Gualtieri and Noel Yuhanna wrote recently in the company’s Wave Report on the Hadoop marketplace. “Forrester believes that Hadoop is a must-have data platform for large enterprises, forming the cornerstone of any flexible future data management platform. If you have lots of structured, unstructured, and/or binary data, there is a sweet spot for Hadoop in your organization.”
So where do you start? Forrester says there are a variety of places to go, and it evaluated nine vendors offering Hadoop services to find the pros and cons of each. Forrester concluded that there is no clear market leader at this point, with relatively young companies in this market offering compelling services alongside the tech titans.