Mar 14

Using Hadoop with Hortonworks Data Platform @ zulilY

Big Data, Technology Comments Off on Using Hadoop with Hortonworks Data Platform @ zulilY

Good paper on how we leverage Hortonworks Data Platform for our big data processing and benefits we get from this. You should also check out our joint video with Google on leveraging HDP + Google Big Query.

zulily is a publicly traded, Seattle-based online retailer. The company launched in 2010 with a mission of bringing its customers (primarily moms) fresh and interesting product every day. The company has over four and a half million active customers and expects to do over $1B in sales for 2014.

Unlike search-based e-commerce sites whose users come looking for specific items, the zulily customer visits the company’s web site and mobile app on a daily basis to browse and discover products. The company focuses on crafting engaging, unique and compelling experiences for its customer.

The zulily experience promises “something special every day,” and creating that experience is no small feat. To do so, the company’s 500 merchandising professionals create over 100 daily “events” that launch more than 9,000 unique products each day.


To turn this raw content into an engaging experience for customers, zulily has invested heavily in personalization. Showing the right sales event and images to the right member at the right time is critical for zulily. A mother with a 6-year old daughter should have a completely different experience from that of an expecting mother.

…take advantage of the latest technologies available in the industry for managing both structured and unstructured data.

Accomplishing this level of personalization required that zulily build systems to understand members coming to its web site and then instantly determine what to show them. To do this, zulily’s systems must capture, integrate and analyze many different inputs from a wide variety of sources.

The company was founded with a data platform built on a relational database, but two years after launch the number of events, SKUs, customers and interactions were growing too rapidly for that system to keep up. To continue delivering relevant, customized content to its rapidly growing customer base, zulily knew that it needed to modernize its platform.

For personalization at scale, zulily built a Hadoop-based system for collecting clickstream data from across the company’s web, mobile and email engagement channels. This system allowed the company to turn clickstream data into engines that produce personalized, targeted and relevant product recommendations.

The zulily platform helped it achieve a new level of precision and maturity in its ability to personalize its customers’ experiences, but one challenge still remained. It still had a silo of structured data in one place (including transactions, customer records and product details), which was separate from its clickstream data in Hadoop.

“We really struggled to integrate and analyze the data across the two different silos,” said Sudhir Hasbe, director of software engineering for data services at zulily. “We often found ourselves making decisions based exclusively on a single type of data, or needing to get developers involved to produce new reports or analyses.”

The need to constantly involve the company’s development team was expensive and time consuming, and it distracted the developers from focusing on their own priorities. Due to the complexity of its siloed data platform, the company found itself limited in its ability to agilely respond to changes in the marketplace or company strategy.


After zulily thoroughly examined its analytical priorities and the challenges posed by its current infrastructure, the company concluded that it needed to move beyond its legacy relational database.

This platform is allowing us to take questions that we couldn’t answer to the point where we can answer them

“We knew we couldn’t build what we needed on traditional relational database technology,” said Hasbe. “We would have to innovate and take advantage of the latest technologies available in the industry for managing both structured and unstructured data.”

As Hasbe and his team further defined the company’s requirements, they formed a vision for what the company now calls the zulily Data Platform (ZDP). They planned to make ZDP a primary, central repository for all of the business’ data. It would:

  • Support the company’s efforts to enhance the customer experience through better personalization and targeting,
  • Give the company’s business analysts easy access to all of the company’s information,
  • Allow the team to make smarter business decisions without needing IT support, and
  • Scale to support the company’s growth over the long term.

To meet these goals, Hasbe and his team created a modern data architecture that combined the strengths of both Apache Hadoop and cloud computing to deliver a highly scalable unified data platform for structured and unstructured data.

ZDP is based on:

  • Hortonworks Data Platform (HDP). With YARN as its architectural center, HDP provides a data platform for multi-workload data processing across an array of processing methods – from batch through interactive to real-time, supported by key capabilities required of an enterprise data platform – spanning governance, security and operations.
  • Google Cloud Platform (GCP), Google’s public cloud infrastructure-as-a-service (IaaS) offering.
  • Google BigQuery, a cloud-based tool for super-fast data query across massive data sets.
  • Tableau, a visualization and reporting tool suite.

After identifying its path forward, the zulily team was able to move from its vision to an in-production data platform in a mere four months. It will migrate all existing data processing to the new platform by the end of 2014.

“Our new platform enables analytics scenarios that were difficult to achieve with our former technology stack,” said Hasbe. “And we now have the ability to scale both storage and analytics on demand. To finally be ahead of the company’s growth curve is exciting for us.”


For Luke Friang, zulily’s chief information officer, the ZDP platform creates important new opportunities for the company.

“Data is everything to us, yet we really struggled with how to properly consume and harvest a mass of data to provide our customers with a great experience,” said Friang. “Our new platform empowers us to use data all over the business. It drives the content of the email that our customers receive in the morning. It drives how and when we ship customers the products they order. It drives what customer sees in the mobile app versus what customer sees in a browser on their computer. It’s allowing us to make sure that we’re tailoring customer experiences appropriately, throughout the entire lifecycle as a zulily customer.”

From Friang’s perspective, it all comes down to supporting the business’ ability to derive new insights and make quick decisions.

“This platform is allowing us to take questions that we couldn’t answer to the point where we can answer them,” he said. “It is allowing us to accelerate decision making processes from weeks, days and hours to minutes, seconds and milliseconds. From an off-line analytics activity to a real-time decision-making processes embedded within a piece of software. That’s the value.”

“Hortonworks’ depth of knowledge was invaluable to us in this process,” added Friang. “The responsiveness of their team, and their ability to get things done and get issues fixed, were key to our ability to get ZDP off the ground.”

zulily does Hadoop. With Hortonworks Data Platform

Apr 26

Apache Ambari 1.5.1 is Released!

Big Data Comments Off on Apache Ambari 1.5.1 is Released!

This week Ambari 1.5 version was released. Need to try it out. Smile Check out post from HortonWorks.

Yesterday the Apache Ambari community proudly released version 1.5.1. This is the result of constant, concerted collaboration among the Ambari project’s many members. This release represents the work of over 30 individuals over 5 months and, combined with the Ambari 1.5.0 release, resolves more than 1,000 JIRAs.


This version of Ambari makes huge strides in simplifying the deployment, management and monitoring of large Hadoop clusters, including those running Hortonworks Data Platform 2.1.

Ambari 1.5.1 contains many new features – let’s take a look at those.

Apache Ambari 1.5.1 is Released! | Hortonworks

Mar 27

Am investigating various offerings with Hadoop. This is a very good article by Sriram Krishnan and Eva Tse from Netflix. Smile Awesome.. Smile

Hadoop has become the de facto standard for managing and processing hundreds of terabytes to petabytes of data. At Netflix, our Hadoop-based data warehouse is petabyte-scale, and growing rapidly. However, with the big data explosion in recent times, even this is not very novel anymore. Our architecture, however, is unique as it enables us to build a data warehouse of practically infinite scale in the cloud (both in terms of data and computational power).

In this article, we discuss our cloud-based data warehouse, how it is different from a traditional data center-based Hadoop infrastructure, and how we leverage the elasticity of the cloud to build a system that is dynamically scalable. We also introduce Genie, which is our in-house Hadoop Platform as a Service (PaaS) that provides REST-ful APIs for job execution and resource management.

Architectural Overview

In a traditional data center-based Hadoop data warehouse, the data is hosted on the Hadoop Distributed File System (HDFS). HDFS can be run on commodity hardware, and provides fault-tolerance and high throughput access to large datasets. The most typical way to build a Hadoop data warehouse in the cloud would be to follow this model, and store your data on HDFS on your cloud-based Hadoop clusters. However, as we describe in the next section, we have chosen to store all of our data on Amazon’s Storage Service (S3), which is the core principle on which our architecture is based. A high-level overview of our architecture is shown below, followed by the details.

The Netflix Tech Blog: Hadoop Platform as a Service in the Cloud

Mar 19

Interesting article but I think the 3 key companies to look at are Hortonworks, Cloudera and MapR. AWS with EMR and Azure with HDInsights will be interesting to watch out too. I am planning to play around with Hortonworks offering this week… Smile

Network World – If you’ve got a lot of data, then Hadoop either is, or should be on your radar.

Once reserved for the Internet empires like Google and Yahoo, the most popular and well-known big data management system is now creeping into the enterprise. There are two big reasons for that: 1) Businesses have a lot more data to manage, and Hadoop is a great platform, especially for combining both legacy old data, and new, unstructured data 2) A lot of vendors are jumping into the game of offering support and services around Hadoop, making it more palatable for enterprises.

Most firms estimate that they are only analyzing 12% of the data that they already have, leaving 88% of it on the cutting-room floor.

— According to Forrester’s Software Survey Q4, 2013

“Hadoop is unstoppable as its open source roots grow wildly and deeply into enterprise data management architectures,” Forrester analysts Mike Gualtieri and Noel Yuhanna wrote recently in the company’s Wave Report on the Hadoop marketplace. “Forrester believes that Hadoop is a must-have data platform for large enterprises, forming the cornerstone of any flexible future data management platform. If you have lots of structured, unstructured, and/or binary data, there is a sweet spot for Hadoop in your organization.”

So where do you start? Forrester says there are a variety of places to go, and it evaluated nine vendors offering Hadoop services to find the pros and cons of each. Forrester concluded that there is no clear market leader at this point, with relatively young companies in this market offering compelling services alongside the tech titans.

Nine Hadoop companies you should know – Network World

Mar 18

Growth in Big data market is going to be staggering. If the number are accurate this is phenomenal. Not sure if there is any other trend that has grown this fast in recent times…

The global Hadoop market is expected to grow at a compound annual growth rate of 58 percent between 2013 and 2020, according to a new report by Allied Market Research.

The market revenue was estimated to be $2 billion in 2013 and is expected to grow to $50.2 billion by 2020. A huge increase in raw structured and unstructured data and increasing demand for big data analytics are the major driving factors for the global Hadoop market, the report says. Hadoop provides cost-effective and faster data processing of big data analytics over conventional data analysis tools such as relational database management systems.

Distributed computing and Hadoop platform security issues are currently hindering the growth of the market, Allied Market Research says. But with continuous technological growth these issues can be addressed, it says

Hadoop Market Forecasted to Reach $50.2 Billion by 2020 – Information Management Online Article