Wednesday 27 February 2013

Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing · YDN Blog

At Yahoo!, Hadoop plays a central role in providing personalized experiences for our users and creating value for our advertisers. To serve Yahoo!?s emerging business needs, the Cloud Engineering Group is working on a next generation platform that enables the convergence of big-data and low-latency processing.

Figure 1. Personalization based on User Interests

Yahoo! is enhancing its web properties and mobile applications to provide its users personalized experience based on interest profiles. To compute user interest, we process billions of events from our over 700 million users, and analyze 2.2 billion content every day. Since users change interest over time, we need to update user profiles to reflect their current interest. Figure 1 illustrates a conceptual architecture that describes how low-latency processing and batch processing are leveraged to update user interest profile for personalization.

Figure 2. Convergence of batch and low-latency processing

Enabling low-latency big-data processing is one of primary design goals of Yahoo!?s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing. As illustrated in Figure 2, our big-data platform enables Hadoop applications and Storm applications to share data via shared storage such as HBase.

Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.
? We have enhanced Storm to support Hadoop style security mechanism (including Kerberos authentication), and thus enable Storm applications authorized to access Hadoop datasets on HDFS and HBase.
? Storm is being integrated into Hadoop YARN for resource management. Storm-on-YARN enables Storm applications to utilize the computation resources in our tens of thousands of Hadoop computation nodes. YARN is used to launch Storm application master (Nimbus) on demand, and enables Nimbus to request resources for Storm application slaves (Supervisors).

Yahoo! is committed to working with open source community on big-data processing. To enable the convergence of low-latency big-data processing, Yahoo! is making our contribution on Storm and YARN available as open source. Additional details on these efforts will be shared at our HUG meetup in April 2013 and Hadoop Summit North America in June, 2013.


crawled from : Yahoo

No comments:

Post a Comment