首页 > 大数据 > Hadoop > 10个常见的Hadoop应用场景


Hadoop 作者:xk537 时间:2012-11-04 12:09:14 0 删除 编辑

这是一篇 Cloudera 创始人 Jeff Hammerbacher 于August 5, 2010 发表的经典演讲,细数了十种Hadoop可以大显身手的应用场景以及对应的行业。你用在了什么地方呢?

10 Common Hadoop-able Problems Webinar 

Today’s speaker - Jeff Hammerbacher • • Studied Mathematics at Harvard • Worked as a Quant on Wall Street • Conceived, built, and led Data team at Facebook • Nearly 30 amazing engineers and data scientists • Several open source projects and research papers • Founder of Cloudera • Chief Scientist • Also, check out the book “Beautiful Data”

What is Hadoop?

• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Scalable data processing engine • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing • Key value • Flexible -> store data without a schema and add it later as needed • Affordable -> cost / TB at a fraction of traditional options • Broadly adopted -> a large and active ecosystem • Proven at scale -> dozens of petabyte + implementations in production today .

Cloudera’s Distribution for Hadoop, version 3 The industry’s leading Hadoop distribution Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume, Sqoop HBase Zookeeper • Open source – 100% Apache licensed • Simplified – Component versions & dependencies managed for you • Integrated – All components & functions interoperate through standard API’s • Reliable – Patched with fixes from future releases to improve stability • Supported – Employs project founders and committers for >70% of components .

How does Cloudera know which problems are Hadoop-able?

• Talking to 1000s of users • Supporting 100s of implementations • Experience putting Hadoop into production with customers across a range of industries  6

What is common across Hadoop-able problems?

Nature of the data • Complex data • Multiple data sources • Lots of it Nature of the analysis • Batch processing • Parallel execution • Spread data over a cluster of servers and take the computation to the data

What Analysis is Possible With Hadoop?

• Text mining • Collaborative filtering • Index building • Prediction models • Graph creation and • Sentiment analysis analysis • Risk assessment • Pattern recognition

Benefits of Analyzing With Hadoop

• Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility

Topics • Introduction • 10 Common Hadoop-able Problems • Summary • Questions

1. Modeling True Risk

1. Modeling True Risk Solution with Hadoop • Source, parse and aggregate disparate data sources to build comprehensive data picture • e.g. credit card records, call recordings, chat sessions, emails, banking activity • Structure and analyze • Sentiment analysis, graph creation, pattern recognition  

Typical Industry • Financial Services (Banks, Insurance)

2. Customer Churn Analysis

2. Customer Churn Analysis Solution with Hadoop • Rapidly test and build behavioral model of customer from disparate data sources • Structure and analyze with Hadoop • Traversing • Graph creation • Pattern recognition

Typical Industry • Telecommunications, Financial Services

3. Recommendation Engine

3. Recommendation Engine Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Collaborative filtering • Collecting ‘taste’ information from many users • Utilizing information to predict what similar users like

Typical Industry • Ecommerce, Manufacturing, Retail

4. Ad Targeting

4. Ad Targeting Solution with Hadoop • Data analysis can be conducted in parallel, reducing processing times from days to hours • With Hadoop, as data volumes grow the only expansion cost is hardware • Add more nodes without a degradation in performance

Typical Industry • Advertising

5. Point of Sale Transaction Analysis

5. Point of Sale Transaction Analysis Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Pattern recognition • Optimizing over multiple data sources • Utilizing information to predict demand

Typical Industry • Retail

6. Analyzing Network Data to Predict Failure

6. Analyzing Network Data to Predict Failure Solution with Hadoop • Take the computation to the data • Expand the range of indexing techniques from simple scans to more complex data mining • Better understand how the network reacts to fluctuations • How previously thought discrete anomalies may, in fact, be interconnected • Identify leading indicators of component failure

Typical Industry • Utilities, Telecommunications, Data Centers

7. Threat Analysis

7. Threat Analysis Solution with Hadoop • Parallel processing over huge datasets • Pattern recognition to identify anomalies i.e. threats

Typical Industry • Security • Financial Services • General: spam fighting, click fraud

8. Trade Surveillance

8. Trade Surveillance Solution with Hadoop • Batch processing framework • Allow execution in in parallel over large datasets • Pattern recognition • Detect trading anomalies and harmful behavior

Typical Industry • Financial services • Regulatory bodies

9. Search Quality

9. Search Quality Solution with Hadoop • Analyzing search attempts in conjunction with structured data • Pattern recognition • Browsing pattern of users performing searches in different categories

Typical Industry • Web • Ecommerce

10. Data “Sandbox”

10. Data “Sandbox” Solution with Hadoop • With Hadoop an organization can “dump” all this data into a HDFS cluster • Then use Hadoop to start trying out different analysis on the data • See patterns or relationships that allow the organization to derive additional value from data

Typical Industry • Common across all industries

Summary – 10 Common Hadoop-able Problems

1. Modeling true risk 6. Threat analysis 2. Customer churn 7. Analyzing network analysis data to predict failure 3. Recommendation 8. Trade surveillance engine 9. Search quality 4. Ad targeting 10. Data “sandbox” 5. PoS transaction analysis

Who is Cloudera?

• Enterprise software & services company providing the industry’s leading Hadoop-based data management platform • Founding team came from large Web companies • Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop • All necessary packages, matched, tested and supported • Tools to support production use of Hadoop • The leading distribution for the enterprise • Contributors and committers • Fixing, patching and adding features


<!-- 正文结束 -->

来自 “ ITPUB博客 ” ,链接:,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
下一篇: 没有了~
请登录后发表评论 登录