泥水匠的家

雷的技术人生

  • 博客访问: 994428
  • 博文数量: 450
  • 用 户 组: 普通用户
  • 注册时间: 2008-05-31 09:00
  • 认证徽章:
个人简介

BI 从业人员。 爱岗,爱家,爱技术。

ITPUB论坛APP

ITPUB论坛APP



APP发帖 享双倍积分

文章分类

全部博文(450)

文章存档

2017年(4)

2016年(10)

2015年(12)

2014年(15)

2013年(14)

2012年(10)

2011年(72)

2010年(80)

2009年(118)

2008年(110)

2007年(4)

2006年(1)

微信关注

IT168企业级官微



微信号:IT168qiye



系统架构师大会



微信号:SACC2013

订阅
热词专题

分类: Hadoop

Technorati 标签:

ORIGINALLY PUBLISHED JANUARY 2012

Apache Hadoop is an excellent framework for processing, storing and analyzing large volumes of unstructured data - aka Big Data. But getting a handle on all the project’s myriad components and sub-components, with names like Pig and Mahout, can be a difficult.

Cloudera Co-Founder and CTO Amr Awadallah walked the Wikibon Community through the Hadoop ecosystem during a visit to #theCUBE at Hadoop World 2011 in New York City. Below is a glossary describing the key Hadoop components and sub-components, as defined both by Awadallah and Wikibon, as well as the live recording of Awadallah inside #theCUBE from the show floor.

This page will be updated as these and other Hadoop projects emerge/grow.


Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonworoks, which include Ambari in its Hortonworks Data Platform.

Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related comp

阅读(323) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~
评论热议
请登录后评论。

登录 注册