ITPub博客

首页 > 大数据 > Hadoop > 什么是 Apache Hadoop?

什么是 Apache Hadoop?

翻译 Hadoop 作者:jichuanlau07 时间:2014-02-13 21:34:24 0 删除 编辑

什么是 Apache Hadoop ?

Apache? Hadoop?项目是为可靠、可扩展及分布式计算而开发的开源软件。

Apache Hadoop软件库是一个允许使用简单编程模型对集群计算机内的大数据集进行分布式处理的框架,
她被设计成可以从单一服务器到成千上万的服务器的纵向扩展,这些服务器提供本地计算及存储。

不是依靠硬件上提供高可用性代码库用来检测处理应用层的失败因此将计算机集群的顶层提供高可用的服务其中的每个节点都允许失效。

该项目包含以下模块:

  • Hadoop Common: 支持其他Hadoop模块的通用工具。
  • Hadoop Distributed File System (HDFS?):访问应用时提供高吞吐量的分布式文件系统。
  • Hadoop YARN: 负责作业调度和集群管理的框架。
  • Hadoop MapReduce:基于YARN的对大数据集进行并行处理的系统。


Apache中包括的其他相关项目:
  • Ambari?: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro ?: 一个数据串行化系统。
  • Cassandra?: 一个支持可扩展的多主节点,不存在单点故障的数据库。
  • Chukwa?: 管理分大型分布式系统的数据集合系统。
  • HBase?: 一个可扩展的分布式数据库,该数据库针对大表支持结构数据存储。
  • Hive?: 一个提供数据汇总和热点查询的数据仓库架构。
  • Mahout?:一个可扩展的机器学习和数据挖掘库。
  • Pig?: 为并行计算提供高层次数据流语言和执行框架。
  • Spark?:为Hadoop数据提供的快速、通用的计算引擎。她提供简单的编程模型,该模型支持各种应用,包括ETL,机器学习,流处理,图形计算。
  • Tez?:一个广义的数据流编程框架,建立在Hadoop YARN,它提供了一个强大的和灵活的引擎来执行任意任务的任意(批处理和交互式的用例)。
  • ZooKeeper?:为分布式应用提供高性能的协调服务


原文参考如下:

What Is Apache Hadoop?

The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS?): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

  • Ambari?: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro?: A data serialization system.
  • Cassandra?: A scalable multi-master database with no single points of failure.
  • Chukwa?: A data collection system for managing large distributed systems.
  • HBase?: A scalable, distributed database that supports structured data storage for large tables.
  • Hive?: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout?: A Scalable machine learning and data mining library.
  • Pig?: A high-level data-flow language and execution framework for parallel computation.
  • Spark?: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez?: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive?, Pig? and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop? MapReduce as the underlying execution engine.
  • ZooKeeper?: A high-performance coordination service for distributed applications.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/23628945/viewspace-1081101/,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
下一篇: linux cd 命令使用
请登录后发表评论 登录
全部评论

注册时间:2010-03-31

  • 博文量
    15
  • 访问量
    69935