首页 > 大数据 > Hadoop > Hadoop Hdfs:小文件的硬盘占用

Hadoop Hdfs:小文件的硬盘占用

Hadoop 作者:zfrzfrzfrzfr 时间:2014-03-11 00:45:18 0 删除 编辑
hadoop hdfs 文件存储的时候的block 大小是很大的,比如64M一个,远大于机器的本地文件系统的block大小。
从 看,其实只占用很小的硬盘空间。实际上 64M 的块大小,只是文件切分存储的大小。小文件对于 hadoop的影响在于对于 namenode的影响。

[附] 引用如下:

        I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB. Lets say that I have 10 GB of disk space in my hadoop cluster that means, HDFS initially has80 blocks as available storage.

If I create a small file of say 12.8 MB, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.


But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).

Enough guessing and thinking, let's try it. My lab configuration is as follow:

  • hadoop version 1.0.4
  • 4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
  • block size of 64 MB, default replication of 1

After starting HDFS, I have the following NameNode summary:

  • 1 files and directories, 0 blocks = 1 total
  • DFS Used: 112 KB
  • DFS Remaining: 19.82 GB

Then I do the following commands:

  • hadoop fs -mkdir /test
  • for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done

With these results:

  • 12 files and directories, 10 blocks = 22 total
  • DFS Used: 122.15 KB
  • DFS Remaining: 19.82 GB

So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").

<!-- 正文结束 -->

来自 “ ITPUB博客 ” ,链接:,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
下一篇: 没有了~
请登录后发表评论 登录