ITPub博客

首页 > 大数据 > Hadoop > spark检索hdfs记录

spark检索hdfs记录

原创 Hadoop 作者:jack22220613 时间:2015-02-27 10:45:50 0 删除 编辑
注:以下用的是CDH5.3版本。

[root@cdh0 ~]# hadoop dfs -ls /user/hive/warehouse/db1.db/tab1
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Found 3 items
-rw-r--r--   3 impala hive  291745403 2015-01-23 11:00 /user/hive/warehouse/db1.db/tab1/7a4d0328b98fcf78-8052e0dbd8c225b6_2104527952_data.0.
drwxrwxrwt   - impala hive          0 2015-01-23 11:00 /user/hive/warehouse/db1.db/tab1/_impala_insert_staging
-rw-r--r--   3 root   hive  765975316 2015-01-23 16:08 /user/hive/warehouse/db1.db/tab1/xab.csv
[root@cdh0 ~]#

[cdh3:21000] > select count(*) from tab1;
Query: select count(*) from tab1
+----------+
| count(*) |
+----------+
| 4465132  |
+----------+

scala> var file=sc.textFile("hdfs:///user/hive/warehouse/db1.db/tab1")  ---写错了,正确的应该是
var file = sc.textFile("hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1")
15/02/26 17:13:58 INFO MemoryStore: ensureFreeSpace(259058) called with curMem=560418, maxMem=278302556
15/02/26 17:13:58 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 253.0 KB, free 264.6 MB)
15/02/26 17:13:58 INFO MemoryStore: ensureFreeSpace(21187) called with curMem=819476, maxMem=278302556
15/02/26 17:13:58 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.7 KB, free 264.6 MB)
15/02/26 17:13:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:47344 (size: 20.7 KB, free: 265.3 MB)
15/02/26 17:13:58 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/02/26 17:13:58 INFO SparkContext: Created broadcast 2 from textFile at :12
file: org.apache.spark.rdd.RDD[String] = hdfs:///user/hive/warehouse/db1.db/tab1 MappedRDD[5] at textFile at :12

scala> file.count()
15/02/26 17:14:00 INFO FileInputFormat: Total input paths to process : 2
15/02/26 17:14:00 INFO SparkContext: Starting job: count at :15
15/02/26 17:14:00 INFO DAGScheduler: Got job 0 (count at :15) with 9 output partitions (allowLocal=false)
15/02/26 17:14:00 INFO DAGScheduler: Final stage: Stage 0(count at :15)
15/02/26 17:14:00 INFO DAGScheduler: Parents of final stage: List()
15/02/26 17:14:00 INFO DAGScheduler: Missing parents: List()
15/02/26 17:14:00 INFO DAGScheduler: Submitting Stage 0 (hdfs:///user/hive/warehouse/db1.db/tab1 MappedRDD[5] at textFile at :12), which has no missing parents
15/02/26 17:14:00 INFO MemoryStore: ensureFreeSpace(2544) called with curMem=840663, maxMem=278302556
15/02/26 17:14:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.5 KB, free 264.6 MB)
15/02/26 17:14:00 INFO MemoryStore: ensureFreeSpace(1606) called with curMem=843207, maxMem=278302556
15/02/26 17:14:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1606.0 B, free 264.6 MB)
15/02/26 17:14:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:47344 (size: 1606.0 B, free: 265.3 MB)
15/02/26 17:14:00 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/02/26 17:14:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/02/26 17:14:00 INFO DAGScheduler: Submitting 9 missing tasks from Stage 0 (hdfs:///user/hive/warehouse/db1.db/tab1 MappedRDD[5] at textFile at :12)
15/02/26 17:14:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 9 tasks
15/02/26 17:14:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1365 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1365 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, ANY, 1365 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, ANY, 1320 bytes)
15/02/26 17:14:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/26 17:14:00 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/26 17:14:00 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
15/02/26 17:14:00 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
15/02/26 17:14:00 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
15/02/26 17:14:00 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
15/02/26 17:14:00 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
15/02/26 17:14:00 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
15/02/26 17:14:00 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:0+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:402653184+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/7a4d0328b98fcf78-8052e0dbd8c225b6_2104527952_data.0.:134217728+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:134217728+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:268435456+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:536870912+134217728
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/xab.csv:671088640+94886676
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/7a4d0328b98fcf78-8052e0dbd8c225b6_2104527952_data.0.:268435456+23309947
15/02/26 17:14:00 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1/7a4d0328b98fcf78-8052e0dbd8c225b6_2104527952_data.0.:0+134217728
15/02/26 17:14:00 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/02/26 17:14:00 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/02/26 17:14:00 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/02/26 17:14:00 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/02/26 17:14:00 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/02/26 17:14:01 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1757 bytes result sent to driver
15/02/26 17:14:01 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 999 ms on localhost (1/9)
15/02/26 17:14:01 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 1812 bytes result sent to driver
15/02/26 17:14:01 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 1688 ms on localhost (2/9)
15/02/26 17:14:03 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2810 ms on localhost (3/9)
15/02/26 17:14:03 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2849 ms on localhost (4/9)
15/02/26 17:14:03 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 3049 ms on localhost (5/9)
15/02/26 17:14:03 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 3083 ms on localhost (6/9)
15/02/26 17:14:03 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 3146 ms on localhost (7/9)
15/02/26 17:14:03 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 3390 ms on localhost (8/9)
15/02/26 17:14:03 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 1812 bytes result sent to driver
15/02/26 17:14:03 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 3494 ms on localhost (9/9)
15/02/26 17:14:03 INFO DAGScheduler: Stage 0 (count at :15) finished in 3.529 s
15/02/26 17:14:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/02/26 17:14:03 INFO DAGScheduler: Job 0 finished: count at :15, took 3.711564 s
res5: Long = 4465132


[root@cdh0 ~]# hadoop dfs -ls /user/hive/warehouse/db1.db/tab1_text
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Found 2 items
-rw-r--r--   3 root hive   99936802 2015-01-21 13:14 /user/hive/warehouse/db1.db/tab1_text/tab1_text.txt
-rw-r--r--   3 root hive  742707621 2015-01-22 09:18 /user/hive/warehouse/db1.db/tab1_text/xaa.csv

[cdh3:21000] > select count(*) from tab1_text;
Query: select count(*) from tab1_text
+----------+
| count(*) |
+----------+
| 3397064  |
+----------+

scala> var file=sc.textFile("hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text")
15/02/26 17:17:04 INFO MemoryStore: ensureFreeSpace(259058) called with curMem=1129220, maxMem=278302556
15/02/26 17:17:04 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 253.0 KB, free 264.1 MB)
15/02/26 17:17:05 INFO MemoryStore: ensureFreeSpace(21187) called with curMem=1388278, maxMem=278302556
15/02/26 17:17:05 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 20.7 KB, free 264.1 MB)
15/02/26 17:17:05 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:47344 (size: 20.7 KB, free: 265.3 MB)
15/02/26 17:17:05 INFO BlockManagerMaster: Updated info of block broadcast_6_piece0
15/02/26 17:17:05 INFO SparkContext: Created broadcast 6 from textFile at :12
file: org.apache.spark.rdd.RDD[String] = hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text MappedRDD[9] at textFile at :12

scala> file.count()
15/02/26 17:17:05 INFO FileInputFormat: Total input paths to process : 2
15/02/26 17:17:05 INFO SparkContext: Starting job: count at :15
15/02/26 17:17:05 INFO DAGScheduler: Got job 2 (count at :15) with 7 output partitions (allowLocal=false)
15/02/26 17:17:05 INFO DAGScheduler: Final stage: Stage 2(count at :15)
15/02/26 17:17:05 INFO DAGScheduler: Parents of final stage: List()
15/02/26 17:17:05 INFO DAGScheduler: Missing parents: List()
15/02/26 17:17:05 INFO DAGScheduler: Submitting Stage 2 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text MappedRDD[9] at textFile at :12), which has no missing parents
15/02/26 17:17:05 INFO MemoryStore: ensureFreeSpace(2560) called with curMem=1409465, maxMem=278302556
15/02/26 17:17:05 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.5 KB, free 264.1 MB)
15/02/26 17:17:05 INFO MemoryStore: ensureFreeSpace(1617) called with curMem=1412025, maxMem=278302556
15/02/26 17:17:05 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1617.0 B, free 264.1 MB)
15/02/26 17:17:05 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:47344 (size: 1617.0 B, free: 265.3 MB)
15/02/26 17:17:05 INFO BlockManagerMaster: Updated info of block broadcast_7_piece0
15/02/26 17:17:05 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:838
15/02/26 17:17:05 INFO DAGScheduler: Submitting 7 missing tasks from Stage 2 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text MappedRDD[9] at textFile at :12)
15/02/26 17:17:05 INFO TaskSchedulerImpl: Adding task set 2.0 with 7 tasks
15/02/26 17:17:05 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 18, localhost, ANY, 1331 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 19, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 20, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 21, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 22, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 5.0 in stage 2.0 (TID 23, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO TaskSetManager: Starting task 6.0 in stage 2.0 (TID 24, localhost, ANY, 1325 bytes)
15/02/26 17:17:05 INFO Executor: Running task 2.0 in stage 2.0 (TID 20)
15/02/26 17:17:05 INFO Executor: Running task 1.0 in stage 2.0 (TID 19)
15/02/26 17:17:05 INFO Executor: Running task 6.0 in stage 2.0 (TID 24)
15/02/26 17:17:05 INFO Executor: Running task 4.0 in stage 2.0 (TID 22)
15/02/26 17:17:05 INFO Executor: Running task 3.0 in stage 2.0 (TID 21)
15/02/26 17:17:05 INFO Executor: Running task 5.0 in stage 2.0 (TID 23)
15/02/26 17:17:05 INFO Executor: Running task 0.0 in stage 2.0 (TID 18)
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:0+134217728
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:268435456+134217728
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:536870912+134217728
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:671088640+71618981
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:402653184+134217728
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/xaa.csv:134217728+134217728
15/02/26 17:17:05 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/tab1_text.txt:0+99936802
15/02/26 17:17:07 INFO Executor: Finished task 6.0 in stage 2.0 (TID 24). 1975 bytes result sent to driver
15/02/26 17:17:07 INFO TaskSetManager: Finished task 6.0 in stage 2.0 (TID 24) in 1535 ms on localhost (1/7)
15/02/26 17:17:07 INFO Executor: Finished task 1.0 in stage 2.0 (TID 19). 1975 bytes result sent to driver
15/02/26 17:17:07 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 19) in 1595 ms on localhost (2/7)
15/02/26 17:17:07 INFO Executor: Finished task 4.0 in stage 2.0 (TID 22). 1975 bytes result sent to driver
15/02/26 17:17:07 INFO TaskSetManager: Finished task 4.0 in stage 2.0 (TID 22) in 1814 ms on localhost (3/7)
15/02/26 17:17:07 INFO Executor: Finished task 5.0 in stage 2.0 (TID 23). 1975 bytes result sent to driver
15/02/26 17:17:07 INFO TaskSetManager: Finished task 5.0 in stage 2.0 (TID 23) in 2095 ms on localhost (4/7)
15/02/26 17:17:08 INFO Executor: Finished task 3.0 in stage 2.0 (TID 21). 1975 bytes result sent to driver
15/02/26 17:17:08 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 21) in 2731 ms on localhost (5/7)
15/02/26 17:17:08 INFO Executor: Finished task 2.0 in stage 2.0 (TID 20). 1975 bytes result sent to driver
15/02/26 17:17:08 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 20) in 2879 ms on localhost (6/7)
15/02/26 17:17:08 INFO Executor: Finished task 0.0 in stage 2.0 (TID 18). 1975 bytes result sent to driver
15/02/26 17:17:08 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 18) in 3230 ms on localhost (7/7)
15/02/26 17:17:08 INFO DAGScheduler: Stage 2 (count at :15) finished in 3.231 s
15/02/26 17:17:08 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/02/26 17:17:08 INFO DAGScheduler: Job 2 finished: count at :15, took 3.248941 s
res7: Long = 3397064

scala> file.first()
15/02/26 17:25:25 INFO SparkContext: Starting job: first at :15
15/02/26 17:25:25 INFO DAGScheduler: Got job 3 (first at :15) with 1 output partitions (allowLocal=true)
15/02/26 17:25:25 INFO DAGScheduler: Final stage: Stage 3(first at :15)
15/02/26 17:25:25 INFO DAGScheduler: Parents of final stage: List()
15/02/26 17:25:25 INFO DAGScheduler: Missing parents: List()
15/02/26 17:25:25 INFO DAGScheduler: Submitting Stage 3 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text MappedRDD[9] at textFile at :12), which has no missing parents
15/02/26 17:25:25 INFO MemoryStore: ensureFreeSpace(2584) called with curMem=1413642, maxMem=278302556
15/02/26 17:25:25 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 2.5 KB, free 264.1 MB)
15/02/26 17:25:25 INFO MemoryStore: ensureFreeSpace(1635) called with curMem=1416226, maxMem=278302556
15/02/26 17:25:25 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1635.0 B, free 264.1 MB)
15/02/26 17:25:25 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:47344 (size: 1635.0 B, free: 265.3 MB)
15/02/26 17:25:25 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/02/26 17:25:25 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:838
15/02/26 17:25:25 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text MappedRDD[9] at textFile at :12)
15/02/26 17:25:25 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/02/26 17:25:25 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 25, localhost, ANY, 1331 bytes)
15/02/26 17:25:25 INFO Executor: Running task 0.0 in stage 3.0 (TID 25)
15/02/26 17:25:25 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab1_text/tab1_text.txt:0+99936802
15/02/26 17:25:25 INFO Executor: Finished task 0.0 in stage 3.0 (TID 25). 3043 bytes result sent to driver
15/02/26 17:25:25 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 25) in 72 ms on localhost (1/1)
15/02/26 17:25:25 INFO DAGScheduler: Stage 3 (first at :15) finished in 0.074 s
15/02/26 17:25:25 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/02/26 17:25:25 INFO DAGScheduler: Job 3 finished: first at :15, took 0.090465 s
res8: String = 2014-06-24 00:00:00#2010100016003410#2010100106010600#2010100106000000#2010004000000000#41050#10611#10#锟斤拷锟斤拷#锟斤拷锟斤拷锟A1#WBJ00026A1#106#锟斤拷锟斤拷锟WBJ00026#106#ECR91#ECR91#ZY0801#锟斤拷锟斤拷锟斤拷#锟斤拷锟斤拷锟斤拷#MSC1#101##GGSN0#SGSN5#1#90#11#10611#0#282#1#116.376599#39.966201#10611#35#25#340#4#2#6##锟斤拷一平台#双锟斤拷锟斤拷#W###1#锟斤拷锟竭硷拷#3206#S444#430#1#1#锟斤拷讯#20090702#20100001#D+W#0#锟斤拷锟斤拷#0#192##锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷路18锟斤拷楼锟斤拷#锟斤拷通#8#65#18#10713#9763###240#144###33#43#25#25#28###4#1#0##0#1#1#1#3#1#1#1#2####2009-05-17 00:00:00##锟斤拷#鲁锟斤拷4锟斤拷锟斤拷锟竭伙拷##锟睫帮拷装#锟睫帮拷装#锟睫帮拷装#锟睫帮拷装#锟睫帮拷装#1.2#1.2#15#15##W#锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷#2010#7/8#锟斤拷锟斤拷锟斤拷#唯一VENDOR+CITY_ID+CI+CLT_CELL_NAME锟斤拷EXG_NE_CELL_W锟斤拷#45#锟斤拷太平庄锟斤拷#锟斤拷锟秸##锟斤拷锟斤拷###锟斤拷锟截革拷锟斤拷####MSS1#MGW1-1##锟杰硷拷锟斤拷锟斤拷#锟斤拷锟斤拷######1###MSC1#ODV-065R18K#8000#8000#0#20#1#2000#310#155#2000#1#2014-02-27 14:41:47#锟斤拷锟斤拷#锟斤拷#101#MGW1-1#0#...
scala>

[root@cdh0 ~]# hadoop dfs -ls /user/hive/warehouse/db1.db/tab2
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Found 3 items
-rw-r--r--   3 impala hive  291745403 2015-01-23 11:37 /user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264186_750804013_data.0.
-rw-r--r--   3 impala hive  291811534 2015-01-23 11:37 /user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264187_45184522_data.0.
-rw-r--r--   3 impala hive  223643301 2015-01-23 11:37 /user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264188_2007540501_data.0.


[cdh3:21000] > select count(*) from tab2;
Query: select count(*) from tab2
+----------+
| count(*) |
+----------+
| 3279912  |
+----------+
Fetched 1 row(s) in 3.84s

scala> var file=sc.textFile("hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2")
15/02/26 17:29:58 INFO MemoryStore: ensureFreeSpace(259058) called with curMem=1417861, maxMem=278302556
15/02/26 17:29:58 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 253.0 KB, free 263.8 MB)
15/02/26 17:29:58 INFO MemoryStore: ensureFreeSpace(21187) called with curMem=1676919, maxMem=278302556
15/02/26 17:29:58 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 20.7 KB, free 263.8 MB)
15/02/26 17:29:58 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:47344 (size: 20.7 KB, free: 265.3 MB)
15/02/26 17:29:58 INFO BlockManagerMaster: Updated info of block broadcast_9_piece0
15/02/26 17:29:58 INFO SparkContext: Created broadcast 9 from textFile at :12
file: org.apache.spark.rdd.RDD[String] = hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2 MappedRDD[11] at textFile at :12

scala> file.count()
15/02/26 17:29:59 INFO FileInputFormat: Total input paths to process : 3
15/02/26 17:29:59 INFO SparkContext: Starting job: count at :15
15/02/26 17:29:59 INFO DAGScheduler: Got job 4 (count at :15) with 8 output partitions (allowLocal=false)
15/02/26 17:29:59 INFO DAGScheduler: Final stage: Stage 4(count at :15)
15/02/26 17:29:59 INFO DAGScheduler: Parents of final stage: List()
15/02/26 17:29:59 INFO DAGScheduler: Missing parents: List()
15/02/26 17:29:59 INFO DAGScheduler: Submitting Stage 4 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2 MappedRDD[11] at textFile at :12), which has no missing parents
15/02/26 17:29:59 INFO MemoryStore: ensureFreeSpace(2552) called with curMem=1698106, maxMem=278302556
15/02/26 17:29:59 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 2.5 KB, free 263.8 MB)
15/02/26 17:29:59 INFO MemoryStore: ensureFreeSpace(1617) called with curMem=1700658, maxMem=278302556
15/02/26 17:29:59 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 1617.0 B, free 263.8 MB)
15/02/26 17:29:59 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:47344 (size: 1617.0 B, free: 265.3 MB)
15/02/26 17:29:59 INFO BlockManagerMaster: Updated info of block broadcast_10_piece0
15/02/26 17:29:59 INFO SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:838
15/02/26 17:29:59 INFO DAGScheduler: Submitting 8 missing tasks from Stage 4 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2 MappedRDD[11] at textFile at :12)
15/02/26 17:29:59 INFO TaskSchedulerImpl: Adding task set 4.0 with 8 tasks
15/02/26 17:29:59 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 26, localhost, ANY, 1363 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 27, localhost, ANY, 1363 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 28, localhost, ANY, 1363 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 29, localhost, ANY, 1362 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 30, localhost, ANY, 1362 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 31, localhost, ANY, 1362 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 32, localhost, ANY, 1364 bytes)
15/02/26 17:29:59 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 33, localhost, ANY, 1364 bytes)
15/02/26 17:29:59 INFO Executor: Running task 0.0 in stage 4.0 (TID 26)
15/02/26 17:29:59 INFO Executor: Running task 1.0 in stage 4.0 (TID 27)
15/02/26 17:29:59 INFO Executor: Running task 2.0 in stage 4.0 (TID 28)
15/02/26 17:29:59 INFO Executor: Running task 3.0 in stage 4.0 (TID 29)
15/02/26 17:29:59 INFO Executor: Running task 4.0 in stage 4.0 (TID 30)
15/02/26 17:29:59 INFO Executor: Running task 5.0 in stage 4.0 (TID 31)
15/02/26 17:29:59 INFO Executor: Running task 6.0 in stage 4.0 (TID 32)
15/02/26 17:29:59 INFO Executor: Running task 7.0 in stage 4.0 (TID 33)
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264186_750804013_data.0.:268435456+23309947
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264186_750804013_data.0.:134217728+134217728
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264187_45184522_data.0.:0+134217728
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264188_2007540501_data.0.:134217728+89425573
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264188_2007540501_data.0.:0+134217728
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264186_750804013_data.0.:0+134217728
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264187_45184522_data.0.:134217728+134217728
15/02/26 17:29:59 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264187_45184522_data.0.:268435456+23376078
15/02/26 17:29:59 INFO BlockManager: Removing broadcast 8
15/02/26 17:29:59 INFO BlockManager: Removing block broadcast_8
15/02/26 17:29:59 INFO MemoryStore: Block broadcast_8 of size 2584 dropped from memory (free 276602865)
15/02/26 17:29:59 INFO BlockManager: Removing block broadcast_8_piece0
15/02/26 17:29:59 INFO MemoryStore: Block broadcast_8_piece0 of size 1635 dropped from memory (free 276604500)
15/02/26 17:29:59 INFO BlockManagerInfo: Removed broadcast_8_piece0 on localhost:47344 in memory (size: 1635.0 B, free: 265.3 MB)
15/02/26 17:29:59 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/02/26 17:29:59 INFO ContextCleaner: Cleaned broadcast 8
15/02/26 17:29:59 INFO Executor: Finished task 5.0 in stage 4.0 (TID 31). 1812 bytes result sent to driver
15/02/26 17:29:59 INFO TaskSetManager: Finished task 5.0 in stage 4.0 (TID 31) in 198 ms on localhost (1/8)
15/02/26 17:30:00 INFO Executor: Finished task 2.0 in stage 4.0 (TID 28). 1812 bytes result sent to driver
15/02/26 17:30:00 INFO TaskSetManager: Finished task 2.0 in stage 4.0 (TID 28) in 888 ms on localhost (2/8)
15/02/26 17:30:00 INFO Executor: Finished task 3.0 in stage 4.0 (TID 29). 1812 bytes result sent to driver
15/02/26 17:30:00 INFO TaskSetManager: Finished task 3.0 in stage 4.0 (TID 29) in 1756 ms on localhost (3/8)
15/02/26 17:30:00 INFO Executor: Finished task 6.0 in stage 4.0 (TID 32). 1812 bytes result sent to driver
15/02/26 17:30:00 INFO TaskSetManager: Finished task 6.0 in stage 4.0 (TID 32) in 1841 ms on localhost (4/8)
15/02/26 17:30:01 INFO Executor: Finished task 7.0 in stage 4.0 (TID 33). 1812 bytes result sent to driver
15/02/26 17:30:01 INFO TaskSetManager: Finished task 7.0 in stage 4.0 (TID 33) in 2223 ms on localhost (5/8)
15/02/26 17:30:01 INFO Executor: Finished task 1.0 in stage 4.0 (TID 27). 1812 bytes result sent to driver
15/02/26 17:30:01 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 27) in 2569 ms on localhost (6/8)
15/02/26 17:30:01 INFO Executor: Finished task 0.0 in stage 4.0 (TID 26). 1812 bytes result sent to driver
15/02/26 17:30:01 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 26) in 2732 ms on localhost (7/8)
15/02/26 17:30:02 INFO Executor: Finished task 4.0 in stage 4.0 (TID 30). 1812 bytes result sent to driver
15/02/26 17:30:02 INFO TaskSetManager: Finished task 4.0 in stage 4.0 (TID 30) in 3238 ms on localhost (8/8)
15/02/26 17:30:02 INFO DAGScheduler: Stage 4 (count at :15) finished in 3.242 s
15/02/26 17:30:02 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/02/26 17:30:02 INFO DAGScheduler: Job 4 finished: count at :15, took 3.259944 s
res12: Long = 3279912

scala>
scala> file.first()
15/02/26 17:30:11 INFO SparkContext: Starting job: first at :15
15/02/26 17:30:11 INFO DAGScheduler: Got job 5 (first at :15) with 1 output partitions (allowLocal=true)
15/02/26 17:30:11 INFO DAGScheduler: Final stage: Stage 5(first at :15)
15/02/26 17:30:11 INFO DAGScheduler: Parents of final stage: List()
15/02/26 17:30:11 INFO DAGScheduler: Missing parents: List()
15/02/26 17:30:11 INFO DAGScheduler: Submitting Stage 5 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2 MappedRDD[11] at textFile at :12), which has no missing parents
15/02/26 17:30:11 INFO MemoryStore: ensureFreeSpace(2584) called with curMem=1698056, maxMem=278302556
15/02/26 17:30:11 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 2.5 KB, free 263.8 MB)
15/02/26 17:30:11 INFO MemoryStore: ensureFreeSpace(1635) called with curMem=1700640, maxMem=278302556
15/02/26 17:30:11 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 1635.0 B, free 263.8 MB)
15/02/26 17:30:11 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:47344 (size: 1635.0 B, free: 265.3 MB)
15/02/26 17:30:11 INFO BlockManagerMaster: Updated info of block broadcast_11_piece0
15/02/26 17:30:11 INFO SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:838
15/02/26 17:30:11 INFO DAGScheduler: Submitting 1 missing tasks from Stage 5 (hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2 MappedRDD[11] at textFile at :12)
15/02/26 17:30:11 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
15/02/26 17:30:11 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 34, localhost, ANY, 1363 bytes)
15/02/26 17:30:11 INFO Executor: Running task 0.0 in stage 5.0 (TID 34)
15/02/26 17:30:11 INFO HadoopRDD: Input split: hdfs://cdh0:8020/user/hive/warehouse/db1.db/tab2/c741ac886827620c-a143f0444264186_750804013_data.0.:0+134217728
15/02/26 17:30:11 INFO Executor: Finished task 0.0 in stage 5.0 (TID 34). 2053 bytes result sent to driver
15/02/26 17:30:11 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 34) in 61 ms on localhost (1/1)
15/02/26 17:30:11 INFO DAGScheduler: Stage 5 (first at :15) finished in 0.061 s
15/02/26 17:30:11 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
15/02/26 17:30:11 INFO DAGScheduler: Job 5 finished: first at :15, took 0.078176 s
res13: String = 2014-07-18 23:12:03,0,0,"460011253616120                                 ",\N,\N,\N,143,30732,2,\N,1,1,467,-95.0,-6.0,\N,9738,10688,\N,2,"4;30732;null;-6.0;-95;null;2080100143030700;|1;30722;466;-23.0;-112;null;2080100143030701;",116.3434920,40.2198250,\N,2080100143030700,

scala> 15/02/26 18:54:46 INFO BlockManager: Removing broadcast 11
15/02/26 18:54:46 INFO BlockManager: Removing block broadcast_11
15/02/26 18:54:46 INFO MemoryStore: Block broadcast_11 of size 2584 dropped from memory (free 276602865)
15/02/26 18:54:46 INFO BlockManager: Removing block broadcast_11_piece0
15/02/26 18:54:46 INFO MemoryStore: Block broadcast_11_piece0 of size 1635 dropped from memory (free 276604500)
15/02/26 18:54:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on localhost:47344 in memory (size: 1635.0 B, free: 265.3 MB)
15/02/26 18:54:46 INFO BlockManagerMaster: Updated info of block broadcast_11_piece0
15/02/26 18:54:46 INFO ContextCleaner: Cleaned broadcast 11


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10037372/viewspace-1442473/,如需转载,请注明出处,否则将追究法律责任。

下一篇: sparksql应用样例
请登录后发表评论 登录
全部评论

注册时间:2009-05-13

  • 博文量
    94
  • 访问量
    357922