ITPub博客

首页 > Linux操作系统 > Linux操作系统 > hive小试(一)

hive小试(一)

原创 Linux操作系统 作者:水鸟eric 时间:2011-07-11 20:28:38 0 删除 编辑
场景:
对于dnsserver上面的日志, SA希望能够load到数据库中进行一些报表的查询,数据量比较大,8台dnsserver每台每小时的日志为300M左右,那么每天大概的数据量为60G。

分析:
可以使用oracle压缩表,自动分区管理,每小时使用sqlldr导入数据。这样每天的数据量大概为15G,压缩比例为25%。

或者也可以使用hadoop+hive来实现,下面就是尝试这种方法。

初步,需要验证可行性,我的测试hadoop为两节点,hadoop1为master node, hadoop2为slave。hdfs replication为1。

OK,下面是做事之前看过的一些文档:
Hive DDL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
          https://cwiki.apache.org/confluence/display/Hive/HiveQL
Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial
               https://cwiki.apache.org/confluence/display/Hive/GettingStarted

这里不详细介绍hive了,很多文档上面写的非常清楚。

日志的格式比较固定,下面是一行:
2011-07-08 10:00:00 sjcdns08.sjc.xxxx.com named[179]: [ID 873579 local5.info] queries: client 212.93.136.18#22219: query: shop.xxxx.fr IN A -

由于hive支持external方式的表,所以我们的表这样来建:

CREATE EXTERNAL TABLE dnslog(
  ts   STRING,
  dnsserver STRING,
  named_port STRING,
  ID STRING,
  Info STRING,
  Client STRING,
  Clientport STRING,
  Query STRING,
  Unknown1 STRING,
  Unknown2 STRING
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([\\d -:]*) ([^ ]*) named\\[([^ ]*)\\]: \\[ID ([^ ]*) ([^ ]*) queries: client ([\\d\\.]*)#([\\d]*): query: ([^ ]*) IN ([^ ]*) ([^ ]*)",
  "output.format.string" = ""
)
STORED AS TEXTFILE
LOCATION '/user/hadoop/dnslog';

注意这里的/user/hadoop/dnslog为一个hdfs的目录,而regex可以在这里地方测试:http://www.regexplanet.com/simple/index.html。

hive> show tables;
OK
dnslog
invites
pokes
Time taken: 2.22 seconds
hive> desc dnslog;
OK
ts      string  from deserializer
dnsserver       string  from deserializer
named_port      string  from deserializer
id      string  from deserializer
info    string  from deserializer
client  string  from deserializer
clientport      string  from deserializer
query   string  from deserializer
unknown1        string  from deserializer
unknown2        string  from deserializer
Time taken: 0.198 seconds


尝试载入20行测试数据,两个文件各拥有10行数据

hadoop dfs -put dnslog.txt dnslog/
hadoop dfs -put dnslog2.txt dnslog/

下面跑一个查询:

记得要在shell中加入hive的jar包:
add jar /oracle/binary/hbase/hive-0.7.1/lib/hive-contrib-0.7.1.jar

hive> select count(*) from dnslog;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201107010232_0026, Tracking URL = http://phxrueidb04.phx.xxxx.com:50030/jobdetails.jsp?jobid=job_201107010232_0026
Kill Command = /oracle/binary/hbase/CDH3/hadoop-0.20.2-cdh3u0/bin/hadoop job  -Dmapred.job.tracker=phxrueidb04.phx.xxxx.com:9001 -kill job_201107010232_0026
2011-07-04 22:36:42,942 Stage-1 map = 100%,  reduce = 0%
2011-07-04 22:36:49,984 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201107010232_0026
OK
20
Time taken: 10.293 seconds

我们可以看到,这里已经可以跑出数据了。

下面做一个基于时间的group by查询:
hive> select substr(ts,0,10) as tss,count(*) as count1 from dnslog group by substr(ts,0,10);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201107010232_0037, Tracking URL = http://phxrueidb04.phx.ebay.com:50030/jobdetails.jsp?jobid=job_201107010232_0037
Kill Command = /oracle/binary/hbase/CDH3/hadoop-0.20.2-cdh3u0/bin/hadoop job  -Dmapred.job.tracker=phxrueidb04.phx.ebay.com:9001 -kill job_201107010232_0037
2011-07-05 02:20:43,975 Stage-1 map = 0%,  reduce = 0%
2011-07-05 02:20:46,992 Stage-1 map = 40%,  reduce = 0%
2011-07-05 02:20:47,998 Stage-1 map = 60%,  reduce = 0%
2011-07-05 02:20:50,010 Stage-1 map = 100%,  reduce = 0%
2011-07-05 02:20:54,032 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201107010232_0037
OK
NULL    10
2011-06-08      1118517
Time taken: 15.899 seconds

至此,我们可以看到,数据已经可以被以表的形式来展现了。

后续,尝试上次大量数据,以及测试两个datanode。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26035685/viewspace-701899/,如需转载,请注明出处,否则将追究法律责任。

上一篇: ORACLE 11.2.0.2.2 PSU
下一篇: 没有了~
请登录后发表评论 登录
全部评论

注册时间:2011-07-08

  • 博文量
    2
  • 访问量
    6456