ITPub博客

首页 > 大数据 > 数据分析 > Titan-hadoop访问DBpedia文件内容

Titan-hadoop访问DBpedia文件内容

原创 数据分析 作者:std1984 时间:2014-09-28 17:39:29 0 删除 编辑
环境: Centos, Titan-0.5.0-Hadoop2


Titan-hadoop 实现对N_TRIPLES格式的RDF 访问,从dbpedia下载nt格式的文件(例如: http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/zh/labels_en_uris_zh.nt.bz2),编写访问属性文件,如下:
[cloudera@localhost titan-0.5.0-hadoop2]$ vi conf/hadoop/rdf-input.properties

# input graph parameters
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.edgelist.rdf.RDFInputFormat
titan.hadoop.input.location=examples/labels_en_uris_zh.nt
titan.hadoop.input.conf.format=N_TRIPLES
titan.hadoop.input.conf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type
titan.hadoop.input.conf.use-localname=true
titan.hadoop.input.conf.literal-as-property=true


# output data parameters
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.graphson.GraphSONOutputFormat
titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat


查询数据:

[cloudera@localhost titan-0.5.0-hadoop2]$ gremlin.sh

gremlin> g = HadoopFactory.open("conf/hadoop/rdf-input.properties")
gremlin> g.V.map()

......

17:37:12 INFO  org.apache.hadoop.mapred.LocalJobRunner  - reduce > reduce
17:37:12 INFO  org.apache.hadoop.mapred.Task  - Task 'attempt_local1370056218_0005_r_000000_0' done.
17:37:13 INFO  org.apache.hadoop.mapreduce.Job  - Job job_local1370056218_0005 completed successfully
17:37:13 INFO  org.apache.hadoop.mapreduce.Job  - Counters: 35
        File System Counters
                FILE: Number of bytes read=2911187173
                FILE: Number of bytes written=3038059762
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=405909
                Map output records=405909
                Map output bytes=65118176
                Map output materialized bytes=66297322
                Input split bytes=268
                Combine input records=405909
                Combine output records=405909
                Reduce input groups=405909
                Reduce shuffle bytes=0
                Reduce input records=405909
                Reduce output records=0
                Spilled Records=811818
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=5136
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=2091909120
        com.thinkaurelius.titan.hadoop.formats.edgelist.EdgeListInputMapReduce$Counters
                IN_EDGES_CREATED=0
                OUT_EDGES_CREATED=0
                VERTEX_PROPERTIES_CREATED=1217727
                VERTICES_CREATED=405909
                VERTICES_EMITTED=405909
        com.thinkaurelius.titan.hadoop.mapreduce.transform.PropertyMapMap$Counters
                VERTICES_PROCESSED=405909
        com.thinkaurelius.titan.hadoop.mapreduce.transform.VerticesMap$Counters
                EDGES_PROCESSED=0
                VERTICES_PROCESSED=405909
        File Input Format Counters
                Bytes Read=54114517
        File Output Format Counters
                Bytes Written=0
==>47994559900176       {label_=[慾望], _id=[47994559900176], name=[Want], uri=[http://dbpedia.org/resource/Want]}
==>60888991522182       {label_=[无机化学命名法], _id=[60888991522182], name=[IUPAC_nomenclature_of_inorganic_chemistry], uri=[http://dbpedia.org/resource/IUPAC_nomenclature_of_inorganic_chemistry]}
==>78841791384159       {label_=[诺伊斯塔特-格莱韦], _id=[78841791384159], name=[Neustadt-Glewe], uri=[http://dbpedia.org/resource/Neustadt-Glewe]}
==>78961407639797       {label_=[打狗英國領事館文化園區], _id=[78961407639797], name=[Former_British_Consulate_at_Takao], uri=[http://dbpedia.org/resource/Former_British_Consulate_at_Takao]}
==>95522075072286       {label_=[賴琳恩], _id=[95522075072286], name=[Lene_Lai], uri=[http://dbpedia.org/resource/Lene_Lai]}
==>153451821264409      {label_=[唐古韭], _id=[153451821264409], name=[Allium_tanguticum], uri=[http://dbpedia.org/resource/Allium_tanguticum]}
==>154857715280524      {label_=[温带], _id=[154857715280524], name=[Temperate_climate], uri=[http://dbpedia.org/resource/Temperate_climate]}
==>166027168671115      {label_=[GSh-18手槍], _id=[166027168671115], name=[GSh-18], uri=[http://dbpedia.org/resource/GSh-18]}
==>166513572484984      {label_=[WMA], _id=[166513572484984], name=[WMA], uri=[http://dbpedia.org/resource/WMA]}
==>182078824443170      {label_=[保罗·纳斯], _id=[182078824443170], name=[Paul_Nurse], uri=[http://dbpedia.org/resource/Paul_Nurse]}
==>211356647821663      {label_=[克魯克斯頓 (明尼蘇達州)], _id=[211356647821663], name=[Crookston,_Minnesota], uri=[http://dbpedia.org/resource/Crookston,_Minnesota]}
==>222227245802710      {label_=[我的女友是九尾狐], _id=[222227245802710], name=[My_Girlfriend_Is_a_Nine-Tailed_Fox], uri=[http://dbpedia.org/resource/My_Girlfriend_Is_a_Nine-Tailed_Fox]}
==>229972043766751      {label_=[李天荣], _id=[229972043766751], name=[Wilson_Lee_Flores], uri=[http://dbpedia.org/resource/Wilson_Lee_Flores]}
==>247488956381743      {label_=[1,2-双(二异丙基膦)乙烷], _id=[247488956381743], name=[1,2-Bis(diisopropylphosphino)ethane], uri=[http://dbpedia.org/resource/1,2-Bis(diisopropylphosphino)ethane]}
==>264200262547493      {label_=[欽迪龍屬], _id=[264200262547493], name=[Chindesaurus], uri=[http://dbpedia.org/resource/Chindesaurus]}
==>...

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16582684/viewspace-1283902/,如需转载,请注明出处,否则将追究法律责任。

下一篇: Maven 免测试打包
请登录后发表评论 登录
全部评论

注册时间:2008-12-29

  • 博文量
    171
  • 访问量
    1286953