ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Amazing Algorithms with NoSQL: A MongoDB Example Part 2

Amazing Algorithms with NoSQL: A MongoDB Example Part 2

原创 Linux操作系统 作者:jieforest 时间:2012-06-21 12:37:57 0 删除 编辑
In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time.

To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered.

The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000.

Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform. the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!


1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step).

Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallelized on a cluster of processing nodes. Let’s start by refactoring our solution to use MongoDB’s map-reduce functionality.

CODE:

01.te the essential numbers
02.int maxnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() / 0.6);
03.int minnumberofcompoundfingerprints = (int) (fingerprintsToFind.size() * 0.6);
04.int numberoffingerprintstoconsider = fingerprintsToFind.size() - minnumberofcompoundfingerprints;
05.
06.List fingerprintsToConsider = fingerprintsToFind.subList(0,numberoffingerprintstoconsider+1);
07.
08.// Find all compounds that satisfy the specified conditions
09.DBObject compoundquery =
10.QueryBuilder.start(FINGERPRINTS_PROPERTY).in(fingerprintsToConsider)
11..and(FINGERPRINTCOUNT_PROPERTY).lessThanEquals(maxnumberofcompoundfingerprints)
12..and(FINGERPRINTCOUNT_PROPERTY).greaterThanEquals(minnumberofcompoundfingerprints)
13..get();
14.
15.// The map fuction
16.String map = "function() { " +
17."var found = 0; " +
18."var fingerprintslength = this.fingerprints.length; " +
19."for (i = 0; i < fingerprintslength; i++) { " +
20."if (fingerprintstofind[this.fingerprints[i]] === true) { found++; } " +
21."} " +
22."if (found >= minnumberofcompoundfingerprints) { emit (this.compound_cid, {found : found, total : this.fingerprint_count} ); } " +
23."}";
24.
25.// Execute the map reduce function
26.MapReduceCommand mr = new MapReduceCommand(compoundsCollection, map, "", null, MapReduceCommand.OutputType.INLINE, compoundquery);
27.
28.// Create a hashmap for the fingerprints to find (to speed up the javascript. execution)
29.Map tofind = new HashMap();
30.for(String fingerprinttofind : fingerprintsToFind) {
31.tofind.put(fingerprinttofind,true);
32.}
33.
34.// Set the map reduce scope
35.Map scope = new HashMap();
36.scope.put("fingerprintstofind",tofind);
37.scope.put("minnumberofcompoundfingerprints",minnumberofcompoundfingerprints);
38.mr.setScope(scope);
39.
40.// Execute the map reduce
41.MapReduceOutput ut = compoundsCollection.mapReduce(mr);
42.
43.// Iterate the results
44.for (DBObject result : out.results()) {
45.String compound_cid = (String)result.get("_id");
46.DBObject value = (DBObject)result.get("value");
47.
48.// Calculate the tanimoto coefficient
49.double totalcount = (Double)value.get("total");
50.double found = (Double)value.get("found");
51.double tanimoto = (Double)value.get("found") / ((Double)value.get("total") + fingerprintsToFind.size() - (Double)value.get("found"));
52.// We still need to check whether the tanimoto is really >= the required similarity
53.if (tanimoto >= 0.6) {
54.System.out.println(compound_cid + " " + (int)(tanimoto * 100) +"%");
55.}

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/301743/viewspace-733548/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2008-04-23

  • 博文量
    443
  • 访问量
    508196