In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies.
In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.
1. Molecular similarity theory
Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.)
Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.
In case of structural keys, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within abitstring.
At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)
When employing fingerprints, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys).
Instead, the fingerprints patterns are used in a hash. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential false positives.
In this article, we will demonstrate the use of non-hashed fingerprints to calculate compound similarities (i.e. using the raw fingerprints).
This approach has two advantages:
1. We eliminate the chance of false positives.
2. The raw fingerprints can be used in other types of structural compound mining
来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/301743/viewspace-733070/，如需转载，请注明出处，否则将追究法律责任。