Efficient exploitation of the massive amount of modern-day life science data
Chemical design, like most scientific disciplines, is becoming increasingly data-intensive and dependent on our capacity to manage and exploit growing data resources. In particular, there is increasing need for drug-discovery organizations to enable decision making that is informed by the growth of their internally generated data and its integration with external data.
Data-driven chemistry (for drug-design materials science, catalysis, polymers) is dependent on researchers dealing with the growth in data and finding ways to convert these resources into better decisions. Increasing the capacity of chemists to undertake data-driven research has the potential to improve decision making in drug discovery and ensure the most benefit can be derived from the growth in data. The rapid increase in available data in the so-called Big Data era makes harnessing these resources and optimizing our research processes a prerequisite for future success.
At the core of chemical design is the “design, synthesis, testing and evaluation” cycle. Traditionally, all components of the cycle have been undertaken in the same laboratory under the control of a small team of synthetic chemists and a computational chemist as part of a multidisciplinary team. The most important task of the chemistry team is to evaluate new biological testing in the context of known chemistry rules, general and project specific models and any other available information such as protein structures. The key to successful design chemistry is the ability to balance an array of often conflicting properties as each round of design and synthesis improves the overall properties of the compound series (or at least facilitates future improvement). Design chemistry is therefore a data-driven task, with a requirement for immediate access to all available data if we want to ensure that the results of new testing truly influences the next rounds of synthesis.
Chemical data analysis workflow tools, such as KNIME, TAVERNA and PIPELINE PILOT have been implemented in most pharmaceutical companies, providing user-friendly workbenches for experts and non-experts to undertake complex data analysis tasks including machine learning, analytics and visualization. TAVERNA and KNIME are open source workflow tools, with large communities developing and sharing new functionality, providing dissemination of methods and rigorous community testing. It is now the case that even the largest commercial software providers, including Schrodinger, Tripos and CCG are providing tools (nodes and extensions) to the KNIME community.
Despite the user-friendly nature of these workflow tools, they are not trivial to manage, especially when seeking to connect with database tools or other extensions. For this reason the Dutch academic community benefits from the Netherlands eScience Center implementing an eScience platform around a workflow tool on their behalf.
This project delivers a local version of such an eScience for chemistry platform, supported by open source databases (MySQL and PostgreSQL), and connected to chemistry specific applications such as RDKit and CDK and the analytics and visualization capabilities of R based on previously described infrastructures.
Such an approach has the potential to support many aspects of data-driven chemistry, but also other disciplines as the central workflow tool KNIME (like TAVERNA and PIPELINE PILOT) is domain independent and could support projects in many other disciplines in the future.