Faced with the upsurge in the quantities of published scientific knowledge, researchers have an increasing need of tools to help them quickly analyze texts and extract accurate data. Text mining technologies have been developed to meet this expectation. However, the devices have been designed by taking into account the specificities of the research fields, the types of text to be treated or the desired analysis, resulting in a fragmented landscape of incompatible text mining solutions.
Create a platform for collaboration and knowledge sharing on text-mining
The objective of the European OpenMinTeD project, funded under the Horizon 2020 programme, is to create a platform for collaboration and knowledge sharing on text mining for scientists in all fields. INRA, with the Bibliome-MaIAGE team and the DIST, is involved in the project along with 16 other academic partners whose contributions are coordinated by the Athena Research and Innovation Centre (ARC). The consortium is working on the integration of resources (scientific literature and annotation resources) and text mining software components, facilitating their reuse by making them interoperable. INRA's contribution to OpenMinTeD is to bring and integrate Alvis technologies developed by the Bibliome team over many years. The design of the platform being guided by use cases, this contribution fits more broadly into the design and implementation of innovative applications in the fields of agriculture and food.
With INRA units in food microbiology and the Migale bioinformatics platform, Bibliome-MaIAGE team and DIST have set up the Florilege application. Its objective is to bring together in a unified representation public information (from databases and scientific articles) on the positive flora of foods (useful for processing, biopreservation, probiotics).
Two other use cases have been developed by Bibliome-MaIAGE and DIST. The first was developed in collaboration with the Info Genomic Research Unit (URGI) within the WheatIS application, an integrated information system on wheat phenotypes and genotypes. The second, built with the Institute of Plant Sciences Paris-Saclay on the "SeeDev" application, integrates data from the "FLAGdb++" plant genome database, with the regulations involved in the development of Arabidopsis thaliana seed extracted from scientific publications. This allows researchers not only to obtain information on the activity of genes during seed development (their interactions or the proteins they produce, for example) but also to have access to the scientific texts describing this activity. Each of these innovative services integrates experimental data, expert data and data extracted en masse by OpenMinTeD from text, into a unified, easy-to-access package.
The last OpenMinTeD consortium meeting took place from 12 to 14 February 2018 at INRA research centre in Jouy-en-Josas. The partners, joined by Open Access communities providing content and text mining IT communities, are currently completing the integration of their applications and components into the platform, which will be officially launched on 24 May 2018 in Brussels.
Extraction and formalization of knowledge from text
Leader: Claire Nédellec
The Bibliome group's objective is the development of new methods and technologies for the extraction and formalisation of fine-grained information and knowledge from textual documents, e.g. scientific papers, patents, free-text fields of databases. The methods are mainly based on Natural Language Processing and Machine Learning algorithms.
The application to Life Science and Agriculture requires new integrative approaches that interlink textual data with other experimental data to be exploited together in analysis tools and bioinformatics platforms. It also requires a user-friendly interface for the training of the text-mining tools, the vizualisation and curation of their results.
Text-mining in a focused domain from small corpora uses external resources such as nomenclatures, vocabularies and ontologies. The Bibliome group also develops methods for designing vocabularies and ontologies. The use of such formal resources contributes to the linking with other data.
The Bibliome group has organized shared tasks on bacteria biotopes and on gene regulation in microorganisms and in plants since 2005 (e.g. LLL, BioNLP-ST).