Keynote, day 1
Prof. Barend Mons
Erasmus Medical Centre, University of Rotterdam and Department of Human Genetics,
Leiden University Medical Centre (the Netherlands)
Datastewardship for Discovery
Knowledge Discovery across data resources is hampered by the lack of standards and the poor adoption of existing standards by stakeholders. Data Interoperability overcomes the barriers of syntactic access with semantic use in one implementation. Optimal Interoperability is only attained when access and use can be completely automated: programming and interfaces conform to standards that specify consistent syntax and formats; and data are associated with metadata and terminology identifiers and codes that support computational aggregation and comparison of information that resides in separate resources.
Many European research programmes and public private partnerships already make significant investments in data infrastructure to make data better Findable, Accessible, Interoperable and ultimately Reusable (FAIR), but without coordination such as provided by ELIXIR the large numbers of stakeholders and programmes within Europe will drive fragmentation and overlapping investments in data management, stewardship, analytics and technology approaches. Through the implementation of community adopted and ELIXIR endorsed standards and, importantly, a European wide framework of experts and a credible supporting organisation, ELIXIR will drive the coordination efforts both at national and international level. ELIXIR is an Open Infrastructure - it will not “own” or 'control' the data resources in Europe but provide a coordinated Backbone that enables and assists partners (e.g. other ESFRI Research Infrastructures) to make use of existing solutions and connect and interoperate their resources. Sustained infrastructure services for e.g. identifier management, data access and mappings between resources drive “standards as the community driven default” and enable long-term data management according to the FAIR principles (data should be Findable, Accessible, Interoperable and Reusable).
Session 1 – Applied Research
Prof. Klaus Tochtermann
Head of ZBW Leibniz Information Centre of Economics - German National Library of Economics (Germany)
On the Evolution of Semantic Technologies in Scientific Libraries
Dr. Roman Klinger
Visiting Professor for Theoretical Computational Linguistics, Institute for Natural Language Processing, University of Stuttgart (Germany)
Sentiment Analysis and Opinion Mining in Product
Reviews: Fine-grained Analysis and Cross-Linguality
Sentiment Analysis and Opinion Mining is often phrased as a text classification task or, in a more fine-grained setting, as text segmentation to extract specific phrases denoting aspects under discussion and evaluating phrases with polarities assigned by an author. In that sense, sentiment analysis is, from a methodological point of view, similar to other information extraction tasks like named entity recognition or relation extraction. However, the specification of text segments to be detected is different and hard to be stated.
In this talk, I give a short introduction to different challenges and applications of coarse-grained and fine-grained sentiment analysis and different methods to address those. I will then introduce a model for joint detection of evaluating phrases and associated aspects as mentioned in product reviews as they are available from shop websites like Amazon. I conclude with a short overview on our recent work in training models across different languages.
Session 2 – Application Examples
Dr. Markus Bundschus
Head Scientific & Business Information Services, Roche Diagnostics GmbH, Penzberg (Germany)
Text and data mining @Roche: an industry perspective
Even though Text and Data Mining is part of the technology portfolio for many years in the industry, only recently it is shifting from being a niche player towards becoming an integral part of business critical processes. The range of applications is huge and diverse in pharmaceutical companies – from traditional use cases such as drug and biomarker discovery, analyzing clinical trials, or optimizing biotechnological production processes, to finding key opinion leaders, among others. Given the unprecedented growth of scientific knowledge represented in written documents, there are currently not so many alternatives in the future to that automated processing technique.
In this talk we discuss our strategy how to successfully implement text mining projects in a challenging industrial setting. We outline important design criteria that have to be critically selected to ensure broad use of these powerful technologies. We highlight selected use cases and discuss open research questions that would be important to be tackled from the industry perspective.
German National Library, Frankfurt/Main
Access to knowledge: Text mining and information extraction in the German National Library.
The German National Library (DNB) is faced with a massive increase of born-digital publications in their collections. In order to offer access to these materials for users the library evaluates ways to use automated data analysis processes. Here, the library also has to consider established access systems and indexing rules and routines. At the same time the metadata infrastructure becomes more and more global and data analysis and data linkage is becoming increasingly important and potentially valuable to reuse and enrich existing classification information. Some methods have been taken productive, others are still in the project phase or simply experimental.
The presentation provides an overview of the approaches DNB follows so far and highlights potentials, which may have a future impact for the further development of the library and information infrastructure.
Prof. Martin Hofmann-Apitius
Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), St. Augustin (Germany)
Modelling hypothetical knowledge: Capturing and representing scientific speculation in text.
Speculative statements communicating experimental findings are frequently found in scientific articles, and their purpose is to provide an impetus for further investigations into the given topic. Automated recognition of speculative statements in scientific text has gained interest in recent years as systematic analysis of such statements could transform speculative thoughts into testable hypotheses. We describe here a pattern matching approach for the detection of speculative statements in scientific text that uses a dictionary of speculative patterns to classify sentences as hypothetical. To demonstrate the practical utility of our approach, we applied it to the domain of Alzheimer’s disease and showed that our automated approach captures a wide spectrum of scientific speculations on Alzheimer’s disease. Subsequent exploration of derived hypothetical knowledge leads to generation of a coherent overview on emerging knowledge niches, and can thus provide added value to ongoing research activities.
Keynote, day 2
Dr. Dietrich Rebholz-Schuhmann
University of Zuerich (Switzerland)
Resolving phenotypes to standard representations: a complex task
Capturing single phentype traits or the full phenotype description is a complex task due to the large number of traits that form the phenotype, and due to the different types of qualities linked to individual phenotypes (e.g., lack of an organ, insufficient function, increase/decrease of a physiological parameter). For human, mouse and other model organisms, specific resources have been produced (e.g., Human phenotype ontology, mouse phenotype ontology) to capture the description of a phenotype. This talk will give an overview on the use of public resources to denote a phenotype, on solutions to use model organism data in combination and on the limitations of linking genes to diseases through by data integration using terminologies and ontologies.
Session 4 – Translational Aspects
Prof. Lars Juhl Jensen
NNF Center for Protein Research, University of Copenhagen (Denmark)
Pragmatic text mining: From literature to electronic health records.
Text mining is rapidly becoming an essential tool for biomedical data mining. The literature is a vast source of knowledge, most of which is not captured by existing structure databases. Electronic health records (EHRs) are another underused textual data source, the mining of which has the potential for revealing unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce a pragmatic approach to mining the biomedical literature for drugs, proteins, subcellular compartments, tissues, diseases, and associations among them. I will also describe how we apply the same techniques to identify adverse reactions of drugs from the clinical narrative in electronic health records.
Dr. Juliane Fluck
Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), St. Augustin (Germany)
Extraction from scientific text of causal and correlative relationships used in systems biomedicine models of disease.
In order to build networks for systems biology from the literature, an UIMA based extraction workflow using various named entity recognition processes and different relation extraction methods has been composed. The Unstructured Information Management architecture (UIMA) is a Java-based framework that allows assembling complicated workflows from a set of NLP components. The new system is processing scientific articles and is writing the open-access biological expression language (BEL) as output. BEL is a machine and human readable language with defined knowledge statements that can be used for knowledge representation, causal reasoning, and hypothesis generation. BEL-based disease models can now be generated by automatically processing hundreds of thousands of full text documents, effectively speeding up model generation in systems biomedicine at unprecedented scale.
Session 5 – Enabling Technologies
Dr. Philipp Daumke
Averbis AG, Freiburg (Germany)
Large-Scale Patent Classification at the European Patent Office.
In the era of Smart Data and the explosion of data volume of all kind, organizations seek for leveraging such data - being it patent information, research literature social media data etc. - for competitive advantage and to help achieving their strategic aims. The process of search, filtering and categorization of large data sets go typically far beyond simple keyword search. Semantic technologies paired with machine learning approaches from artifical intelligence are a promising approach to support more fine-granular analysis of data.
The European Patent Office and Averbis recently went into collaboration for the pre-classification of incoming patent applications (use case 1) and re-classification of existing classification schemes (use case 2). In this cooperation, various services are provided with the aim of automatically assigning patent applications to the right departments and automatically allocating existing patents with new CPC codes. The solution is based on complex linguistic and semantic analyses, as well as statistically-based machine learning processes. Up to 250.000 incoming patents shall be classified per year and categorized in up to 1.500 categories. In this talk, we want to present both use cases together with some technical background about the applied language technologies.
TEMIS Germany, Heidelberg (Germany)
Deloitte Consulting AG (Switzerland)
Text Mining and Compliance - Supporting access to complex regulatory legislation by natural language processing.
Legal regulatory frameworks are often of a complexity that makes it hard even for the expert reader to digest the necessary information. Verifying and ensuring compliance with the relevant legislation then often means having to handle countless mutually dependent regulations that refer to one another as well as to specific sophisticated technical terminologies. Natural language processing can provide essential support in managing and accessing information of this type. The presentation describes a use case around the FATCA ("Foreign Account Tax Compliance Act") legislation that has been successfully applied in the banking industry.
The take home message may be that certain legal frameworks are today of a complexity that benefits from or even requires modern natural language processing technologies to make it accessible to the intended audience.
Dr. Anton Heijs
Datasciencesets, Gouda (the Netherlands)
Impact of developments in big data analytics for new use cases.
Developments in big data analytics technology and machine learning techniques have made data and text-mining more powerful. This enables new use cases in pharma/biotech and healthcare where large amounts of structured and unstructured data can be used. Combining analysis of structured data (table or image data) with text data can enable faster and richer insights. The algorithmic and technological developments of big data analytics enabling scalable processing and analysis with an overview of all the data will be discussed. The value of drill down approaches especially using visualization will be presented. Also the importance of detecting trends, patterns and semantics of especially large text data sets is analyzed. Some new use case that benefit from scalable processing with an overview of all data will be discussed including the requirements and created value coming from such use cases. Especially in the medical and life sciences domain these development will have a huge impact although there are still many challenging complexities that need to be addressed in the near future.