Skip Navigation

NIH VideoCasting

CIT can broadcast your seminar, conference or meeting live to a world-wide audience over the Internet as a real-time streaming video. The event can be recorded and made available for viewers to watch at their convenience as an on-demand video or a downloadable file. CIT can also broadcast NIH-only or HHS-only content.

ContentMine: High-Throughput Extractions of Facts from Scientific Articles

Loading video...

361 Views  
   
Air date: Tuesday, November 15, 2016, 1:00:00 PM
Time displayed is Eastern Time, Washington DC Local
Views: Total views: 361 (152 Live, 209 On-demand)
Category: Special
Runtime: 01:25:58
Description: The NIH Frontiers in Data Science Lecture Series

"ContentMine: High-Throughput Extractions of Facts from Scientific Articles"

Dr. Peter Murray-Rust, University of Cambridge and Founder of the ContenMine Project

There are millions of scientific articles published each year, but much of the content is not accessible because it is non-machine-readable or hidden in supplemental information or bitmapped figures. Content Mining (Text-and-Data Mining/TDM) turns this semi-structured material into semantic form (XML) and annotates it with known metadata. EuropePMC, which works closely with PubMedCentral, provides an API for rapid fulltext search and retrieval of fulltext. ContentMine software then extracts "facts" with a number of "facet" tools: word search, regexes, bespoke text tools, chemical NLP (OSCAR), and certain diagram types (phylogenetic trees). The "facts" can be mapped onto triples and incorporated into Wikidata or used to annotate the text to help human readers. Common facets are often supported by dictionaries, but they can be easily extended by anyone with a list of words. Using heuristics, data can be extracted from common diagram types. The vision is to develop a communal open toolbox that can be extended and validated for a wide range of purposes. However, many rightsholders are trying to control TDM through technical and legal means. There is a recent legal exception in the U.K. that allows for text mining of facts for scientific research. The University of Cambridge is doing this and publishing to the open web. This talk will have live demos, many accessible to the participants during the talk. ABOUT THE SPEAKER: Dr. Peter Murray-Rust is Founder of the ContentMine project which has used machines to liberate more than 100,000,000 facts from scientific literature. His research interests involve the automated analysis of data in scientific publications and the creation of virtual scientific communities. He has applied this to Chemistry through the development of the Chemical Markup Language (ChemML or CML). Dr. Murray-Rust holds a Doctor of Philosophy from the University of Oxford. His academic career spans more than thirty years in Computational Chemistry and Molecular Informatics at the Glaxo Group Research at Greenford, University of Nottingham, and University of Cambridge. He is known internationally for his activism in scientific open access and open data, which has been primarily focused on making scientific knowledge from literature freely available.

For more information go to https://datascience.nih.gov/community/datascience-at-nih/frontiers
Debug: Show Debug
NLM Title: ContentMine : high-throughput extractions of facts from scientific articles / Peter Murray-Rust.
Author: Murray-Rust, Peter.
National Institutes of Health (U.S.),
Publisher:
Abstract: (CIT): The NIH Frontiers in Data Science Lecture Series "ContentMine: High-Throughput Extractions of Facts from Scientific Articles" Dr. Peter Murray-Rust, University of Cambridge and Founder of the ContentMine Project There are millions of scientific articles published each year, but much of the content is not accessible because it is non-machine-readable or hidden in supplemental information or bitmapped figures. Content Mining (Text-and-Data Mining/TDM) turns this semi-structured material into semantic form (XML) and annotates it with known metadata. EuropePMC, which works closely with PubMedCentral, provides an API for rapid fulltext search and retrieval of fulltext. ContentMine software then extracts "facts" with a number of "facet" tools: word search, regexes, bespoke text tools, chemical NLP (OSCAR), and certain diagram types (phylogenetic trees). The "facts" can be mapped onto triples and incorporated into Wikidata or used to annotate the text to help human readers. Common facets are often supported by dictionaries, but they can be easily extended by anyone with a list of words. Using heuristics, data can be extracted from common diagram types. The vision is to develop a communal open toolbox that can be extended and validated for a wide range of purposes. However, many rightsholders are trying to control TDM through technical and legal means. There is a recent legal exception in the U.K. that allows for text mining of facts for scientific research. The University of Cambridge is doing this and publishing to the open web. This talk will have live demos, many accessible to the participants during the talk. ABOUT THE SPEAKER: Dr. Peter Murray-Rust is Founder of the ContentMine project which has used machines to liberate more than 100,000,000 facts from scientific literature. His research interests involve the automated analysis of data in scientific publications and the creation of virtual scientific communities. He has applied this to Chemistry through the development of the Chemical Markup Language (ChemML or CML). Dr. Murray-Rust holds a Doctor of Philosophy from the University of Oxford. His academic career spans more than thirty years in Computational Chemistry and Molecular Informatics at the Glaxo Group Research at Greenford, University of Nottingham, and University of Cambridge. He is known internationally for his activism in scientific open access and open data, which has been primarily focused on making scientific knowledge from literature freely available.
Subjects: Data Mining
Databases as Topic
Metadata
Publication Types: Lecture
Webcast
Download: To download this event, select one of the available bitrates:
[64k]  [150k]  [240k]  [440k]  [740k]  [1040k]  [1240k]  [1440k]  [1840k]    How to download a Videocast
Caption Text: Download Caption File
NLM Classification: W 26.55.I4
NLM ID: 101697723
CIT Live ID: 20292
Permanent link: https://videocast.nih.gov/watch=20292