Text-mining the research literature

The size and complexity of the research literature is growing rapidly, and this trend has been evident for a number of years. More nations are investing in research, and within established research nations, investment is increasing Im both cases more research outputs are generated. Alongside increased investment, there is an increasing pressure on researchers to publish more and more, especially in nations that have chosen to attach incentives to the volume of research.

The increasing scale of the research literature presents a considerable challenge for researchers, and maintaining an understanding of developments across even a narrow disciplinary focus can be difficult. And in considering new questions and research directions, extracting currently latent insights from the literature can be a significant barrier.

Advances in AI and machine learning are beginning to offer real potential to help with this challenge. This is illustrated by an article from last year that used relatively simple, and unsupervised text mining approaches on the materials science literature (a read-only version of the article is available).

This articles reports on analysis using text from the materials science literature, with minimal human intervention. The authors were able to predict, with a relatively high level of accuracy, properties of materials, even when the literature did not itself contain reports on the properties. Most strikingly, the authors demonstrate that their approach could have predicted the discovery of novel materials. Using historically bounded sets of literature, they show that materials that exhibit a range of properties (thermoelectric or photovoltaic behaviour, for example) could be predicted years in advance of their ‘discovery’ using wholly empirical approaches.

The method used is based entirely on text analysis, so could, at least in principle, be applied into other research domains. The study used only text from article abstracts, and the authors suggest that working with full text may actually be more difficult, due the more complex and nuanced language used in the full articles. Some initial filtering of abstracts made for a more effective prediction process, and using a larger dataset (the full corpus of Wikipedia) performed less well.

The authors conclude:

“Scientific progress relies on the efficient assimilation of existing knowledge in order to choose the most promising way forward and to minimize re-invention. As the amount of scientific literature grows, this is becoming increasingly difficult, if not impossible, for an individual scientist. We hope that this work will pave the way towards making the vast amount of information found in scientific literature accessible to individuals in ways that enable a new paradigm of machine-assisted scientific breakthroughs.”

There is huge potential in this ‘new paradigm’, and not just for scientific disciplines. Part of the challenge is assembling and accessing the required data, but many are seeking to address this issue, and assemble large corpuses for analysis (as reported in Nature last year).

Alongside the potential, there are some important implications for the processes and culture of research. Activities that have relied on time-consuming work, like extracting insight from the literature, may switch to machine-led or machine-augmented alternatives. So, the scale and skills of the workforce will need to change in response.

In the short term, researchers who are familiar with the tools, techniques and algorithms of AI and machine learning, and their limitations, are likely to be able to advance their research more effectively. And in the medium term, the policy challenge is to ensure these new approaches become embedded in the training of the all researchers.


Written on January 31, 2020

Creative Commons Licence
© 2020 Steven Hill. Unless otherwise stated, this work is licensed under a Creative Commons Attribution 4.0 International License.