NLP and Knowledge Graphs for Research Cluster Prediction and Analysis
Results of this project are explained in the final report.
- Sponsored by: TUM Chair of Software Engineering for Business Information Systems (sebis)
- Project Lead: Dr. Ricardo Acevedo Cabra
- Scientific Lead: PhD candidate Tim Schopf
- TUM Co-Mentor: Dr. Ricardo Acevedo Cabra
- Term: Winter semester 2022
Motivation
The constantly increasing rate of new scientific publications and thereby also new research areas brings innovations and at the same time, raises new challenges. One of the challenges is to organize scientific knowledge in a way that researchers can easily find relevant research results and discover new scientific findings. Given that scientific knowledge usually is available in large quantities as unstructured texts, it is very difficult for researchers to obtain an overview of research fields or scientific domains. Similarly, it is difficult for researchers to gain insight into topics being researched at research institutions. Information about conducted research often only exists in unstructured texts on homepages or intranet pages. In addition, the websites are usually designed according to organizational structures of research institutions rather than a logical structure based on research areas. Therefore, a research area based navigation through the institution websites is hardly possible. This makes exploration and navigation of topics being researched in an institution very difficult for external as well as for internal users. Structuring the scientific knowledge of research institutions and linking semantically related scientific domains offers researchers the potential for enhanced exploration of research areas.
Goal
The goal of this project is to model and visualize the different research clusters of research institutions based on the publications associated with them. Based on the modeled research clusters, it should be possible to get an overview of the research conducted at an institution, how large the research clusters are within the institution, how they relate to each other, how they evolved over time, and who the key people are that work there. To model the different research clusters, we first create a Knowledge Graph that stores information about research publications, their topics, authors, and the research institutions affiliated with the authors. A Knowledge Graph is a database which stores information in a graphical format and can be used to generate a graphical representation of the relationships between any of its data points. After the Knowledge Graph is created, we use graph clustering algorithms to model the different research clusters of research institutions. To achieve our goal, the following steps are required:
- Classifying the research publications according to the provided “Field of Study” ontology. The ontology consists of possible research topics that are connected to each other in terms of hypernm ↔ hyponym relations. The ontology can also be understood as a hierarchy of research topics. For classification we will apply a variety of NLP algorithms to the publication titles, abstracts, and keywords. This may include, but is not limited to recent transformer models such as BERT, RoBERTa, DeBERTa, SBERT, SimCSE, etc.
- Constructing the Knowledge Graph based on the publication classifications. Information about the publication authors, citation information and their research institution affiliation will be provided and also needs to be included in the Knowledge Graph.
- Modeling research clusters of research institutions based on various graph clustering/community detection algorithms. In addition, we need to evaluate how graph clustering results differ from purely text-based topic models of research publications.
Data
- Corpus of research publications that includes titles, abstracts, authors, publication years, and citation information.
- List of authors that created the research publications. Authors are linked to their research publications as well as their current research institution.
- List of research institutions to which authors claim affiliations.
- Ontology of “Field of Study” concepts that describe the topic of a publication. Publications can be tagged with “Field of Study” concepts. The Ontology contains more than 70,000 concepts.
Main methods
- Various supervised and unsupervised text classification algorithms
- Various graph clustering algorithms
- Various text-based topic modeling algorithms
Required prior knowledge
- Fundamental knowledge in Natural Language Processing (NLP)
- Advanced Python skills
The following prior knowledge is advantageous, but not necessarily required:
- Fundamental knowledge about knowledge graphs, ontologies, and neo4j
- Previous experience with transformers and language models
Accepted students to this project should attend (unless they have proven knowledge) online workshops at the LRZ from TBA. More information will be provided to students accepted to this project.