Natural language processing for intelligent data mining and augmentation of probabilistic graphical networks.

This project took place in summer term 2022, you CAN NOT apply to this project anymore!

Results of this project are explained in detail in the final report.

Probabilistic networks are models, which consist of variables of interest and probabilistic relationships between them. These models can be used for variety of tasks, such as anomaly detection, prediction, hypothesis testing and reasoning.
Probabilistic networks can be constructed by experts manually in order to be used further for inference. The parameters of the pre-constructed networks can be changed and learned from the data, providing better reasoning. However, the limitation of these models is their structure, provided by an expert, as it depicts the understanding of an underlying problem by the expert and there is no room for statements, undiscovered by the expert.
For the expert, who defines the structure of a probabilistic network, it is important to be able to validate the variables and the relationships between them, as well as to find hidden variables and augment the model with them.
In order to validate and to augment a probabilistic network, text documents, containing information on the domain of interest, can be used and the required information can be extracted.

Project Objective
In this project we will use novel data mining and data augmentation techniques in order to improve probabilistic network models, based on information, extracted from the text documents.
The following steps are foreseen:

  1. Generate supporting statements (relationships and/or numbers) for the manually pre-configured probabilistic network. The main approach for this task is to investigate probabilistic relationship between given variables, based on information, extracted from the text documents and to augment the pre-defined network with the extracted data.
  2. Generate the structure of the probabilistic network based on text documents. This task relies on automatic extraction of the variables and their probabilistic relationships solely from the text documents, without any input from an expert.
  3. Both approaches ((1) and (2)) can be also combined.
  4. Improve reasoning and inference of the existing network by means of parameters, learned from the given data. This part of the project will be dedicated to the learning of the final probabilistic network.

Algorithms:
Machine Learning (advanced natural language processing (NLP) methods, transformer models, deep learning)
Data:
Set of documents, related to the domain of interest, covered in this project, will be given. Additionally, all open source and freely available data can be used.
Tools:
Python