Deep Learning in Natural Language Processing for Analysis of Document Similarity

This project took place in summer term 2021, you CAN NOT apply to this project anymore!

Results of this project are explained in detail in the final report and presentation.

One of the most significant developments in the area of natural language processing is the evolu6on of transformer machine learning models. Transformers are a novel deep learning architecture which can solve sequence to sequence tasks without weakening long-range dependencies. It relies on self- a>en6on and recurrence mechanisms. As the input sequence does not have to be processed sequen6ally, it is possible to highly parallelize the model training, thus boos6ng the training speed and performance of the model.

Another important feature of transformer model is the ability to be pre-trained and fine-tuned for certain tasks. As the training from scratch for a massive text corpus takes immense amounts of computa6onal resources, it is more prac6cable to use pre-trained released models (such as BERT, XLNet, T5, RoBERTa) and to fine-tune the model to perform appropriately on the new task.

Full transformer model consists of encoder and decoder parts. This architecture is especially suitable for language transla6on tasks. However, it is possible to use encoder or decoder part only, depending on the NLP task.

Project Objective

The major goal of this project is to develop a natural language processing model capable to search relevant documents from a database or search engine based on given topics. In order to achieve this objec6ve, the following steps are foreseen:

  1. Topic modeling – unsupervised clustering of documents based on their similarity/context

  2. Document search based on similarity/context metrics

Algorithms:

Machine Learning (advanced natural language processing (NLP) methods, transformer models, deep learning)

Data:
Set of documents (ar6cles and news) will be given. Addi6onally, all open source and freely available data can be used.

Tools:
Python

Accepted students to this project should attend (unless they have proven knowledge) online workshops at the LRZ from 06.04.2021 - 09.04.2021 (9:00 AM to 5:00 PM). More information will be provided to students accepted to this project.