Deep Learning in Natural Language Processing for Analysis of Document Similarity
One of the most significant developments in the area of natural language processing is the evolu6on of transformer machine learning models. Transformers are a novel deep learning architecture which can solve sequence to sequence tasks without weakening long-range dependencies. It relies on self- a>en6on and recurrence mechanisms. As the input sequence does not have to be processed sequen6ally, it is possible to highly parallelize the model training, thus boos6ng the training speed and performance of the model.
Another important feature of transformer model is the ability to be pre-trained and fine-tuned for certain tasks. As the training from scratch for a massive text corpus takes immense amounts of computa6onal resources, it is more prac6cable to use pre-trained released models (such as BERT, XLNet, T5, RoBERTa) and to fine-tune the model to perform appropriately on the new task.
Full transformer model consists of encoder and decoder parts. This architecture is especially suitable for language transla6on tasks. However, it is possible to use encoder or decoder part only, depending on the NLP task.
The major goal of this project is to develop a natural language processing model capable to search relevant documents from a database or search engine based on given topics. In order to achieve this objec6ve, the following steps are foreseen:
Topic modeling – unsupervised clustering of documents based on their similarity/context
Document search based on similarity/context metrics
Machine Learning (advanced natural language processing (NLP) methods, transformer models, deep learning)
Set of documents (ar6cles and news) will be given. Addi6onally, all open source and freely available data can be used.
Accepted students to this project should attend (unless they have proven knowledge) online workshops at the LRZ from 06.04.2021 - 09.04.2021 (9:00 AM to 5:00 PM). More information will be provided to students accepted to this project.