Using NLP for Adaptive Fact Extraction and Text Summarization

This project was for summer term 2020, you CAN NOT apply to this project anymore!

Results of this project are explained in detail in the final documentation and presentation.

The internet is expected to reach 40,000 exabytes (40 billion terabytes!) of data in 2020, with a large part of that being textual information. The explosion of data on the internet provides us with a lot of content, but fewer ways to understand and digest all the information. Especially in the field of online trend research and analysis, it’s getting tougher to digest and comprehend the vast amount of information. One group of professionals especially affected by information overload is journalists. Journalists spend hours looking over dry documents published by public institutions with little success in finding relevant information for articles that need to be published every day. Sometimes, crucial information could even be skipped or missed due to the tedious nature of desk research. To solve this major painpoint, at faktual we’re working on developing and training automated summarization algorithms to cut down reading time.

The goal of the project is to develop a natural language processing model capable of performing summarization adaptable to user-specified input. Recent progress in computational linguistics models have mostly focused on extracting somewhat objective important sentences in a text, which provides a general idea, but don’t provide the key points related to the user’s needs. If time allows, the team will also investigate the ability to extract relationships between entities in text to improve summarization. Data being crawled each day from government and authoritative sources will be available for the project. The language used is python, with standard machine learning/deep learning libraries being used. Experience in or knowledge of natural language processing is preferred.