Parsing Web 2.0 Sentences

Today, dependency parsing is an important component in many natural language processing (NLP) tasks. Many applications such as machine translation, information extraction, question answering systems and sentiment analysis may be counted among these tasks. In recent years, with the usage of data-driven statistical methods, multilingual parsers have been able to yield high performances for many languages. Up to now, the majority of relevant research has been made on formal (well written and edited) texts. Gold standard treebanks have been generally used in these studies. These are the treebanks where the preprocessing NLP stages (morphological tagging, multi-word expression chunking etc.) were manually conducted (nearly perfectly) by human annotators. However, today, the necessity of parsing real data is much higher than ever. Especially with the rise of Web 2.0, the understanding of the language spoken in social media has become an important and urgent requirement for many scientific and industrial studies. Real data differs significantly from formal texts: Firstly, it is necessary to automate the preprocessing tasks and the error margin introduced to the system by the tools used in these steps reduces the parsing performances drastically. More importantly, spoken language entails far more errors than formal texts. In this language, spelling rules are ignored, and a different jargon and web-specific styles are employed. As a result, the current performances on formal texts could not be obtained for this new domain. The aim of this project is to develop new methods for parsing web data for morphologically rich and free constituent order languages.

Gülsen Eryigit
Funded by
April 2018