Representation and contextualization for document understanding

Show simple item record

dc.identifier.uri Tran, Nam Khanh ger 2019-02-13T10:37:53Z 2019-02-13T10:37:53Z 2019
dc.identifier.citation Tran, Nam Khanh: Representation and contextualization for document understanding. Hannover : Gottfried Wilhelm Leibniz Universität, Diss., 2019, xviii, 130 S. DOI: ger
dc.description.abstract Document understanding requires discovery of meaningful patterns in text, which in turn involves analyzing documents and extracting useful information for a certain purpose. There is a multitude of problems that need to be dealt with to solve this task. With the goal of improving document understanding, we identify three main problems to study within the scope of this thesis. The first problem is about learning text representation, which is considered as starting point to gain understanding of documents. The representation enables us to build applications around the semantics or meaning of the documents, rather than just around the keywords presented in the texts. The second problem is about acquiring document context. A document cannot be fully understood in isolation since it may refer to knowledge that is not explicitly included in its textual content. To obtain a full understanding of the meaning of the document, that prior knowledge, therefore, has to be retrieved to supplement the text in the document. The last problem we address is about recommending related information to textual documents. When consuming text especially in applications such as e-readers and Web browsers, users often get attracted by the topics or entities appeared in the text. Gaining comprehension of these aspects, therefore, can help users not only further explore those topics but also better understand the text. In this thesis, we tackle the aforementioned problems and propose automated approaches that improve document representation, and suggest relevant as well as missing information for supporting interpretations of documents. To this end, we make the following contributions as part of this thesis: Representation learning - the first contribution is to improve document representation which serves as input to document understanding algorithms. Firstly, we adopt probabilistic methods to represent documents as a mixture of topics and propose a generalizable framework for improving the quality of topics learned from small collections. The proposed method can be well adapted to different application domains. Secondly, we focus on learning the distributed representation of documents. We introduce multiplicative tree-structured Long Short-Term Memory (LSTM) networks which are capable of integrating syntactic and semantic information from text into the standard LSTM architecture for improved representation learning. Finally, we investigate the usefulness of attention mechanism for enhancing distributed representations. In particular, we propose Multihop Attention Networks which can learn effective representations and illustrate its usefulness in the application of question answering. Time-aware contextualization - the second contribution is to formalize the novel and challenging task of time-aware contextualization, where explicit context information is required for bridging the gap between the situation at the time of content creation and the situation at the time of content digestion. To solve this task, we propose a novel approach which automatically formulates queries for retrieving adequate contextualization candidates from an underlying knowledge source such as Wikipedia, and then ranks the candidates using learning-to-rank algorithms. Context-aware entity recommendation - the third contribution is to give assistance to document exploration by recommending related entities to the entities mentioned in the documents. For this purpose, we first introduce the idea of a contextual relatedness of entities and formalize the problem of context-aware entity recommendation. Then, we approach the problem by a statistically sound probabilistic model incorporating temporal and topical context via embedding methods. ger
dc.language.iso eng ger
dc.publisher Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
dc.rights Es gilt deutsches Urheberrecht. Das Dokument darf zum eigenen Gebrauch kostenfrei genutzt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden. ger
dc.subject document understanding eng
dc.subject representation learning eng
dc.subject time-aware contextualization eng
dc.subject context-aware entity recommendation eng
dc.subject Dokumentverständnis ger
dc.subject Lernen von Textrepräsentation ger
dc.subject zeitbewusste Kontextualisierung ger
dc.subject kontextbewusste Entitätsempfehlung ger
dc.subject.ddc 004 | Informatik ger
dc.title Representation and contextualization for document understanding ger
dc.type doctoralThesis ger
dc.type Text ger
dc.description.version publishedVersion ger
tib.accessRights frei zug�nglich ger

Files in this item

This item appears in the following Collection(s):

Show simple item record


Search the repository


My Account

Usage Statistics