Tran, Nam Khanh: Representation and contextualization for document understanding. Hannover : Gottfried Wilhelm Leibniz Universität, Diss., 2019, xviii, 130 S. DOI: https://doi.org/10.15488/4440
Document understanding requires discovery of meaningful patterns in text, which in turn involves analyzing documents and extracting useful information for a certain purpose. There is a multitude of problems that need to be dealt with to solve this task. With the goal of improving document understanding, we identify three main problems to study within the scope of this thesis. The first problem is about learning text representation, which is considered as starting point to gain understanding of documents. The representation enables us to build applications around the semantics or meaning of the documents, rather than just around the keywords presented in the texts. The second problem is about acquiring document context. A document cannot be fully understood in isolation since it may refer to knowledge that is not explicitly included in its textual content. To obtain a full understanding of the meaning of the document, that prior knowledge, therefore, has to be retrieved to supplement the text in the document. The last problem we address is about recommending related information to textual documents. When consuming text especially in applications such as e-readers and Web browsers, users often get attracted by the topics or entities appeared in the text. Gaining comprehension of these aspects, therefore, can help users not only further explore those topics but also better understand the text. In this thesis, we tackle the aforementioned problems and propose automated approaches that improve document representation, and suggest relevant as well as missing information for supporting interpretations of documents. To this end, we make the following contributions as part of this thesis: Representation learning - the first contribution is to improve document representation which serves as input to document understanding algorithms. Firstly, we adopt probabilistic methods to represent documents as a mixture of topics and propose a generalizable framework for improving the quality of topics learned from small collections. The proposed method can be well adapted to different application domains. Secondly, we focus on learning the distributed representation of documents. We introduce multiplicative tree-structured Long Short-Term Memory (LSTM) networks which are capable of integrating syntactic and semantic information from text into the standard LSTM architecture for improved representation learning. Finally, we investigate the usefulness of attention mechanism for enhancing distributed representations. In particular, we propose Multihop Attention Networks which can learn effective representations and illustrate its usefulness in the application of question answering. Time-aware contextualization - the second contribution is to formalize the novel and challenging task of time-aware contextualization, where explicit context information is required for bridging the gap between the situation at the time of content creation and the situation at the time of content digestion. To solve this task, we propose a novel approach which automatically formulates queries for retrieving adequate contextualization candidates from an underlying knowledge source such as Wikipedia, and then ranks the candidates using learning-to-rank algorithms. Context-aware entity recommendation - the third contribution is to give assistance to document exploration by recommending related entities to the entities mentioned in the documents. For this purpose, we first introduce the idea of a contextual relatedness of entities and formalize the problem of context-aware entity recommendation. Then, we approach the problem by a statistically sound probabilistic model incorporating temporal and topical context via embedding methods.
|License of this version:||Es gilt deutsches Urheberrecht. Das Dokument darf zum eigenen Gebrauch kostenfrei genutzt, aber nicht im Internet bereitgestellt oder an Außenstehende weitergegeben werden.|
|Appears in Collections:||Fakultät für Elektrotechnik und Informatik|
Zur Erhebung der Downloadstatistiken kommen entsprechend dem „COUNTER Code of Practice for e-Resources“ international anerkannte Regeln und Normen zur Anwendung. COUNTER ist eine internationale Non-Profit-Organisation, in der Bibliotheksverbände, Datenbankanbieter und Verlage gemeinsam an Standards zur Erhebung, Speicherung und Verarbeitung von Nutzungsdaten elektronischer Ressourcen arbeiten, welche so Objektivität und Vergleichbarkeit gewährleisten sollen. Es werden hierbei ausschließlich Zugriffe auf die entsprechenden Volltexte ausgewertet, keine Aufrufe der Website an sich.