Capturing protein domain structure and function using self-supervision on domain architectures

Zur Kurzanzeige

dc.identifier.uri http://dx.doi.org/10.15488/12310
dc.identifier.uri https://www.repo.uni-hannover.de/handle/123456789/12408
dc.contributor.author Melidis, Damianos P.
dc.contributor.author Nejdl, Wolfgang
dc.date.accessioned 2022-06-21T05:47:17Z
dc.date.available 2022-06-21T05:47:17Z
dc.date.issued 2021
dc.identifier.citation Melidis, D.P.; Nejdl, W.: Capturing protein domain structure and function using self-supervision on domain architectures. In: Algorithms 14 (2021), Nr. 1, 28. DOI: https://doi.org/10.3390/a14010028
dc.description.abstract Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. © 2021 by the authors. Licensee MDPI, Basel, Switzerland. eng
dc.language.iso eng
dc.publisher Basel : MDPI AG
dc.relation.ispartofseries Algorithms 14 (2021), Nr. 1
dc.rights CC BY 4.0 Unported
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.subject Enzymatic commission class eng
dc.subject Protein domain architectures eng
dc.subject Quantitative quality assessment eng
dc.subject SCOPe secondary structure class eng
dc.subject Word embeddings eng
dc.subject Amino acids eng
dc.subject Forecasting eng
dc.subject Linguistics eng
dc.subject Metadata eng
dc.subject Proteins eng
dc.subject Biological characteristic eng
dc.subject Biological information eng
dc.subject Biological properties eng
dc.subject Domain architectures eng
dc.subject Enzymatic functions eng
dc.subject Linguistic features eng
dc.subject Location prediction eng
dc.subject Protein prediction eng
dc.subject Embeddings eng
dc.subject.ddc 510 | Mathematik ger
dc.title Capturing protein domain structure and function using self-supervision on domain architectures
dc.type Article
dc.type Text
dc.relation.essn 1999-4893
dc.relation.doi https://doi.org/10.3390/a14010028
dc.bibliographicCitation.issue 1
dc.bibliographicCitation.volume 14
dc.bibliographicCitation.firstPage 28
dc.description.version publishedVersion
tib.accessRights frei zug�nglich


Die Publikation erscheint in Sammlung(en):

Zur Kurzanzeige

 

Suche im Repositorium


Durchblättern

Mein Nutzer/innenkonto

Nutzungsstatistiken