dc.contributor.author | Thorvaldsen, Steinar | |
dc.contributor.author | Hössjer, Ola | |
dc.date.accessioned | 2024-09-24T07:23:59Z | |
dc.date.available | 2024-09-24T07:23:59Z | |
dc.date.issued | 2024-06-12 | |
dc.description.abstract | A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical
framework for doing so. In this paper, we present a multinomial probability space X as a general foundation for
multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information
that is infused in order to generate a sample of such data is quantified as a distance on X between the prior
distribution of data and the empirical distribution of the sample. A number of distances on X are treated. All of
them have an information theoretic interpretation, reflecting the information that the sampling mechanism
provides about which variants that have a selective advantage and therefore appear more frequently compared to
prior expectations. This includes distances on X based on mutual information, conditional mutual information,
active information, and functional information. The functional information distance is singled out as particularly
useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where
functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting
times. The functional information is also a quasi-metric on X , with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the
functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment
of protein families. | en_US |
dc.identifier.citation | Thorvaldsen, Hössjer. Use of directed quasi-metric distances for quantifying the information of gene families. Biosystems (Amsterdam. Print). 2024 | en_US |
dc.identifier.cristinID | FRIDAID 2279886 | |
dc.identifier.doi | 10.1016/j.biosystems.2024.105256 | |
dc.identifier.issn | 0303-2647 | |
dc.identifier.issn | 1872-8324 | |
dc.identifier.uri | https://hdl.handle.net/10037/34834 | |
dc.language.iso | eng | en_US |
dc.publisher | Elsevier | en_US |
dc.relation.journal | Biosystems (Amsterdam. Print) | |
dc.rights.accessRights | openAccess | en_US |
dc.rights.holder | Copyright 2024 The Author(s) | en_US |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0 | en_US |
dc.rights | Attribution 4.0 International (CC BY 4.0) | en_US |
dc.title | Use of directed quasi-metric distances for quantifying the information of gene families | en_US |
dc.type.version | publishedVersion | en_US |
dc.type | Journal article | en_US |
dc.type | Tidsskriftartikkel | en_US |
dc.type | Peer reviewed | en_US |