论文部分内容阅读
Background: Given the sequenced fragments from a pair of chromosomes, the goal of the haplotype assembly problem is to reconstruct the two haplotypes for the chromosomes.Many heuristic algorithms and parameterized algorithms have been introduced for the problem.So far for a given input instance, the algorithms all aim to construct a pair of haplotypes that is optimal or near-optimal.When there are many optimal solutions, they can only choose one randomly or the one they find at first.Hence enumerating multiple optimal solutions to the problem is of practical importance.Methods: Based on the characteristics of real DNA sequence fragment data, we propose a parameterized dynamic programming algorithm to enumerate multiple optimal solutions to the Minimum SNPs Removal (MSR) model of the haplotypes assembly problem when the fragments contain no missing data.Since some sequencing technologies tend to produce more errors in some specific DNA sequence sites than others, assuming there are just a few errorprone sites, MSR try to build the real pair of haplotypes with minimum SNPs deleted.We construct initial optimal solutions for the sub-dataset consisting of the data at the first SNP site ; we then consider the data containing the first two SNP sites, and so on.When reaching the last SNP site, we will obtain multiple optimal solutions to the MSR model.Results: The algorithm can find out at most k optimal solutions to the MSR model in the timecomplexity O(nkk1k2+mlogm+mkl) and the space complexity O(mn+kn2), where n is the number of SNPs, m is the number of fragments, kl is the maximum number of SNP sites that a fragment covers, and k2 is the maximum number of fragments that cover a SNP site.Extensive experimental results show that for a general input dataset, the minimum SNP site set whose removal make building a pair of haplotypes feasible is not unique, and that the different haplotypes pairs reconstructed by removing different SNP sets usually show different haplotype reconstruction accuracies.Conclusions: We propose an exact enumeration algorithm for the MSR model of the haplotype assembly problem.The algorithm can effectively provide multiple optimal solutions to biologists for further analyses .