Processing and analysis of complex nucleic acid sequence data
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G01N-033/50
G06F-019/22
C12Q-001/68
출원번호
US-0447087
(2012-04-13)
등록번호
US-9524369
(2016-12-20)
발명자
/ 주소
Drmanac, Radoje
Peters, Brock A.
Kermani, Bahram Ghaffarzadeh
출원인 / 주소
Complete Genomics, Inc.
대리인 / 주소
Kilpatrick Townsend & Stockton LLP
인용정보
피인용 횟수 :
1인용 특허 :
107
초록
The present invention is directed to logic for analysis of nucleic acid sequence data that employs algorithms that lead to a substantial improvement in sequence accuracy and that can be used to phase sequence variations, e.g., in connection with the use of the long fragment read (LFR) process.
대표청구항▼
1. A method of analyzing genomic DNA of an organism to produce a phased sequence corresponding to at least a portion of a genome of the organism, the method comprising: providing a plurality of aliquots of the genomic DNA of the organism;tagging fragments of genomic DNA in each aliquot with a corres
1. A method of analyzing genomic DNA of an organism to produce a phased sequence corresponding to at least a portion of a genome of the organism, the method comprising: providing a plurality of aliquots of the genomic DNA of the organism;tagging fragments of genomic DNA in each aliquot with a corresponding aliquot-specific tag sequence;sequencing the tagged fragments of genomic DNA to obtain a plurality of reads;receiving, at one or more computing devices, the plurality of reads corresponding to fragments of genomic DNA from the plurality of aliquots, each read comprising a sequence from a fragment of genomic DNA and an aliquot-specific tag sequence, wherein each aliquot contains less than a haploid genome equivalent of genomic DNA;determining, with the one or more computing devices, the aliquots from which the plurality of reads originate by identifying the aliquot-specific tag sequences;producing, with the one or more computing devices, the phased sequence from the reads by: identifying a plurality of heterozygous loci corresponding to at least a portion of the genome of the organism based on numbers of reads having different alleles at each of the plurality of heterozygous loci; andphasing the plurality of heterozygous loci to produce a first haplotype and a second haplotype, the phasing using the aliquots of origin for reads mapping to the plurality of heterozygous loci to determine which alleles at the heterozygous loci are on a same haplotype, wherein reads at different ones of the plurality of heterozygous loci and having the same aliquot of origin are determined to be from the same haplotype, the phased sequence corresponding to the first haplotype and the second haplotype of the at least a portion of the genome of the organism. 2. The method of claim 1, further comprising: identifying a first sequence variant at a first locus based on reads having a first allele and a second allele at the first locus;determining which reads correspond to which of the first and second haplotypes based on the aliquots of origin; andidentifying as an error the first sequence variant at the first locus when the aliquots of origin for the reads of the first and second allele are inconsistent with an existence of both the first and second alleles at the first locus. 3. The method of claim 1, wherein at least 70 percent of the heterozygous loci are phased. 4. The method of claim 1, wherein a region of the at least a portion of the genome of the organism comprises a short tandem repeat, the method further comprising: determining a first number of reads of the first haplotype in the region;determining a second number of reads of the second haplotype in the region;comparing the first number with the second number; andbased on the comparison, identifying an expansion of the short tandem repeat in the first haplotype or the second haplotype. 5. The method of claim 1, further comprising: producing, with the one or more computing devices, a plurality of assembled sequences that align to an overlap region of the genome, each assembled sequence in the overlap region corresponding to a different aliquot, wherein the plurality of heterozygous loci include N heterozygous loci, where N is an integer greater than one;wherein phasing the plurality of heterozygous loci includes: clustering the assembled sequences in a space of 2N to 4N possibilities based on the alleles at the N heterozygous loci for the respective assembled sequences, thereby creating a plurality of clusters; andidentifying two clusters with a highest density. 6. The method of claim 5, wherein the phasing further includes: computing a matrix of N dimensions, each dimension corresponding to a heterozygous locus, where each matrix element corresponds to a number of assembled sequences having a combination of alleles corresponding to the matrix element;identifying a first matrix element and a second matrix element that are each a center of one of the two clusters;determining the first haplotype at the N heterozygous loci from the first matrix element; anddetermining the second haplotype at the N heterozygous loci from the second matrix element. 7. The method of claim 1, wherein the organism is a diploid mammal, the method further comprising: producing an assembled sequence for the first and second haplotypes using the phased sequence, wherein the assembled sequences comprise a genome call rate of 70% or greater and an exome call rate of 70% or greater. 8. The method of claim 7, wherein the assembled sequence comprises less than one false single nucleotide variant per megabase. 9. The method of claim 7, wherein the assembled sequence comprises less than 600 false single nucleotide variants per gigabase. 10. The method of claim 7, further comprising: calling a base at a position of the assembled sequence based on preliminary base calls for the position from two or more aliquots; andidentifying the base call as true when the base call is present three or more times in reads from the two or more aliquots. 11. The method of claim 1, wherein phasing the plurality of heterozygous loci includes: for each of a plurality of pairs of heterozygous loci: determining a matrix of a number of shared aliquots between alleles at the heterozygous loci of the pair, the heterozygous loci of the pair being located within a specified distance of each other. 12. The method of claim 11, wherein phasing the plurality of heterozygous loci further includes: using each matrix to calculate a score and an orientation for the respective pair of heterozygous loci; andusing the scores and the orientations to determine the first haplotype and the second haplotype. 13. The method of claim 12, wherein using the scores and the orientations to determine the first haplotype and the second haplotype includes: optimizing a graph of connections between pairs of heterozygous loci based on the scores and the orientations. 14. The method of claim 1, further comprising: identifying a phased SNP from the plurality of heterozygous loci, the phased SNP having a first allele and a second allele;identifying a locus that is a neighbor of the phased SNP, where the locus as a no-call, the locus having reads with a third allele and a fourth allele;calculating a first number of shared aliquots that include the first allele at the phased SNP and the third allele at the locus; anddetermining the third allele to be at the locus based on the first number of shared aliquots. 15. The method of claim 14, further comprising: determining the third allele is at the locus when the first number of shared aliquots is above a threshold, the threshold being two or more. 16. The method of claim 14, further comprising: calculating a second number of the shared aliquots that include the second allele at the phased SNP and the third allele at the locus;calculating a third number of the shared aliquots that include the first allele at the phased SNP and the fourth allele at the locus;determining the locus to be homozygous for the third allele when the first number and the second number are greater than a threshold, and the third number is less than the threshold. 17. The method of claim 14, further comprising: calculating a second number of the shared aliquots that include the second allele at the phased SNP and the fourth allele at the locus; anddetermining the locus to be heterozygous for the third allele and the fourth allele when all of the reads with the third allele share an aliquot with the first allele, and all of the reads with the fourth allele share an aliquot with the second allele. 18. The method of claim 1, further comprising: phasing at least 70 percent of the plurality of heterozygous loci. 19. The method of claim 1, wherein each aliquot-specific tag comprises an error-correction code, and wherein each read comprises correct tag sequence data or incorrect tag sequence data that comprises one or more errors, the method further comprising: using the error-correction code to correct the incorrect tag sequence data, thereby producing corrected tag sequence data and tag sequence data that cannot be corrected;using reads comprising the correct tag sequence data and the corrected tag sequence data in a first computer process that requires the tag sequence data and that produces a first output; andusing reads comprising the tag sequence data that cannot be corrected in a second computer process that does not require the tag sequence data and that produces a second output. 20. The method of claim 19, wherein said first computer process is selected from a list comprising sample multiplexing, library multiplexing, phasing, and an error correction process that employs the tag sequence data, and wherein the second computer process comprises mapping, assembly, and pool-based statistics. 21. The method of claim 1, wherein the phased sequence is of a first region of the genome of the organism, the first region comprising a short tandem repeat, the method further comprising: comparing reads of the first haplotype of the region with reads of the second haplotype; andbased on the comparison, identifying an expansion of the short tandem repeat in one of the first haplotype or the second haplotype. 22. The method of claim 1, wherein the genomic DNA is from a group consisting of the genome of the organism, an exome of the organism, a mixture of genomes of different organisms that includes the organism, a mixture of genomes of different cell types of the organism, and subsets thereof. 23. The method of claim 1, further comprising: amplifying the fragments of genomic DNA in each aliquot. 24. The method of claim 1, wherein the organism is a mammal. 25. The method of claim 1, wherein the organism is a human. 26. A computer-readable non-transitory storage medium storing instructions which, when executed by one or more computing devices, cause the one or more computing devices to analyzing genomic DNA of an organism to produce a phased sequence corresponding to at least a portion of a genome of the organism, the instructions comprising: receiving a plurality of reads corresponding to fragments of genomic DNA from a plurality of aliquots, each fragment of genomic DNA being tagged with an aliquot-specific tag sequence, and each read comprising sequence from a fragment of genomic DNA and an aliquot-specific tag sequence, wherein each aliquot contains less than a haploid genome equivalent of genomic DNA, wherein the plurality of reads are obtained by: providing the plurality of aliquots of the genomic DNA of the organism;tagging the fragments of genomic DNA in each aliquot with the corresponding aliquot-specific tag sequence; andsequencing the tagged fragments of genomic DNA to obtain a plurality of reads;determining the aliquots from which the plurality of reads originate by identifying the aliquot-specific tag sequences;producing the phased sequence from the reads by:identifying a plurality of heterozygous loci corresponding to at least a portion of the genome of the organism based on numbers of reads having different alleles at each of the plurality of heterozygous loci; andphasing the plurality of heterozygous loci to produce a first haplotype and a second haplotype, the phasing using the aliquots of origin for reads mapping to the plurality of heterozygous loci to determine which alleles at the heterozygous loci are on a same haplotype, wherein reads at different ones of the plurality of heterozygous loci and having the same aliquot of origin are determined to be from the same haplotype, the phased sequence corresponding to the first haplotype and the second haplotype of the at least a portion of the genome of the organism. 27. The computer-readable non-transitory storage medium of claim 25, wherein the instructions further comprise: identifying a first sequence variant at a first locus based on reads having a first allele and a second allele at the first locus;determining which reads correspond to which of the first and second haplotypes based on the aliquots of origin; andidentifying as an error the first sequence variant at the first locus when the aliquots of origin for the reads of the first and second allele are inconsistent with an existence of both the first and second alleles at the first locus. 28. The computer-readable non-transitory storage medium of claim 25, wherein a region of the at least a portion of the genome of the organism comprises a short tandem repeat, wherein the instructions further comprise: determining a first number of reads of the first haplotype in the region;determining a second number of reads of the second haplotype in the region;comparing the first number with the second number; andbased on the comparison, identifying an expansion of the short tandem repeat in the first haplotype or the second haplotype. 29. The computer-readable non-transitory storage medium of claim 25, wherein the instructions further comprise: producing a plurality of assembled sequences that align to an overlap region of the genome, each assembled sequence in the overlap region corresponding to a different aliquot, wherein the plurality of heterozygous loci include N heterozygous loci, where N is an integer greater than one;wherein phasing the plurality of heterozygous loci includes: clustering the assembled sequences in a space of 2N to 4N possibilities based on the alleles at the N heterozygous loci for the respective assembled sequences, thereby creating a plurality of clusters; andidentifying two clusters with a highest density. 30. The computer-readable non-transitory storage medium of claim 29, wherein the phasing further includes: computing a matrix of N dimensions, each dimension corresponding to a heterozygous locus, where each matrix element corresponds to a number of assembled sequences having a combination of alleles corresponding to the matrix element;identifying a first matrix element and a second matrix element that are each a center of one of the two clusters;determining the first haplotype at the N heterozygous loci from the first matrix element; anddetermining the second haplotype at the N heterozygous loci from the second matrix element. 31. The computer-readable non-transitory storage medium of claim 25, wherein the instructions further comprise: producing an assembled sequence for the first and second haplotypes using the phased sequence;calling a base at a position of the assembled sequence based on preliminary base calls for the position from two or more aliquots; andidentifying the base call as true when the base call is present three or more times in reads from the two or more aliquots. 32. The computer-readable non-transitory storage medium of claim 25, wherein phasing the plurality of heterozygous loci includes: for each of a plurality of pairs of heterozygous loci:determining a matrix of a number of shared aliquots between alleles at the heterozygous loci of the pair, the heterozygous loci of the pair being located within a specified distance of each other. 33. The computer-readable non-transitory storage medium of claim 32, wherein phasing the plurality of heterozygous loci further includes: using each matrix to calculate a score and an orientation for the respective pair of heterozygous loci; andusing the scores and the orientations to determine the first haplotype and the second haplotype. 34. The computer-readable non-transitory storage medium of claim 33, wherein using the scores and the orientations to determine the first haplotype and the second haplotype includes: optimizing a graph of connections between pairs of heterozygous loci based on the scores and the orientations. 35. The computer-readable non-transitory storage medium of claim 26, wherein the instructions further comprise: identifying a phased SNP from the plurality of heterozygous loci, the phased SNP having a first allele and a second allele;identifying a locus that is a neighbor of the phased SNP, where the locus as a no-call, the locus having reads with a third allele and a fourth allele;calculating a first number of shared aliquots that include the first allele at the phased SNP and the third allele at the locus; anddetermining the third allele to be at the locus based on the first number of shared aliquots. 36. The computer-readable non-transitory storage medium of claim 35, wherein the instructions further comprise: determining the third allele is at the locus when the first number of shared aliquots is above a threshold, the threshold being two or more. 37. The computer-readable non-transitory storage medium of claim 35, wherein the instructions further comprise: calculating a second number of the shared aliquots that include the second allele at the phased SNP and the third allele at the locus;calculating a third number of the shared aliquots that include the first allele at the phased SNP and the fourth allele at the locus;determining the locus to be homozygous for the third allele when the first number and the second number are greater than a threshold, and the third number is less than the threshold. 38. The computer-readable non-transitory storage medium of claim 35, wherein the instructions further comprise: calculating a second number of the shared aliquots that include the second allele at the phased SNP and the fourth allele at the locus; anddetermining the locus to be heterozygous for the third allele and the fourth allele when all of the reads with the third allele share an aliquot with the first allele, and all of the reads with the fourth allele share an aliquot with the second allele. 39. The computer-readable non-transitory storage medium of claim 26, wherein the phased sequence is of a first region of the genome of the organism, the first region comprising a short tandem repeat, wherein the instructions further comprise: comparing reads of the first haplotype of the region with reads of the second haplotype; andbased on the comparison, identifying an expansion of the short tandem repeat in one of the first haplotype or the second haplotype.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (107)
Barany, Francis; Liu, Jianzhao; Kirk, Brian W.; Zirvi, Monib; Gerry, Norman P.; Paty, Philip B., Accelerating identification of single nucleotide polymorphisms and alignment of clones in genomic sequencing.
Birkenmeyer Larry G. (Chicago IL) Carrino John J. (Gurnee IL) Dille Bruce J. (Antioch IL) Hu Hsiang-Yun (Libertyville IL) Kratochvil Jon D. (Kenosha WI) Laffler Thomas G. (Libertyville IL) Marshall R, Amplification of target nucleic acids using gap filling ligase chain reaction.
Whiteley Norman M. (San Carlos CA) Hunkapiller Michael W. (San Carlos CA) Glazer Alexander N. (Orinda CA), Detection of specific sequences in nucleic acids.
Pirrung Michael C. (Durham NC) Read J. Leighton (Palo Alto CA) Fodor Stephen P. A. (Palo Alto CA) Stryer Lubert (Stanford CA), Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof.
Albrecht Glenn ; Brenner Sydney,GBX ; DuBridge Robert B. ; Lloyd David H. ; Pallas Michael C., Massively parallel signature sequencing by ligation of encoded adaptors.
Adams Christopher P. (Winter Hill MA) Kron Stephen Joseph (Boston MA), Method for performing amplification of nucleic acid with two primers bound to a single solid support.
Drmanac Radoje T. (Beograd YUX) Crkvenjakov Radomir B. (Beograd YUX), Method of determining an ordered sequence of subfragments of a nucleic acid fragment by hybridization of oligonucleotide.
Rothberg,Jonathan M.; Bader,Joel S.; Dewell,Scott B.; McDade,Keith; Simpson,John W.; Berka,Jan; Colangelo,Christopher M., Method of sequencing a nucleic acid.
Rothberg,Jonathan M.; Bader,Joel S.; Dewell,Scott B.; McDade,Keith; Simpson,John W.; Berka,Jan; Colangelo,Christopher M., Method of sequencing a nucleic acid.
Drmanac Radoje T. (Zvecanska 46 Beograd 11000) Crkvenjakov Radomir B. (Bulevar JNA 118 Beograd YUX 11000), Method of sequencing of genomes by hybridization of oligonucleotide probes.
Brennan Thomas M. (2000 Broadway ; No. 705 San Francisco CA 94115) Heyneker Herbert L. (360 Forest Ave. ; No. 506 Palo Alto CA 94301), Methods and compositions for determining the sequence of nucleic acids.
Drmanac Radoje T. ; Drmanac Snezana ; Hou Aaron ; Hauser Brian, Methods for sequencing repetitive sequences and for determining the order of sequence subfragments.
Heller Michael J. (Encinitas CA) Tu Eugene (San Diego CA) Butler William F. (Carlsbad CA), Molecular biological diagnostic systems including electrodes.
Urdea Michael S. (Alamo CA) Warner Brian (Martinez CA) Horn Thomas (Berkeley CA), Nucleic acid multimers and amplified nucleic acid hybridization assays using same.
Newman Peter J. (Shorewood WI) Aster Richard H. (Milwaukee WI), Polymorphism of human platelet membrane glycoprotein IIIa and diagnostic and therapeutic applications thereof.
Rabbani,Elazar; Stavrianopoulos,Jannis G.; Kirtikar,Dollie; Johnston,Kenneth H.; Thalenfeld,Barbara E., System, array and non-porous solid support comprising fixed or immobilized nucleic acids.
Cantor, Charles R.; Siddiqi, Fouad A., Use of nucleotide analogs in the analysis of oligonucleotide mixtures and in highly multiplexed nucleic acid sequencing.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.