Disclosed herein are compositions, systems and methods related to sequence assembly, such as nucleic acid sequence assembly of single reads and contigs into larger contigs and scaffolds through the use of read pair sequence information, such as read pair information indicative of nucleic acid sequen
Disclosed herein are compositions, systems and methods related to sequence assembly, such as nucleic acid sequence assembly of single reads and contigs into larger contigs and scaffolds through the use of read pair sequence information, such as read pair information indicative of nucleic acid sequence phase or physical linkage.
대표청구항▼
1. A method for nucleic acid sequence data assembly, comprising: (a) obtaining purified DNA;(b) binding the purified DNA with a DNA binding agent to form DNA/chromatin complexes;(c) incubating the DNA-chromatin complexes with restriction enzymes to leave sticky ends;(d) performing ligation to join e
1. A method for nucleic acid sequence data assembly, comprising: (a) obtaining purified DNA;(b) binding the purified DNA with a DNA binding agent to form DNA/chromatin complexes;(c) incubating the DNA-chromatin complexes with restriction enzymes to leave sticky ends;(d) performing ligation to join ends of DNA;(e) sequencing ligated DNA junctions to generate paired end reads;(f) obtaining standard paired-end read distance frequency data;(g) obtaining grouped contig sequences; and(h) scaffolding the grouped contig sequences such that read pair distance frequency data for read pairs that map to separate contigs approximates the standard paired-end read distance frequency data,thereby assembling the sequence data of the nucleic acid. 2. The method of claim 1, wherein read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data when read pair distance likelihood is improved by at least 5% compared to read pair distance likelihood calculated for unscaffolded contig sequences. 3. A method for scaffolding contigs of nucleic acid sequence information obtained from a biological sample, said method comprising: (a) obtaining a set of contig sequences having an initial configuration, wherein the contig sequences are obtained by extracting DNA from a biological material and sequencing the DNA;(b) obtaining a set of paired end reads, wherein the set of paired-end reads is obtained by digesting sample DNA to generate internal double strand breaks within the nucleic acid, allowing the double strand breaks to re-ligate randomly to form a plurality of re-ligation junctions, and sequencing across the plurality of re-ligation junctions;(c) obtaining standard paired-end read distance frequency data;(d) grouping contig pairs sharing sequence that coexists in at least one paired end read, thereby generating grouped contigs; and(e) scaffolding the grouped contigs such that read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data by at least 5% relative to the read pair frequency data of the grouped contigs in the initial configuration. 4. The method of claim 3, wherein the sample DNA is crosslinked to at least one DNA binding agent. 5. The method of claim 3, wherein the sample DNA comprises isolated naked DNA. 6. The method of claim 5, wherein the isolated naked DNA is reassembled into reconstituted chromatin. 7. The method of claim 6, wherein the reconstituted chromatin is crosslinked. 8. The method of claim 6, wherein the reconstituted chromatin comprises a DNA binding protein. 9. The method of claim 6, wherein the reconstituted chromatin comprises a nanoparticle. 10. The method of claim 3, wherein the set of paired-end reads are obtained by digesting sample DNA to generate internal double strand breaks within the nucleic acid, allowing the double strand breaks to re-ligate randomly to form a plurality of re-ligation junctions, and sequencing on each side of the plurality of re-ligation junctions. 11. The method of claim 3, wherein standard paired-end read distance frequency data is obtained from paired-end reads where both reads map to a common contig. 12. The method of claim 3, wherein standard paired-end read distance frequency data is obtained from previously generated curves. 13. The method of claim 3, wherein said scaffolding comprises selecting a first set of putative adjacent contigs of said grouped contigs, determining a minimal distance order of said first set of putative adjacent contigs that reduces an aggregate measure of the read-pair distances for said read pairs, and scaffolding said first set of putative adjacent contigs so as to reduce said aggregate measure of the read-pair distance. 14. The method of claim 13, wherein said determining a minimal distance order comprises determining an expected read-pair distance for all possible contig configurations for at least one read pair that comprises reads mapping to two contigs of said first set of putative adjacent contigs. 15. The method of claim 14, comprising selecting a contig orientation from said all possible contig configurations that most improves the read pair distance likelihood compared to other possible configurations from said all possible contig configurations. 16. The method of claim 3, wherein the biological sample comprises a genome. 17. The method of claim 3, wherein the biological sample is a heterogeneous sample comprising a plurality of genomes. 18. A method for scaffolding contigs of nucleic acid sequence information comprising: (a) obtaining a set of contig sequences having an initial configuration;(b) obtaining a set of paired end reads;(c) obtaining standard paired-end read distance frequency data;(d) grouping contig pairs sharing sequence that coexists in at least one paired end read, thereby generating grouped contigs; and(e) scaffolding the grouped contigs such that read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data by at least 5% relative to the read pair distance frequency data of the grouped contigs in the initial configuration. 19. The method of claim 18, wherein the scaffolding comprises at least one of ordering the grouped contigs, orienting the grouped contigs, merging at least two contigs end to end, inserting one contig into a second contig, and cleaving a contig into at least two constituent contigs. 20. The method of claim 18, wherein read pair distance frequency data for read pairs that map to separate contigs more closely approximates the paired-end read distance frequency data when read pair distance likelihood increases by at least 5% relative to the read pair distance frequency data of the group contigs in the initial configuration. 21. The method of claim 20, wherein read-pair distance likelihood is maximized. 22. The method of claim 18, wherein read pair distance frequency data for read pairs that map to separate contigs more closely approximates the standard paired-end read distance frequency data when a statistical measure of difference between the read pair distance frequency data for read pairs that map to separate contigs and the standard paired-end read distance frequency data decreases by at least 5% relative to the read pair distance frequency data of the grouped contigs in the initial configuration. 23. The method of claim 22, wherein the statistical measure of distance between the read pair distance frequency data for read pairs that map to separate contigs and the standard paired-end read distance frequency data comprises at least one of ANOVA, a t-test, and a X-squared test. 24. The method of claim 23, wherein read pair distance for read pairs that map to separate contigs more closely matches the paired-end read distance frequency data when deviation of read pair distance distribution among ordered contigs obtained as compared to standard paired-end read distance frequency decreases. 25. The method of claim 24, wherein deviation of read pair distance distribution among ordered contigs obtained as compared to standard paired-end read distance frequency is minimized. 26. A method of assembling contig sequence information into at least one scaffold, comprising (a) obtaining sequence information corresponding to a plurality of contigs, obtaining paired-end read information from a nucleic acid sample represented by the plurality of contigs, and(b) configuring the plurality of contigs such that deviation of a read pair distance parameter from a predicted read pair distance data set is decreased by at least 5% compared to the read pair distance parameter of plurality of contigs in an initial configuration, wherein the configuring occurs in less than 8 hours. 27. The method of claim 26, wherein the predicted read pair distance data set comprises a read pair distance likelihood curve. 28. The method of claim 26, wherein the read pair distance parameter is maximum distance likelihood relative to a read pair distance likelihood curve. 29. The method of claim 26, wherein the read pair distance parameter is variation relative to a read pair distance likelihood curve. 30. The method of claim 26, wherein 70% of the plurality of contigs are ordered and oriented so as to match the relative order and orientation of their sequences in the nucleic acid sample in no more than 8 hours. 31. The method of claim 1, wherein an N50 value is increased by at least 1% relative to unscaffolded sequence data. 32. The method of claim 3, wherein an N50 value is increased by at least 1% relative to the grouped contigs in the initial configuration. 33. The method of claim 18, wherein an N50 value is increased by at least 1% relative to the grouped contigs in the initial configuration. 34. The method of claim 26, wherein an N50 value is increased by at least 1% relative to the plurality of contigs in the initial configuration.
Barany, Francis; Kiu, Jianzhao; Kirk, Brian W.; Zirvi, Monib; Gerry, Norman P.; Paty, Philip B., Accelerating identification of single nucleotide polymorphisms and alignment of clones in genomic sequencing.
Chee Mark ; Cronin Maureen T. ; Fodor Stephen P. A. ; Huang Xiaohua X. ; Hubbell Earl A. ; Lipshutz Robert J. ; Lobban Peter E. ; Morris MacDonald S. ; Sheldon Edward L., Arrays of nucleic acid probes on biological chips.
Hubbell Earl A. (Mt. View CA) Lipshutz Robert J. (Palo Alto CA) Morris Macdonald S. (San Jose CA) Winkler James L. (Palo Alto CA), Computer-aided engineering system for design of sequence arrays and lithographic masks.
Hubbell Earl A. (Mt. View CA) Morris MacDonald S. (San Jose CA) Winkler James L. (Palo Alto CA), Computer-aided engineering system for design of sequence arrays and lithographic masks.
Whiteley Norman M. (San Carlos CA) Hunkapiller Michael W. (San Carlos CA) Glazer Alexander N. (Orinda CA), Detection of specific sequences in nucleic acids.
Pirrung Michael C. (Durham NC) Read J. Leighton (Palo Alto CA) Fodor Stephen P. A. (Palo Alto CA) Stryer Lubert (Stanford CA), Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof.
Herman James G. ; Baylin Stephen B., Method of detection of methylated nucleic acid using agents which modify unmethylated cytosine and distinguishing modifi.
Nazarenko Irina A. ; Bhatnagar Satish K. ; Winn-Deen Emily S. ; Hohman Robert J., Nucleic acid amplification oligonucleotides with molecular energy transfer labels and methods based thereon.
Boom Willem R. (Amsterdam NLX) Adriaanse Henritte M. A. (Arnhem NLX) Kievits Tim (The Hague NLX) Lens Peter F. (Amsterdam NLX), Process for isolating nucleic acid.
Barany Francis (New York NY) Zebala John (New York NY) Nickerson Deborah (Seattle WA) Kaiser ; Jr. Robert J. (Seattle WA) Hood Leroy (Seattle WA), Thermostable ligase-mediated DNA amplifications system for the detection of genetic disease.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.