A deep basecaller system for Sanger sequencing and associated methods are provided. The methods use deep machine learning. A Deep Learning Model is used to determine scan labelling probabilities based on an analyzed trace. A Neural Network is trained to learn the optimal mapping function to minimize
A deep basecaller system for Sanger sequencing and associated methods are provided. The methods use deep machine learning. A Deep Learning Model is used to determine scan labelling probabilities based on an analyzed trace. A Neural Network is trained to learn the optimal mapping function to minimize a Connectionist Temporal Classification (CTC) Loss function. The CTC function is used to calculate loss by matching a target sequence and predicted scan labelling probabilities. A Decoder generates a sequence with the maximum probability. A Basecall position finder using prefix beam search is used to walk through CTC labelling probabilities to find a scan range and then the scan a position of peak labelling probability within the scan range for each called base. Quality Value (QV) is determined using a feature vector calculated from CTC labelling probabilities as an index into a QV look-up table to find a quality score.
대표청구항▼
1. A neural network control system comprising: a trace generator coupled to a Sanger Sequencer and generating a trace for a biological sample;a segmenter to divide the trace into scan windows;an aligner to shift the scan windows;logic to determine associated annotated basecalls for each of the scan
1. A neural network control system comprising: a trace generator coupled to a Sanger Sequencer and generating a trace for a biological sample;a segmenter to divide the trace into scan windows;an aligner to shift the scan windows;logic to determine associated annotated basecalls for each of the scan windows to generate target annotated basecalls for use in training; a bi-directional recurrent neural network (BRNN) comprising:at least one long short term memory (LSTM) or general recurrent unit (GRU) layer;an output layer configured to output scan label probabilities for all scans in a scan window;a CTC loss function to calculate the loss between the output scan label probabilities and the target annotated basecalls; anda gradient descent optimizer configured as a closed loop feedback control to the BRNN to update weights of the BRNN to minimize the loss against a minibatch of training samples randomly selected from the target annotated basecalls at each training step. 2. The system of claim 1, further comprising: each of the scan windows comprising 500 scans shifted by 250 scans. 3. The system of claim 1, further comprising: an aggregator to assemble the label probabilities for all scan windows to generate label probabilities for the entire trace. 4. The system of claim 3, further comprising: a dequeue max finder algorithm to identify scan positions for the basecalls based on an output of the CTC loss function and the basecalls. 5. The system of claim 3, further comprising: a prefix beam search decoder to transform the label probabilities for the entire trace into basecalls for the biological sample. 6. The system of claim 5, wherein the basecalls are at 5′ and 3′ ends of the biological sample. 7. The system of claim 1, wherein the trace is a sequence of raw dye RFUs. 8. The system of claim 1, wherein the trace is raw spectrum data collected from one or more capillary electrophoresis genetic analyzer. 9. The system of claim 1, further comprising: at least one generative adversarial network configured to inject noise in the trace. 10. The system of claim 1, further comprising: at least one generative adversarial network configured to inject spikes into the trace. 11. The system of claim 1, further comprising: at least one generative adversarial network configured to inject dye blob artifacts into the trace. 12. A process control method, comprising: operating a Sanger Sequencer to generate a trace for a biological sample;dividing the trace into scan windows;shifting the scan windows;determining associated annotated basecalls for each of the scan windows to generate target annotated basecalls;inputting the scan windows to a bi-directional recurrent neural network (BRNN) comprising: at least one long short term memory (LSTM) or general recurrent unit (GRU) layer;an output layer configured to output scan label probabilities for all scans in a scan window;a CTC loss function to calculate the loss between the output scan label probabilities and the target annotated basecalls; andapplying the loss through a gradient descent optimizer configured as a closed loop feedback control to the BRNN to update weights of the BRNN to minimize the loss against a minibatch of training samples randomly selected from the target annotated basecalls at each training step. 13. The method of claim 12, further comprising: each of the scan windows comprising 500 scans shifted by 250 scans. 14. The method of claim 12, further comprising: assembling the label probabilities for all scan windows to generate label probabilities for the entire trace. 15. The method of claim 14, further comprising: identifying scan positions for the basecalls based on an output of the CTC loss function and the basecalls. 16. The method of claim 14, further comprising: decoding the label probabilities for the entire trace into basecalls for the biological sample. 17. The method of claim 16, wherein the basecalls are at 5′ and 3′ ends of the biological sample. 18. The method of claim 12, wherein the trace is one of a sequence of raw dye RFUs, or raw spectrum data collected from one or more capillary electrophoresis genetic analyzer. 19. The system of claim 12, further comprising: at least one generative adversarial network configured to inject one or more of noise, spikes, or dye blog artifacts into the trace. 20. A method of training networks for basecalling a sequencing sample, comprising: for each sample in a plurality of sequencing samples, dividing a sequence of preprocessed relative fluorescence units (RFUs) into a plurality of scan windows, with a first predetermined number of scans shifted by a second predetermined number of scans;determining an annotated basecall for each scan window of the plurality of scan windows;constructing a plurality of training samples, wherein each training sample in the plurality of training samples comprises the scan windows with the first predetermined number of scans and the respective annotated basecall;for each of a plurality of iterations: i) randomly selecting a subset of the plurality of training samples,ii) receiving, by a neural network, the selected subset of the plurality of training samples, wherein the neural network comprises: one or more hidden layers of a plurality of Long Short-Term Memory (LSTM) units or Gated Recurrent Units (GRUs),an output layer, anda plurality of network elements, wherein each network element is associated with one or more weights,iii) outputting, by the output layer, label probabilities for all scans of the training samples in the selected subset of the plurality of training samples,iv) calculating a loss between the output label probabilities and the respective annotated basecalls,v) updating the weights of the plurality of network elements, using a network optimizer, to minimize the loss against the selected subset of the plurality of training samples,vi) storing a trained network in a plurality of trained networks,vii) evaluating the trained networks with a validation data set; andviii) returning to step i) until a predetermined number of training steps is reached or a validation loss or error rate cannot improve anymore;calculating an evaluation loss or an error rate for the plurality of trained networks, using an independent subset of plurality of samples which were not included in the selected subsets of training samples; andselecting a best trained network from the plurality of trained networks, wherein the best trained network has a minimum evaluation loss or error rate. 21. The method of claim 20, further comprising: receiving a sequencing sample;dividing an entire trace of the sequencing sample into a second plurality of scan windows, with the first predetermined number of scans shifted by the second predetermined number of scans;outputting scan label probabilities for the second plurality of scan windows, by providing the second plurality of scan windows to the selected trained network;assembling the scan label probabilities for the second plurality of scan windows to generate label probabilities for the entire trace of the sequencing sample;determining basecalls for the sequencing sample based on the assembled scan label probabilities;determining scan positions for all the determined basecalls based on the scan label probabilities and the basecalls; andoutputting the determined basecalls and the determined scan positions. 22. A method for quality valuation of a series of sequencing basecalls, comprising: receiving scan label probabilities, basecalls, and scan positions for a plurality of samples;generating a plurality of training samples based on the plurality of samples using the scan label probabilities around the center scan position of each basecall for each sample in the plurality of samples;assigning a category to each basecall of each sample of the plurality of training samples, wherein the category corresponds to one of correct or incorrect;for each of a plurality of iterations: i) randomly select a subset of the plurality of training samples,ii) receiving, by a neural network, the selected subset of the plurality of training sample, wherein the neural network comprises: one or more hidden layers,an output layer, anda plurality of network elements, wherein each network element is associated with a weight;iii) outputting, by the output layer, predicted error probabilities based on the scan label probabilities using a hypothesis function;iv) calculating a loss between the predicted error probabilities and the assigned category for each basecall of each sample of the subset of the plurality of training samples;v) updating the weights of the plurality of network elements, using a network optimizer, to minimize the loss against the selected subset of the plurality of training samples;vi) storing the neural network as a trained network in a plurality of trained networks; andvii) returning to step i) until a predetermined number of training steps is reached or a validation loss or error cannot improve anymore;calculating an evaluation loss or an error rate for each trained network in the plurality of trained networks, using an independent subset of plurality of samples which were not included in the selected subsets of training samples; andselecting a best trained network from the plurality of trained networks, wherein the best trained network has a minimum evaluation loss or error rate. 23. The method of claim 22, further comprising: receiving scan label probabilities around basecall positions of an input sample;outputting error probabilities for the input sample, by providing the scan label probabilities around basecall positions of the input sample to the selected trained network;determining a plurality of quality scores based on the output error probabilities; andoutputting the plurality of quality scores.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.