[특허]Deep Basecaller for Sanger Sequencing

Deep Basecaller for Sanger Sequencing 원문보기

IPC분류정보
국가/구분	United States(US) Patent 공개
국제특허분류(IPC7판)	G16B-040/20 G16B-030/20 C12Q-001/6869 G06N-003/08
출원번호	17312168 (2019-12-10)
공개번호	20220013193 (2022-01-13)
국제출원번호	PCT/US2019/065540 (2019-12-10)
발명자 / 주소	CHU, Yong SCHNEIDER, Stephanie SCHAEFFER, Rylan WOO, David
출원인 / 주소	CHU, Yong
인용정보	피인용 횟수 : 0 인용 특허 : 0

초록 ▼

A deep basecaller system for Sanger sequencing and associated methods are provided. The methods use deep machine learning. A Deep Learning Model is used to determine scan labelling probabilities based on an analyzed trace. A Neural Network is trained to learn the optimal mapping function to minimize a Connectionist Temporal Classification (CTC) Loss function. The CTC function is used to calculate loss by matching a target sequence and predicted scan labelling probabilities. A Decoder generates a sequence with the maximum probability. A Basecall position finder using prefix beam search is used to walk through CTC labelling probabilities to find a scan range and then the scan a position of peak labelling probability within the scan range for each called base. Quality Value (QV) is determined using a feature vector calculated from CTC labelling probabilities as an index into a QV look-up table to find a quality score.

대표청구항 ▼

1. A neural network control system comprising: a trace generator coupled to a Sanger Sequencer and generating a trace for a biological sample;a segmenter to divide the trace into scan windows;an aligner to shift the scan windows;logic to determine associated annotated basecalls for each of the scan windows to generate target annotated basecalls for use in training; a bi-directional recurrent neural network (BRNN) comprising:at least one long short term memory (LSTM) or general recurrent unit (GRU) layer;an output layer configured to output scan label probabilities for all scans in a scan window;a CTC loss function to calculate the loss between the output scan label probabilities and the target annotated basecalls; anda gradient descent optimizer configured as a closed loop feedback control to the BRNN to update weights of the BRNN to minimize the loss against a minibatch of training samples randomly selected from the target annotated basecalls at each training step. 2. The system of claim 1, further comprising: each of the scan windows comprising 500 scans shifted by 250 scans. 3. The system of claim 1, further comprising: an aggregator to assemble the label probabilities for all scan windows to generate label probabilities for the entire trace. 4. The system of claim 3, further comprising: a dequeue max finder algorithm to identify scan positions for the basecalls based on an output of the CTC loss function and the basecalls. 5. The system of claim 3, further comprising: a prefix beam search decoder to transform the label probabilities for the entire trace into basecalls for the biological sample. 6. The system of claim 5, wherein the basecalls are at 5′ and 3′ ends of the biological sample. 7. The system of claim 1, wherein the trace is a sequence of raw dye RFUs. 8. The system of claim 1, wherein the trace is raw spectrum data collected from one or more capillary electrophoresis genetic analyzer. 9. The system of claim 1, further comprising: at least one generative adversarial network configured to inject noise in the trace. 10. The system of claim 1, further comprising: at least one generative adversarial network configured to inject spikes into the trace. 11. The system of claim 1, further comprising: at least one generative adversarial network configured to inject dye blob artifacts into the trace. 12. A process control method, comprising: operating a Sanger Sequencer to generate a trace for a biological sample;dividing the trace into scan windows;shifting the scan windows;determining associated annotated basecalls for each of the scan windows to generate target annotated basecalls;inputting the scan windows to a bi-directional recurrent neural network (BRNN) comprising: at least one long short term memory (LSTM) or general recurrent unit (GRU) layer;an output layer configured to output scan label probabilities for all scans in a scan window;a CTC loss function to calculate the loss between the output scan label probabilities and the target annotated basecalls; andapplying the loss through a gradient descent optimizer configured as a closed loop feedback control to the BRNN to update weights of the BRNN to minimize the loss against a minibatch of training samples randomly selected from the target annotated basecalls at each training step. 13. The method of claim 12, further comprising: each of the scan windows comprising 500 scans shifted by 250 scans. 14. The method of claim 12, further comprising: assembling the label probabilities for all scan windows to generate label probabilities for the entire trace. 15. The method of claim 14, further comprising: identifying scan positions for the basecalls based on an output of the CTC loss function and the basecalls. 16. The method of claim 14, further comprising: decoding the label probabilities for the entire trace into basecalls for the biological sample. 17. The method of claim 16, wherein the basecalls are at 5′ and 3′ ends of the biological sample. 18. The method of claim 12, wherein the trace is one of a sequence of raw dye RFUs, or raw spectrum data collected from one or more capillary electrophoresis genetic analyzer. 19. The system of claim 12, further comprising: at least one generative adversarial network configured to inject one or more of noise, spikes, or dye blog artifacts into the trace. 20. A method of training networks for basecalling a sequencing sample, comprising: for each sample in a plurality of sequencing samples, dividing a sequence of preprocessed relative fluorescence units (RFUs) into a plurality of scan windows, with a first predetermined number of scans shifted by a second predetermined number of scans;determining an annotated basecall for each scan window of the plurality of scan windows;constructing a plurality of training samples, wherein each training sample in the plurality of training samples comprises the scan windows with the first predetermined number of scans and the respective annotated basecall;for each of a plurality of iterations: i) randomly selecting a subset of the plurality of training samples,ii) receiving, by a neural network, the selected subset of the plurality of training samples, wherein the neural network comprises: one or more hidden layers of a plurality of Long Short-Term Memory (LSTM) units or Gated Recurrent Units (GRUs),an output layer, anda plurality of network elements, wherein each network element is associated with one or more weights,iii) outputting, by the output layer, label probabilities for all scans of the training samples in the selected subset of the plurality of training samples,iv) calculating a loss between the output label probabilities and the respective annotated basecalls,v) updating the weights of the plurality of network elements, using a network optimizer, to minimize the loss against the selected subset of the plurality of training samples,vi) storing a trained network in a plurality of trained networks,vii) evaluating the trained networks with a validation data set; andviii) returning to step i) until a predetermined number of training steps is reached or a validation loss or error rate cannot improve anymore;calculating an evaluation loss or an error rate for the plurality of trained networks, using an independent subset of plurality of samples which were not included in the selected subsets of training samples; andselecting a best trained network from the plurality of trained networks, wherein the best trained network has a minimum evaluation loss or error rate. 21. The method of claim 20, further comprising: receiving a sequencing sample;dividing an entire trace of the sequencing sample into a second plurality of scan windows, with the first predetermined number of scans shifted by the second predetermined number of scans;outputting scan label probabilities for the second plurality of scan windows, by providing the second plurality of scan windows to the selected trained network;assembling the scan label probabilities for the second plurality of scan windows to generate label probabilities for the entire trace of the sequencing sample;determining basecalls for the sequencing sample based on the assembled scan label probabilities;determining scan positions for all the determined basecalls based on the scan label probabilities and the basecalls; andoutputting the determined basecalls and the determined scan positions. 22. A method for quality valuation of a series of sequencing basecalls, comprising: receiving scan label probabilities, basecalls, and scan positions for a plurality of samples;generating a plurality of training samples based on the plurality of samples using the scan label probabilities around the center scan position of each basecall for each sample in the plurality of samples;assigning a category to each basecall of each sample of the plurality of training samples, wherein the category corresponds to one of correct or incorrect;for each of a plurality of iterations: i) randomly select a subset of the plurality of training samples,ii) receiving, by a neural network, the selected subset of the plurality of training sample, wherein the neural network comprises: one or more hidden layers,an output layer, anda plurality of network elements, wherein each network element is associated with a weight;iii) outputting, by the output layer, predicted error probabilities based on the scan label probabilities using a hypothesis function;iv) calculating a loss between the predicted error probabilities and the assigned category for each basecall of each sample of the subset of the plurality of training samples;v) updating the weights of the plurality of network elements, using a network optimizer, to minimize the loss against the selected subset of the plurality of training samples;vi) storing the neural network as a trained network in a plurality of trained networks; andvii) returning to step i) until a predetermined number of training steps is reached or a validation loss or error cannot improve anymore;calculating an evaluation loss or an error rate for each trained network in the plurality of trained networks, using an independent subset of plurality of samples which were not included in the selected subsets of training samples; andselecting a best trained network from the plurality of trained networks, wherein the best trained network has a minimum evaluation loss or error rate. 23. The method of claim 22, further comprising: receiving scan label probabilities around basecall positions of an input sample;outputting error probabilities for the input sample, by providing the scan label probabilities around basecall positions of the input sample to the selected trained network;determining a plurality of quality scores based on the output error probabilities; andoutputting the plurality of quality scores.

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC

저장형식

Text(ASCII format)
Excel format
PIAS분석(.xls)

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

연합인증

Deep Basecaller for Sanger Sequencing 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

이 특허와 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Deep Basecaller for Sanger Sequencing 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

이 특허와 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트