Apparatus and method for two computing elements in a fault-tolerant server to execute instructions in lockstep
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-011/00
출원번호
US-0832466
(2001-04-11)
§371/§102 date
20011016
(20011016)
발명자
/ 주소
Griffin, Gerry
McLoughlin, Michael
출원인 / 주소
Stratus Technologies Bermuda Ltd.
대리인 / 주소
Kirkpatrick &
인용정보
피인용 횟수 :
21인용 특허 :
90
초록▼
An apparatus and method for a first computing element and a second computing element to execute in lockstep in a fault-tolerant server. In one embodiment, the first computing element provides a first instruction to a communications link and the second computing element provides a second instruction
An apparatus and method for a first computing element and a second computing element to execute in lockstep in a fault-tolerant server. In one embodiment, the first computing element provides a first instruction to a communications link and the second computing element provides a second instruction to a communications link. In one embodiment, a first local input-output (I/O) subsystem and a second local I/O subsystem are each in communication with the communications link. The first and/or the second local I/O subsystem compare the first instruction and the second instruction. In one embodiment, the first and second local I/O subsystems indicate a fault of the first computing element or the second computing element. Such a fault may be determined by a miscompare of the first instruction and the second instruction.
대표청구항▼
1. A fault-tolerant server comprising:(a) a communications link comprising a switching fabric, a first communications channel, and a second communications channel; (b) a first computing element in electrical communication with the communications link, the first computing element providing a first ou
1. A fault-tolerant server comprising:(a) a communications link comprising a switching fabric, a first communications channel, and a second communications channel; (b) a first computing element in electrical communication with the communications link, the first computing element providing a first output to the communications link; (c) a second computing element in electrical communication with the communications link, the second computing element providing a second output to the communications link; (d) a first local input-output (I/O) module in electrical communication with the first computing element and the communications link; and (e) a second local I/O module in electrical communication with the second computing element and the communications link, wherein at least one of the first local I/O module and the second local I/O module compares the first output and the second output and indicates a fault of at least one of the first computing element and the second computing element upon the detection of a miscompare of the first output and the second output, and wherein the first local I/O module is in electrical communication with the second local I/O module via a sync bus to synchronize the first local I/O module and the second local I/O module, the synchronization of the first local I/O module and the second local I/O module providing a verification of state information about the first computing element and the second computing element. 2. The fault-tolerant server of claim 1 wherein each computing element further comprises a respective Central Processing Unit (CPU) and a respective local mass storage device.3. The fault-tolerant server of claim 2 wherein the switching fabric comprises:a first switching fabric in electrical communication with the CPU of the first computing element; and a second switching fabric in electrical communication with the CPU of the second computing element, wherein each respective switching fabric is in electrical communication with at least one of the first local I/O module and the second local I/O module. 4. The fault-tolerant server of claim 1 further comprising a priority module to assign a priority to each respective computing element.5. The fault-tolerant server of claim 4 wherein each local I/O module further comprises I/O fault-tolerant logic to determine whether at least one of the first computing element and the second computing element is faulty based on the priority.6. The fault-tolerant server of claim 1 wherein each local I/O module further comprises I/O fault-tolerant logic to determine whether the first output and the second output are equivalent.7. The fault-tolerant server of claim 6 wherein each I/O fault-tolerant logic comprises a comparator.8. The fault-tolerant server of claim 6 wherein each I/O fault-tolerant logic further comprises a buffer to hold at least one of the first output and the second output from at least one of the CPUs.9. The fault-tolerant server of claim 1 further comprising a voter delay buffer to store at least one of the first output and the second output upon a miscompare of the first output and the second output.10. The fault-tolerant server of claim 1 further comprising a first delay module in electrical communication with the first local I/O module to delay transmission of at least one output to the first local I/O module and a second delay module in electrical communication with the second local I/O module to delay transmission of at least one output to the second local I/O module.11. The fault-tolerant server of claim 1 wherein the first computing element and the second computing element further comprise a 1U rack-mount motherboard.12. The fault-tolerant server of claim 1 wherein each respective local I/O module is located on a same motherboard as the respective computing element.13. A method for a first computing element and a second computing element to execute in lockstep in a fault-tolerant server, the method comprising the steps of:(a) establishing communication between the first computing element and a communications link, the communications link comprising a switching fabric, a first communications channel, and a second communications channel; (b) establishing communication between the second computing element and the communications link; (c) transmitting, by the first computing element, a first output to the communications link; (d) transmitting, by the second computing element, a second output to the communications link; and (e) comparing, by at least one of a local input-output (I/O) module of the first computing element and a local I/O module of the second computing element, the first output and the second output and indicating a fault of at least one of the first computing element and the second computing element in response thereto, wherein the local I/O module of the first computing element is in electrical communication with the local I/O module of the second computing element via a sync bus to enable synchronization of the local I/O modules, the synchronization of the local I/O modules providing a verification of state information about the first computing element and the second computing element. 14. The method of claim 13 further comprising the step of transmitting a stop command to each computing element when the first output does not equal the second output.15. The method of claim 13 further comprising detecting an error introduced by the communications link.16. The method of claim 13 further comprising assigning a priority to each respective computing element.17. The method of claim 16 further comprising determining whether at least one of the first computing element and the second computing element is faulty based on the priority.18. The method of claim 16 further comprising determining whether the first output and the second output are equivalent.19. The method of claim 13 further comprising storing at least one of the first output and the second output from at least one of the computing elements for a predetermined amount of time.20. The method of claim 13 further comprising storing at least one of the first output and the second output upon a miscompare of the first output and the second output.21. The method of claim 13 wherein the transmitting of the first output and the transmitting of the second output to the communications link occur simultaneously.22. An apparatus for enabling a first computing element and a second computing element to execute in lockstep in a fault-tolerant server, the apparatus comprising:(a) means for establishing communication between the first computing element and a communications link, the communications link comprising a switching fabric, a first communications channel, and a second communications channel; (b) means for establishing communication between the second computing element and the communications link; (c) means for transmitting, by the first computing element, a first output to the communications link; (d) means for transmitting, by the second computing element, a second output to the communications link; (e) means for comparing, by at least one of a local input-output (I/O) module of the first computing element and a local I/O module of the second computing element, the first output and the second output and indicating a fault of at least one of the first computing element and the second computing element in response thereto; and (d) means for synchronizing the local I/O module of the first computing element and the local I/O module of the second computing element, the synchronization of the local I/O modules providing a verification of state information about the first computing element and the second computing element.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (90)
Lord Christopher C. ; Schwartz David B., Active failure detection.
Bienvenu Jacques (Paris FRX) Carre Claude (La Varenne-St-Hilaire FRX) Dufond Patrick (Paris FRX) Tuong Duc L. (Paris FRX) deRivet Philippe-Hubert (Paris FRX) Verdier Henri (Paris FRX) Bradley John J., Apparatus and method for providing synchronization between processes and events occurring at different times in a data p.
Budde David L. (Portland OR) Carson David G. (Hillsboro OR) Cornish Anthony L. (Essex OR GB2) Johnson David B. (Portland OR) Peterson Craig B. (Portland OR), Apparatus for recovery from failures in a multiprocessing system.
Leavitt William I. ; Clemson Conrad R. ; Somers Jeffrey S. ; Chaves John M. ; Barbera David R. ; Clayton Shawn A., Digital data processing methods and apparatus for fault isolation.
Hendrie Gardner C. (Marlboro MA) Baty Kurt F. (Medway MA) Dynneson Ronald E. (Brighton MA) Falkoff Daniel M. (Natick MA) Reid Robert (Dunstable MA) Samson Joseph E. (Dover MA) Wolff Kenneth T. (Medwa, Digital data processor apparatus with pipelined fault tolerant bus protocol.
Samson Joseph E. (Dover MA) Wolff Kenneth T. (Medway MA) Reid Robert (Dunstable MA) Hendrie Gardner C. (Marlboro MA) Falkoff Daniel M. (Natick MA) Dynneson Ronald E. (Brighton MA) Clemson Daniel M. (, Digital data processor with high reliability.
Guenthner Russell W. (Glendale AZ) Eckard Clinton B. (Glendale AZ) Rabins Leonard (Scottsdale AZ) Shelly William A. (Phoenix AZ) Lange Ronald E. (Glendale AZ) Edwards David S. (Phoenix AZ) Flocken Br, Error detection in the basic processing unit of a VLSI central processor.
Horst Robert W. ; Baker William Edward ; Banton Randall G. ; Brown John Michael ; Bruckert William F. ; Bunton William Patterson ; Campbell Gary F. ; Coddington John Deane ; Cutts ; Jr. Richard W. ; , Fail-fast, fail-functional, fault-tolerant multiprocessor system.
Bissett Thomas D. (Northborough MA) Fiorentino Richard D. (Carlisle MA) Glorioso Robert M. (Stow MA) McCauley Diane T. (Hopkinton MA) McCollum James D. (Whitinsville MA) Tremblay Glenn A. (Upton MA), Fault resilient/fault tolerant computing.
Bissett Thomas D. ; Fitzgerald ; V Martin J. ; Leveille Paul A. ; McCollum James D. ; Muench Erik ; Tremblay Glenn A., Fault resilient/fault tolerant computing.
Bissett Thomas Dale ; Fiorentino Richard D. ; Glorioso Robert M. ; McCauley Diane T. ; McCollum James D. ; Tremblay Glenn A. ; Troiani Mario, Fault resilient/fault tolerant computing.
Bissett Thomas Dale ; Fiorentino Richard D. ; Glorioso Robert M. ; McCauley Diane T. ; McCollum James D. ; Tremblay Glenn A. ; Troiani Mario, Fault resilient/fault tolerant computing.
Bruckert William F. (Northboro MA) Bissett Thomas D. (Derry NH) Mazur Dennis (Worcester MA) Munzer John (Brookline MA), Fault tolerant, synchronized twin computer system with error checking of I/O communication.
Jewett Douglas E. (Austin TX) Bereiter Tom (Austin TX) Vetter Brian (Austin TX) Banton Randall G. (Austin TX) Cutts ; Jr. Richard W. (Georgetown TX) Westbrook ; deceased Donald C. (late of Austin TX , Fault-tolerant computer system with online recovery and reintegration of redundant components.
Vrba Richard Alan ; Klecka James Stevens ; Fey ; Jr. Kyran Wilfred ; Lamano Larry Leonard ; Mehta Nikhil A., High-performance fault tolerant computer system with clock length synchronization of loosely coupled processors.
Woods ; John M. ; Porter ; Marion G. ; Mills ; Donald V. ; Weller ; III ; Edward F. ; Patterson ; Garvin Wesley ; Monahan ; Earnest M., Input/output processing system utilizing locked processors.
Holm, Ingemar; Kohler, Helmut; Mannherz, Peter; Schumacher, Norbert; Zilles, Gerhard, Method and apparatus for checking the address and contents of a memory array.
Lawrence Kenneth J. (Rochester MN) McDermott Michael J. (Oronoco MN), Method and apparatus for deriving mirrored unit state when re-initializing a system.
Casorso Anthony J. (Westminster CO) Haldeman David P. (Broomfield CO), Method and apparatus for ensuring data integrity in a dynamically mapped data storage subsystem.
Goodrum Alan L. ; Autor Jeffrey S. ; Culley Paul R. ; Miller Joseph P. ; Tavallaei Siamak ; Basile Barry P. ; Richard Elizabeth A. ; Rose Eric E., Method and apparatus for identifying faulty devices in a computer system.
Green Gregory M. (Boxborough MA) Kohalmi Steven (Newton MA) Bricknell Karen R. (Berlin MA), Method and apparatus for validating I/O addresses in a fault-tolerant computer system.
BeMent Bradley Earl (Farmington Hills MI) Tiedje Kevin Mark (Farmington Hills MI) Crawford Robert Dennis (Livonia MI), Method and system for detecting fault conditions on multiplexed networks.
Bradley Frank (Cliffside Park NJ) Fletcher Royce (Santa Cruz CA), Method for determining reliability of high speed digital transmission by use of a synchronized low speed side channel.
Fujiwara Shinji (Yokohama JPX), Method of designated time interval reservation access process of online updating and backing up of large database versio.
Katzman James A. (San Jose CA) Bartlett Joel F. (Palo Alto CA) Bixler Richard M. (Sunnyvale CA) Davidow William H. (Atherton CA) Despotakis John A. (Pleasanton CA) Graziano Peter J. (Los Altos CA) Gr, Multiprocessor system.
Katzman James A. (San Jose CA) Bartlett Joel F. (Palo Alto CA) Bixler Richard M. (Sunnyvale CA) Davidow William H. (Atherton CA) Despotakis John A. (Pleasanton CA) Graziano Peter J. (Los Altos CA) Gr, Multiprocessor system.
Jewett Douglas E. (Austin TX), Multiprocessor system with each processor executing the same instruction sequence and hierarchical memory providing on d.
Whiteside Arliss E. (Royal Oak MI) Freedman Morris D. (Southfield MI) Tasar Omur (Harvard MA) Rothschild Alexander M. (Ann Arbor MI), Operations controller for a fault-tolerant multiple computer system.
Danielsen Carl M. (Lake Zurich IL) Dabbish Ezzat A. (Buffalo Grove IL) Puhl Larry C. (Sleepy Hollow IL), Redundant microprocessor control system using locks and keys.
Godiwala Nitin D. (Boylston MA) Maskas Barry A. (Sterling MA) Thaller Kurt M. (Acton MA) Metzger Jeffrey A. (Leominster MA), Scheme for error handling in a computer system.
Fogg ; Jr. Richard G. (Austin TX) Mathis Joseph R. (Georgetown TX) Nicholson James O. (Austin TX), System for DMA block data transfer based on linked control blocks.
Lamb Joseph M. (Hopedale MA), System using separate transfer circuits for performing different transfer operations respectively and scanning I/O devic.
Lemonovich, John E.; Sharp, William A.; Werner, James C.; Ding, Zhu; Berecek, Sean P., Cab signal receiver demodulator employing redundant, diverse field programmable gate arrays.
Corcoran, James J.; Danielson, Eric J.; Hemaidan, Samir S.; Roltgen, John W.; Sisson, James E.; Kovalan, Mark A.; Singer, Mark C., Dissimilar processor synchronization in fly-by-wire high integrity computing platforms and displays.
Rohleder, Michael; Fader, Joachim; Lenke, Frank; Baumeister, Markus, Fault tolerance of data processing steps operating in either a parallel operation mode or a non-synchronous redundant operation mode.
Baumann,Dietmar; Hofmann,Dirk; Vollert,Herbert; Nagel,Willi; Henke,Andreas; Foitzik,Bertram; Goetzelmann,Bernd, Method and device for monitoring a distributed system.
Bernick,David L.; Bruckert,William F.; Garcia,David J.; Jardine,Robert L.; Klecka,James S.; Mehra,Pankaj; Smullen,James R., Method and system executing user programs on non-deterministic processors.
Del Vigna, Jr., Paul; Jardine, Robert L., Method and system of aligning execution point of duplicate copies of a user program by exchanging information about instructions executed.
Kondo, Thomas J.; Jardine, Robert L; Bruckert, William F.; Garcia, David J.; Klecka, James S.; Smullen, James R.; Sprouse, Jeff; Stott, Graham B., Method and system of copying memory from a source processor to a target processor by duplicating memory writes.
Bernick,David L.; Bruckert,William F.; Garcia,David J.; Jardine,Robert L.; Mehra,Pankaj; Smullen,James R., Method and system of determining whether a user program has made a system level call.
Ple, Christophe, Process for maintaining execution synchronization between several asynchronous processors working in parallel and in a redundant manner.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.