[특허]Fault detection, diagnosis, and prevention for complex computing systems

Fault detection, diagnosis, and prevention for complex computing systems 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-011/00 G06F-011/07
출원번호	US-0022453 (2008-01-30)
등록번호	US-8949671 (2015-02-03)
발명자 / 주소	Mukherjee, Maharaj
출원인 / 주소	International Business Machines Corporation
대리인 / 주소	Cantor Colburn LLP
인용정보	피인용 횟수 : 4 인용 특허 : 27

초록 ▼

A method is provided for diagnosing failures in an object-oriented software system. The method comprises collecting runtime diagnostic information; maintaining a record of the diagnostic information in a storage buffer; and dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval. The diagnostic information includes at least one set of call stack information for at least one currently running application and at least one set of other information. Each of the at least one set of other information is selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process.

대표청구항 ▼

1. A method for diagnosing failures in an object-oriented software system, the method comprising: continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;generating a failure model classified by type and category of the stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;dynamically evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information during run time and changing operation based on the preventative information to avoid future failures. 2. The method of claim 1, further comprising monitoring the software system to detect occurrences of failures related to the runtime interactions between the set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected, the failures related to the runtime interactions between the set of objects including one or more of data access violations, memory existing in an inconsistent state, and sudden impact from large resource usages. 3. The method of claim 2, further comprising monitoring a set of resources being utilized by the software system, and recording the snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 4. The method of claim 3, further comprising monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any of the set of resources is being utilized beyond a specified threshold for that resource. 5. The method of claim 4, wherein the method is performed by one or more separate processes executing within an address space of the software system, one or more threads of a process executing within the software system, one or more daemons executing within the software system, one or more services executing outside the address space of the software system, or combinations thereof. 6. The method of claim 1, wherein the diagnostic information is generated by one or more processes executing within the software system. 7. The method of claim 1, wherein the diagnostic information is generated by one or more threads of one or more processes executing within the software system. 8. The method of claim 1, wherein the software system is communicatively coupled to a data repository, and wherein the storage buffer is maintained in the data repository, such that in a state in which the storage buffer is full, the maintaining the record of the diagnostic information includes recording current diagnostic information over oldest diagnostic information existing in the storage buffer for the record. 9. The method of claim 1, wherein the record of the diagnostic information is maintained in a log trace file. 10. The method of claim 1, wherein the predetermined interval is specified according to a maximum size for the storage buffer, a maximum number of call stack changes, or a maximum period of time. 11. The method of claim 2, further comprising allowing an analyst to access the record of the diagnostic information to attempt identify and categorize the cause of each occurrence of a failure that is detected. 12. The method of claim 11, further comprising sorting and extracting information from the record of diagnostic information that is relevant to the each occurrence of a failure that is detected. 13. The method of claim 3, wherein the software system is communicatively coupled to a data repository, and wherein each snapshot of the record of diagnostic information that is recorded is maintained in a record of snapshots in the data repository. 14. The method of claim 13, further comprising utilizing the record of snapshots to predict occurrences of failures related to runtime interactions between a set of objects in the software system. 15. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises performing regular regression testing. 16. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises creating a failure model of data clusters for utilization conditions of the set of resources. 17. The method of claim 1, wherein the software system is an application selected from operating system applications, database management applications, server-side software applications, web-based applications, and client-side software applications. 18. The method of claim 1, wherein the software system is executing on a single processor, several processors in close proximity, or distributed across a network. 19. A system for diagnosing failures in an object-oriented software system, the system comprising: a first software module, executed by a processor, to collect runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application in the software system, and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;a storage buffer configured to maintain a record of the diagnostic information, the storage buffer being configured to automatically receive the diagnostic information collected by the first software module and dynamically update the record to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval wherein the storage buffer also receives and collects snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;a failure diagnosis component for evaluating the diagnostic information record to determine diagnose causes of the failure by type; anda failure prediction component having a prevention component configured to generate a failure model classified by type and category of the stress conditions from the diagnostic information, configured to: identify and categorize the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;localize one or more failure conditions within the failure model using a multivariate normal distribution; andupdate the failure model responsive to changes to configuration settings;wherein the record of the diagnostic information is used to reproduce the failure for diagnostics. 20. The diagnostic system of claim 19, further comprising a second software module configured to be notified by the software system of the each occurrence of a failure related to the runtime interactions between the set of objects in the software system, the second software module being configured to access the storage buffer to evaluate the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure for which the second software module receives notification. 21. The diagnostic system of claim 20, further comprising a third software module configured to monitor a set of resources being utilized by the software system, the third software module being further configured to record a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 22. The diagnostic system of claim 21, wherein the first, second, and third software modules are implemented within one or more libraries of functions, one or more plug-in modules, one or more dynamic link-libraries, or combinations thereof. 23. A computer having a non transitory machine usable medium including computer readable instructions stored thereon for execution by a processor to perform a method for diagnosing failures in an object-oriented software system, the method comprising: collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;generating a failure model classified by type and category of the stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information based on the evaluation to prevent future failures. 24. The computer-usable medium of claim 23, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected. 25. The computer-usable medium of claim 23, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 26. A data processing system comprising: a central processing unit;a random access memory for storing data and programs for execution by the central processing unit;a first storage level comprising a nonvolatile storage device; andcomputer readable instructions stored in the random access memory for execution by central processing unit to perform a method for diagnosing failures in an object-oriented software system, the method comprising: collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; identifying and categorizing runtime interactions between a set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;generating a failure model classified by type and category of stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; anddynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information based on the evaluation to prevent future failures. 27. The data processing system of claim 26, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure that is detected. 28. The data processing system of claim 26, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.

이 특허에 인용된 특허 (27)

Mullen,David C., Arrangement for scheduling tasks based on probability of availability of resources at a future point in time.
상세보기
Pawar, Sitaram; Bono, Jean Pierre; Bergant, Milena; Potnis, Ajay S.; Agrawal, Ashwin B., Automatic media error correction in a file server.
상세보기
Meek, Christopher A.; Heckerman, David E.; Rounthwaite, Robert L.; Chickering, David Maxwell; Thiesson, Bo, Bayesian approach for learning regression decision graph models and regression models for time series analysis.
상세보기
Cheng,Wu Tung; Tsai,Kun Han; Huang,Yu; Tamarapalli,Nagesh; Rajski,Janusz, Compactor independent fault diagnosis.
상세보기
Banga, Gaurav, Computer assisted automatic error detection and diagnosis of file servers.
상세보기
Ketterhagen, Thomas; Bramhall, Bruce; Graf, Nicholas; Okcu, Okan, Computer system configuration representation and transfer.
상세보기
Brumme, Christopher W; Trowbridge, Sean E; Martin, Rudi; Liu, WeiWen; Grunkemeyer, Brian M.; Prakriya, Mahesh, Constrained execution regions.
상세보기
Bodamer Roger, Diagnostic methodology for debugging integrated software.
상세보기
Stout,William F.; Hartz,Sarah M., Latent property diagnosing procedure.
상세보기
Anglin David M. ; Adams Vernon J. ; Walker Julia C. ; Kleinfelter Kevin P. ; Nugent Michael T., Method and apparatus for facilitating customer service communications in a computing environment.
상세보기
Liu, Xiaowei, Method and system for broadband predistortion linearization.
상세보기
Galuten, Albhy; Williams, Peter, Method and system for handling errors in a distributed computer system.
상세보기
Benjamin C. Chamberlain, Method and system for restoring a computer to its original state after an unsuccessful patch installation attempt.
상세보기
Dixon, III,Walter; Murren,Brian, Method and system for verifying a computer program.
상세보기
da Silva Luis A., Method for determining a source of failure during a file system access.
상세보기
Petersen, Paul M.; Pellett, Flint, Method for finding errors in multithreaded applications.
상세보기
Schultz, Len; Quach, Nhon Toai; Mulla, Dean; Hays, Jim; Fu, John, Method of correcting a machine check error.
상세보기
Adibhatla,Sridhar; Ashby,Malcolm J., Methods and apparatus for model based diagnostics.
상세보기
Weissman, Craig; Tamm, Steven; Fell, Simon; Wong, Simon; Fisher, Steve, Methods and systems for providing fault recovery to side effects occurring during data processing.
상세보기
Oesterling,Christopher L.; Beiermeister,Frederick J.; Stefan,Jeffrey M., Providing status data for vehicle maintenance.
상세보기
Wegerich, Stephan W.; Wilks, Alan D.; Nelligan, John D., Signal differentiation system using improved non-linear operator.
상세보기
Ndumu Divine T,GBX ; Nwana Hyacinth S,GBX ; Lee Lyndon C,GBX, Software system generation.
상세보기
Draper, Andrew; Flaherty, Edward, Synchronization of hardware and software debuggers.
상세보기
Emigholz,Kenneth F.; Wang,Robert K.; Woo,Stephen S.; McLain,Richard B.; Dash,Sourabh K.; Kendi,Thomas A., System and method for abnormal event detection in the operation of continuous industrial processes.
상세보기
Botes, Par, System and method for point-in-time recovery of application resource sets.
상세보기
Voas, Jeffrey M.; Ghosh, Anup K., System and method for software certification.
상세보기
Chang, Shu-Ping; Gu, Xiaohui; Papadimitriou, Spyridon; Yu, Philip Shi-lung, Systems and methods for predictive failure management.
상세보기

이 특허를 인용한 특허 (4)

Mensah, Trevor, Communications device.
상세보기
Charters, Graham C.; Evans, Lewis; Mitchell, Timothy J.; Pilkington, Adam J., Diagnostic stackframe buffer.
상세보기
Shivanna, Suhas; Anders, Valentin; Malhotra, Sunil; Prabhakar, Omkar S, Robust hardware fault management system, method and framework for enterprise devices.
상세보기
Findeisen, Piotr, System and method for collecting application performance data.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Fault detection, diagnosis, and prevention for complex computing systems 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (27)

이 특허를 인용한 특허 (4)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Fault detection, diagnosis, and prevention for complex computing systems 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (27)

이 특허를 인용한 특허 (4)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트