Fault detection, diagnosis, and prevention for complex computing systems
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-011/00
G06F-011/07
출원번호
US-0022453
(2008-01-30)
등록번호
US-8949671
(2015-02-03)
발명자
/ 주소
Mukherjee, Maharaj
출원인 / 주소
International Business Machines Corporation
대리인 / 주소
Cantor Colburn LLP
인용정보
피인용 횟수 :
4인용 특허 :
27
초록▼
A method is provided for diagnosing failures in an object-oriented software system. The method comprises collecting runtime diagnostic information; maintaining a record of the diagnostic information in a storage buffer; and dynamically updating the record of the diagnostic information to include a g
A method is provided for diagnosing failures in an object-oriented software system. The method comprises collecting runtime diagnostic information; maintaining a record of the diagnostic information in a storage buffer; and dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval. The diagnostic information includes at least one set of call stack information for at least one currently running application and at least one set of other information. Each of the at least one set of other information is selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process.
대표청구항▼
1. A method for diagnosing failures in an object-oriented software system, the method comprising: continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set
1. A method for diagnosing failures in an object-oriented software system, the method comprising: continually collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one other set of information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;generating a failure model classified by type and category of the stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;dynamically evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information during run time and changing operation based on the preventative information to avoid future failures. 2. The method of claim 1, further comprising monitoring the software system to detect occurrences of failures related to the runtime interactions between the set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected, the failures related to the runtime interactions between the set of objects including one or more of data access violations, memory existing in an inconsistent state, and sudden impact from large resource usages. 3. The method of claim 2, further comprising monitoring a set of resources being utilized by the software system, and recording the snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 4. The method of claim 3, further comprising monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any of the set of resources is being utilized beyond a specified threshold for that resource. 5. The method of claim 4, wherein the method is performed by one or more separate processes executing within an address space of the software system, one or more threads of a process executing within the software system, one or more daemons executing within the software system, one or more services executing outside the address space of the software system, or combinations thereof. 6. The method of claim 1, wherein the diagnostic information is generated by one or more processes executing within the software system. 7. The method of claim 1, wherein the diagnostic information is generated by one or more threads of one or more processes executing within the software system. 8. The method of claim 1, wherein the software system is communicatively coupled to a data repository, and wherein the storage buffer is maintained in the data repository, such that in a state in which the storage buffer is full, the maintaining the record of the diagnostic information includes recording current diagnostic information over oldest diagnostic information existing in the storage buffer for the record. 9. The method of claim 1, wherein the record of the diagnostic information is maintained in a log trace file. 10. The method of claim 1, wherein the predetermined interval is specified according to a maximum size for the storage buffer, a maximum number of call stack changes, or a maximum period of time. 11. The method of claim 2, further comprising allowing an analyst to access the record of the diagnostic information to attempt identify and categorize the cause of each occurrence of a failure that is detected. 12. The method of claim 11, further comprising sorting and extracting information from the record of diagnostic information that is relevant to the each occurrence of a failure that is detected. 13. The method of claim 3, wherein the software system is communicatively coupled to a data repository, and wherein each snapshot of the record of diagnostic information that is recorded is maintained in a record of snapshots in the data repository. 14. The method of claim 13, further comprising utilizing the record of snapshots to predict occurrences of failures related to runtime interactions between a set of objects in the software system. 15. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises performing regular regression testing. 16. The method of claim 14, wherein utilizing the record of snapshots to predict occurrences of failures comprises creating a failure model of data clusters for utilization conditions of the set of resources. 17. The method of claim 1, wherein the software system is an application selected from operating system applications, database management applications, server-side software applications, web-based applications, and client-side software applications. 18. The method of claim 1, wherein the software system is executing on a single processor, several processors in close proximity, or distributed across a network. 19. A system for diagnosing failures in an object-oriented software system, the system comprising: a first software module, executed by a processor, to collect runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application in the software system, and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;a storage buffer configured to maintain a record of the diagnostic information, the storage buffer being configured to automatically receive the diagnostic information collected by the first software module and dynamically update the record to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval wherein the storage buffer also receives and collects snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;a failure diagnosis component for evaluating the diagnostic information record to determine diagnose causes of the failure by type; anda failure prediction component having a prevention component configured to generate a failure model classified by type and category of the stress conditions from the diagnostic information, configured to: identify and categorize the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;localize one or more failure conditions within the failure model using a multivariate normal distribution; andupdate the failure model responsive to changes to configuration settings;wherein the record of the diagnostic information is used to reproduce the failure for diagnostics. 20. The diagnostic system of claim 19, further comprising a second software module configured to be notified by the software system of the each occurrence of a failure related to the runtime interactions between the set of objects in the software system, the second software module being configured to access the storage buffer to evaluate the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure for which the second software module receives notification. 21. The diagnostic system of claim 20, further comprising a third software module configured to monitor a set of resources being utilized by the software system, the third software module being further configured to record a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 22. The diagnostic system of claim 21, wherein the first, second, and third software modules are implemented within one or more libraries of functions, one or more plug-in modules, one or more dynamic link-libraries, or combinations thereof. 23. A computer having a non transitory machine usable medium including computer readable instructions stored thereon for execution by a processor to perform a method for diagnosing failures in an object-oriented software system, the method comprising: collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, andwherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions;identifying and categorizing the runtime interactions between the set of objects in the software system upon localizing a cause of each occurrence of the failure that is detected;generating a failure model classified by type and category of the stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval;dynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information based on the evaluation to prevent future failures. 24. The computer-usable medium of claim 23, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of the failure that is detected. 25. The computer-usable medium of claim 23, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource. 26. A data processing system comprising: a central processing unit;a random access memory for storing data and programs for execution by the central processing unit;a first storage level comprising a nonvolatile storage device; andcomputer readable instructions stored in the random access memory for execution by central processing unit to perform a method for diagnosing failures in an object-oriented software system, the method comprising: collecting runtime diagnostic information, the diagnostic information including at least one set of call stack information for at least one currently running application and at least one set of other information, each of the at least one set of other information being selected from a set of memory access information, a set of data access information, and a set of paging information for each currently executing process;maintaining a record of the diagnostic information in a storage buffer including a snap shots of any failure that occurs, wherein the snap shots are recorded at an instance resource parameters exceed a predetermined threshold related to stress conditions for runtime interactions between a set of objects in which the resource parameters include CPU utilization for one or more processors, memory utilization of logical and physical memory, page file usage, disk I/O utilization, a number of processes or threads concurrently being executed, length of a data access wait list, and network throughput, and wherein the failure, related to the snap shots, includes paging problems, deadlock, thrashing, and race conditions; identifying and categorizing runtime interactions between a set of objects in the software system upon localizing a cause of each occurrence of a failure that is detected;generating a failure model classified by type and category of stress conditions from the diagnostic information;localizing one or more failure conditions within the failure model using a multivariate normal distribution;dynamically updating the record of the diagnostic information to include a group of the diagnostic information collected over a most recent occurrence of a predetermined interval; anddynamically updating the failure model responsive to configuration changes, wherein the record of the diagnostic information is used to reproduce the failure for diagnostics;evaluating the collected diagnostic information to diagnose causes of failure; andproviding preventative information based on the evaluation to prevent future failures. 27. The data processing system of claim 26, wherein the method further comprises monitoring the software system to detect occurrences of failures related to runtime interactions between a set of objects in the software system, and evaluating the record of the diagnostic information to attempt to localize the cause of each occurrence of a failure that is detected. 28. The data processing system of claim 26, wherein the method further comprises monitoring a set of resources being utilized by the software system, and recording a snapshot of the record of the diagnostic information whenever any resource of the set of resources is being utilized beyond a specified threshold for that resource.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (27)
Mullen,David C., Arrangement for scheduling tasks based on probability of availability of resources at a future point in time.
Meek, Christopher A.; Heckerman, David E.; Rounthwaite, Robert L.; Chickering, David Maxwell; Thiesson, Bo, Bayesian approach for learning regression decision graph models and regression models for time series analysis.
Anglin David M. ; Adams Vernon J. ; Walker Julia C. ; Kleinfelter Kevin P. ; Nugent Michael T., Method and apparatus for facilitating customer service communications in a computing environment.
Weissman, Craig; Tamm, Steven; Fell, Simon; Wong, Simon; Fisher, Steve, Methods and systems for providing fault recovery to side effects occurring during data processing.
Emigholz,Kenneth F.; Wang,Robert K.; Woo,Stephen S.; McLain,Richard B.; Dash,Sourabh K.; Kendi,Thomas A., System and method for abnormal event detection in the operation of continuous industrial processes.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.