IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
US-0533398
(2000-03-22)
|
우선권정보 |
GB-9908545(1999-04-14) |
발명자
/ 주소 |
- Taylor,Michael James
- Rowe,Simon Michael
|
출원인 / 주소 |
|
대리인 / 주소 |
Fitzpatrick, Cella, Harper &
|
인용정보 |
피인용 횟수 :
38 인용 특허 :
9 |
초록
▼
Image data from a plurality of cameras 2-1, 2-2, 2-3 showing the movements of a number of people, for example in a meeting, and sound data from a directional microphone array 4 is processed by a computer processing apparatus 24 to archive the data in a meeting archive database 60. The image data is
Image data from a plurality of cameras 2-1, 2-2, 2-3 showing the movements of a number of people, for example in a meeting, and sound data from a directional microphone array 4 is processed by a computer processing apparatus 24 to archive the data in a meeting archive database 60. The image data is processed to determine the three-dimensional position and orientation of each person's head and to determine at whom each person is looking. The sound data is processed to determine the direction from which the sound came. Processing is carried out to determine who is speaking by determining which person has his head in a position corresponding to the direction from which the sound came. Having determined which person is speaking, the personal speech recognition parameters for that person are selected and used to convert the sound data to text data. Image data to be archived is chosen by selecting the camera which best shows the speaking participant and the participant to whom he is speaking. Image data, sound data, text data and data defining at whom each person is looking is stored in the meeting archive database 60.
대표청구항
▼
What is claimed is: 1. Image processing apparatus, comprising: an image data receiver for receiving image data recorded by a plurality of cameras showing the movements of a plurality of people; a speaker identifier for determining which of the people is speaking; a speech recipient identifier for
What is claimed is: 1. Image processing apparatus, comprising: an image data receiver for receiving image data recorded by a plurality of cameras showing the movements of a plurality of people; a speaker identifier for determining which of the people is speaking; a speech recipient identifier for determining at whom the speaker is looking; a position calculator for determining the position of the speaker and the position of the person at whom the speaker is looking; and camera selection means for selecting image data from the received image data on the basis of the determined positions of the speaker and the person at whom the speaker is looking, said camera selection means being arranged to select image data in which both the speaker and the person at whom the speaker is looking appear, and wherein the camera selection means is arranged to generate quality values representing a quality of the views that at least some of the cameras have of the speaker and the person at whom the speaker is looking, and to select the image data on the basis of which camera has the quality value representing the highest quality. 2. Apparatus according to claim 1, wherein the camera selection means is arranged to determine which of the cameras have a view of the speaker and the person at whom the speaker is looking, and to generate a respective quality value for each camera which has a view of the speaker and the person at whom the speaker is looking. 3. Apparatus according to claim 1, wherein the camera selection means is arranged to generate each quality value in dependence upon the position and orientation of the head of the speaker and the position and orientation of the head of the person at whom the speaker is looking. 4. Apparatus according to claim 1, wherein the camera selection means comprises: a data store for storing data defining a camera from which image data is to be selected for respective pairs of positions; and an image data selector arranged to use data stored in the data store to select the image data in dependence upon the positions of the speaker and the person at whom the speaker is looking. 5. Apparatus according to claim 1, wherein the speech recipient identifier and the position calculator comprise an image processor for processing the image data from at least one of the cameras to determine at whom the speaker is looking and the positions. 6. Apparatus according to claim 5, wherein the image processor is arranged to determine the position of each person and at whom each person is looking by processing the image data from the at least one camera. 7. Apparatus according to claim 5, wherein the image processor is arranged to track the position and orientation of each person's head in three dimensions. 8. Apparatus according to claim 1, wherein the speaker identifier is arranged to receive speech data from a plurality of microphones each of which is allocated to a respective one of the people, and to determine which of the people is speaking on the basis of the microphone from which the speech data was received. 9. Apparatus according to claim 1, further comprising a sound processor for processing sound data defining words spoken by the people to generate text data therefrom in dependence upon the result of the processing performed by the speaker identifier. 10. Apparatus according to claim 9, wherein the sound processor has associated therewith a store for storing respective voice recognition parameters for each of the people, and a parameter selector for selecting the voice recognition parameters to be used to process the sound data in dependence upon the person determined to be speaking by the speaker identifier. 11. Apparatus according to claim 9, further comprising a database for storing at least some of the received image data, the sound data, the text data produced by the sound processor and viewing data defining at whom at least the person who is speaking is looking, the database being arranged to store the data such that corresponding text data and viewing data are associated with each other and with the corresponding image data and sound data. 12. Apparatus according to claim 11, further comprising a data compressor for compressing the image data and the sound data for storage in the database. 13. Apparatus according to claim 12, wherein the data compressor comprises an encoder for encoding the image data and the sound data as MPEG data. 14. Apparatus according to claim 11, further comprising a gaze time data generator for generating gaze time data defining, for a predetermined period, the proportion of time spent by a given person looking at each of the other people during the predetermined period, and wherein the database is arranged to store the gaze time data so that it is associated with the corresponding image data, sound data, text data and viewing data. 15. Apparatus according to claim 14, wherein the predetermined period comprises a period during which the given person was talking. 16. A method of processing image data recorded by a plurality of cameras showing the movements of a plurality of people to select image data for storage, the method comprising: a speaker identification step of determining which of the people is speaking; a step of determining at whom the speaker is looking; a step of determining the position of the speaker and the position of the person at whom the speaker is looking; and a camera selection step for selecting image data on the basis of the determined positions of the speaker and the person at whom the speaker is looking, wherein, in the camera selection step, image data is selected in which both the speaker and the person at whom the speaker is looking appear, quality values are generated representing a quality of the views that at least some of the cameras have of the speaker and the person at whom the speaker is looking, and the image data is selected on the basis of which camera has the quality value representing the highest quality. 17. A method according to claim 16, wherein, in the camera selection step, processing is performed to determine which of the cameras have a view of the speaker and the person at whom the speaker is looking, and to generate a respective quality value for each camera which has a view of the speaker and the person at whom the speaker is looking. 18. A method according to claim 16, wherein, in the camera selection step, each quality value is generated in dependence upon the position and orientation of the head of the speaker and the position and orientation of the head of the person at whom the speaker is looking. 19. A method according to claim 16, wherein, in the camera selection step pre-stored data defining a camera from which image data is to be selected for respective pairs of positions is used to select the image data in dependence upon the positions of the speaker and the person at whom the speaker is looking. 20. A method according to claim 16, wherein, in the steps of determining at whom the speaker is looking and determining the positions of the speaker and the person at whom the speaker is looking, image data from at least one of the cameras is processed to determine at whom the speaker is looking and the positions. 21. A method according to claim 20, wherein the image data from that at least one camera is processed to determine the position of each person and at whom each person is looking. 22. A method according to claim 20, wherein image data is processed to track the position and orientation of each person's head in three dimensions. 23. A method according to claim 16, wherein speech data is received from a plurality of microphones each of which is allocated to a respective one of the people, and, in the speaker identification step, it is determined which of the people is speaking on the basis of the microphone from which the speech data was received. 24. A method according to claim 16, further comprising a sound processing step of processing sound data defining words spoken by the people to generate text data therefrom in dependence upon the result of the processing performed in the speaker identification step. 25. A method according to claim 24, wherein the sound processing step includes selecting, from among stored respective voice recognition parameters for each of the people, the voice recognition parameters to be used to process the sound data in dependence upon the person determined to be speaking in the speaker identification step. 26. A method according to claim 24, further comprising the step of storing in a database at least some of the received image data, the sound data, the text data produced in the sound processing step and viewing data defining at whom at least the person who is speaking is looking, the data being stored in the database such that corresponding text data and viewing data are associated with each other and with the corresponding image data and sound data. 27. A method according to claim 26, wherein the image data and the sound data are stored in the database in compressed form. 28. A method according to claim 27, wherein the image data and the sound data are stored as MPEG data. 29. A method according to claim 26, further comprising the steps of generating data defining, for a predetermined period, the proportion of time spent by a given person looking at each of the other people during the predetermined period, and storing the data in the database so that it is associated with the corresponding image data, sound data, text data and viewing data. 30. A method according to claim 29, wherein the predetermined period comprises a period during which the given person was talking. 31. A method according to claim 26, further comprising the step of generating a signal conveying the database with data therein. 32. A method according to claim 31, further comprising the step of recording the signal either directly or indirectly to generate a recording thereof. 33. A method according to claim 16, further comprising the step of generating a signal conveying information defining the image data selected in the camera selection step.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.