[논문]Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Senocak, Arda; Oh, Tae-Hyun; Kim, Junsik; Yang, Ming-Hsuan; Kweon, In So

doi:10.1109/tpami.2019.2952095

[해외논문] Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications 원문보기

IEEE transactions on pattern analysis and machine intelligence, v.43 no.5, 2021년, pp.1605 - 1619

Senocak, Arda (KAIST, School of Electrical Engineering, Daejeon, Republic of Korea) , Oh, Tae-Hyun (POSTECH, Pohang, Korea) , Kim, Junsik (KAIST, School of Electrical Engineering, Daejeon, Republic of Korea) , Yang, Ming-Hsuan (University of California, Merced, CA, USA) , Kweon, In So (KAIST, School of Electrical Engineering, Daejeon, Republic of Korea)

Abstract ▼ AI-Helper

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

참고문헌 (55)

Proc Int Conf Learn Representations Neural machine translation by jointly learning to align and translate bahdanau 2015 1
Proc Int Conf Learn Representations Very deep convolutional networks for large-scale image recognition simonyan 2015 1
Proc Int Conf Mach Learn Show, attend and tell: Neural image caption generation with visual attention xu 2015 2048
Corbetta, Maurizio, Shulman, Gordon L.. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews. Neuroscience, vol.3, no.3, 201-215.

상세보기
Perrott, David R., Cisneros, John, Mckinley, Richard L., D'Angelo, William R.. Aurally Aided Visual Search under Virtual and Free-Field Listening Conditions. Human factors : the journal of the Human Factors and Ergonomics Society, vol.38, no.4, 702-715.

상세보기
Bolia, Robert S., D'Angelo, William R., McKinley, Richard L.. Aurally Aided Visual Search in Three-Dimensional Space. Human factors : the journal of the Human Factors and Ergonomics Society, vol.41, no.4, 664-669.

상세보기
Proc 26th Int Conf Neural Inf Process Syst Deep content-based music recommendation van den oord 2013 2643
Proc Eur Conf Comput Vis Visualizing and understanding convolutional networks zeiler 2014 818
Stein, Barry E., Stanford, Terrence R.. Multisensory integration: current issues from the perspective of the single neuron. Nature reviews. Neuroscience, vol.9, no.4, 255-266.

상세보기
10.1109/CVPR.2016.319
Majdak, Piotr, Goupell, Matthew J., Laback, Bernhard. 3-D localization of virtual sound sources: Effects of visual environment, pointing method, and training. Attention, perception & psychophysics, vol.72, no.2, 454-469.

상세보기
Jones, Bill, Kabanoff, Boris. Eye movements in auditory space perception. Perception & psychophysics, vol.17, no.3, 241-245.

상세보기
Shelton, B. R., Searle, C. L.. The influence of vision on the absolute identification of sound-source position. Perception & psychophysics, vol.28, no.6, 589-596.

상세보기
10.1109/CVPR.2018.00458
Gaver, William W.. What in the World Do We Hear?: An Ecological Approach to Auditory Event Perception. Ecological psychology : a publication of the International Society for Ecological Psychology, vol.5, no.1, 1-29.

상세보기
10.1109/CVPR.2005.274
10.1109/CVPR.2007.383344
Proc 13th Int Conf Neural Inf Process Syst Learning joint statistical models for audio-visual fusion and segregation fisher 2001 742
Optimum Array Processing Part IV of Detection Estimation and Modulation Theory van trees 2002 10.1002/0471221104
Izadinia, H., Saleemi, I., Shah, M.. Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects. IEEE transactions on multimedia, vol.15, no.2, 378-390.

상세보기
10.21437/Interspeech.2018-1400
10.1109/ICCVW.2015.95
Proc IEEE Conf Comput Vis Pattern Recognit Deep 360 pilot: Learning a deep agent for piloting through $360^{\circ }$360? sports video hu 2017 1396
Proc AAAI Self-view grounding given a narrated $360^{\circ }$360? video chou 2017 6748
Proc 32nd Int Conf Neural Inf Process Syst Self-supervised generation of spatial audio for $360^{\circ }$360? video morgado 2018 360
arXiv 1904 07933 Audio–visual model distillation using acoustic images perez 2019
10.1109/CVPR.2019.00041
ACM Trans Graphics $360^{\circ }$360? video stabilization kopf 2016 10.1145/2980179.2982405 35

상세보기
10.1109/CVPR.2018.00374
Proc Asia Conf Comput Vis On learning associations of faces and voices kim 2018 276
Proc Int Workshop Similarity-Based Pattern Recognit Deep metric learning using triplet network hoffer 2015 10.1007/978-3-319-24261-3_7 84
10.1017/CBO9781107298019
Proc Eur Conf Comput Vis Ambient sound provides supervision for visual learning owens 2016 801
Owens, Andrew, Wu, Jiajun, McDermott, Josh H., Freeman, William T., Torralba, Antonio. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning. International journal of computer vision, vol.126, no.10, 1120-1137.

상세보기
Proc 30th Int Conf Neural Inf Process Syst SoundNet: Learning sound representations from unlabeled video aytar 2016 892
Proc IEEE Int Conf Comput Vis Look, listen and learn arandjelovi? 2017 609
CoRR See, hear, and read: Deep aligned representations aytar 2017 abs 1706 932
Proc 32nd Int Conf Neural Inf Process Syst Cooperative learning of audio and video models from self-supervised synchronization korbar 2018 7774
Proc 12th Int Conf Neural Inf Process Syst Audio vision: Using audio-visual synchrony to locate sounds hershey 1999
Proc Eur Conf Comput Vis Learning to separate object sounds by watching unlabeled video gao 2018 36
Proc Eur Conf Comput Vis Objects that sound arandjelovic 2018 451
ACM Trans Graphics Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation ephrat 2018 10.1145/3197517.3201357 37

상세보기
Proc Eur Conf Comput Vis The sound of pixels zhao 2018 587
Proc Eur Conf Comput Vis Audio-visual scene analysis with self-supervised multisensory features owens 2018 639
Proc Eur Conf Comput Vis Jointly discovering visual objects and spoken words from raw sensory input harwath 2018 659
Proc IEEE Conf Comput Vis Pattern Recognit Making $360^{\circ }$360? video watchable in 2D: Learning videography for click free viewing su 2017 1368
Proc Eur Conf Comput Vis Audio-visual event localization in unconstrained videos tian 2018 252
Everingham, Mark, Van Gool, Luc, Williams, Christopher K. I., Winn, John, Zisserman, Andrew. The Pascal Visual Object Classes (VOC) Challenge. International journal of computer vision, vol.88, no.2, 303-338.

상세보기
Kafle, Kushal, Kanan, Christopher. Visual question answering: Datasets, algorithms, and future challenges. Computer vision and image understanding : CVIU, vol.163, 3-20.

상세보기
Proc Asia Conf Comput Vis Pano2vid: Automatic cinematography for watching $360^{\circ }$360? videos su 2016 154
10.1109/CVPR.2018.00154
TensorFlow: Large-scale machine learning on heterogeneous systems abadi 2015
Skinner, B. F.. 'Superstition' in the pigeon.. Journal of experimental psychology, vol.38, no.2, 168-172.

상세보기
Thomee, Bart, Shamma, David A., Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. YFCC100M : the new data in multimedia research. Communications of the ACM, vol.59, no.2, 64-73.

상세보기
Proc Int Conf Learn Representations Adam: A method for stochastic optimization kingma 2015 1

LOADING...

활용도 분석정보

상세보기

다운로드

내보내기

활용도 Top5 논문

해당 논문의 주제분야에서 활용도가 높은 상위 5개 콘텐츠를 보여줍니다.
더보기 버튼을 클릭하시면 더 많은 관련자료를 살펴볼 수 있습니다.

원문 보기

AccessON 원문보기

원문 URL 링크

DOI : 10.1109/TPAMI.2019.2952095
IEEE Computer Society : 저널 > 논문
IEEE : 저널 > 논문
AccessON : 저널

*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.

오픈액세스(OA) 유형

GREEN

저자가 공개 리포지터리에 출판본, post-print, 또는 pre-print를 셀프 아카이빙 하여 자유로운 이용이 가능한 논문

저작권 관리 안내

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관

저장형식

Text(ASCII format)
Excel format
RefWorks Direct Export
RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

연합인증