[미국특허]
Architecture and execution for efficient mixed precision computations in single instruction multiple data/thread (SIMD/T) devices
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/76
G06F-009/38
G06F-009/30
G06F-009/46
G06F-008/41
출원번호
US-0672694
(2015-03-30)
등록번호
US-10061592
(2018-08-28)
발명자
/ 주소
Lukyanov, Maxim
Grosul, Alexander
Alsup, Mitchell
Beylin, Boris
출원인 / 주소
Samsung Electronics Co., Ltd.
대리인 / 주소
Innovation Counsel LLP
인용정보
피인용 횟수 :
0인용 특허 :
15
초록▼
A method for improving power, performance, area (PPA) for mixed precision computations in a processing environment. The method includes determining a braiding factor as a number of units of work encoded into a physical thread. A value of the braiding factor is determined based on a mix of precision
A method for improving power, performance, area (PPA) for mixed precision computations in a processing environment. The method includes determining a braiding factor as a number of units of work encoded into a physical thread. A value of the braiding factor is determined based on a mix of precision requirements presented for individual units of work. Units of work are classified as instructions for applied code transformation based on associated precision requirements for the processing environment. Instruction inputs from specified registers are packed together into a destination register according to the determined value of the braiding factor. The packed instructions presented in vector form are executed with an instruction set architecture configured for executing packed instructions of different precisions.
대표청구항▼
1. A method for improving power, performance, area (PPA) for mixed precision computations in a processing environment, the method comprising: determining a braiding factor as a number of units of work encoded into a physical thread prior to processing, a unit of work being a set of input data or ins
1. A method for improving power, performance, area (PPA) for mixed precision computations in a processing environment, the method comprising: determining a braiding factor as a number of units of work encoded into a physical thread prior to processing, a unit of work being a set of input data or instructions for processing;determining a value of the braiding factor based on a mix of precision requirements presented for individual units of work;classifying units of work as instructions for applied code transformation based on associated precision requirements for the processing environment, which comprises a single instruction multiple thread (SIMT) or a single instruction multiple data (SIMD) processing architecture;after classifying the units of work as instructions for applied code transformation, replicating instructions for the units of work to generate replicated instructions that are identical instructions for processing by neighboring threads, wherein the replicating is done with precision requirements greater than precision requirements corresponding to the determined value of the braiding factor as multiple threads for SIMT or SIMD processing;packing instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor; andexecuting the packed instructions presented in vector form with an instruction set architecture configured for executing packed instructions of different precisions. 2. The method of claim 1, wherein multiple units of work are packed for parallel processing of multiple data elements into the physical thread for execution based on associated precision requirements for the processing environment. 3. The method of claim 2, wherein the number of units of work packed into the physical thread is determined based on compiler analysis or explicit qualification in a source language. 4. The method of claim 3, wherein the mix of precision requirements are presented for the individual units of work in the source language to reduce fragmentation or scattering of contents in the register file. 5. The method of claim 2, further comprising: selectively replicating and packing instructions into the physical thread for units of work with precision requirements less than a basis precision of an instruction set architecture of the processing environment. 6. The method of claim 5, further comprising: narrowing or widening operands of instructions to be applied as necessary to ensure consistent precision types of instruction inputs according to precision requirements of an instruction output; and packing instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor. 7. The method of claim 2, further comprising: handling flow control divergence per unit of work within packed instructions with predication masks. 8. The method of claim 7, wherein handling flow control further comprises: designating a predication mask for each unit of work within packed instructions to manage independent determination of branch outcomes for each unit of work;determining whether each unit of work within packed instructions is currently active or should be reactivated with the predication mask; andproviding tracking information for each unit of work within packed instructions for inspection and modification by an executing program. 9. The method of claim 1, wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 10. A non-transitory computer-readable storage medium embodied thereon instructions being executable by at least one processor to perform a method for improving power, performance, area (PPA) for mixed precision computations in a processing environment, the method comprising: determining a braiding factor as a number of units of work encoded into a physical thread prior to processing, a unit of work being a set of input data or instructions for processing;determining a value of the braiding factor based on a mix of precision requirements presented for individual units of work;classifying units of work as instructions for applied code transformation based on associated precision requirements for the processing environment, which comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture;after classifying the units of work as instructions for applied code transformation, replicating instructions for the units of work to generate replicated instructions that are identical instructions for processing by neighboring threads, wherein the replicating is done with precision requirements greater than precision requirements corresponding to the determined value of the braiding factor as multiple threads for SIMT or SIMD processing;packing instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor; andexecuting the packed instructions presented in vector form with an instruction set architecture configured for executing packed instructions of different precisions. 11. The non-transitory computer-readable storage medium of claim 10, wherein multiple units of work are packed for parallel processing of multiple data elements into the physical thread for execution based on associated precision requirements for the processing environment. 12. The non-transitory computer-readable storage medium of claim 11, wherein: the number of units of work packed into the physical thread are determined based on compiler analysis or explicit qualification in a source language; andthe mix of precision requirements is presented for the individual units of work in the source language to reduce fragmentation or scattering of contents in the register file. 13. The non-transitory computer-readable storage medium of claim 11, further comprising: selectively replicating and packing instructions into the physical thread for units of work with precision requirements less than a basis precision of an instruction set architecture of the processing environment. 14. The non-transitory computer-readable storage medium of claim 13, further comprising: narrowing or widening operands of instructions to be applied as necessary to ensure consistent precision types of instruction inputs according to precision requirements of an instruction output; andpacking instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor. 15. The non-transitory computer-readable storage medium of claim 11, further comprising: handling flow control divergence per unit of work within packed instructions with predication masks. 16. The non-transitory computer-readable storage medium of claim 15, wherein handling flow control further comprises: designating a predication mask for each unit of work within packed instructions to manage independent determination of branch outcomes for each unit of work;determining whether each unit of work within packed instructions is currently active or should be reactivated with the predication mask; andproviding tracking information for each unit of work within packed instructions for inspection and modification by an executing program. 17. The non-transitory computer-readable storage medium of claim 10, wherein the processing environment is included in a graphics processing unit (GPU) of a mobile electronic device. 18. A graphics processor for an electronic device comprising: one or more processing elements coupled to a memory device, wherein the one or more processing elements are configured to:determine a braiding factor as a number of units of work encoded into a physical thread;determine a value of the braiding factor based on a mix of precision requirements presented for individual units of work prior to processing, a unit of work being a set of input data or instructions for processing;classify units of work as instructions for applied code transformation based on associated precision requirements for the processing environment, which comprises a single instruction multiple thread (SIMT) or single instruction multiple data (SIMD) processing architecture;pack instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor; andexecute the packed instructions presented in vector form with an instruction set architecture configured for executing packed instructions of different precisions,wherein the one or more processing elements are further configured to: replicate instructions for the units of work with precision requirements greater than precision requirements corresponding to the determined value of the braiding factor as multiple threads for SIMT or SIMD processing, the replicated instructions being identical instructions for processing by neighboring threads. 19. The graphics processor of claim 18, wherein multiple units of work are packed for parallel processing of multiple data elements into the physical thread for execution based on associated precision requirements for the processing environment. 20. The graphics processor of claim 19, wherein the number of units of work packed into the physical thread is determined based on compiler analysis or explicit qualification in a source language, and the mix of precision requirements are presented for the individual units of work in the source language to reduce fragmentation or scattering of contents in the register file. 21. The graphics processor of claim 20, wherein the one or more processing elements are further configured to: selectively replicate and pack instructions into the physical thread for units of work with precision requirements less than a basis precision of an instruction set architecture of the processing environment. 22. The graphics processor of claim 19, wherein the one or more processing elements are further configured to: narrow or widen operands of instructions to be applied as necessary to ensure consistent precision types of instruction inputs according to precision requirements of an instruction output;pack instruction inputs from specified registers together into a destination register according to the determined value of the braiding factor; andhandle flow control divergence per unit of work within packed instructions with predication masks. 23. The graphics processor of claim 22, wherein the one or more processing elements are further configured to: designate a predication mask for each unit of work within packed instructions to manage independent determination of branch outcomes for each unit of work;determine whether each unit of work within packed instructions is currently active or should be reactivated with the predication mask; andprovide tracking information for each unit of work within packed instructions for inspection and modification by an executing program. 24. The graphics processor of claim 18, wherein the electronic device comprises a mobile electronic device.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (15)
Crow,Franklin C.; Montrym,John S.; Craighead,Matthew J., Apparatus, system, and method for gamma correction of smoothed primitives.
Fahs, Brian; Nickolls, John R.; Moreton, Henry Packard; Coon, Brett W., Efficient implementation of arrays of structures on SIMT and SIMD architectures.
Lu, Yan-Hong; Chang, Jia-Yang; Kuo, Pao-Hung; Chang, Chia-Chi; Tsung, Pei-Kuei, Methods and systems for managing an instruction sequence with a divergent control flow in a SIMT architecture.
Abdallah, Mohammad A., Parallel processing of a sequential program using hardware generated threads and their instruction groups executing on plural execution units and accessing register file segments using dependency inheritance vectors across multiple engines.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.