본문 바로가기

반도체-삼성전자-하이닉스-마이크론

PIM(2024.05.31)

2024.05.31

‘연산 기능’ 더한 메모리 반도체 개발하는 삼성전자·SK하이닉스… “GPU보다 응답 속도 13배 빨라” (msn.com)

 

“챗GPT를 구동하는 그래픽처리장치(GPU)는 보유한 성능의 0.3%밖에 사용하지 못해 막대한 비용 투입과 전력 손실이 발생합니다. SK하이닉스는 전력 손실을 줄일 수 있고, GPU보다 응답속도가 13배 빠른 프로세싱 인 메모리(PIM) 솔루션을 보유하고 있습니다.”

----------------------------------------------

 

Many modern and emerging applications must process increasingly large volumes of data. Unfortunately, prevalent computing paradigms are not designed to efficiently handle such large-scale data: The energy and performance costs to move this data between the memory subsystem and the CPU now dominate the total costs of computation. This forces system architects and designers to fundamentally rethink how to design computers. Processing-in-memory(PIM) is a computing paradigm that avoids most data movement costs by bringing computation to the data. New opportunities in modern memory systems are enabling architectures that can perform varying degrees of processing inside the memory subsystem. However, many practical system-level issues must be tackled to construct PIM architectures, including enabling workloads and programmers to easily take advantage of PIM. This article examines three key domains of work toward the practical construction and widespread adoption of PIM architectures. First, we describe our work on systematically identifying opportunities for PIM in real applications and quantify potential gains for popular emerging applications (e.g., machine learning, data analytics, genome analysis). Second, we aim to solve several key issues in programming these applications for PIM architectures. Third, we describe challenges that remain for the widespread adoption of PIM.

 

많은 현대 및 신흥 응용 프로그램은 점점 더 많은 양의 데이터를 처리해야 합니다. 불행히도, 일반적인 컴퓨팅 패러다임은 이러한 대규모 데이터를 효율적으로 처리하도록 설계되지 않았습니다. 메모리 서브시스템과 CPU 간의 데이터 이동에 드는 에너지와 성능 비용이 총 연산 비용을 지배하게 되면서 시스템 설계자와 디자이너들은 컴퓨터 설계 방식을 근본적으로 재고해야 합니다. 

메모리 내 처리(PIM, Processing-in-memory)는 대부분의 데이터 이동 비용을 피하고 데이터를 처리하는 새로운 컴퓨팅 패러다임입니다. 현대 메모리 시스템에서 새로운 기회는 메모리 서브시스템 내에서 다양한 정도의 처리를 수행할 수 있는 아키텍처를 가능하게 하고 있습니다. 

그러나 PIM 아키텍처를 구축하기 위해서는 실용적인 시스템 수준의 여러 문제를 해결해야 하며, 여기에는 워크로드와 프로그래머가 쉽게 PIM의 이점을 활용할 수 있도록 하는 것이 포함됩니다. 이 글에서는 PIM 아키텍처의 실용적인 구축과 널리 채택되기 위한 세 가지 주요 작업 영역을 살펴봅니다.

첫째, 실제 응용 프로그램에서 PIM의 기회를 체계적으로 식별하고, 머신러닝, 데이터 분석, 유전체 분석 등 인기 있는 신흥 응용 프로그램에 대한 잠재적 이득을 정량화하는 우리의 연구를 설명합니다.

둘째, 이러한 응용 프로그램을 PIM 아키텍처에 맞게 프로그래밍하는 데 있어 몇 가지 주요 문제를 해결하고자 합니다.

셋째, PIM의 널리 보급되기 위해 남아 있는 과제들을 설명합니다.

 

### 서론
다양한 응용 분야가 컴퓨팅 플랫폼의 모든 유형이 사회에서 더욱 널리 사용됨에 따라 등장했습니다. 이러한 현대적이고 새로운 응용 프로그램의 많은 부분은 이제 매우 큰 데이터셋을 처리해야 합니다. 예를 들어, 증강 현실 애플리케이션의 객체 분류 알고리즘은 수백만 개의 예제 이미지와 비디오 클립을 학습하고 실시간 고화질 비디오 스트림을 분류합니다. 대량의 데이터에서 유의미한 정보를 처리하기 위해 응용 프로그램은 인공지능(AI) 또는 머신러닝과 데이터 분석을 통해 데이터를 체계적으로 탐색하고 데이터셋의 주요 특성을 추출합니다.

대량의 데이터를 조작하고 탐색하는 데 대한 의존도가 증가함에 따라 이러한 현대적 응용 프로그램은 현대 컴퓨터의 데이터 저장 및 이동 자원을 크게 압도합니다. 현대 컴퓨터에서 주 메모리(DRAM으로 구성)는 데이터에 대한 어떠한 작업도 수행할 수 없습니다.

그 결과, 메모리에 저장된 데이터에 대해 어떠한 작업을 수행하려면 데이터를 메모리에서 CPU로 메모리 채널을 통해 이동해야 합니다. 메모리 채널은 핀 수가 제한된 오프칩 버스입니다(예: 기존의 DDR 메모리는 64비트 메모리 채널을 사용합니다). 데이터를 이동시키기 위해 CPU는 메모리 컨트롤러에 요청을 보내야 하며, 메모리 컨트롤러는 메모리 채널을 통해 데이터를 포함하는 DRAM 모듈에 명령을 보냅니다. DRAM 모듈은 메모리 채널을 통해 데이터를 읽고 반환한 후, 데이터는 캐시 계층을 거쳐 CPU 캐시에 저장됩니다. CPU는 데이터가 캐시에서 CPU 레지스터로 로드된 후에만 데이터를 처리할 수 있습니다.

 

Unfortunately, for modern and emerging applications, the large amounts of data that need to move across the memory channel create a large data movement bottleneck in the computing system [13, 14]. The data movement bottleneck incurs a heavy penalty in terms of both performance and energy consumption [13–20]. First, there is a long latency and significant energy involved in bringing data from DRAM. Second, it is difficult to send a large number of requests to memory in parallel, in part because of the narrow width of the memory channel. Third, despite the costs of bringing data into memory, much of this data is not reused by the CPU, rendering the caching either highly inefficient or completely unnecessary [5, 21], especially for modern workloads with very large datasets and random access patterns. Today, the total cost of computation, in terms of performance and in terms of energy, is dominated by the cost of data movement for modern data-intensive workloads such as machine learning and data analytics

 

불행히도, 현대 및 신흥 응용 프로그램의 경우 메모리 채널을 통해 이동해야 하는 대량의 데이터로 인해 컴퓨팅 시스템에서 큰 데이터 이동 병목 현상이 발생합니다. 데이터 이동 병목 현상은 성능과 에너지 소비 측면에서 큰 비용을 초래합니다.

첫째, DRAM에서 데이터를 가져오는 데 긴 지연 시간과 상당한 에너지가 소요됩니다.

둘째, 메모리 채널의 폭이 좁기 때문에 많은 수의 요청을 병렬로 메모리에 보내기가 어렵습니다.

셋째, 데이터를 메모리로 가져오는 비용에도 불구하고, 이 데이터의 대부분이 CPU에 의해 재사용되지 않아 캐싱이 매우 비효율적이거나 완전히 불필요하게 됩니다. 특히, 매우 큰 데이터셋과 랜덤 접근 패턴을 가진 현대 워크로드에서는 더욱 그렇습니다.

오늘날 성능과 에너지 측면에서 전체 연산 비용은 머신러닝 및 데이터 분석과 같은 현대 데이터 집약적 워크로드의 경우 데이터 이동 비용이 지배적입니다.