Hyunwoo KIM김현우

AI systems · FPGA acceleration · LLM inference · Seoul, KR

I work on the hardware/software boundary of AI systems: FPGA-based NPUs, LLM inference kernels, memory-bound workloads, and the runtimes that connect models to silicon.

My main project is pccx — a small research stack around a custom 64-bit ISA, an INT8 systolic array, runtime queues, and a Python-facing driver for edge FPGA inference. I care about the uncomfortable last mile of deployment: where the model graph finally meets memory bandwidth, queues, and hardware limits.

email↗github↗papers projects now

tech stackNETWORK

Loading network...

01threads

Research threads

The topics I keep returning to when I read papers or build systems.

memory-bound inference
When bandwidth becomes the model's real batch size
Decode-phase LLM inference often feels less like "more MACs" and more like carefully shaped data movement.
gemm / gemv
The kernel shape matters more than the operation name
GEMV is not merely GEMM with N = 1; the memory access pattern changes the whole optimization target.
runtime
A driver is part of the accelerator
Queues, synchronization, scratchpads, and transfer overlap decide whether the hardware feels fast or broken.
low-bit systems
Quantization is a system design problem
Weight precision, activation precision, packing, and hardware datapaths have to be reasoned about together.

02toolbox

Toolbox

A compact snapshot. This is not meant to be a resume table.

hardwareSystemVerilog, Vitis HLS, FPGA bring-up, systolic-array datapaths
systemsC/C++, Python runtimes, queues, memory layout, profiling, small kernels
ai inferenceTransformer inference, KV-cache, GEMM/GEMV, quantization, roofline-style analysis
writingpaper notes, architecture diagrams, reproducible experiment logs