Skip to main content

Hyunwoo KIM김현우

AI systems · FPGA acceleration · LLM inference · Seoul, KR

I work on the hardware/software boundary of AI systems: FPGA-based NPUs, LLM inference kernels, memory-bound workloads, and the runtimes that connect models to silicon.

My main project is pccx — a small research stack around a custom 64-bit ISA, an INT8 systolic array, runtime queues, and a Python-facing driver for edge FPGA inference. I care about the uncomfortable last mile of deployment: where the model graph finally meets memory bandwidth, queues, and hardware limits.

tech stackNETWORK
Loading network...
01threads

Research threads

The topics I keep returning to when I read papers or build systems.

  • memory-bound inference
    When bandwidth becomes the model's real batch size

    Decode-phase LLM inference often feels less like "more MACs" and more like carefully shaped data movement.

  • gemm / gemv
    The kernel shape matters more than the operation name

    GEMV is not merely GEMM with N = 1; the memory access pattern changes the whole optimization target.

  • runtime
    A driver is part of the accelerator

    Queues, synchronization, scratchpads, and transfer overlap decide whether the hardware feels fast or broken.

  • low-bit systems
    Quantization is a system design problem

    Weight precision, activation precision, packing, and hardware datapaths have to be reasoned about together.

02toolbox

Toolbox

A compact snapshot. This is not meant to be a resume table.

  • hardwareSystemVerilog, Vitis HLS, FPGA bring-up, systolic-array datapaths
  • systemsC/C++, Python runtimes, queues, memory layout, profiling, small kernels
  • ai inferenceTransformer inference, KV-cache, GEMM/GEMV, quantization, roofline-style analysis
  • writingpaper notes, architecture diagrams, reproducible experiment logs