<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>hkimw Blog</title>
        <link>https://hkimw.github.io/hkimw/ko/blog</link>
        <description>hkimw Blog</description>
        <lastBuildDate>Fri, 17 Apr 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>ko</language>
        <item>
            <title><![CDATA[[논문] Attention Is All You Need]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need</guid>
            <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Transformer 모델 구조의 핵심 개념과 수학적 원리를 담은 글이다.]]></description>
            <content:encoded><![CDATA[<p>Transformer 모델 구조의 핵심 개념과 수학적 원리를 담은 글이다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-transformer의-등장-배경">1. Transformer의 등장 배경<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#1-transformer%EC%9D%98-%EB%93%B1%EC%9E%A5-%EB%B0%B0%EA%B2%BD" class="hash-link" aria-label="1. Transformer의 등장 배경에 대한 직접 링크" title="1. Transformer의 등장 배경에 대한 직접 링크" translate="no">​</a></h2>
<p>기존 NLP 처리 분야에서 주류를 이루던 모델은 RNN(Recurrent Neural Network)과 LSTM(Long Short-Term Memory)이었다. 이 모델들은 데이터를 순차적(Sequential)으로 처리한다. 예를 들어 "나는 학교에 간다"라는 문장이 있을 때, '나는'을 처리한 결과를 바탕으로 '학교에'를 처리하고, 그 결과를 다시 바탕으로 '간다'를 처리하는 방식이다.</p>
<p>이러한 순차적 처리 방식에는 두 가지 치명적인 한계가 있다.</p>
<ol>
<li class="">
<p><strong>parallel하게 처리 불가:</strong> 이전 단어의 연산이 끝나야만 다음 단어의 연산을 수행할 수 있으므로, 컴퓨터의 연산 자원을 동시에 활용하는 parallel 처리가 불가능하다.</p>
</li>
<li class="">
<p><strong>장기 의존성(Long-term Dependency) 문제:</strong> 문장이 길어질수록 초반에 입력된 단어의 정보가 뒤로 갈수록 희미해지는 현상이 발생한다.</p>
</li>
</ol>
<p>Transformer는 "<strong>단어들을 순차적으로 넣지 말고, 문장 전체를 한꺼번에 입력한 뒤 단어들 간의 관계를 동시에 계산하자</strong>"는 아이디어에서 출발했다. 이를 가능하게 한 핵심 기술이 바로 <strong>Attention</strong> 메커니즘이다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-model-architecture">2. Model Architecture<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#2-model-architecture" class="hash-link" aria-label="2. Model Architecture에 대한 직접 링크" title="2. Model Architecture에 대한 직접 링크" translate="no">​</a></h2>
<p>Transformer는 기계 번역과 같은 Sequence Transduction 작업에 최적화된 <strong>Encoder-Decoder</strong> 구조를 채택하고 있다.</p>
<!-- -->
<div style="padding:1rem;background:#FBF8F3;border-radius:4px;border:1px solid #e0e0e0;margin:1rem 0"><svg viewBox="0 0 800 600" width="100%" height="100%" style="font-family:var(--ifm-font-family-monospace)"><defs><marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#333"></path></marker></defs><g transform="translate(100, 50)"><rect x="0" y="0" width="250" height="500" fill="none" stroke="#666" stroke-width="2" stroke-dasharray="5,5"></rect><text x="125" y="-15" text-anchor="middle" fill="#333" font-weight="bold">Encoder (N=6)</text><rect x="50" y="450" width="150" height="30" fill="#eaeaea" stroke="#333" rx="4"></rect><text x="125" y="470" text-anchor="middle" font-size="12" fill="#333">Input Embedding</text><circle x="125" y="410" r="15" fill="#eaeaea" stroke="#333"></circle><text x="125" y="415" text-anchor="middle" font-size="16" fill="#333">+</text><text x="60" y="415" font-size="12" fill="#333">Positional Encoding</text><rect x="50" y="320" width="150" height="40" fill="#f0e6d2" stroke="#333" rx="4"></rect><text x="125" y="345" text-anchor="middle" font-size="12" fill="#333">Multi-Head Attention</text><rect x="50" y="250" width="150" height="30" fill="#e2ede2" stroke="#333" rx="4"></rect><text x="125" y="270" text-anchor="middle" font-size="12" fill="#333">Add &amp; Norm</text><rect x="50" y="160" width="150" height="40" fill="#e6f0f9" stroke="#333" rx="4"></rect><text x="125" y="185" text-anchor="middle" font-size="12" fill="#333">Feed Forward</text><rect x="50" y="90" width="150" height="30" fill="#e2ede2" stroke="#333" rx="4"></rect><text x="125" y="110" text-anchor="middle" font-size="12" fill="#333">Add &amp; Norm</text><path d="M 125 450 L 125 425" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 395 L 125 360" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 320 L 125 280" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 250 L 125 200" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 160 L 125 120" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 90 L 125 30" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 380 L 25 380 L 25 265 L 50 265" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 225 L 25 225 L 25 105 L 50 105" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(450, 50)"><rect x="0" y="0" width="250" height="500" fill="none" stroke="#666" stroke-width="2" stroke-dasharray="5,5"></rect><text x="125" y="-15" text-anchor="middle" fill="#333" font-weight="bold">Decoder (N=6)</text><rect x="50" y="450" width="150" height="30" fill="#eaeaea" stroke="#333" rx="4"></rect><text x="125" y="470" text-anchor="middle" font-size="12" fill="#333">Output Embedding</text><circle x="125" y="410" r="15" fill="#eaeaea" stroke="#333"></circle><text x="125" y="415" text-anchor="middle" font-size="16" fill="#333">+</text><text x="145" y="415" font-size="12" fill="#333">Positional Encoding</text><rect x="50" y="340" width="150" height="40" fill="#f0e6d2" stroke="#333" rx="4"></rect><text x="125" y="357" text-anchor="middle" font-size="12" fill="#333">Masked</text><text x="125" y="372" text-anchor="middle" font-size="12" fill="#333">Multi-Head Attention</text><rect x="50" y="280" width="150" height="30" fill="#e2ede2" stroke="#333" rx="4"></rect><text x="125" y="300" text-anchor="middle" font-size="12" fill="#333">Add &amp; Norm</text><rect x="50" y="210" width="150" height="40" fill="#f0e6d2" stroke="#333" rx="4"></rect><text x="125" y="235" text-anchor="middle" font-size="12" fill="#333">Multi-Head Attention</text><rect x="50" y="150" width="150" height="30" fill="#e2ede2" stroke="#333" rx="4"></rect><text x="125" y="170" text-anchor="middle" font-size="12" fill="#333">Add &amp; Norm</text><rect x="50" y="90" width="150" height="40" fill="#e6f0f9" stroke="#333" rx="4"></rect><text x="125" y="115" text-anchor="middle" font-size="12" fill="#333">Feed Forward</text><rect x="50" y="30" width="150" height="30" fill="#e2ede2" stroke="#333" rx="4"></rect><text x="125" y="50" text-anchor="middle" font-size="12" fill="#333">Add &amp; Norm</text><rect x="50" y="-40" width="150" height="20" fill="#eaeaea" stroke="#333" rx="4"></rect><text x="125" y="-25" text-anchor="middle" font-size="12" fill="#333">Linear</text><rect x="50" y="-80" width="150" height="20" fill="#eaeaea" stroke="#333" rx="4"></rect><text x="125" y="-65" text-anchor="middle" font-size="12" fill="#333">Softmax</text><text x="125" y="-100" text-anchor="middle" font-size="12" fill="#333">Output Probabilities</text><path d="M 125 450 L 125 425" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 395 L 125 380" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 340 L 125 310" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 280 L 125 250" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 210 L 125 180" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 150 L 125 130" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 90 L 125 60" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 30 L 125 -20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 -40 L 125 -60" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 -80 L 125 -95" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 390 L 25 390 L 25 295 L 50 295" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 265 L 25 265 L 25 165 L 50 165" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 140 L 25 140 L 25 45 L 50 45" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><path d="M 225 140 L 400 140 L 400 230 L 500 230" fill="none" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><text x="350" y="130" font-size="12" fill="#333">K, V</text><text x="510" y="260" font-size="12" fill="#333">Q</text></svg></div>
<ul>
<li class=""><strong>Auto-regressive 특성:</strong> 모델은 출력을 생성할 때 이전에 자신이 생성한 출력 기호들을 다음 단계의 추가 입력으로 사용한다. 즉, 1번째 단어를 예측하고, 그 단어를 포함하여 2번째 단어를 예측하는 방식이다.</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="21-encoder">2.1 Encoder<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#21-encoder" class="hash-link" aria-label="2.1 Encoder에 대한 직접 링크" title="2.1 Encoder에 대한 직접 링크" translate="no">​</a></h3>
<p>Encoder는 입력된 원본 문장(예: 한국어 문장)을 읽고, 그 문장 내 단어들의 의미와 문맥을 파악하여 압축된 정보(Representation)로 변환하는 역할을 한다.</p>
<ul>
<li class="">
<p><strong>계층 구조:</strong> 총 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>N</mi><mo>=</mo><mn>6</mn></mrow><annotation encoding="application/x-tex">N = 6</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathnormal" style="margin-right:0.109em">N</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">6</span></span></span></span>개의 Identical layers 를 쌓아 올린 형태이다.</p>
</li>
<li class="">
<p><strong>Sub-layer:</strong> 각 레이어는 내부적으로 2개의 Sub-layer를 가진다.</p>
<ol>
<li class="">
<p><strong>Multi-Head Self-Attention:</strong> 문장 내부의 단어들이 서로 어떤 연관성을 가지는지 파악한다.</p>
</li>
<li class="">
<p><strong>Position-wise Feed-Forward Network (FFN):</strong> 파악된 연관성 정보를 바탕으로 각 단어의 특징을 더욱 깊게 학습하는 Neural Network이다.</p>
</li>
</ol>
</li>
<li class="">
<p><strong>Residual Connection 및 Layer Normalization:</strong>
각 Sub-layer의 출력은 다음과 같은 수식으로 처리된다.</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>O</mi><mi>u</mi><mi>t</mi><mi>p</mi><mi>u</mi><mi>t</mi><mo>=</mo><mi>L</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi><mi>N</mi><mi>o</mi><mi>r</mi><mi>m</mi><mo stretchy="false">(</mo><mi>x</mi><mo>+</mo><mi>S</mi><mi>u</mi><mi>b</mi><mi>l</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">Output = LayerNorm(x + Sublayer(x))</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0278em">O</span><span class="mord mathnormal">u</span><span class="mord mathnormal">tp</span><span class="mord mathnormal">u</span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">L</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord mathnormal" style="margin-right:0.0278em">er</span><span class="mord mathnormal" style="margin-right:0.109em">N</span><span class="mord mathnormal" style="margin-right:0.0278em">or</span><span class="mord mathnormal">m</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0576em">S</span><span class="mord mathnormal">u</span><span class="mord mathnormal">b</span><span class="mord mathnormal" style="margin-right:0.0197em">l</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord mathnormal" style="margin-right:0.0278em">er</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">))</span></span></span></span></span>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">x</span></span></span></span><strong>:</strong> Sub-layer로 들어가는 원본 입력값이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>S</mi><mi>u</mi><mi>b</mi><mi>l</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">Sublayer(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0576em">S</span><span class="mord mathnormal">u</span><span class="mord mathnormal">b</span><span class="mord mathnormal" style="margin-right:0.0197em">l</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord mathnormal" style="margin-right:0.0278em">er</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span><strong>:</strong> Attention이나 FFN 연산을 거친 결과값이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>+</mo><mi>S</mi><mi>u</mi><mi>b</mi><mi>l</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">x + Sublayer(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em"></span><span class="mord mathnormal">x</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0576em">S</span><span class="mord mathnormal">u</span><span class="mord mathnormal">b</span><span class="mord mathnormal" style="margin-right:0.0197em">l</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord mathnormal" style="margin-right:0.0278em">er</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span> <strong>(Residual Connection):</strong> 연산 결과에 원본 입력값을 더해준다. 층이 깊어지더라도 초기 정보가 소실되는 것을 방지하여 학습을 안정적으로 만든다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>L</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi><mi>N</mi><mi>o</mi><mi>r</mi><mi>m</mi><mo stretchy="false">(</mo><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">LayerNorm(...)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">L</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord mathnormal" style="margin-right:0.0278em">er</span><span class="mord mathnormal" style="margin-right:0.109em">N</span><span class="mord mathnormal" style="margin-right:0.0278em">or</span><span class="mord mathnormal">m</span><span class="mopen">(</span><span class="mord">...</span><span class="mclose">)</span></span></span></span><strong>:</strong> 더해진 결과값의 평균과 분산을 구하여 데이터를 일정한 범위로 정규화한다.</p>
</li>
</ul>
</li>
<li class="">
<p><strong>차원 통일:</strong> Residual Connection을 원활하게 수행하기 위해, 모델 내부의 모든 Sub-layer와 Embedding 층의 출력 차원은 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>=</mo><mn>512</mn></mrow><annotation encoding="application/x-tex">d_{model} = 512</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">512</span></span></span></span>로 고정된다.</p>
</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="22-decoder">2.2 Decoder<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#22-decoder" class="hash-link" aria-label="2.2 Decoder에 대한 직접 링크" title="2.2 Decoder에 대한 직접 링크" translate="no">​</a></h3>
<p>Decoder는 Encoder가 압축해 놓은 문맥 정보를 바탕으로 타겟 문장(예: 번역된 영어 문장)을 하나씩 생성하는 역할을 한다. Encoder와 마찬가지로 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>N</mi><mo>=</mo><mn>6</mn></mrow><annotation encoding="application/x-tex">N = 6</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathnormal" style="margin-right:0.109em">N</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">6</span></span></span></span>개의 동일한 레이어로 구성되지만, Sub-layer가 3개로 늘어난다.</p>
<ol>
<li class="">
<p><strong>Masked Multi-Head Self-Attention:</strong></p>
<ul>
<li class="">
<p>Decoder가 출력 단어를 생성할 때, 현재 위치보다 뒤에 있는(미래의) 단어들을 미리 보지 못하게 가리는(Masking) 역할을 한다.</p>
</li>
<li class="">
<p>예를 들어 3번째 단어를 예측할 때는 1, 2번째 단어만 참조할 수 있도록, 미래 단어들의 유사도 점수(Score)를 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>−</mo><mi mathvariant="normal">∞</mi></mrow><annotation encoding="application/x-tex">-\infty</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em"></span><span class="mord">−</span><span class="mord">∞</span></span></span></span>로 마스킹하여, Softmax 함수를 거친 후의 Attention 가중치(Weight)가 0이 되도록 만든다.</p>
</li>
</ul>
</li>
<li class="">
<p><strong>Multi-Head Attention (Encoder-Decoder Attention):</strong></p>
<ul>
<li class="">
<p>Decoder가 단어를 생성하기 위해 "원본 문장의 어떤 부분을 집중해서 봐야 할지"를 결정하는 곳이다.</p>
</li>
<li class="">
<p>여기서 Decoder는 자신의 정보를 기준(Query)으로 삼고, Encoder가 최종적으로 출력한 정보(Key, Value)를 참조한다.</p>
</li>
</ul>
</li>
<li class="">
<p><strong>Position-wise Feed-Forward Network:</strong> Encoder의 구조와 동일하다.</p>
</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-attention-메커니즘">3. Attention 메커니즘<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#3-attention-%EB%A9%94%EC%BB%A4%EB%8B%88%EC%A6%98" class="hash-link" aria-label="3. Attention 메커니즘에 대한 직접 링크" title="3. Attention 메커니즘에 대한 직접 링크" translate="no">​</a></h2>
<p>Attention 메커니즘은 Transformer의 핵심이다. Attention 함수는 하나의 Query와 Key-Value 쌍들의 집합을 출력에 매핑하는 작업으로 설명할 수 있다.</p>
<!-- -->
<p>비유하자면 도서관에서 정보를 찾는 과정과 같다.</p>
<ul>
<li class="">
<p><strong>Query (Q):</strong> 사용자가 검색창에 입력한 '검색어' (현재 파악하고자 하는 대상 단어)</p>
</li>
<li class="">
<p><strong>Key (K):</strong> 도서관 책들에 붙어있는 '색인' 또는 '라벨' (다른 단어들이 가진 특징)</p>
</li>
<li class="">
<p><strong>Value (V):</strong> 그 책의 실제 '내용' (다른 단어들이 가진 실제 정보)</p>
</li>
</ul>
<p>(* Self-Attention의 경우 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Q</mi><mo separator="true">,</mo><mi>K</mi><mo separator="true">,</mo><mi>V</mi></mrow><annotation encoding="application/x-tex">Q, K, V</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em"></span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.2222em">V</span></span></span></span>는 모두 같은 입력 문장으로부터 생성되며, 각각 서로 다른 가중치 행렬을 곱해 목적에 맞게 변환된 값이다)</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="31-scaled-dot-product-attention">3.1 Scaled Dot-Product Attention<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#31-scaled-dot-product-attention" class="hash-link" aria-label="3.1 Scaled Dot-Product Attention에 대한 직접 링크" title="3.1 Scaled Dot-Product Attention에 대한 직접 링크" translate="no">​</a></h3>
<p>논문에서는 Attention을 계산하기 위해 'Scaled Dot-Product Attention'이라는 방식을 제안한다. 연산 수식은 다음과 같다.</p>
<!-- -->
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>A</mi><mi>t</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><mi>K</mi><mo separator="true">,</mo><mi>V</mi><mo stretchy="false">)</mo><mo>=</mo><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mfrac><mrow><mi>Q</mi><msup><mi>K</mi><mi>T</mi></msup></mrow><msqrt><msub><mi>d</mi><mi>k</mi></msub></msqrt></mfrac><mo stretchy="false">)</mo><mi>V</mi></mrow><annotation encoding="application/x-tex">Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">A</span><span class="mord mathnormal">tt</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.2222em">V</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.4483em;vertical-align:-0.93em"></span><span class="mord mathnormal">so</span><span class="mord mathnormal" style="margin-right:0.1076em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span><span class="mopen">(</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.5183em"><span style="top:-2.2528em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8572em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord" style="padding-left:0.833em"><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span><span style="top:-2.8172em"><span class="pstrut" style="height:3em"></span><span class="hide-tail" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1828em"><span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.677em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8413em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.1389em">T</span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.93em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose">)</span><span class="mord mathnormal" style="margin-right:0.2222em">V</span></span></span></span></span>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Q</mi></mrow><annotation encoding="application/x-tex">Q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em"></span><span class="mord mathnormal">Q</span></span></span></span> <strong>(Query Matrix):</strong> | [질문] | 타겟 단어들의 벡터가 모인 Matrix이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>K</mi></mrow><annotation encoding="application/x-tex">K</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathnormal" style="margin-right:0.0715em">K</span></span></span></span> <strong>(Key Matrix):</strong> | [위치] | 참조할 단어들의 벡터가 모인 Matrix이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>V</mi></mrow><annotation encoding="application/x-tex">V</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathnormal" style="margin-right:0.2222em">V</span></span></span></span> <strong>(Value Matrix):</strong> | [내용] | 참조할 단어들의 실제 정보 벡터가 모인 Matrix이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>K</mi><mi>T</mi></msup></mrow><annotation encoding="application/x-tex">K^T</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8413em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8413em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.1389em">T</span></span></span></span></span></span></span></span></span></span></span><strong>:</strong> Key Matrix의 전치 Matrix(Transposed Matrix)이다. Matrix 곱을 위해 행과 열을 바꾼 형태이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mi>k</mi></msub></mrow><annotation encoding="application/x-tex">d_k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span><strong>:</strong> Query와 Key 벡터의 차원 수이다. (논문에서는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mi>k</mi></msub><mo>=</mo><mn>64</mn></mrow><annotation encoding="application/x-tex">d_k = 64</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">64</span></span></span></span>를 사용한다.)</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><msub><mi>d</mi><mi>k</mi></msub></msqrt></mrow><annotation encoding="application/x-tex">\sqrt{d_k}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.04em;vertical-align:-0.1828em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8572em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord" style="padding-left:0.833em"><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span><span style="top:-2.8172em"><span class="pstrut" style="height:3em"></span><span class="hide-tail" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1828em"><span></span></span></span></span></span></span></span></span><strong>:</strong> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mi>k</mi></msub></mrow><annotation encoding="application/x-tex">d_k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>의 제곱근이다. (논문에서는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><mn>64</mn></msqrt><mo>=</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">\sqrt{64} = 8</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.04em;vertical-align:-0.1328em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.9072em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord" style="padding-left:0.833em"><span class="mord">64</span></span></span><span style="top:-2.8672em"><span class="pstrut" style="height:3em"></span><span class="hide-tail" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1328em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">8</span></span></span></span>이 된다.)</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow><annotation encoding="application/x-tex">softmax</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord mathnormal">so</span><span class="mord mathnormal" style="margin-right:0.1076em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span></span></span></span><strong>:</strong> 입력된 값들을 0과 1 사이의 확률값으로 변환하고, 그 총합이 1이 되도록 만드는 함수이다. (공식: <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><msup><mi>e</mi><msub><mi>x</mi><mi>i</mi></msub></msup><mrow><mo>∑</mo><msup><mi>e</mi><msub><mi>x</mi><mi>j</mi></msub></msup></mrow></mfrac></mrow><annotation encoding="application/x-tex">\frac{e^{x_i}}{\sum e^{x_j}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.4413em;vertical-align:-0.5303em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.911em"><span style="top:-2.6447em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mop op-symbol small-op mtight" style="position:relative;top:0em">∑</span><span class="mspace mtight" style="margin-right:0.1952em"></span><span class="mord mtight"><span class="mord mathnormal mtight">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.779em"><span style="top:-2.9714em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3448em;margin-left:0em;margin-right:0.1em"><span class="pstrut" style="height:2.6595em"></span><span class="mord mathnormal mtight" style="margin-right:0.0572em">j</span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.5092em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.394em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7385em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3448em;margin-left:0em;margin-right:0.1em"><span class="pstrut" style="height:2.6595em"></span><span class="mord mathnormal mtight">i</span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.3147em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.5303em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span>)</p>
</li>
</ul>
<hr>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>A</mi><mi>t</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo stretchy="false">(</mo><mi>Q</mi><mo separator="true">,</mo><mi>K</mi><mo separator="true">,</mo><mi>V</mi><mo stretchy="false">)</mo><mo>=</mo><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mfrac><mrow><mi>Q</mi><msup><mi>K</mi><mi>T</mi></msup></mrow><msqrt><msub><mi>d</mi><mi>k</mi></msub></msqrt></mfrac><mo stretchy="false">)</mo><mi>V</mi></mrow><annotation encoding="application/x-tex">Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">A</span><span class="mord mathnormal">tt</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mopen">(</span><span class="mord mathnormal">Q</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.2222em">V</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.4483em;vertical-align:-0.93em"></span><span class="mord mathnormal">so</span><span class="mord mathnormal" style="margin-right:0.1076em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span><span class="mopen">(</span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.5183em"><span style="top:-2.2528em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8572em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord" style="padding-left:0.833em"><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span><span style="top:-2.8172em"><span class="pstrut" style="height:3em"></span><span class="hide-tail" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1828em"><span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.677em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8413em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.1389em">T</span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.93em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose">)</span><span class="mord mathnormal" style="margin-right:0.2222em">V</span></span></span></span></span>
<ol>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Q</mi><msup><mi>K</mi><mi>T</mi></msup></mrow><annotation encoding="application/x-tex">QK^T</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0358em;vertical-align:-0.1944em"></span><span class="mord mathnormal">Q</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0715em">K</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8413em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.1389em">T</span></span></span></span></span></span></span></span></span></span></span> <strong>(유사도 계산):</strong> Query 행렬과 Key 전치 행렬을 행렬 곱(Matrix Multiplication)한다. 이는 Query 단어 벡터와 각 Key 단어 벡터 간의 내적(Dot Product)을 한 번에 계산하는 과정으로, Query 단어와 각 key 단어가 얼마나 연관성이 높은지(유사한지)를 수학적인 점수로 산출하는 과정이다. 값이 클수록 두 단어의 연관성이 높다는 뜻이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mrow><mi>Q</mi><msup><mi>K</mi><mi>T</mi></msup></mrow><msqrt><msub><mi>d</mi><mi>k</mi></msub></msqrt></mfrac></mrow><annotation encoding="application/x-tex">\frac{QK^T}{\sqrt{d_k}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.6275em;vertical-align:-0.538em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.0895em"><span style="top:-2.5864em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord sqrt mtight"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8622em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord mtight" style="padding-left:0.833em"><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3488em;margin-left:0em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1512em"><span></span></span></span></span></span></span></span></span><span style="top:-2.8222em"><span class="pstrut" style="height:3em"></span><span class="hide-tail mtight" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1778em"><span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.4461em"><span class="pstrut" style="height:3em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">Q</span><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0715em">K</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.9191em"><span style="top:-2.931em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight" style="margin-right:0.1389em">T</span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.538em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span> <strong>(Scaling):</strong> Dot product을 수행하면 차원 수(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mi>k</mi></msub></mrow><annotation encoding="application/x-tex">d_k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>)가 클수록 결과값이 매우 커지는 경향이 있다. 값이 너무 커지면 다음 단계인 Softmax 함수에서 기울기(Gradient)가 0에 수렴하여 학습이 진행되지 않는 문제가 발생한다. 이를 방지하기 위해 점수를 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msqrt><msub><mi>d</mi><mi>k</mi></msub></msqrt></mrow><annotation encoding="application/x-tex">\sqrt{d_k}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.04em;vertical-align:-0.1828em"></span><span class="mord sqrt"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8572em"><span class="svg-align" style="top:-3em"><span class="pstrut" style="height:3em"></span><span class="mord" style="padding-left:0.833em"><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span><span style="top:-2.8172em"><span class="pstrut" style="height:3em"></span><span class="hide-tail" style="min-width:0.853em;height:1.08em"><svg xmlns="http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"><path d="M95,702
c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,-10,-9.5,-14
c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54
c44.2,-33.3,65.8,-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10
s173,378,173,378c0.7,0,35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429
c69,-144,104.5,-217.7,106.5,-221
l0 -0
c5.3,-9.3,12,-14,20,-14
H400000v40H845.2724
s-225.272,467,-225.272,467s-235,486,-235,486c-2.7,4.7,-9,7,-19,7
c-6,0,-10,-1,-12,-3s-194,-422,-194,-422s-65,47,-65,47z
M834 80h400000v40h-400000z"></path></svg></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1828em"><span></span></span></span></span></span></span></span></span>로 나누어 값의 크기를 적절하게 조절(Scaling)한다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">softmax(...)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">so</span><span class="mord mathnormal" style="margin-right:0.1076em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span><span class="mopen">(</span><span class="mord">...</span><span class="mclose">)</span></span></span></span> <strong>(weight 확률화):</strong> Scaling 된 점수들을 Softmax 함수에 통과시킨다. 이 과정을 거치면 각 단어에 대한 점수가 0~1 사이의 확률값(weight)으로 변환된다. 예를 들어 "0.9"가 나오면 이 단어와 매우 강하게 연관되어 있다는 뜻이고, "0.01"이 나오면 거의 무시해도 좋다는 뜻이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>×</mo><mi>V</mi></mrow><annotation encoding="application/x-tex">\times V</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7667em;vertical-align:-0.0833em"></span><span class="mord">×</span><span class="mord mathnormal" style="margin-right:0.2222em">V</span></span></span></span> <strong>(정보의 결합):</strong> 계산된 Softmax weight를 실제 정보인 Value Matrix에 곱한다. 결과적으로 연관성이 높은 단어의 정보(Value)는 많이 가져오고, 연관성이 낮은 단어의 정보는 적게 가져와서 하나로 합치게 된다. 이 결과가 바로 Attention의 최종 출력값이 된다.</p>
</li>
</ol>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="32-multi-head-attention">3.2 Multi-Head Attention<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#32-multi-head-attention" class="hash-link" aria-label="3.2 Multi-Head Attention에 대한 직접 링크" title="3.2 Multi-Head Attention에 대한 직접 링크" translate="no">​</a></h3>
<p>Transformer는 위의 단일 Attention을 한 번만 수행하지 않고, 차원을 여러 개로 쪼개어 여러 번의 Attention을 parallel하게 수행한다. 이를 Multi-Head Attention이라고 부른다.</p>
<!-- -->
<div style="padding:1.5rem;background:#FBF8F3;border-radius:4px;border:1px solid #e0e0e0;margin:1rem 0"><svg viewBox="0 0 800 500" width="100%" height="100%" style="font-family:var(--ifm-font-family-monospace)"><defs><marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#333"></path></marker></defs><rect x="300" y="20" width="200" height="40" fill="#e8eaf6" stroke="#3f51b5" rx="4"></rect><text x="400" y="35" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Input Q, K, V</text><text x="400" y="50" text-anchor="middle" font-size="10" fill="#666">(d_model=512)</text><path d="M 400 60 L 400 90" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="250" y="90" width="300" height="40" fill="#e0f2f1" stroke="#00695c" rx="4"></rect><text x="400" y="105" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Linear Projections &amp; Split</text><text x="400" y="120" text-anchor="middle" font-size="10" fill="#666">(h=8 heads)</text><path d="M 300 130 L 150 160" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 350 130 L 300 160" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 450 130 L 500 160" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 500 130 L 650 160" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><g transform="translate(100, 160)"><rect x="0" y="0" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Head 1</text><text x="50" y="30" text-anchor="middle" font-size="10" fill="#666">(d_k=64)</text><path d="M 50 40 L 50 70" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="0" y="70" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="85" text-anchor="middle" font-size="10" fill="#333">Scaled Dot-</text><text x="50" y="100" text-anchor="middle" font-size="10" fill="#333">Product Attn</text><path d="M 50 110 L 100 180" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(250, 160)"><rect x="0" y="0" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Head 2</text><text x="50" y="30" text-anchor="middle" font-size="10" fill="#666">(d_k=64)</text><path d="M 50 40 L 50 70" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="0" y="70" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="85" text-anchor="middle" font-size="10" fill="#333">Scaled Dot-</text><text x="50" y="100" text-anchor="middle" font-size="10" fill="#333">Product Attn</text><path d="M 50 110 L 100 180" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><text x="400" y="220" text-anchor="middle" font-size="24" fill="#666" letter-spacing="5">...</text><g transform="translate(600, 160)"><rect x="0" y="0" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Head 8</text><text x="50" y="30" text-anchor="middle" font-size="10" fill="#666">(d_k=64)</text><path d="M 50 40 L 50 70" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="0" y="70" width="100" height="40" fill="#ffe0b2" stroke="#ef6c00" rx="4"></rect><text x="50" y="85" text-anchor="middle" font-size="10" fill="#333">Scaled Dot-</text><text x="50" y="100" text-anchor="middle" font-size="10" fill="#333">Product Attn</text><path d="M 50 110 L -100 180" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(250, 340)"><rect x="0" y="0" width="300" height="40" fill="#e0f2f1" stroke="#00695c" rx="4"></rect><text x="150" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Concatenate</text><text x="150" y="30" text-anchor="middle" font-size="10" fill="#666">(8 × 64 = 512 dim)</text><path d="M 150 40 L 150 70" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="50" y="70" width="200" height="40" fill="#e0f2f1" stroke="#00695c" rx="4"></rect><text x="150" y="85" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Final Linear Projection</text><path d="M 150 110 L 150 140" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="50" y="140" width="200" height="40" fill="#c8e6c9" stroke="#2e7d32" rx="4"></rect><text x="150" y="165" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Multi-Head Attention Output</text></g></svg></div>
<p>논문에서는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>=</mo><mn>512</mn></mrow><annotation encoding="application/x-tex">d_{model} = 512</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">512</span></span></span></span>차원을 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>h</mi><mo>=</mo><mn>8</mn></mrow><annotation encoding="application/x-tex">h = 8</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal">h</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">8</span></span></span></span>개의 Head로 쪼갠다. 따라서 각 Head는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mi>k</mi></msub><mo>=</mo><msub><mi>d</mi><mi>v</mi></msub><mo>=</mo><mn>512</mn><mi mathvariant="normal">/</mi><mn>8</mn><mo>=</mo><mn>64</mn></mrow><annotation encoding="application/x-tex">d_k = d_v = 512 / 8 = 64</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em">v</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord">512/8</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">64</span></span></span></span> 차원의 벡터를 다루게 된다.</p>
<p><strong>왜 Multi Head(여러개)를 사용하는가?</strong></p>
<p>문장 내에서 단어들의 관계는 다각도로 해석될 수 있다.
예를 들어 "그가 강하게 공을 찼다"라는 문장에서 '찼다'라는 단어는 '그가'(주어, 누가 했는가?)와 연결될 수도 있고, '공을'(목적어, 무엇을 했는가?)과 연결될 수도 있다.
단일 Attention만 사용하면 여러 관계 중 평균적인 한 가지 관점만 보게 되지만, Head를 8개로 나누면 각각의 Head가 주어와의 관계, 목적어와의 관계, 시제와의 관계 등 서로 다른 다양한 문맥적 특징(Representation subspace)을 동시에 포착할 수 있다.</p>
<p>각각의 Head에서 계산된 8개의 결과 Matrix은 마지막에 하나로 이어 붙여진(Concatenated) 후, 선형 변환(Linear Projection) Matrix을 곱하여 최종 출력 Matrix이 된다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-position-wise-feed-forward-network">4. Position-wise Feed-Forward Network<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#4-position-wise-feed-forward-network" class="hash-link" aria-label="4. Position-wise Feed-Forward Network에 대한 직접 링크" title="4. Position-wise Feed-Forward Network에 대한 직접 링크" translate="no">​</a></h2>
<p>Attention Sub-layer를 통과한 데이터는 각 레이어마다 포함된 완전 연결 전방향 신경망(Fully Connected Feed-Forward Network, FFN)을 거치게 된다.</p>
<!-- -->
<p>"Position-wise"라는 의미는 문장을 구성하는 개별 단어 위치(Position)마다 동일한 Neural Network가 각각 독립적으로 적용된다는 뜻이다.</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>F</mi><mi>F</mi><mi>N</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>max</mi><mo>⁡</mo><mo stretchy="false">(</mo><mn>0</mn><mo separator="true">,</mo><mi>x</mi><msub><mi>W</mi><mn>1</mn></msub><mo>+</mo><msub><mi>b</mi><mn>1</mn></msub><mo stretchy="false">)</mo><msub><mi>W</mi><mn>2</mn></msub><mo>+</mo><msub><mi>b</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1389em">F</span><span class="mord mathnormal" style="margin-right:0.1389em">F</span><span class="mord mathnormal" style="margin-right:0.109em">N</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mop">max</span><span class="mopen">(</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">x</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">b</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">b</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span></span>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">x</span></span></span></span><strong>:</strong> Attention 층을 통과하여 들어온 입력 벡터이다. 차원은 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>=</mo><mn>512</mn></mrow><annotation encoding="application/x-tex">d_{model} = 512</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">512</span></span></span></span>이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mn>1</mn></msub><mo separator="true">,</mo><msub><mi>b</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">W_1, b_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">b</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span><strong>:</strong> 첫 번째 선형 변환을 위한 weight(Weight) Matrix과 편향(Bias) 벡터이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>max</mi><mo>⁡</mo><mo stretchy="false">(</mo><mn>0</mn><mo separator="true">,</mo><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mi mathvariant="normal">.</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\max(0, ...)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mop">max</span><span class="mopen">(</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">...</span><span class="mclose">)</span></span></span></span><strong>:</strong> ReLU(Rectified Linear Unit) 활성화 함수이다. 괄호 안의 계산 결과가 0보다 작으면 0으로 만들고, 0보다 크면 그 값을 그대로 유지한다. 비선형성을 부여하는 핵심 요소이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mn>2</mn></msub><mo separator="true">,</mo><msub><mi>b</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">W_2, b_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">b</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span><strong>:</strong> 두 번째 선형 변환을 위한 weight Matrix과 편향 벡터이다.</p>
</li>
</ul>
<p>이 신경망은 샌드위치 구조를 가진다.</p>
<ol>
<li class="">
<p><strong>차원 확장:</strong> 입력 벡터 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi></mrow><annotation encoding="application/x-tex">x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">x</span></span></span></span> (512차원)에 weight <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">W_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>을 곱하여 차원을 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>f</mi><mi>f</mi></mrow></msub><mo>=</mo><mn>2048</mn></mrow><annotation encoding="application/x-tex">d_{ff} = 2048</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9805em;vertical-align:-0.2861em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.1076em">f</span><span class="mord mathnormal mtight" style="margin-right:0.1076em">f</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2048</span></span></span></span> 차원으로 크게 확장시킨다.</p>
</li>
<li class="">
<p><strong>활성화:</strong> 확장된 공간에서 ReLU 함수를 거치며 데이터의 비선형적 특징을 추출한다. 이 과정에서 불필요한 정보(음수 값)는 0으로 소거된다.</p>
</li>
<li class="">
<p><strong>차원 압축:</strong> 다시 weight <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">W_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>를 곱하여 원래의 차원인 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub><mo>=</mo><mn>512</mn></mrow><annotation encoding="application/x-tex">d_{model} = 512</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">512</span></span></span></span> 차원으로 압축하여 출력한다.</p>
</li>
</ol>
<p>Attention 이 단어들 사이의 '관계'를 수집하는 과정이라면, FFN 층은 수집된 정보를 바탕으로 각 단어 자체가 가진 '의미'를 더욱 복잡하고 풍부하게 가공하여 기억하는 역할을 담당한다. 모델 전체의 학습 파라미터(weight) 대부분이 바로 이 FFN의 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mn>1</mn></msub><mo separator="true">,</mo><msub><mi>W</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">W_1, W_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span> Matrix에 집중되어 있다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-positional-encoding">5. Positional Encoding<a href="https://hkimw.github.io/hkimw/ko/blog/attention-is-all-you-need#5-positional-encoding" class="hash-link" aria-label="5. Positional Encoding에 대한 직접 링크" title="5. Positional Encoding에 대한 직접 링크" translate="no">​</a></h2>
<p>Transformer는 RNN 구조를 버리고 Matrix 곱셈을 통한 parallel 처리를 택했다. 그러나 이로 인해 치명적인 단점이 생긴다. Attention 연산은 단어 집합을 마치 순서가 없는 '가방(Bag of words)'처럼 취급하기 때문에, "나는 밥을 먹는다"와 "밥을 나는 먹는다"를 수학적으로 동일하게 인식할 수 있다.</p>
<!-- -->
<p>이를 해결하기 위해 모델이 Sequence 내 단어의 상대적 또는 절대적 '위치(순서)' 정보를 알 수 있도록, 입력 단어의 Embedding 벡터에 위치 정보를 담은 벡터를 더해주는 과정을 <strong>Positional Encoding</strong>이라고 한다.</p>
<p>논문에서는 위치 정보를 생성하기 위해 다양한 주파수를 가진 사인(Sine) 및 코사인(Cosine) 함수를 사용한다.</p>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>P</mi><msub><mi>E</mi><mrow><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo separator="true">,</mo><mn>2</mn><mi>i</mi><mo stretchy="false">)</mo></mrow></msub><mo>=</mo><mi>sin</mi><mo>⁡</mo><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mi mathvariant="normal">/</mi><msup><mn>10000</mn><mrow><mn>2</mn><mi>i</mi><mi mathvariant="normal">/</mi><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0385em;vertical-align:-0.3552em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0576em">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.5198em;margin-left:-0.0576em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">p</span><span class="mord mathnormal mtight">os</span><span class="mpunct mtight">,</span><span class="mord mtight">2</span><span class="mord mathnormal mtight">i</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.3552em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.188em;vertical-align:-0.25em"></span><span class="mop">sin</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord mathnormal">os</span><span class="mord">/1000</span><span class="mord"><span class="mord">0</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.938em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">i</span><span class="mord mtight">/</span><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3488em;margin-left:0em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1512em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>P</mi><msub><mi>E</mi><mrow><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mo separator="true">,</mo><mn>2</mn><mi>i</mi><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msub><mo>=</mo><mi>cos</mi><mo>⁡</mo><mo stretchy="false">(</mo><mi>p</mi><mi>o</mi><mi>s</mi><mi mathvariant="normal">/</mi><msup><mn>10000</mn><mrow><mn>2</mn><mi>i</mi><mi mathvariant="normal">/</mi><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0385em;vertical-align:-0.3552em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0576em">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.5198em;margin-left:-0.0576em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">p</span><span class="mord mathnormal mtight">os</span><span class="mpunct mtight">,</span><span class="mord mtight">2</span><span class="mord mathnormal mtight">i</span><span class="mbin mtight">+</span><span class="mord mtight">1</span><span class="mclose mtight">)</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.3552em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.188em;vertical-align:-0.25em"></span><span class="mop">cos</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mord mathnormal">os</span><span class="mord">/1000</span><span class="mord"><span class="mord">0</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.938em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">i</span><span class="mord mtight">/</span><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3488em;margin-left:0em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1512em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mi>o</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">pos</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal">p</span><span class="mord mathnormal">os</span></span></span></span><strong>:</strong> 문장 내에서 해당 단어의 위치(Position) 인덱스이다. (예: 첫 번째 단어는 0, 두 번째 단어는 1)</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em"></span><span class="mord mathnormal">i</span></span></span></span><strong>:</strong> 차원(Dimension)의 인덱스이다. Embedding 벡터 내의 몇 번째 값인지를 나타낸다.<br>
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em"></span><span class="mord mathnormal">i</span></span></span></span>의 범위는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0</mn></mrow><annotation encoding="application/x-tex">0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0</span></span></span></span>부터 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub><mi mathvariant="normal">/</mi><mn>2</mn><mo>−</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">d_{model}/2 - 1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mord">/2</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span>까지이며, 이를 통해 벡터의 짝수 인덱스(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>i</mi></mrow><annotation encoding="application/x-tex">2i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em"></span><span class="mord">2</span><span class="mord mathnormal">i</span></span></span></span>)와 홀수 인덱스(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2</mn><mi>i</mi><mo>+</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">2i+1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7429em;vertical-align:-0.0833em"></span><span class="mord">2</span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span>)에 각각 다른 삼각함수를 짝지어 적용한다</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mn>2</mn><mi>i</mi></msub><mo separator="true">,</mo><msub><mn>2</mn><mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></mrow><annotation encoding="application/x-tex">2_{i}, 2_{i+1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8528em;vertical-align:-0.2083em"></span><span class="mord"><span class="mord">2</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord">2</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span></span></span></span><strong>:</strong> 벡터의 인덱스가 짝수(2i)일 때는 사인(sin) 함수를, 홀수(2i+1)일 때는 코사인(cos) 함수를 사용한다는 의미이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow><annotation encoding="application/x-tex">d_{model}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span><strong>:</strong> Embedding 벡터의 총 차원 수 (512)이다.</p>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mn>10000</mn><mrow><mn>2</mn><mi>i</mi><mi mathvariant="normal">/</mi><msub><mi>d</mi><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi></mrow></msub></mrow></msup></mrow><annotation encoding="application/x-tex">10000^{2i/d_{model}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.888em"></span><span class="mord">1000</span><span class="mord"><span class="mord">0</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.888em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">i</span><span class="mord mtight">/</span><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3448em"><span style="top:-2.3488em;margin-left:0em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span><span class="mord mathnormal mtight">o</span><span class="mord mathnormal mtight">d</span><span class="mord mathnormal mtight">e</span><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.1512em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><strong>:</strong> 주파수를 결정하는 분모 항목이다. 인덱스 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi></mrow><annotation encoding="application/x-tex">i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em"></span><span class="mord mathnormal">i</span></span></span></span>가 커질수록 분모가 커져 주파수가 매우 느리게 변하게 된다.</p>
</li>
</ul>
<p>이 공식을 사용하면 문장 내의 각 위치(pos)마다, 그리고 벡터의 각 차원(i)마다 고유한 패턴을 가지는 연속적인 실수 값이 생성된다. 삼각함수를 사용했기 때문에 위치 Vector의 값들은 -1에서 1 사이의 값으로 일정하게 파동을 그린다.</p>
<p>이렇게 수학적 규칙으로 생성된 512 dimension의 '위치 벡터'를, 데이터가 Encoder나 Decoder의 첫 번째 레이어에 들어가기 직전에 원래 단어의 'Embedding 벡터'에 단순 덧셈(+)해 준다. 결과적으로 모델은 학습을 진행하면서 단어의 고유한 의미뿐만 아니라, 이 삼각함수 파동 패턴을 역추적해서 "아, 이 단어는 문장의 앞부분에 있구나" 혹은 "저 단어는 바로 다음 위치에 있구나"라는 상대적인 순서(relative position)를 파악할 수 있게 된다.</p>]]></content:encoded>
            <category>논문</category>
            <category>transformer</category>
            <category>nlp</category>
            <category>딥러닝</category>
        </item>
        <item>
            <title><![CDATA[[논문] GPT-1 핵심 정리]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/gpt-1</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/gpt-1</guid>
            <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[이 문서는 GPT-1 논문의 architecture와 학습 과정을 수학적/정의와 직관적인 해설을 결합하여 정리한 노트이다.]]></description>
            <content:encoded><![CDATA[<p>이 문서는 GPT-1 논문의 architecture와 학습 과정을 수학적/정의와 직관적인 해설을 결합하여 정리한 노트이다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-언어-모델의-핵심-기초-개념">1. 언어 모델의 핵심 기초 개념<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#1-%EC%96%B8%EC%96%B4-%EB%AA%A8%EB%8D%B8%EC%9D%98-%ED%95%B5%EC%8B%AC-%EA%B8%B0%EC%B4%88-%EA%B0%9C%EB%85%90" class="hash-link" aria-label="1. 언어 모델의 핵심 기초 개념에 대한 직접 링크" title="1. 언어 모델의 핵심 기초 개념에 대한 직접 링크" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-context-window">1) Context Window<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#1-context-window" class="hash-link" aria-label="1) Context Window에 대한 직접 링크" title="1) Context Window에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class="">
<p><strong>정의</strong>: 모델이 한 번에 처리할 수 있는 <strong>단어(token)의 최대 개수</strong>, 즉 sequence의 길이 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal" style="margin-right:0.0315em">k</span></span></span></span>를 의미한다. 트랜스포머의 Self-Attention 연산 복잡도는 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><msup><mi>k</mi><mn>2</mn></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(k^2)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0641em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0278em">O</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0315em">k</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span>이다.</p>
</li>
<li class="">
<p><strong>직관적 해설</strong>:</p>
<ul>
<li class="">
<p><strong>장점</strong>: Context Window(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal" style="margin-right:0.0315em">k</span></span></span></span>값)가 커질수록 모델은 더 먼 과거의 단어들까지 기억할 수 있다. 힌트가 많아지니 문맥을 정교하게 파악하고 다음 단어를 예측하는 정확도가 상승한다.</p>
</li>
<li class="">
<p><strong>단점</strong>: 트랜스포머는 단어들끼리의 관계(Attention)를 모두 짝지어 계산해야 한다. 따라서 문맥 창이 10배 길어지면 연산량은 제곱인 100배로 폭증한다. 즉, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal" style="margin-right:0.0315em">k</span></span></span></span>의 증가는 하드웨어 메모리와 학습 비용의 한계와 직결되는 현실적인 장벽이다.</p>
</li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-maximize-likelihood-최대-우도-추정">2) Maximize Likelihood (최대 우도 추정)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#2-maximize-likelihood-%EC%B5%9C%EB%8C%80-%EC%9A%B0%EB%8F%84-%EC%B6%94%EC%A0%95" class="hash-link" aria-label="2) Maximize Likelihood (최대 우도 추정)에 대한 직접 링크" title="2) Maximize Likelihood (최대 우도 추정)에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class="">
<p><strong>정의</strong>: 주어진 문맥 뒤에 등장할 실제 정답 단어가 나올 조건부 확률(Likelihood)을 극대화(Maximize)하도록 모델의 내부 parameter <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Θ</mi></mrow><annotation encoding="application/x-tex">\Theta</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">Θ</span></span></span></span>를 최적화하는 수학적 목적 함수다.</p>
</li>
<li class="">
<p><strong>직관적 해설</strong>: 쉽게 말해 언어 모델이 학습하는 가장 근본적인 '목표'다. 모델이 수많은 텍스트 데이터를 읽으면서 자기가 예측한 단어가 실제 텍스트에 적힌 단어와 일치하도록 끊임없이 내부 회로(parameter)를 조절하는 과정이다.</p>
</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-gpt의-뼈대-트랜스포머-디코더-transformer-decoder">2. GPT의 뼈대: 트랜스포머 디코더 (Transformer Decoder)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#2-gpt%EC%9D%98-%EB%BC%88%EB%8C%80-%ED%8A%B8%EB%9E%9C%EC%8A%A4%ED%8F%AC%EB%A8%B8-%EB%94%94%EC%BD%94%EB%8D%94-transformer-decoder" class="hash-link" aria-label="2. GPT의 뼈대: 트랜스포머 디코더 (Transformer Decoder)에 대한 직접 링크" title="2. GPT의 뼈대: 트랜스포머 디코더 (Transformer Decoder)에 대한 직접 링크" translate="no">​</a></h2>
<p>원래 구글이 발표한 트랜스포머는 기계 번역을 위해 인코더(입력 파악)와 디코더(출력 생성)로 구성되었다. 하지만 GPT는 여기서 인코더를 과감히 버리고 <strong>디코더만을 12층으로 쌓아 올린 구조</strong>를 채택했다.</p>
<ul>
<li class=""><strong>왜 디코더만 썼을까?</strong>
GPT의 본질은 <strong>다음 단어 예측(Auto-regressive)</strong> 이기 때문이다. 디코더 내부에는 <strong>Masked Self-Attention</strong> 이라는 핵심 기능이 있다. 이는 모델이 현재 단어를 처리할 때 미래에 나올 단어들을 보지 못하게 Masking(가림 처리)하여 '커닝'을 막는다. 오직 과거부터 현재까지의 문맥만 보고 다음을 유추해야 하는 GPT의 철학과 완벽하게 맞아떨어지는 구조다.</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-gpt-1의-2단계-학습-파이프라인">3. GPT-1의 2단계 학습 파이프라인<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#3-gpt-1%EC%9D%98-2%EB%8B%A8%EA%B3%84-%ED%95%99%EC%8A%B5-%ED%8C%8C%EC%9D%B4%ED%94%84%EB%9D%BC%EC%9D%B8" class="hash-link" aria-label="3. GPT-1의 2단계 학습 파이프라인에 대한 직접 링크" title="3. GPT-1의 2단계 학습 파이프라인에 대한 직접 링크" translate="no">​</a></h2>
<!-- -->
<div style="padding:1.5rem;background:#FBF8F3;border-radius:4px;border:1px solid #e0e0e0;margin:1rem 0"><svg viewBox="0 0 800 450" width="100%" height="100%" style="font-family:var(--ifm-font-family-monospace)"><defs><marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#333"></path></marker></defs><text x="400" y="20" text-anchor="middle" fill="#333" font-weight="bold" font-size="18">GPT-1: 2-Stage Training Pipeline</text><g transform="translate(100, 50)"><rect x="0" y="0" width="250" height="350" fill="#e3f2fd" stroke="#1565c0" rx="4" stroke-dasharray="4,4"></rect><text x="125" y="25" text-anchor="middle" font-size="14" fill="#1565c0" font-weight="bold">Stage 1: Pre-training</text><text x="125" y="45" text-anchor="middle" font-size="12" fill="#666">Unsupervised</text><rect x="25" y="70" width="200" height="40" fill="#fff" stroke="#333" rx="4"></rect><text x="125" y="85" text-anchor="middle" font-size="12" fill="#333">Unlabeled Text Corpus</text><text x="125" y="100" text-anchor="middle" font-size="10" fill="#666">(BooksCorpus)</text><rect x="25" y="140" width="200" height="50" fill="#fff9c4" stroke="#fbc02d" rx="4"></rect><text x="125" y="160" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">12-Layer Transformer</text><text x="125" y="175" text-anchor="middle" font-size="12" fill="#333">Decoder</text><rect x="25" y="220" width="200" height="40" fill="#e8f5e9" stroke="#2e7d32" rx="4"></rect><text x="125" y="235" text-anchor="middle" font-size="12" fill="#333">Objective (L1)</text><text x="125" y="250" text-anchor="middle" font-size="10" fill="#666">Maximize Log-Likelihood</text><rect x="25" y="290" width="200" height="30" fill="#fff" stroke="#333" rx="4"></rect><text x="125" y="310" text-anchor="middle" font-size="12" fill="#333">Next Token Prediction</text><path d="M 125 110 L 125 135" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 190 L 125 215" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 260 L 125 285" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><path d="M 350 200 L 440 200" stroke="#ef6c00" stroke-width="2" stroke-dasharray="6,4" marker-end="url(#arrow)"></path><text x="395" y="190" text-anchor="middle" font-size="12" fill="#ef6c00" font-weight="bold">Transfer Weights</text><g transform="translate(450, 50)"><rect x="0" y="0" width="250" height="350" fill="#fce4ec" stroke="#c2185b" rx="4" stroke-dasharray="4,4"></rect><text x="125" y="25" text-anchor="middle" font-size="14" fill="#c2185b" font-weight="bold">Stage 2: Fine-tuning</text><text x="125" y="45" text-anchor="middle" font-size="12" fill="#666">Supervised</text><rect x="25" y="70" width="200" height="40" fill="#fff" stroke="#333" rx="4"></rect><text x="125" y="85" text-anchor="middle" font-size="12" fill="#333">Labeled Task Data</text><text x="125" y="100" text-anchor="middle" font-size="10" fill="#666">(Input x, Label y)</text><rect x="25" y="140" width="200" height="40" fill="#fff9c4" stroke="#fbc02d" rx="4"></rect><text x="125" y="165" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Pre-trained Transformer</text><rect x="25" y="210" width="200" height="30" fill="#fff" stroke="#333" rx="4"></rect><text x="125" y="230" text-anchor="middle" font-size="12" fill="#333">Added Linear Layer (Wy)</text><rect x="25" y="260" width="200" height="40" fill="#e8f5e9" stroke="#2e7d32" rx="4"></rect><text x="125" y="275" text-anchor="middle" font-size="12" fill="#333">Objective (L3)</text><text x="125" y="290" text-anchor="middle" font-size="10" fill="#666">L2 + λ * L1</text><rect x="25" y="320" width="200" height="30" fill="#fff" stroke="#333" rx="4"></rect><text x="125" y="340" text-anchor="middle" font-size="12" fill="#333">Target Label Prediction</text><path d="M 125 110 L 125 135" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 180 L 125 205" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 240 L 125 255" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><path d="M 125 300 L 125 315" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g></svg></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1단계-unsupervised-pre-training-비지도-사전-학습">1단계: Unsupervised Pre-training (비지도 사전 학습)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#1%EB%8B%A8%EA%B3%84-unsupervised-pre-training-%EB%B9%84%EC%A7%80%EB%8F%84-%EC%82%AC%EC%A0%84-%ED%95%99%EC%8A%B5" class="hash-link" aria-label="1단계: Unsupervised Pre-training (비지도 사전 학습)에 대한 직접 링크" title="1단계: Unsupervised Pre-training (비지도 사전 학습)에 대한 직접 링크" translate="no">​</a></h3>
<p>labeling되지 않은 대규모 텍스트 데이터를 통해 언어의 전반적인 패턴을 스스로 깨우치는 단계다.</p>
<ul>
<li class=""><strong>정의 (Objective Function)</strong>:
labeling되지 않은 대규모 Corpus(말뭉치) <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">U</mi><mo>=</mo><mo stretchy="false">{</mo><msub><mi>u</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>u</mi><mi>n</mi></msub><mo stretchy="false">}</mo></mrow><annotation encoding="application/x-tex">\mathcal{U} = \{u_1, \dots, u_n\}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathcal" style="margin-right:0.0993em">U</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mclose">}</span></span></span></span>가 주어졌을 때, 다음의 Log-Likelihood를 최대화하도록 학습된다.</li>
</ul>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">U</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo>∑</mo><mi>i</mi></munder><mi>log</mi><mo>⁡</mo><mi>P</mi><mo stretchy="false">(</mo><msub><mi>u</mi><mi>i</mi></msub><mi mathvariant="normal">∣</mi><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mi>k</mi></mrow></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub><mo separator="true">;</mo><mi mathvariant="normal">Θ</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, \dots, u_{i-1}; \Theta)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0993em">U</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.3277em;vertical-align:-1.2777em"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.05em"><span style="top:-1.8723em;margin-left:0em"><span class="pstrut" style="height:3.05em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span style="top:-3.05em"><span class="pstrut" style="height:3.05em"></span><span><span class="mop op-symbol large-op">∑</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.2777em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mop">lo<span style="margin-right:0.0139em">g</span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mord">∣</span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span><span class="mpunct">;</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">Θ</span><span class="mclose">)</span></span></span></span></span>
<blockquote>
<p>모델(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Θ</mi></mrow><annotation encoding="application/x-tex">Θ</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">Θ</span></span></span></span>)에게 이전 단어들(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mi>k</mi></mrow></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></mrow><annotation encoding="application/x-tex">u_{i-k} ,…,u_{i−1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6389em;vertical-align:-0.2083em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span></span></span></span>)을 보여주었을 때, 그 다음에 올 진짜 정답 단어(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>u</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">u_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>)를 맞출 확률 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi><mo stretchy="false">(</mo><mo>⋯</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">P(⋯)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="minner">⋯</span><span class="mclose">)</span></span></span></span> 을 계산하고, 이를 모든 텍스트 데이터에 대해 다 더한 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mo>∑</mo><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">∑_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0497em;vertical-align:-0.2997em"></span><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.162em"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2997em"><span></span></span></span></span></span></span></span></span></span>​ 값 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">U</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_1(\mathcal{U})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0993em">U</span><span class="mclose">)</span></span></span></span></p>
</blockquote>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">U</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_1(\mathcal{U})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0993em">U</span><span class="mclose">)</span></span></span></span>  :</p>
<ul>
<li class="">목적 함수(Objective Function)를 의미합니다.<br>
<!-- -->여기서 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">U</mi></mrow><annotation encoding="application/x-tex">\mathcal{U}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathcal" style="margin-right:0.0993em">U</span></span></span></span>는 학습 데이터로 사용되는 라벨링되지 않은 거대한 텍스트 Corpus(말뭉치)입니다.<br>
<!-- -->즉, "데이터 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">U</mi></mrow><annotation encoding="application/x-tex">\mathcal{U}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathcal" style="margin-right:0.0993em">U</span></span></span></span>를 모델이 얼마나 잘 이해(예측)하고 있는가"를 점수로 나타낸 것입니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mo>∑</mo><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">∑_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0497em;vertical-align:-0.2997em"></span><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.162em"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2997em"><span></span></span></span></span></span></span></span></span></span>​ :</p>
<ul>
<li class="">문장(데이터) 속에 있는 모든 단어(토큰)들의 순서 ii에 대해 아래의 확률 값을 전부 더하라는 뜻입니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>o</mi><mi>g</mi></mrow><annotation encoding="application/x-tex">log</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0197em">l</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.0359em">g</span></span></span></span>:</p>
<ul>
<li class="">로그 함수입니다. 확률값은 0과 1 사이의 소수인데, 여러 단어의 확률을 계속 곱하면 숫자가 0에 수렴해버리는 문제(언더플로우)가 생깁니다. 로그를 씌우면 곱셈이 덧셈(∑∑)으로 바뀌어 컴퓨터가 계산하기 매우 좋아집니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi><mo stretchy="false">(</mo><mo>⋯</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">P(⋯)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="minner">⋯</span><span class="mclose">)</span></span></span></span>:</p>
<ul>
<li class="">확률(Probability)입니다.(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi></mrow><annotation encoding="application/x-tex">P</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span></span></span></span>=parameter <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Θ</mi></mrow><annotation encoding="application/x-tex">\Theta</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">Θ</span></span></span></span>를 가진 Transformer Decoder에 의해 계산된 조건부 확률)</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>u</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">u_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>​ :</p>
<ul>
<li class="">모델이 맞춰야 할 <strong>'현재(다음) 단어'</strong></li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mi>k</mi></mrow></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>u</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></mrow><annotation encoding="application/x-tex">u_{i-k} ,…,u_{i−1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6389em;vertical-align:-0.2083em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.0315em">k</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mbin mtight">−</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2083em"><span></span></span></span></span></span></span></span></span></span>:</p>
<ul>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>u</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">u_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">u</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span> 이전에 등장한 단어들입니다. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal" style="margin-right:0.0315em">k</span></span></span></span>는 모델이 한 번에 볼 수 있는 문맥의 길이(Context Window Size)를 뜻합니다. 즉, **'이전까지의 문맥'**입니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="normal">Θ</mi></mrow><annotation encoding="application/x-tex">Θ</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">Θ</span></span></span></span> (세타):</p>
<ul>
<li class="">우리가 학습시키고자 하는 **인공지능 모델의 파라미터(가중치)**입니다.</li>
</ul>
</li>
</ul>
<hr>
<ul>
<li class="">
<p><strong>직관적 해설</strong>:</p>
<ul>
<li class="">
<p><strong>방식</strong>: 인터넷에 널려 있는 거대한 텍스트(뉴스, 책, 위키 등)를 순서대로 읽으며 빈칸(다음 단어)을 맞추게 한다.<br>
<!-- -->( * <em>실제로 GPT-1 모델이 학습한 메인 말뭉치는 7,000여 권의 미출판 도서 데이터인 'BooksCorpus' 입니다. 책 데이터 특성상 긴 문맥(Long-range dependency)을 학습하는 데 큰 도움이 되었다함</em>)</p>
</li>
<li class="">
<p><strong>비지도 학습인 이유</strong>: 사람이 일일이 정답표(labeling)를 달아줄 필요가 없다. "대한민국의 수도는 [서울]이다"라는 문장 자체가 문제이자 정답이기 때문이다.</p>
</li>
<li class="">
<p><strong>결과</strong>: 이 거대하고 단순한 '다음 단어 맞추기 게임'을 통해, 모델은 스스로 문법, 세상의 상식, 문맥의 논리를 통째로 학습하게 된다.</p>
</li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2단계-supervised-fine-tuning-지도-미세-조정">2단계: Supervised Fine-tuning (지도 미세 조정)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#2%EB%8B%A8%EA%B3%84-supervised-fine-tuning-%EC%A7%80%EB%8F%84-%EB%AF%B8%EC%84%B8-%EC%A1%B0%EC%A0%95" class="hash-link" aria-label="2단계: Supervised Fine-tuning (지도 미세 조정)에 대한 직접 링크" title="2단계: Supervised Fine-tuning (지도 미세 조정)에 대한 직접 링크" translate="no">​</a></h3>
<p>사전 학습이 완료된 후, 우리가 진짜 풀고 싶은 특정 문제(감정 분석, 객관식 등)에 맞춰 모델을 튜닝하는 단계다. 정답이 있는 데이터를 사용하므로 지도 학습이 된다.</p>
<ul>
<li class=""><strong>정의 (Objective Function)</strong>:
labeling된 dataset <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">C</mi></mrow><annotation encoding="application/x-tex">\mathcal{C}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathcal" style="margin-right:0.0583em">C</span></span></span></span>의 입력 sequence <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>x</mi><mn>1</mn></msup><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msup><mi>x</mi><mi>m</mi></msup></mrow><annotation encoding="application/x-tex">x^1, \dots, x^m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0085em;vertical-align:-0.1944em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span></span></span></span></span></span></span>과 라벨 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0359em">y</span></span></span></span>가 주어질 때의 예측 확률과 목적 함수는 다음과 같다.</li>
</ul>
<hr>
<h3>label(정답) 예측 확률</h3>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mi mathvariant="normal">∣</mi><msup><mi>x</mi><mn>1</mn></msup><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msup><mi>x</mi><mi>m</mi></msup><mo stretchy="false">)</mo><mo>=</mo><mtext>softmax</mtext><mo stretchy="false">(</mo><msubsup><mi>h</mi><mi>l</mi><mi>m</mi></msubsup><msub><mi>W</mi><mi>y</mi></msub><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">P(y | x^1, \dots, x^m) = \text{softmax}(h_l^m W_y)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1141em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord">∣</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8641em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7144em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1.0361em;vertical-align:-0.2861em"></span><span class="mord text"><span class="mord">softmax</span></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">h</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.7144em"><span style="top:-2.453em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.247em"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em">y</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span>
<ul>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>x</mi><mn>1</mn></msup><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msup><mi>x</mi><mi>m</mi></msup></mrow><annotation encoding="application/x-tex">x^1, \dots, x^m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0085em;vertical-align:-0.1944em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span></span></span></span></span></span></span> :</p>
<ul>
<li class="">입력된 문장(데이터)입니다. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi></mrow><annotation encoding="application/x-tex">m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">m</span></span></span></span>개의 단어(토큰)로 이루어져 있습니다. (예: "이 영화 너무 재밌다")</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0359em">y</span></span></span></span>:</p>
<ul>
<li class="">우리가 예측해야 할 정답 라벨입니다. (예: 긍정(Positive) 또는 부정(Negative))</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>h</mi><mi>m</mi></msup></mrow><annotation encoding="application/x-tex">h^m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord"><span class="mord mathnormal">h</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span></span></span></span></span></span></span>​ :</p>
<ul>
<li class="">사전 학습된 트랜스포머(Transformer) 모델의 제일 마지막 레이어(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi></mrow><annotation encoding="application/x-tex">l</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal" style="margin-right:0.0197em">l</span></span></span></span>)에서, 맨 마지막 단어(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi></mrow><annotation encoding="application/x-tex">m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">m</span></span></span></span>)를 처리하고 나온 **최종 출력값(Hidden state)**입니다. 모델이 문장 전체를 처음부터 끝까지 읽고 요약해 낸 **'문장의 핵심 의미'**라고 보시면 됩니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mi>y</mi></msub></mrow><annotation encoding="application/x-tex">W_y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em">y</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em"><span></span></span></span></span></span></span></span></span></span>​ :</p>
<ul>
<li class="">특정 임무(분류)를 수행하기 위해 새로 추가한 선형 계층(Linear Layer)의 가중치입니다. 모델의 요약본 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><msubsup><mi>h</mi><mi>l</mi><mi>m</mi></msubsup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(h_{l}^m)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0331em;vertical-align:-0.2831em"></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">h</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-2.4169em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span></span><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2831em"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span> 을 받아서 정답 라벨의 개수만큼 점수를 변환해 줍니다.</li>
</ul>
</li>
<li class="">
<p><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow><annotation encoding="application/x-tex">softmax</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord mathnormal">so</span><span class="mord mathnormal" style="margin-right:0.1076em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span></span></span></span>:</p>
<ul>
<li class="">소프트맥스 함수입니다. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>W</mi><mi>y</mi></mrow><annotation encoding="application/x-tex">Wy</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span></span></span></span>​ 를 통해 나온 단순한 점수들을 총합이 1(100%)이 되는 확률값으로 예쁘게 바꿔줍니다. (예: 긍정일 확률 0.9, 부정일 확률 0.1)</li>
</ul>
</li>
</ul>
<hr>
<h3>미세 조정(Fine-Tuning) 목적 함수</h3>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>L</mi><mn>2</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo>∑</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo separator="true">,</mo><mi>y</mi><mo stretchy="false">)</mo></mrow></munder><mi>log</mi><mo>⁡</mo><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mi mathvariant="normal">∣</mi><msup><mi>x</mi><mn>1</mn></msup><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msup><mi>x</mi><mi>m</mi></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y | x^1, \dots, x^m)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.566em;vertical-align:-1.516em"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.05em"><span style="top:-1.809em;margin-left:0em"><span class="pstrut" style="height:3.05em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">x</span><span class="mpunct mtight">,</span><span class="mord mathnormal mtight" style="margin-right:0.0359em">y</span><span class="mclose mtight">)</span></span></span></span><span style="top:-3.05em"><span class="pstrut" style="height:3.05em"></span><span><span class="mop op-symbol large-op">∑</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.516em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mop">lo<span style="margin-right:0.0139em">g</span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mord">∣</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8641em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7144em"><span style="top:-3.113em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span>
<ul>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>2</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_2(\mathcal{C})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span></span></span></span>
<ul>
<li class="">두 번째 학습 단계(미세 조정)의 목적 함수입니다. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">C</mi></mrow><annotation encoding="application/x-tex">\mathcal{C}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord mathcal" style="margin-right:0.0583em">C</span></span></span></span>는 사람이 직접 정답(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0359em">y</span></span></span></span>)을 달아놓은 라벨링 데이터셋(예: 리뷰-별점 데이터)을 의미합니다.</li>
</ul>
</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mo>∑</mo><mo stretchy="false">(</mo></msub><mi>x</mi><mo separator="true">,</mo><mi>y</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">∑_(x,y)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.2247em;vertical-align:-0.4747em"></span><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.2253em"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mopen mtight">(</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.4747em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">x</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0359em">y</span><span class="mclose">)</span></span></span></span>:<!-- -->
<ul>
<li class="">데이터셋 CC에 있는 모든 (입력 문장 xx, 정답 yy) 쌍에 대해서 아래의 확률을 전부 더하라는 뜻입니다.</li>
</ul>
</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>o</mi><mi>g</mi><mi>P</mi><mo stretchy="false">(</mo><mo>…</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">logP(…)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0197em">l</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.0359em">g</span><span class="mord mathnormal" style="margin-right:0.1389em">P</span><span class="mopen">(</span><span class="minner">…</span><span class="mclose">)</span></span></span></span>:<!-- -->
<ul>
<li class="">모델이 진짜 정답 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>y</mi></mrow><annotation encoding="application/x-tex">y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0359em">y</span></span></span></span>를 맞출 확률에 로그를 씌운 값입니다.</li>
</ul>
</li>
</ul>
<p>(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msubsup><mi>h</mi><mi>l</mi><mi>m</mi></msubsup></mrow><annotation encoding="application/x-tex">h_l^m</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9775em;vertical-align:-0.2831em"></span><span class="mord"><span class="mord mathnormal">h</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-2.4169em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0197em">l</span></span></span><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">m</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2831em"><span></span></span></span></span></span></span></span></span></span>은 Transformer 마지막 블록의 최종 활성화 벡터, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>W</mi><mi>y</mi></msub></mrow><annotation encoding="application/x-tex">W_y</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em">W</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0359em">y</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em"><span></span></span></span></span></span></span></span></span></span>는 출력층의 가중치 행렬이다.)</p>
<hr>
<ul>
<li class=""><strong>Auxiliary Objective (보조 목적 함수)의 활용</strong>:
GPT-1은 지도 학습 단계에서도 학습의 안정성과 수렴 속도를 높이기 위해, 1단계의 언어 모델링(다음 단어 예측) 목적 함수를 보조적으로 함께 사용한다.</li>
</ul>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>L</mi><mn>3</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>L</mi><mn>2</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo><mo>+</mo><mi>λ</mi><mo>⋅</mo><msub><mi>L</mi><mn>1</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal">λ</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span></span></span></span></span>
<ul>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>3</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_3(\mathcal{C})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">3</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span></span></span></span>:<!-- -->
<ul>
<li class="">미세 조정(Fine-Tuning) 단계에서 모델이 최종적으로 최대화해야 하는 종합 목표 점수입니다.</li>
</ul>
</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>2</mn></msub><mo stretchy="false">(</mo><mi mathvariant="script">C</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L_2(\mathcal{C})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathcal" style="margin-right:0.0583em">C</span><span class="mclose">)</span></span></span></span>:<!-- -->
<ul>
<li class="">이전에 설명해 드린 '정답(라벨) 맞추기' 점수입니다. (지도 학습)</li>
</ul>
</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub><mi mathvariant="script">C</mi></mrow><annotation encoding="application/x-tex">L_1{\mathcal{C}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mord"><span class="mord mathcal" style="margin-right:0.0583em">C</span></span></span></span></span>:<!-- -->
<ul>
<li class="">맨 처음에 설명해 드린 '다음 단어 맞추기' 점수입니다. (사전 학습 때 썼던 방식) 단, 여기서는 거대한 인터넷 데이터(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">U</mi></mrow><annotation encoding="application/x-tex">{\mathcal{U}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord"><span class="mord mathcal" style="margin-right:0.0993em">U</span></span></span></span></span>)가 아니라, 현재 훈련 중인 라벨링 데이터셋(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi mathvariant="script">C</mi></mrow><annotation encoding="application/x-tex">{\mathcal{C}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord"><span class="mord mathcal" style="margin-right:0.0583em">C</span></span></span></span></span>)의 텍스트를 가지고 다음 단어를 맞춥니다.</li>
</ul>
</li>
<li class=""><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>λ</mi></mrow><annotation encoding="application/x-tex">\lambda</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em"></span><span class="mord mathnormal">λ</span></span></span></span> (lamda):<!-- -->
<ul>
<li class="">가중치(Weight)를 조절하는 숫자입니다. "정답 맞추기(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">L_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>)가 메인 임무이긴 한데, 다음 단어 맞추기(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">L_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>)를 얼만큼의 비율로 섞어서 학습시킬까?"를 결정하는 조절 다이얼입니다. (보통 0.5 같은 값을 줍니다.)</li>
</ul>
</li>
</ul>
<p><strong></strong></p><h3><strong>왜 굳이 끝난 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>L</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">L_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal">L</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span> 을 다시 가져와서 더했을까?</strong></h3><p></p>
<blockquote>
<p><strong>일반화 성능 향상 (과적합 방지):</strong><br>
<!-- -->정답(라벨) 맞추기에만 몰두하면, 모델이 텍스트의 진짜 의미는 잊어버리고 얄팍한 꼼수(특정 단어가 나오면 무조건 '긍정'으로 찍기 등)만 배울 수 있습니다(과적합). 다음 단어를 계속 예측하게 만들면, 문맥을 깊이 이해하는 능력을 유지하게 됩니다.</p>
</blockquote>
<blockquote>
<p><strong>학습 속도 상승 (빠른 수렴):</strong><br>
<!-- -->언어의 구조를 계속 인지하면서 학습하기 때문에, 모델이 정답을 찾아가는 속도가 훨씬 빨라집니다.</p>
</blockquote>
<blockquote>
<p><strong>사전 학습된 지식 유지:</strong><br>
<!-- -->인터넷 전체를 읽으며 고생해서 쌓아놓은 똑똑한 뇌(가중치)가, 특정 임무 하나만 배우다가 망가지는 현상(Catastrophic Forgetting)을 막아줍니다.</p>
</blockquote>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="4-task-aware-input-transformations-작업-인식-입력-변환">4. Task-aware input transformations (작업 인식 입력 변환)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#4-task-aware-input-transformations-%EC%9E%91%EC%97%85-%EC%9D%B8%EC%8B%9D-%EC%9E%85%EB%A0%A5-%EB%B3%80%ED%99%98" class="hash-link" aria-label="4. Task-aware input transformations (작업 인식 입력 변환)에 대한 직접 링크" title="4. Task-aware input transformations (작업 인식 입력 변환)에 대한 직접 링크" translate="no">​</a></h2>
<p>이 기법의 핵심은 <strong>잘 만들어진 12층짜리 디코더 구조를 뜯어고치지 않는다는 것</strong>이다. architecture 변경 없이, 텍스트 입력의 형태만 특수 token을 활용해 조작함으로써 다양한 태스크를 수행한다.</p>
<!-- -->
<div style="padding:1.5rem;background:#FBF8F3;border-radius:4px;border:1px solid #e0e0e0;margin:1rem 0;overflow-x:auto"><svg viewBox="0 0 900 250" width="100%" height="100%" style="font-family:var(--ifm-font-family-monospace)"><defs><marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#333"></path></marker></defs><text x="450" y="20" text-anchor="middle" fill="#333" font-weight="bold" font-size="16">Task-aware Input Transformations (Multiple Choice)</text><g transform="translate(50, 60)"><rect x="0" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="30" y="20" text-anchor="middle" font-size="12" fill="#333">&lt;S&gt;</text><rect x="70" y="0" width="120" height="30" fill="#e1bee7" stroke="#8e24aa" rx="4"></rect><text x="130" y="20" text-anchor="middle" font-size="12" fill="#333">Premise</text><rect x="200" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="230" y="20" text-anchor="middle" font-size="12" fill="#333">$</text><rect x="270" y="0" width="100" height="30" fill="#e1bee7" stroke="#8e24aa" rx="4"></rect><text x="320" y="20" text-anchor="middle" font-size="12" fill="#333">Option 1</text><rect x="380" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="410" y="20" text-anchor="middle" font-size="12" fill="#333">&lt;E&gt;</text><path d="M 450 15 L 480 15" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="490" y="-10" width="100" height="50" fill="#bbdefb" stroke="#1976d2" rx="4"></rect><text x="540" y="10" text-anchor="middle" font-size="12" fill="#333">Transformer</text><text x="540" y="25" text-anchor="middle" font-size="10" fill="#666">+ Linear</text><path d="M 600 15 L 630 35" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(50, 130)"><rect x="0" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="30" y="20" text-anchor="middle" font-size="12" fill="#333">&lt;S&gt;</text><rect x="70" y="0" width="120" height="30" fill="#e1bee7" stroke="#8e24aa" rx="4"></rect><text x="130" y="20" text-anchor="middle" font-size="12" fill="#333">Premise</text><rect x="200" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="230" y="20" text-anchor="middle" font-size="12" fill="#333">$</text><rect x="270" y="0" width="100" height="30" fill="#e1bee7" stroke="#8e24aa" rx="4"></rect><text x="320" y="20" text-anchor="middle" font-size="12" fill="#333">Option 2</text><rect x="380" y="0" width="60" height="30" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="410" y="20" text-anchor="middle" font-size="12" fill="#333">&lt;E&gt;</text><path d="M 450 15 L 480 15" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="490" y="-10" width="100" height="50" fill="#bbdefb" stroke="#1976d2" rx="4"></rect><text x="540" y="10" text-anchor="middle" font-size="12" fill="#333">Transformer</text><text x="540" y="25" text-anchor="middle" font-size="10" fill="#666">+ Linear</text><path d="M 600 15 L 630 -5" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(680, 80)"><rect x="0" y="0" width="80" height="60" fill="#c8e6c9" stroke="#388e3c" rx="4"></rect><text x="40" y="35" text-anchor="middle" font-size="12" fill="#333">Softmax</text><path d="M 90 30 L 120 30" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="130" y="0" width="80" height="60" fill="#fff" stroke="#333" rx="4"></rect><text x="170" y="25" text-anchor="middle" font-size="12" fill="#333">Output</text><text x="170" y="45" text-anchor="middle" font-size="10" fill="#666">Probabilities</text></g><path d="M 370 190 L 370 210" stroke="#999" stroke-width="2" stroke-dasharray="4,4"></path></svg></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-특수-token의-역할">1) 특수 token의 역할<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#1-%ED%8A%B9%EC%88%98-token%EC%9D%98-%EC%97%AD%ED%95%A0" class="hash-link" aria-label="1) 특수 token의 역할에 대한 직접 링크" title="1) 특수 token의 역할에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class="">
<p><strong><code>&lt;S&gt; (Start)</code> token</strong>: sequence 맨 앞에 붙어 새로운 작업의 시작을 알리는 <strong>닻(Anchor)</strong> 역할.</p>
<ul>
<li class=""><em>Positional Encoding과의 차이</em>: 포지셔널 인코딩이 단어의 '물리적 위치'를 알려준다면, <code>&lt;S&gt; (Start)</code> token은 이전 문맥과 단절된 새로운 독립적 문제임을 알리는 '구조적 초기화 신호'다. 이 token이 없다면 첫 단어가 의미적 역할과 구조적 역할을 동시에 수행해야 해 어텐션 연산에 과부하가 온다.</li>
</ul>
</li>
<li class="">
<p><strong><code>$ (Delim)</code> token</strong>: 제시문과 보기 등 서로 다른 성격의 글을 분리해주는 <strong>구분자</strong> 역할.</p>
</li>
<li class="">
<p><strong><code>&lt;E&gt; (Extract)</code> token</strong>: sequence 맨 마지막에 붙는 token. 디코더가 이 token에 도달했을 때는 앞선 모든 문맥 정보가 계산된 상태다. 즉, 문장 전체의 의미를 꾹꾹 눌러 담은 <strong>하나의 요약 벡터(Vector)를 뽑아내는 방아쇠</strong> 역할을 한다.</p>
</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-객관식-문제-multiple-choice-처리-메커니즘">2) 객관식 문제 (Multiple Choice) 처리 메커니즘<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#2-%EA%B0%9D%EA%B4%80%EC%8B%9D-%EB%AC%B8%EC%A0%9C-multiple-choice-%EC%B2%98%EB%A6%AC-%EB%A9%94%EC%BB%A4%EB%8B%88%EC%A6%98" class="hash-link" aria-label="2) 객관식 문제 (Multiple Choice) 처리 메커니즘에 대한 직접 링크" title="2) 객관식 문제 (Multiple Choice) 처리 메커니즘에 대한 직접 링크" translate="no">​</a></h3>
<p>수능 국어 객관식(제시문 1개, 보기 4개)을 푼다고 가정할 때의 처리 과정이다.</p>
<ol>
<li class="">
<p><strong>배치(Batch) 구성</strong>: 보기 4개를 하나의 긴 글로 묶지 않는다. 보기 개수만큼 다음과 같이 독립된 sequence로 구성한다.</p>
<ul>
<li class="">
<p><code>&lt;S&gt; (Start)</code> + 제시문 + <code>$ (Delim)</code> + 보기1 + <code>&lt;E&gt; (Extract)</code></p>
</li>
<li class="">
<p><code>&lt;S&gt; (Start)</code> + 제시문 + <code>$ (Delim)</code> + 보기2 + <code>&lt;E&gt; (Extract)</code> (이하 동일)</p>
</li>
</ul>
</li>
<li class="">
<p><strong>병렬 연산</strong>: 위 4개의 독립된 sequence를 배치로 묶어 모델에 한 번에 통과시킨다.</p>
</li>
<li class="">
<p><strong>점수 도출</strong>: 각각의 끝에 있는 <code>&lt;E&gt; (Extract)</code> token이 출력한 4개의 벡터를 동일한 선형 분류기(Linear Classifier)에 통과시켜 각 보기당 1개씩, 총 4개의 임의의 점수(Logit)를 얻어낸 뒤, 이 점수들을 모아 Softmax 함수를 통과시켜 정답 확률을 도출한다.</p>
</li>
</ol>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="5-수학적-처리와-오차-계산-학습의-완성">5. 수학적 처리와 오차 계산 (학습의 완성)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#5-%EC%88%98%ED%95%99%EC%A0%81-%EC%B2%98%EB%A6%AC%EC%99%80-%EC%98%A4%EC%B0%A8-%EA%B3%84%EC%82%B0-%ED%95%99%EC%8A%B5%EC%9D%98-%EC%99%84%EC%84%B1" class="hash-link" aria-label="5. 수학적 처리와 오차 계산 (학습의 완성)에 대한 직접 링크" title="5. 수학적 처리와 오차 계산 (학습의 완성)에 대한 직접 링크" translate="no">​</a></h2>
<p>모델이 뱉어낸 임의의 점수를 실제 정답과 비교하여 parameter를 업데이트(학습)하기 위한 필수 수학적 과정이다.</p>
<!-- -->
<div style="padding:1.5rem;background:#FBF8F3;border-radius:4px;border:1px solid #e0e0e0;margin:1rem 0"><svg viewBox="0 0 800 250" width="100%" height="100%" style="font-family:var(--ifm-font-family-monospace)"><defs><marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse"><path d="M 0 0 L 10 5 L 0 10 z" fill="#333"></path></marker></defs><g transform="translate(50, 40)"><rect x="0" y="0" width="150" height="40" fill="#f5f5f5" stroke="#9e9e9e" rx="4"></rect><text x="75" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Model Logits</text><text x="75" y="30" text-anchor="middle" font-size="10" fill="#666">[10, 5, 1, -2]</text><path d="M 160 20 L 190 20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="200" y="0" width="100" height="40" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="250" y="25" text-anchor="middle" font-size="12" fill="#333">Softmax σ(z)</text><path d="M 310 20 L 340 20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="350" y="0" width="150" height="40" fill="#e3f2fd" stroke="#1565c0" rx="4"></rect><text x="425" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Predicted Prob (q)</text><text x="425" y="30" text-anchor="middle" font-size="10" fill="#666">[0.7, 0.2, 0.08, 0.02]</text><path d="M 510 20 L 580 60" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(50, 140)"><rect x="0" y="0" width="150" height="40" fill="#f5f5f5" stroke="#9e9e9e" rx="4"></rect><text x="75" y="25" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">True Label (c=1)</text><path d="M 160 20 L 190 20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="200" y="0" width="100" height="40" fill="#ffe0b2" stroke="#f57c00" rx="4"></rect><text x="250" y="25" text-anchor="middle" font-size="12" fill="#333">One-hot Enc.</text><path d="M 310 20 L 340 20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="350" y="0" width="150" height="40" fill="#e8f5e9" stroke="#2e7d32" rx="4"></rect><text x="425" y="15" text-anchor="middle" font-size="12" fill="#333" font-weight="bold">Target Prob (p)</text><text x="425" y="30" text-anchor="middle" font-size="10" fill="#666">[1.0, 0.0, 0.0, 0.0]</text><path d="M 510 20 L 580 -20" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path></g><g transform="translate(640, 80)"><rect x="0" y="0" width="120" height="60" fill="#ffcdd2" stroke="#d32f2f" rx="4"></rect><text x="60" y="25" text-anchor="middle" font-size="12" fill="#d32f2f" font-weight="bold">Cross-Entropy</text><text x="60" y="45" text-anchor="middle" font-size="12" fill="#333">Loss H(p,q)</text><path d="M 60 70 L 60 100" stroke="#333" stroke-width="1.5" marker-end="url(#arrow)"></path><rect x="10" y="110" width="100" height="30" fill="#f5f5f5" stroke="#9e9e9e" rx="4"></rect><text x="60" y="130" text-anchor="middle" font-size="10" fill="#333">Backpropagate</text></g></svg></div>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-softmax-소프트맥스-함수">1) Softmax (소프트맥스 함수)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#1-softmax-%EC%86%8C%ED%94%84%ED%8A%B8%EB%A7%A5%EC%8A%A4-%ED%95%A8%EC%88%98" class="hash-link" aria-label="1) Softmax (소프트맥스 함수)에 대한 직접 링크" title="1) Softmax (소프트맥스 함수)에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class=""><strong>정의</strong>: 선형 분류기를 거쳐 나온 각 클래스의 임의의 점수 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>z</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">z_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.044em">z</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:-0.044em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span></span></span></span>를 확률 값으로 변환한다.</li>
</ul>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>σ</mi><mo stretchy="false">(</mo><mi mathvariant="bold">z</mi><msub><mo stretchy="false">)</mo><mi>i</mi></msub><mo>=</mo><mfrac><msup><mi>e</mi><msub><mi>z</mi><mi>i</mi></msub></msup><mrow><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>K</mi></munderover><msup><mi>e</mi><msub><mi>z</mi><mi>j</mi></msub></msup></mrow></mfrac></mrow><annotation encoding="application/x-tex">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0359em">σ</span><span class="mopen">(</span><span class="mord mathbf">z</span><span class="mclose"><span class="mclose">)</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.6484em;vertical-align:-1.307em"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3414em"><span style="top:-2.1288em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.9812em"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0572em">j</span><span class="mrel mtight">=</span><span class="mord mtight">1</span></span></span></span><span style="top:-3.2029em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0715em">K</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.4358em"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6065em"><span style="top:-3.0051em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.044em">z</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3281em"><span style="top:-2.357em;margin-left:-0.044em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0572em">j</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2819em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span style="top:-3.23em"><span class="pstrut" style="height:3em"></span><span class="frac-line" style="border-bottom-width:0.04em"></span></span><span style="top:-3.677em"><span class="pstrut" style="height:3em"></span><span class="mord"><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.044em">z</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3281em"><span style="top:-2.357em;margin-left:-0.044em;margin-right:0.0714em"><span class="pstrut" style="height:2.5em"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.143em"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.307em"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></span>
<ul>
<li class="">
<p><strong>직관적 해설</strong>:
선형 분류기에서 나온 4개의 점수(예: 10, 5, 1, -2)는 크기가 제각각이다. 이를 단순 비교하지 않고 Softmax를 쓰는 이유는 두 가지다.</p>
<ol>
<li class="">
<p><strong>확률 분포 변환</strong>: 점수들을 다 합쳐서 정확히 1(100%)이 되도록(각 값은 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0</mn><mo>&lt;</mo><mi>σ</mi><mo>&lt;</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">0 &lt; \sigma &lt; 1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6835em;vertical-align:-0.0391em"></span><span class="mord">0</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em"></span><span class="mord mathnormal" style="margin-right:0.0359em">σ</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">&lt;</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1</span></span></span></span>)록 비율을 맞춘다 (예: 70%, 20%, 8%, 2%). 지수 함수(<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>e</mi></mrow><annotation encoding="application/x-tex">e</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">e</span></span></span></span>)를 쓰기 때문에 큰 값은 더 확실하게, 작은 값은 더 작게 만들어 모델이 확신을 갖도록 유도한다.</p>
</li>
<li class="">
<p><strong>미분 가능성</strong>: 딥러닝 역전파 학습을 위해선 그래프가 미분 가능해야 하는데, Softmax는 이 수학적 조건을 완벽하게 충족한다.</p>
</li>
</ol>
</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-one-hot-encoding-원-핫-인코딩">2) One-hot Encoding (원-핫 인코딩)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#2-one-hot-encoding-%EC%9B%90-%ED%95%AB-%EC%9D%B8%EC%BD%94%EB%94%A9" class="hash-link" aria-label="2) One-hot Encoding (원-핫 인코딩)에 대한 직접 링크" title="2) One-hot Encoding (원-핫 인코딩)에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class=""><strong>정의</strong>: 정답이 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">c</span></span></span></span>번 클래스일 때의 목표 확률 분포 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi></mrow><annotation encoding="application/x-tex">p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal">p</span></span></span></span>는 다음과 같다.</li>
</ul>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>p</mi><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo><mo>=</mo><mrow><mo fence="true">{</mo><mtable rowspacing="0.36em" columnalign="left left" columnspacing="1em"><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mn>1</mn></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>if&nbsp;</mtext><mi>i</mi><mo>=</mo><mi>c</mi></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mn>0</mn></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>if&nbsp;</mtext><mi>i</mi><mo mathvariant="normal">≠</mo><mi>c</mi></mrow></mstyle></mtd></mtr></mtable></mrow></mrow><annotation encoding="application/x-tex">p(i) = \begin{cases} 1 &amp; \text{if } i = c \\ 0 &amp; \text{if } i \neq c \end{cases}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">p</span><span class="mopen">(</span><span class="mord mathnormal">i</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:3em;vertical-align:-1.25em"></span><span class="minner"><span class="mopen delimcenter" style="top:0em"><span class="delimsizing size4">{</span></span><span class="mord"><span class="mtable"><span class="col-align-l"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.69em"><span style="top:-3.69em"><span class="pstrut" style="height:3.008em"></span><span class="mord"><span class="mord">1</span></span></span><span style="top:-2.25em"><span class="pstrut" style="height:3.008em"></span><span class="mord"><span class="mord">0</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.19em"><span></span></span></span></span></span><span class="arraycolsep" style="width:1em"></span><span class="col-align-l"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.69em"><span style="top:-3.69em"><span class="pstrut" style="height:3.008em"></span><span class="mord"><span class="mord text"><span class="mord">if&nbsp;</span></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">c</span></span></span><span style="top:-2.25em"><span class="pstrut" style="height:3.008em"></span><span class="mord"><span class="mord text"><span class="mord">if&nbsp;</span></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel"><span class="mrel"><span class="mord vbox"><span class="thinbox"><span class="rlap"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="inner"><span class="mord"><span class="mrel"></span></span></span><span class="fix"></span></span></span></span></span><span class="mspace nobreak"></span><span class="mrel">=</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mord mathnormal">c</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.19em"><span></span></span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></span>
<ul>
<li class=""><strong>직관적 해설</strong>: 컴퓨터가 자기가 예측한 확률(70%, 20%, 8%, 2%)과 진짜 정답을 비교하려면, 정답도 '확률 모양'이어야 한다. 정답이 2번이라면, 2번 자리에만 100%(1.0)를 주고 나머지는 0%(0.0)를 주어 <code>[0.0, 1.0, 0.0, 0.0]</code> 형태로 만들어주는 작업이다.</li>
</ul>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-cross-entropy-loss-크로스-엔트로피-오차">3) Cross-Entropy Loss (크로스 엔트로피 오차)<a href="https://hkimw.github.io/hkimw/ko/blog/gpt-1#3-cross-entropy-loss-%ED%81%AC%EB%A1%9C%EC%8A%A4-%EC%97%94%ED%8A%B8%EB%A1%9C%ED%94%BC-%EC%98%A4%EC%B0%A8" class="hash-link" aria-label="3) Cross-Entropy Loss (크로스 엔트로피 오차)에 대한 직접 링크" title="3) Cross-Entropy Loss (크로스 엔트로피 오차)에 대한 직접 링크" translate="no">​</a></h3>
<ul>
<li class=""><strong>정의</strong>: 모델의 예측 확률 분포 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi></mrow><annotation encoding="application/x-tex">q</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal" style="margin-right:0.0359em">q</span></span></span></span>와 실제 정답 분포 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi></mrow><annotation encoding="application/x-tex">p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal">p</span></span></span></span> 사이의 차이(Loss)를 측정한다.</li>
</ul>
<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>H</mi><mo stretchy="false">(</mo><mi>p</mi><mo separator="true">,</mo><mi>q</mi><mo stretchy="false">)</mo><mo>=</mo><mo>−</mo><munder><mo>∑</mo><mi>x</mi></munder><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mi>log</mi><mo>⁡</mo><mi>q</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">H(p, q) = -\sum_{x} p(x) \log q(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0813em">H</span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0359em">q</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:2.3em;vertical-align:-1.25em"></span><span class="mord">−</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.05em"><span style="top:-1.9em;margin-left:0em"><span class="pstrut" style="height:3.05em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">x</span></span></span></span><span style="top:-3.05em"><span class="pstrut" style="height:3.05em"></span><span><span class="mop op-symbol large-op">∑</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:1.25em"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">p</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mop">lo<span style="margin-right:0.0139em">g</span></span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.0359em">q</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span></span>
<p>정답이 One-hot Encoding된 경우, 실제 정답 클래스 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>c</mi></mrow><annotation encoding="application/x-tex">c</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">c</span></span></span></span>에 대해서만 확률을 계산하게 된다.
모델이 정답 클래스에 할당한 확률 <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">q(c)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.0359em">q</span><span class="mopen">(</span><span class="mord mathnormal">c</span><span class="mclose">)</span></span></span></span>가 1에 가까울수록 오차(Loss)는 0에 수렴하고, 확률이 낮을수록 오차는 무한대로 발산한다.</p>
<ul>
<li class=""><strong>직관적 해설</strong>:
MSE(평균 제곱 오차)는 집값 예측 같은 연속된 숫자(회귀)에 쓴다. 반면, 객관식이나 분류 문제에서는 <strong>두 확률 분포(예측값 vs 정답) 간의 거리를 재는 Cross-Entropy</strong>가 훨씬 적합하다.
모델은 예측값(예: <code>[0.1, 0.7, 0.05, 0.15]</code>)과 정답(<code>[0, 1, 0, 0]</code>) 사이의 오차값을 계산한 뒤, 이 오차를 줄이는 방향으로 내부 parameter를 수정하며 점차 정답률을 높인다.</li>
</ul>]]></content:encoded>
            <category>논문</category>
            <category>gpt</category>
            <category>nlp</category>
            <category>llm</category>
            <category>딥러닝</category>
        </item>
        <item>
            <title><![CDATA[[일상] 봄, 그리고 새 시작]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/daily-first</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/daily-first</guid>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[벚꽃이 피기 시작하는 계절에 개인 홈페이지도 새로 시작합니다.]]></description>
            <content:encoded><![CDATA[<p>벚꽃이 피기 시작하는 계절에 개인 홈페이지도 새로 시작합니다.</p>
<p>요즘 연구실에서 GPU 프로그래밍 프로젝트를 진행 중인데, 코드를 짜다 보면 시간 가는 줄 모릅니다.<br>
<!-- -->CUDA 커널이 처음 예상대로 동작할 때의 그 쾌감이... 아직도 짜릿해요 😄</p>
<p>블로그를 꾸준히 쓰는 게 목표인데, 공부 기록뿐 아니라 이런 가벼운 일상 이야기도 남겨두려 합니다.</p>
<p>오늘은 커피 한 잔 하면서 사이트 세팅을 마무리했습니다.<br>
<!-- -->봄처럼 좋은 하루였어요.</p>]]></content:encoded>
            <category>일상</category>
        </item>
        <item>
            <title><![CDATA[[잡도리] 개인 홈페이지를 Docusaurus로 새로 만들었습니다]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/jabdori-first</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/jabdori-first</guid>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[드디어 개인 홈페이지를 제대로 꾸렸습니다. 그동안 GitHub Profile README로만 유지하던 걸, Docusaurus 기반의 정적 사이트로 이전했어요.]]></description>
            <content:encoded><![CDATA[<p>드디어 개인 홈페이지를 제대로 꾸렸습니다. 그동안 GitHub Profile README로만 유지하던 걸, Docusaurus 기반의 정적 사이트로 이전했어요.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="왜-docusaurus인가">왜 Docusaurus인가<a href="https://hkimw.github.io/hkimw/ko/blog/jabdori-first#%EC%99%9C-docusaurus%EC%9D%B8%EA%B0%80" class="hash-link" aria-label="왜 Docusaurus인가에 대한 직접 링크" title="왜 Docusaurus인가에 대한 직접 링크" translate="no">​</a></h2>
<ul>
<li class=""><strong>Markdown 우선</strong>: 블로그 글을 <code>.md</code> 파일로 관리하면 충분합니다.</li>
<li class=""><strong>React 확장</strong>: 논문, 프로젝트, 챗봇 같은 커스텀 페이지는 React 컴포넌트로 자유롭게 만들 수 있어요.</li>
<li class=""><strong>GitHub Pages 배포</strong>: <code>gh-pages</code> 브랜치 push 한 번으로 배포가 완료됩니다.</li>
<li class=""><strong>다크모드 기본 지원</strong>: 따로 구현 안 해도 됩니다 😄</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="이-사이트의-구성">이 사이트의 구성<a href="https://hkimw.github.io/hkimw/ko/blog/jabdori-first#%EC%9D%B4-%EC%82%AC%EC%9D%B4%ED%8A%B8%EC%9D%98-%EA%B5%AC%EC%84%B1" class="hash-link" aria-label="이 사이트의 구성에 대한 직접 링크" title="이 사이트의 구성에 대한 직접 링크" translate="no">​</a></h2>
<table><thead><tr><th>섹션</th><th>내용</th></tr></thead><tbody><tr><td>홈</td><td>소개, 기술 스택, 연락처</td></tr><tr><td>블로그</td><td>공부 / 잡도리 / 일상 / 리뷰 / 뉴스</td></tr><tr><td>논문</td><td>작성한 논문 아카이브</td></tr><tr><td>프로젝트</td><td>GitHub 저장소 &amp; 릴리즈 쇼케이스</td></tr><tr><td>챗봇</td><td>나에 대한 AI Q&amp;A 챗봇 (예정)</td></tr></tbody></table>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="앞으로-할-것들">앞으로 할 것들<a href="https://hkimw.github.io/hkimw/ko/blog/jabdori-first#%EC%95%9E%EC%9C%BC%EB%A1%9C-%ED%95%A0-%EA%B2%83%EB%93%A4" class="hash-link" aria-label="앞으로 할 것들에 대한 직접 링크" title="앞으로 할 것들에 대한 직접 링크" translate="no">​</a></h2>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->챗봇 실제 배포 &amp; 연결</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->논문/프로젝트 데이터 채우기</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->블로그 꾸준히 쓰기 (가장 어려운 부분...)</li>
</ul>
<p>부담 없이 기록하는 공간으로 쓰려고 합니다. 자주 들러주세요!</p>]]></content:encoded>
            <category>잡도리</category>
        </item>
        <item>
            <title><![CDATA[[뉴스] AI/HPC 주간 클리핑 — 2026.04.14]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/news-first</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/news-first</guid>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[관심 분야(딥러닝 추론, GPU 아키텍처, HPC)에서 이번 주 눈에 띄는 소식들을 정리합니다.]]></description>
            <content:encoded><![CDATA[<p>관심 분야(딥러닝 추론, GPU 아키텍처, HPC)에서 이번 주 눈에 띄는 소식들을 정리합니다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="이번-주-주요-소식">이번 주 주요 소식<a href="https://hkimw.github.io/hkimw/ko/blog/news-first#%EC%9D%B4%EB%B2%88-%EC%A3%BC-%EC%A3%BC%EC%9A%94-%EC%86%8C%EC%8B%9D" class="hash-link" aria-label="이번 주 주요 소식에 대한 직접 링크" title="이번 주 주요 소식에 대한 직접 링크" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="1-nvidia-blackwell-2세대-추론-벤치마크-공개">1. NVIDIA Blackwell 2세대 추론 벤치마크 공개<a href="https://hkimw.github.io/hkimw/ko/blog/news-first#1-nvidia-blackwell-2%EC%84%B8%EB%8C%80-%EC%B6%94%EB%A1%A0-%EB%B2%A4%EC%B9%98%EB%A7%88%ED%81%AC-%EA%B3%B5%EA%B0%9C" class="hash-link" aria-label="1. NVIDIA Blackwell 2세대 추론 벤치마크 공개에 대한 직접 링크" title="1. NVIDIA Blackwell 2세대 추론 벤치마크 공개에 대한 직접 링크" translate="no">​</a></h3>
<p>차세대 Blackwell 아키텍처의 FP8 추론 처리량이 H100 대비 최대 <strong>4× 향상</strong>됐다는 벤치마크 결과가 공개됐습니다.<br>
<!-- -->특히 LLM 디코딩 단계에서의 메모리 대역폭 효율이 크게 개선된 것이 주목됩니다.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="2-flashattention-3-논문-arxiv-공개">2. FlashAttention-3 논문 arXiv 공개<a href="https://hkimw.github.io/hkimw/ko/blog/news-first#2-flashattention-3-%EB%85%BC%EB%AC%B8-arxiv-%EA%B3%B5%EA%B0%9C" class="hash-link" aria-label="2. FlashAttention-3 논문 arXiv 공개에 대한 직접 링크" title="2. FlashAttention-3 논문 arXiv 공개에 대한 직접 링크" translate="no">​</a></h3>
<p>Flash Attention 시리즈의 세 번째 논문이 공개됐습니다.<br>
<!-- -->Hopper 아키텍처(H100)의 **Tensor Memory Accelerator(TMA)**와 비동기 파이프라인을 활용해 Attention 커널 효율을 높였습니다.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="3-pytorch-27-릴리즈">3. PyTorch 2.7 릴리즈<a href="https://hkimw.github.io/hkimw/ko/blog/news-first#3-pytorch-27-%EB%A6%B4%EB%A6%AC%EC%A6%88" class="hash-link" aria-label="3. PyTorch 2.7 릴리즈에 대한 직접 링크" title="3. PyTorch 2.7 릴리즈에 대한 직접 링크" translate="no">​</a></h3>
<p><code>torch.compile</code>의 안정성 개선과 함께 CUDA Graph 자동화 기능이 강화됐습니다.</p>
<hr>
<p><em>개인적으로 정리한 내용이라 오류가 있을 수 있습니다. 원본 소스를 꼭 확인하세요!</em></p>]]></content:encoded>
            <category>뉴스</category>
            <category>AI</category>
            <category>GPU</category>
        </item>
        <item>
            <title><![CDATA[[리뷰] 책 『CUDA by Example』 — GPU 입문에 가장 좋은 책]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/review-first</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/review-first</guid>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[CUDA 프로그래밍을 처음 배울 때 가장 많은 도움을 받은 책을 소개합니다.]]></description>
            <content:encoded><![CDATA[<p>CUDA 프로그래밍을 처음 배울 때 가장 많은 도움을 받은 책을 소개합니다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="책-정보">책 정보<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EC%B1%85-%EC%A0%95%EB%B3%B4" class="hash-link" aria-label="책 정보에 대한 직접 링크" title="책 정보에 대한 직접 링크" translate="no">​</a></h2>
<ul>
<li class=""><strong>제목</strong>: CUDA by Example: An Introduction to General-Purpose GPU Programming</li>
<li class=""><strong>저자</strong>: Jason Sanders, Edward Kandrot</li>
<li class=""><strong>출판</strong>: Addison-Wesley Professional (2010)</li>
<li class=""><strong>난이도</strong>: ⭐⭐☆☆☆ (입문)</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="왜-좋은가">왜 좋은가<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EC%99%9C-%EC%A2%8B%EC%9D%80%EA%B0%80" class="hash-link" aria-label="왜 좋은가에 대한 직접 링크" title="왜 좋은가에 대한 직접 링크" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="예제-중심-구성">예제 중심 구성<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EC%98%88%EC%A0%9C-%EC%A4%91%EC%8B%AC-%EA%B5%AC%EC%84%B1" class="hash-link" aria-label="예제 중심 구성에 대한 직접 링크" title="예제 중심 구성에 대한 직접 링크" translate="no">​</a></h3>
<p>이론 설명보다 <strong>실제 동작하는 코드</strong>를 먼저 보여주고 설명하는 방식이라 직관적입니다.<br>
<!-- -->커널 작성 → 메모리 관리 → 텍스처/상수 메모리 → 스트리밍 순으로 자연스럽게 발전합니다.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="다루는-핵심-개념">다루는 핵심 개념<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EB%8B%A4%EB%A3%A8%EB%8A%94-%ED%95%B5%EC%8B%AC-%EA%B0%9C%EB%85%90" class="hash-link" aria-label="다루는 핵심 개념에 대한 직접 링크" title="다루는 핵심 개념에 대한 직접 링크" translate="no">​</a></h3>
<table><thead><tr><th>챕터</th><th>주제</th></tr></thead><tbody><tr><td>3</td><td>기본 커널 작성 &amp; 실행</td></tr><tr><td>4</td><td>병렬 Reduction</td></tr><tr><td>5</td><td>스레드 협력 &amp; Shared Memory</td></tr><tr><td>9</td><td>원자적 연산(Atomics)</td></tr><tr><td>10</td><td>CUDA 스트림</td></tr></tbody></table>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="아쉬운-점">아쉬운 점<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EC%95%84%EC%89%AC%EC%9A%B4-%EC%A0%90" class="hash-link" aria-label="아쉬운 점에 대한 직접 링크" title="아쉬운 점에 대한 직접 링크" translate="no">​</a></h2>
<ul>
<li class="">2010년 책이라 최신 아키텍처(Volta/Ampere/Hopper) 내용이 없습니다.</li>
<li class="">Warp-level 프리미티브(<code>__shfl_sync</code> 등)는 NVIDIA 공식 Programming Guide를 별도로 봐야 합니다.</li>
</ul>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="추천-대상">추천 대상<a href="https://hkimw.github.io/hkimw/ko/blog/review-first#%EC%B6%94%EC%B2%9C-%EB%8C%80%EC%83%81" class="hash-link" aria-label="추천 대상에 대한 직접 링크" title="추천 대상에 대한 직접 링크" translate="no">​</a></h2>
<p>C를 알고 CUDA를 처음 시작하는 분에게 <strong>강력히 추천</strong>합니다.<br>
<!-- -->진지한 최적화는 이후 Programming Guide와 GTC 발표 자료를 참고하면 됩니다.</p>
<p><strong>총점: 4 / 5</strong> ⭐⭐⭐⭐☆</p>]]></content:encoded>
            <category>리뷰</category>
            <category>CUDA</category>
            <category>책</category>
        </item>
        <item>
            <title><![CDATA[[공부] CUDA 커널 최적화 — 메모리 접근 패턴 정리]]></title>
            <link>https://hkimw.github.io/hkimw/ko/blog/study-first</link>
            <guid>https://hkimw.github.io/hkimw/ko/blog/study-first</guid>
            <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[딥러닝 추론 최적화를 공부하면서 CUDA 커널 작성 시 메모리 접근 패턴이 성능에 얼마나 영향을 주는지 정리해봤습니다.]]></description>
            <content:encoded><![CDATA[<p>딥러닝 추론 최적화를 공부하면서 CUDA 커널 작성 시 메모리 접근 패턴이 성능에 얼마나 영향을 주는지 정리해봤습니다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="핵심-개념">핵심 개념<a href="https://hkimw.github.io/hkimw/ko/blog/study-first#%ED%95%B5%EC%8B%AC-%EA%B0%9C%EB%85%90" class="hash-link" aria-label="핵심 개념에 대한 직접 링크" title="핵심 개념에 대한 직접 링크" translate="no">​</a></h2>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="coalesced-memory-access">Coalesced Memory Access<a href="https://hkimw.github.io/hkimw/ko/blog/study-first#coalesced-memory-access" class="hash-link" aria-label="Coalesced Memory Access에 대한 직접 링크" title="Coalesced Memory Access에 대한 직접 링크" translate="no">​</a></h3>
<p>GPU 글로벌 메모리는 워프(warp) 내 스레드들이 <strong>연속된 주소</strong>에 접근할 때 하나의 트랜잭션으로 묶어 처리합니다.<br>
<!-- -->비연속 접근(Strided Access)은 트랜잭션 수가 늘어나 대역폭 효율이 급격히 떨어집니다.</p>
<h3 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="shared-memory-활용">Shared Memory 활용<a href="https://hkimw.github.io/hkimw/ko/blog/study-first#shared-memory-%ED%99%9C%EC%9A%A9" class="hash-link" aria-label="Shared Memory 활용에 대한 직접 링크" title="Shared Memory 활용에 대한 직접 링크" translate="no">​</a></h3>
<p>L1 캐시와 물리적으로 같은 온칩 SRAM인 Shared Memory를 타일(tile) 단위로 미리 적재하면 글로벌 메모리 접근 횟수를 대폭 줄일 수 있습니다.</p>
<div class="language-c codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-c codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">__global__ </span><span class="token keyword" style="font-style:italic">void</span><span class="token plain"> </span><span class="token function" style="color:rgb(130, 170, 255)">matmul_tiled</span><span class="token punctuation" style="color:rgb(199, 146, 234)">(</span><span class="token keyword" style="font-style:italic">float</span><span class="token plain"> </span><span class="token operator" style="color:rgb(137, 221, 255)">*</span><span class="token plain">A</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"> </span><span class="token keyword" style="font-style:italic">float</span><span class="token plain"> </span><span class="token operator" style="color:rgb(137, 221, 255)">*</span><span class="token plain">B</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"> </span><span class="token keyword" style="font-style:italic">float</span><span class="token plain"> </span><span class="token operator" style="color:rgb(137, 221, 255)">*</span><span class="token plain">C</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"> </span><span class="token keyword" style="font-style:italic">int</span><span class="token plain"> N</span><span class="token punctuation" style="color:rgb(199, 146, 234)">)</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    __shared__ </span><span class="token keyword" style="font-style:italic">float</span><span class="token plain"> sA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain">TILE</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain">TILE</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    __shared__ </span><span class="token keyword" style="font-style:italic">float</span><span class="token plain"> sB</span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain">TILE</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain">TILE</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token comment" style="color:rgb(105, 112, 152);font-style:italic">// ...</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="오늘의-실험-결과">오늘의 실험 결과<a href="https://hkimw.github.io/hkimw/ko/blog/study-first#%EC%98%A4%EB%8A%98%EC%9D%98-%EC%8B%A4%ED%97%98-%EA%B2%B0%EA%B3%BC" class="hash-link" aria-label="오늘의 실험 결과에 대한 직접 링크" title="오늘의 실험 결과에 대한 직접 링크" translate="no">​</a></h2>
<table><thead><tr><th>구현 방식</th><th>처리량 (GFLOPS)</th></tr></thead><tbody><tr><td>Naive (글로벌)</td><td>42</td></tr><tr><td>Coalesced</td><td>198</td></tr><tr><td>+ Shared Memory</td><td>573</td></tr></tbody></table>
<p>Shared Memory 타일링만 적용해도 약 <strong>13.6× 성능 향상</strong>을 확인했습니다.</p>
<h2 class="anchor anchorTargetHideOnScrollNavbar_vjPI" id="다음-목표">다음 목표<a href="https://hkimw.github.io/hkimw/ko/blog/study-first#%EB%8B%A4%EC%9D%8C-%EB%AA%A9%ED%91%9C" class="hash-link" aria-label="다음 목표에 대한 직접 링크" title="다음 목표에 대한 직접 링크" translate="no">​</a></h2>
<ul>
<li class="">Bank conflict 분석 및 패딩 전략</li>
<li class=""><code>__ldg()</code> read-only cache 활용</li>
<li class="">Warp divergence 최소화 패턴</li>
</ul>]]></content:encoded>
            <category>공부</category>
            <category>CUDA</category>
            <category>GPU</category>
        </item>
    </channel>
</rss>