[CSE URP] ViT Self-Attention 구조

Notice

[공지] About this blog, and⋯

Recent Posts

Recent Comments

Link

공부가 아닌, 일상을 담는 블로그

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

On the journey of

[CSE URP] ViT Self-Attention 구조 본문

Experiences & Study/CSE URP' 29

[CSE URP] ViT Self-Attention 구조

dlrpskdi 2023. 8. 25. 05:52

* 해당 포스팅은 Attention 구조 및 Transformer에 대한 논의를.. 좀더 잘 이해하기 위해 공부하고 쓰는 글입니다. URP에서 본격적으로 다룬 내용은 아님을 밝혀둡니다 :)

References(Github & Huggingface)

https://nlpinkorean.github.io/illustrated-transformer/

https://github.com/hyunwoongko/transformer/blob/master/models/layers/multi_head_attention.py

https://github.com/rwightman/pytorch-image-models/blob/a520da9b495422bc773fb5dfe10819acb8bd7c5c/timm/models/vision_transformer.py#L183-L208

The animal didn't cross the street because it was too tired (예문)

위 문장을 예시로 self-attention이 이루어지기 전 Tokenizing, Embedding이 수행된다.

self-attention?
self-attention은 Query(Q), Key(K), Value(V) 간의 관계를 추출한다. Q, K, V는 입력 문장의 모든 벡터 값이며 모두 동일하다.
Q, K, V?예를 들어 입력 단어가 X_1 = “Thinking”, X_2 = “Machines”라면 다음과 같이 W_Q, W_K, W_V가 곱해지면서 각각 (q_1, k_1, v_1), (q_2, k_2, v_2)가 만들어진다.
어떤 입력 단어 벡터 시퀀스에 어떤 trainable한 행렬 W_Q, W_K, W_V이 곱해지면서 Q, K, V가 만들어진다.

* QK^T는 Q와 K간의 연관성을 계산하여 score를 출력한다(아래 식...)

위 예문을 기준으로는, 다음과 같이 연산된다.

이처럼 “The”의 Q에 대한 각 전체 단어 K간의 연관성이 얼마나되는지 score를 계산하고 “animal”도 마찬가지로 계산하는 식으로 진행된다. sqrt{d_k}에서 d_k는 key 벡터 사이즈를 의미하여 이 값을 나누어주는 이유는 key 벡터의 차원이 늘어날수록 내적 연산시 값이 커지는 문제를 보완하기 위해서이다. 또한 더 안정적인 gradient를 가지기 위함이며, 이후 각 score들은 softmax를 거쳐 0~1 사이의 값으로 만든다.

softmax까지 거친 score들은 또 다시 V와 곱해진다. 이를 수식화하고, 위 예문에 적용해 보면 아래와 같다.

수식 )

예시 )

이렇게 각 score가 V와 곱해지면서 Q와 K의 연관성이 V에 반영된다. 쉽게 말해 Q와의 연관성이 큰 K는 중요도가 클 것(score가 높음)이고 연관성이 작은 K는 중요도가 작을 것(score가 낮음)인데 이것이 V에 반영된다는 것이다.

그림으로 표현했을때 score가 크면 다음과 같이 맨 위처럼 선명하고 score가 낮을수록 아래에 있는 것처럼 희미해진다.

마지막으로 각 V는 sum이 되어 각 Token의 의미에 해당하는 값을 얻게된다. 예를 들어 “The”에 해당하는 모든 V를 더하면 전체 문장에서 “The”의 의미를 가지는 벡터를 얻게 된다. 이를 python class로 정의, 구현하면 아래와 같아진다 :)

class ScaledDotProductAttention(nn.Module):
    
    def __init__(self, dim: int):
        super(ScaledDotProductAttention, self).__init__()
        self.sqrt_dim = np.sqrt(dim)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tuple[Tensor, Tensor]:
        score = torch.bmm(query, key.transpose(1, 2)) / self.sqrt_dim
        attn = F.softmax(score, -1)
        context = torch.bmm(attn, value)
        return context, attn

'Experiences & Study > CSE URP' 29' 카테고리의 다른 글

[CSE URP] GAN(Generative Adversarial Networks) 논문읽기 (0)	2023.08.26
[CSE URP] Auto-Encoding Variational Bayes (ICLR 2014) (0)	2023.08.25
[CSE URP]BEiT: BERT Pre-Training of Image Transformers (2) (0)	2023.08.24
[CSE URP]BEiT: BERT Pre-Training of Image Transformers (1) (0)	2023.08.24
230731 PyTorch를 활용한 딥러닝 구현, MCMC, 코드분석 (0)	2023.08.07

'Experiences & Study/CSE URP' 29' Related Articles

On the journey of

[CSE URP] ViT Self-Attention 구조 본문

[CSE URP] ViT Self-Attention 구조

'Experiences & Study > CSE URP' 29' 카테고리의 다른 글

티스토리툴바