๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

EMNLP 2024

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€


โœจ Point โœจ
(1-1) Extracting Layer Activations 
(1-2) Applying the Diffusion Kernel ์•Œ๊ณ ๋ฆฌ์ฆ˜ (์ฐจ์› ์ถ•์†Œ)

(2-1) NPIB๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ similarity matrix๋ฅผ ๊ตฌ์ถ•
(2-2) ์ตœ์ ํ™”๋œ merging ratio๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด, adaptive weight allocation function์„ ๋„์ž….
(2-3) weighted sum ์„ ํ†ตํ•ด ์„ ํƒ๋œ (์œ ์‚ฌํ•œ) layer์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ fuse(merge)ํ•œ๋‹ค.

 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

 

LLM์˜ complexity์™€ scale ๋ฌธ์ œ๋กœ, resource-limited ํ™˜๊ฒฝ์—์„œ ํ™œ์šฉ์ด ์–ด๋ ต๋‹ค.

parameter pruning๊ฐ™์€ compression ๊ธฐ์ˆ ์€ ์ œ๊ฑฐ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋ฌธ์ œ์ .

 

-> Manifold-Based Knowledge Alignment and Layer Merging compression (MKA)

manifold learning(Diffusion Kernel Algo.)์„ ์‚ฌ์šฉ, Normalized Pairwise Information Bottleneck (NPIB)๋ฅผ ์‚ฌ์šฉํ•ด์„œ mergeํ•  ์œ ์‚ฌํ•œ ๋ ˆ์ด์–ด๋ฅผ ์„ ํƒ(?)ํ•จ

(๋ ˆ์ด์–ด์˜ ๋‹จ์œ„๊ฐ€ ๋ญ์•ผ? activation๋ผ๋ฆฌ๋งŒ mergeํ•˜๋Š” ๊ฑด ์•„๋‹ ๊ฑฐ์ž”์•„)

 

 

=> ์„ฑ๋Šฅ ์œ ์ง€, ์ƒ๋‹จํ•œ ์••์ถ• ๋น„์œจ

quantization๊นŒ์ง€ ์ ์šฉํ•œ ๊ฒฝ์šฐ ๋” ์ข‹์€ ์••์ถ•์„ ํ•ด๋ƒ„

 


1. Introduction

computational resources, memory requirements, and energy consumption ใ… ใ… 

 

- ๋ชจ๋ธ compression์€ ๋Ÿฌํ”„ํ•˜๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋ˆ ๋ณผ ์ˆ˜ ์žˆ์Œ

1) quantization

     - ๋” ์ ์€ ๋น„ํŠธ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• (low-precistion values)

     - ํ•˜๋“œ์›จ์–ด ์ง€์›์— ์˜์กด์ ์ž„

     - ์–ด๋–จ ๋• ์ถ”๊ฐ€ finetuning ํ•„์š”ํ•จ

2) pruning

     - retraining ํ•„์š” ์—†๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ (๋ญ์•ผ ์œ„๋ž‘ ๊ฐ™์€ ๋ง์ธ๋ฐ ๋งํ•˜๊ธฐ ๋‚˜๋ฆ„์ด๋„ค;;)

     - hardware-friendly

     - While effective, pruning usually risks losing valuable model structures and determining how to prune the LLM with minimal disruption to the origin remains an unsolved problem [LLM-Pruner]

 

- ์œ„ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด model merging์— ๋Œ€ํ•ด delve into(์ฒ ์ €ํ•˜๊ฒŒ ์กฐ์‚ฌ)ํ•จ.

1) ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ merge ํ•˜๋Š” ๋ฐฉ๋ฒ• (ํ˜„์žฌ ์—ฐ๊ตฌ๋Š” ์ด์ชฝ์— ๊ตญํ•œ๋˜์—ˆ์Œ)

     - ์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ ๊ฐ•์ ๊ณผ ์ง€์‹์„ ์›ํ™œํ•˜๊ฒŒ ๊ฒฐํ•ฉ

     - ๊ฐ™์€ architecture๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ๋“ค์˜ weight๋ฅผ ํ‰๊ท ๋‚ด๋Š” ๋ฐฉ์‹

     - bias์™€ error๋ฅผ ์ƒ์‡„ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ(!)๋˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค๊ณ  ํ•จ (ref)

 

2) ํ•˜๋‚˜์˜ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ merge ํ•˜๋Š” ๋ฐฉ๋ฒ• (๋ณธ ๋…ผ๋ฌธ์€ ์ด์ชฝ!)

     - ๋ ˆ์ด์–ด ๊ฐ„ ์ง€์‹์˜ ์ ์ง„์ ์ธ merge์„ ํ†ตํ•ด ์ „์ฒด ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ ์••์ถ•์ด ๊ฐ€๋Šฅํ• ๊นŒ?

-> Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA) ๋ฐฉ๋ฒ• ์ œ์•ˆ

 

(1) Manifold Learning for LLM Knowledge:

manifold learning ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ๋ ˆ์ด์–ด์˜ activation์„ ์ถ”์ถœ

-> Diffusion Kernel ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ, ๋ ˆ์ด์–ด ๊ฐ„ ์ง€์‹์„ alignment ํ•œ๋‹ค.

 

==> activation์— ์žˆ๋Š” nonlinear structure์„ ๋” ์ž˜ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ์Œ.

์ค‘์š”ํ•œ activation feature๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ์ฐจ์›์ถ•์†Œ๊ฐ€ ๊ฐ€๋Šฅํ•จ.

๋”ฐ๋ผ์„œ ์„œ๋กœ๋‹ค๋ฅธ ๋ ˆ์ด์–ด ๊ฐ„์˜ knowledge ํŒจํ„ด์„ ํšจ๊ณผ์ ์œผ๋กœ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.

 

(2) Similarity Alignment Layer Merging:

- Normalized Pairwise Information Bottleneck (NPIB)์„ ์‚ฌ์šฉํ•˜์—ฌ

๋ ˆ์ด์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜๋Š” ์œ ์‚ฌ๋„ํ–‰๋ ฌ(similarity matrix)์„ ๊ตฌํ•œ๋‹ค. 

- ์ด ์ธก์ •๊ฐ’์€ ๊ฐ ๋ ˆ์ด์–ด์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ณ ๋ คํ•˜๋ฉด์„œ, ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰(mutual information)์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ ˆ์ด์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

(๋ญ๋ผ๋…ธ)

- ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์œ ์‚ฌ๋„ํ–‰๋ ฌ(similarity matrix)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ์œ ์‚ฌํ•œ ๋ ˆ์ด์–ด ์Œ์„ ์„ ํƒํ•˜์—ฌ mergeํ•œ๋‹ค.

(merge๋Š” ์–ด์ผ€ ํ•˜๋Š”๋ฐ?? ์ด๊ฒƒ๋„ ๊ฑ ์ด์ „ ๋ฐฉ๋ฒ• ๋”ฐ๋ผ์„œ ํ‰๊ท ๋‚ด??)

 

cf. Information Bottleneck

 

 

Main Contributions:

- innovative model compression technique: MKA (alignํ•˜๊ธฐ ์œ„ํ•ด manifold learning ์ ์šฉ / ๋ ˆ์ด์–ด์˜ ์ง€์‹ ํ†ตํ•ฉ) -> ์„ฑ๋Šฅ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ ์ค„์ž„

- develop a manifold-based knowledge alignment approach (Diffusion Kernel & NPIB) -> ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ž˜ ์บก์ฒ˜ํ•˜๊ณ  align ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

- ๋‹ค์–‘ํ•œ benchmark datasets & ๋‹ค์–‘ํ•œ LLM ์‚ฌ์šฉ -> ๋ชจ๋ธ ์„ฑ๋Šฅ์— ํฐ ์ €ํ•˜์—†์ด ์ƒ๋‹นํ•œ ์••์ถ• ํ•ด๋ƒ„

(์ฒซ๋ฒˆ์งธ ๋‘๋ฒˆ์งธ ๊ฐ™์€๋ง ๊ฐ™์€๋ฐ ๋ญ๊ฐ€ ๋” ์ž‡๋Š”์ง€-?)

 

2. Manifold-Based(์ฐจ์›์ถ•์†Œ-๊ธฐ๋ฐ˜) Knowledge Alignment and Layer Merging

๋ชจ๋ธ ํ›„๋ฐ˜(the latter) ๋ ˆ์ด์–ด์— ์กด์žฌํ•˜๋Š” redundancy๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. [ref]

input-output ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ๋ ˆ์ด์–ด๋“ค์„ ๋’ค์—์„œ ์•ž์œผ๋กœ(back to front) mergeํ•จ. (๋‚ด๊ฐ€ ์ƒ๊ฐํ•œ ๋ฐฉ๋ฒ•์ด๋ž‘ ๋ฐฉํ–ฅ ๋™์ผํ•œ ๊ฒƒ ๊ฐ™์Œ)

 

๊ณ ์ฐจ์›์˜ intermediate states๋Š” ๋ถ„์„์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—,

ํ•ด๋‹น states์˜ ์ถ”์ถœ ๋ฐ ์ฐจ์› ์ถ•์†Œ ๊ณผ์ •์„ ์„ค๋ช…ํ•œ๋‹ค.

 

๊ทธ ๋‹ค์Œ, similarity alignment์— ๊ธฐ๋ฐ˜ํ•œ layer merging ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

์ด ๊ธฐ๋ฒ•์€ intermediate states๋ฅผ alignmentํ•˜๋ฉด์„œ mergeํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.

 

2.1. Manifold Learning for LLM Knowledge

LLM์˜ ๊ณ„์ธต ๊ฐ„ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•ด, MKA๋Š” manifold learning ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ LLM ๋‚ด๋ถ€ ๊ตฌ์กฐ ๋‚ด์˜ ๋ณต์žกํ•œ ๋น„์„ ํ˜• ์˜์กด์„ฑ์„ ํฌ์ฐฉํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์€ layer activations๋ฅผ ์˜๋ฏธ ์žˆ๋Š” ๋ฐฉ์‹์œผ๋กœ ๋น„๊ตํ•˜๊ณ  ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ, ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ๋„ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

 

layer activations H^l ์„ ์ถ”์ถœํ•œ๋‹ค. (dataset: w)

์—ฌ๊ธฐ์„œ activations์€ ์ธํ’‹ ์ƒ˜ํ”Œ๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ฐ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹์„ ์˜๋ฏธํ•จ. (๋ญ.. H๊ฐ€ ํžˆ๋“ repre-์ธ๊ฑด๊ฐ€๊ทธ๋Ÿผ ๋ฐ‘์—์„œ ๋ฐ”๋กœ ์„ค๋ช…ํ•จ)

๊ณ ์ฐจ์›์ธ activation์„ ์ €์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋ฐ”๊พธ๊ธฐ ์œ„ํ•ด Diffustion Kernel algorithm์„ ์‚ฌ์šฉํ•œ๋‹ค. (LLM ๋‚ด๋ถ€๊ฐ€ ์•„. 128๋””๋ฉ˜์…˜ ๋ง‰ ์ด๋Ÿฌ๋‹ˆ๊นŒ ๊ทธ๋ ‡๋„ค ๊ณ ์ฐจ์›์ด๋„ค..)

 

Extracting Layer Activations:

activations of each layers (H^l) ๋ฅผ ์ถ”์ถœ

 

Constructing the Pairwise Distance Matrix:

pairwise Euclidean distance matrix (D)๋ฅผ ๊ณ„์‚ฐ

๋ชจ๋“  activation์˜ ์Œ์˜ distance๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

Applying the Diffusion Kernel:

Diffusion Kernel ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ distance matrix (D)๋ฅผ ์ €์ฐจ์› manifold representation (Φ_i)๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

σ_K : the kernel bandwidth parameter (์‹คํ—˜์—์„œ๋Š” 8๋กœ ์„ค์ •ํ–ˆ๋‹ค๊ณ  ํ•จ)
EigVectors_d : eigenvectors corresponding to the d smallest eigenvalues of the Laplacian matrix L

EigVectors๋Š” **๋ผํ”Œ๋ผ์‹œ์•ˆ ํ–‰๋ ฌ L**์˜ ๊ฐ€์žฅ ์ž‘์€ d๊ฐœ์˜ ๊ณ ์œ ๊ฐ’์— ๋Œ€์‘๋˜๋Š” ๊ณ ์œ ๋ฒกํ„ฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๋ณ€ํ™˜์€ ํ™œ์„ฑ๊ฐ’(activation) ๋‚ด์— ์กด์žฌํ•˜๋Š” ํ•ต์‹ฌ์ ์ธ ํŠน์ง•๊ณผ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜๋ฉฐ,

์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด ๊ฐ„์˜ ํšจ๊ณผ์ ์ธ ๋น„๊ต๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.

  ..์ผ๋‹จ ํŒจใ……์Šค..~

 

* ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹ : the first question from the 57-question MMLU dataset

 

2.2. Similarity-based Layer Merging (์ด์ œ merge ํ•˜๋Š” ๊ฑฐ)

manifold learning representations๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์œ ์‚ฌ๋„-๊ธฐ๋ฐ˜(similarity-based) ๋ ˆ์ด์–ด merging์„ ์ง„ํ–‰ํ•œ๋‹ค.

๋ ˆ์ด์–ด๋“ค ๊ฐ„์˜ similarity๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Normalized Pariwise Information Bottleneck (NPIB) metric์„ ์‚ฌ์šฉํ•œ๋‹ค!

 

(1) NPIB๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ similarity matrix๋ฅผ ๊ตฌ์ถ•

(2) ์ตœ์ ํ™”๋œ merging ratio๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด, adaptive weight allocation function์„ ๋„์ž….

(3) weighted sum ์„ ํ†ตํ•ด ์„ ํƒ๋œ (์œ ์‚ฌํ•œ) layer์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ fuseํ•œ๋‹ค. (์•„ใ…ใ…ใ…ใ…ใ…ใ…ใ…ใ…์›จ์ดํ‹ฐ๋“œ์ธ์ด์š”?;;; ์ด๋Ÿฐ ์˜› ๋ฐฉ๋ฒ• ใ„ฑใ…Š์•„?? ์ด๊ฒŒ ์ง€์‹ ๋ณด์กด์ด ๋ผ????????????????????????????????????????????????????????????? ์‹คํ—˜์€ ํ•ด๋ณผ๋งŒ ํ•œ๋“ฏ ... distillation ๋กœ์Šค ์•ˆ ์“ฐ๊ณ  ๊ทธ๋ƒฅ weighted sum...ใ…‹ใ…‹ ์™„์ „ ๊ฐ„๋‹จํ•ด์ง€๊ฒŸ๋‹ค ์šฐ์™•)

 

Constructing the Similarity Matrix (1)

Normalized Pairwise Information Bottleneck(NPIB)๋Š” ๊ฐ ๋ ˆ์ด์–ด์˜ ๊ฐœ๋ณ„ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ •๊ทœํ™”ํ•˜๋ฉด์„œ, ๋ ˆ์ด์–ด ๊ฐ„์— ๊ณต์œ ๋˜๋Š” ์ •๋ณด๋Ÿ‰์„ ์ •๋Ÿ‰ํ™”ํ•˜์—ฌ,

๋ ˆ์ด์–ด ๊ฐ„ ์ง€์‹ ํŒจํ„ด์„ ๋น„๊ตํ•˜๊ธฐ์— ์ด์ƒ์ ์ธ ์ฒ™๋„๋ฅผ ์ œ๊ณต..ํ•œ๋‹ค..

P(x, y) : E_i์™€ E_j์˜ joint probability distribution
p(x) : E_i์˜ marginal probability distribution
p(y) : E_j์˜ marginal probability distribution

 

์œ„ ์œ ์‚ฌ๋„ํ–‰๋ ฌ์€ ์–ด๋–ค ๋ ˆ์ด์–ด๋“ค์ด aligned knowledge representation์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ. (์ฆ‰ mergeํ•  ์• ๋“ค ๊ฒฐ์ • ใ„ฑใ„ด)

 

Calculate Weight Ratio (2)

๋ ˆ์ด์–ด ์‚ฌ์ด์˜ similarity difference์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ merging weight๋ฅผ ์ •ํ•œ๋‹ค.

๋‘ ๋ ˆ์ด์–ด ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„ ์ฐจ์ด๊ฐ€ ํฌ๋ฉด Ψ๋Š” ์œ ์‚ฌ๋„๊ฐ€ ๋” ๋†’์€ ๋ ˆ์ด์–ด์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ํ• ๋‹นํ•˜๊ณ , ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์€ ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜๋Š” ์ค„์ธ๋‹ค.

(๋ญ”์†Œ๋ฆฌ์•ผ ๊ธฐ์ค€์ด ๋˜๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๋ญ”๋ฐ?;;;; ๋‘ ๊ฐœ๋ฅผ ๋น„๊ตํ•˜๋Š”๋ฐ ๋‘ ๊ฐœ ์ค‘ ๋” ์œ ์‚ฌํ•œ ํ•œ ๊ฐœ??? ๋ญ”์†Œ๋ฆฌ์•ผ ์ด๊ฒŒ)

λ_m : the merging ratio
Ψ : the adaptive weight allocation function

 

 

Merging Layer Parameters (3)

fused parmeters โ˜ weighted sum

 

L_m : ์ƒˆ๋กญ๊ฒŒ merge๋œ ๋ ˆ์ด์–ด (L_i ์™€ L_j ์˜ mergence)

 

 

3. Experiments

a comprehensive set of experiments to evaluate the effectiveness and generalizability of our MKA method across various domains

3.1. Experimental Setup

3.1.1. Datasets

(์ „๋ถ€ accuracy)

MMLU

PIQA

HellaSwag

RACE-H

BoolQ

 

3.1.2. LLMs

Llama-2

Llama-3

Mistral-7B

 

3.1.3. Baselines

(1. ๊ธฐ์กด pruning ๊ธฐ๋ฒ• /  2.๊ธฐ์กด pruning ๊ธฐ๋ฒ• + quntization )

 

SparseGPT

ShortGPT

Reverse Pruning (๋ ˆ์ด์–ด์˜ ์ค‘์š”๋„๋ฅผ ๋ชจ๋ธ ๋‚ด ์ˆœ์„œ์— ๋ฐ˜๋น„๋ก€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ, ์ดˆ๊ธฐ ๋ ˆ์ด์–ด๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.)

SmoothQuant

GPTQ

AWQ

 

3.2. In what ways does MKA surpass conventional pruning techniques?

MMLU dataset using the Llama3-8B, Llama3-70B, Mistral-7B, Llama2-7B, and Llama2-13B models

MMLU

- Llama3-8B์˜ ์••์ถ•๋ฅ ์€ 43.5%์— ๋„๋‹ฌํ•˜๊ณ , Mistral-7B๋Š” 40%, Llama2-13B๋Š” ๋†€๋ž๊ฒŒ๋„ 57.5%

- ๋‘ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ๋ถ•๊ดด(collapse)๋ฅผ ๊ฒฝํ—˜ํ•˜์ง€๋งŒ, ๋ชจ๋ธ ๋ณ‘ํ•ฉ ๋ฐฉ์‹์€ ์–ด๋А ์ •๋„ ๋ ˆ์ด์–ด ๋ถ•๊ดด๋ฅผ ์ง€์—ฐ์‹œํ‚ค๊ณ  ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•  ์ˆ˜

- ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์ด Reverse Prune์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— (์–ธ์ œ ๊ทธ๋Ÿฐ ๋ง์„..), Llama3-8B, Llama2-7B, Llama2-13B ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ ์ˆ˜๋Š” Reverse Prune๊ณผ ๋งค์šฐ ์œ ์‚ฌ (๋‹ค๋ฅธ ๊ฒฝ์šฐ๋Š” ์œ ์‚ฌํ•˜์ง€ ์•Š๊ธฐ๋„ ํ–ˆ๋‹ค. ..)

 

3.3. How Does MKA Combined with Quantization Perform Compared to Pruning Combined with Quantization?

 

- pruning๋œ ๋ชจ๋ธ๋“ค์ด ์ถ”๊ฐ€๋กœ quantization๋˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๊ณ , ๋” ๋†’์€ ์••์ถ•๋ฅ ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค

- For example, on Llama3-8B, at a compression ratio of 85.94%, MKA with SmoothQuant achieves 64.20%, far exceeding ShortGPT with SmoothQuant at 37.66%.

 

- ์—ใ…”ใ…”ใ…”ใ…”ใ…”ใ…”ใ…”ใ…”ใ…”์—ฅ?????? pruning ๋น„์œจ์ด 80%๊ฐ€ ๋„˜์–ด?????? ใ…๋ญ์•ผ ๋‚จ๋Š” ๊ฒŒ ์žˆ๋Š” ๊ฑฐ์•ผ?? ๋ง์ด ๋ผ,,?

quantization ์ชฝ์ด ์›๋ž˜ ์ด๋Ÿฐ๊ฑด๊ฐ€๋ด„ ...........

 

 

3.4. MKA vs. Other Pruning Methods on varies benchmarks

Llama3-8B

ratios of {34.375%, 37.5%, 40.625%, 43.75%}

 

- ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„

- For example, at a compression ratio of 34.375% on the MMLU dataset, our method can outperform ShortGPT by 21.92% and SparseGPT by 20.42%.

- ์˜ค.. ํ”„๋ฃจ๋‹ํ•˜๋ฉด PIQA๋ž‘ HellaSwag(๋ž‘ MMLU) ์„ฑ๋Šฅ ์—„์ฒญ ๋–จ์–ด์ง€๋„ค

3.5. Are Inter-Layer Knowledge Alignment Similarity Matrices Consistent Across different Large Models?

 

- Visualize the knowledge alignment and layer merging effects of MKA on various models. (์—ฅ ๊ทผ๋ฐ before and after MKA๋ผ๋Š”๋ฐ ๋ญ ์–ด๋””๊ฐ€ ๋น„ํฌ๊ณ  ์–ด๋””๊ฐ€ ์• ํ”„ํ„ฐ์ธ์ง€;;;;;;;)

- ์ „๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ์˜ ํ›„๋ฐ˜๋ ˆ์ด์–ด(the later layers)๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค

 

- ์ดˆ๋ฐ˜ ๋ ˆ์ด์–ด์˜ ์ค‘์š”์„ฑ -> Additionally, when merging the earlier layers, we notice a collapse of the matrix in the final figure, suggesting that earlier layers have a significant influence on later layers.

 

4. Discussion

4.1. Extension to Multimodal and Specialized Models

- MoE ์™€ Mamba ๋ชจ๋ธ์—๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. (๋‘ ๋ชจ๋ธ๋„ ๋น„์Šทํ•œ redundancy๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Œ)

-jamba์™€ Mixtral-8x7B์˜ ์œ ์‚ฌ๋„ ๋ถ„ํฌ๋Š” LLM๊ณผ๋Š” ์•ฝ๊ฐ„ ๋‹ค๋ฅธ ๊ฒฝํ–ฅ์„ ๋ˆ๋‹ค(!) (๊ตฌ์กฐ๊ฐ€ ๋‹ค๋ฅด๋‹ˆ๊นŒ ๊ทธ๋Ÿด๋ฒ•๋„ ํ•˜์ง€๋งŒ ์ด์œ ๋Š” ๊ถ๊ธˆํ•˜๋‹ค)

 

* Mixtral-8x7B : attention ์‚ฌ์šฉ, Mistral 7B+Mixture of Expert

* Mamba : attention์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ณ‘๋ ฌ ๊ฐ€๋Šฅํ•œ RNN-like ๊ตฌ์กฐ์˜ State Space Model(SSM) ๊ธฐ๋ฐ˜ ์‹œํ€€์Šค ๋ชจ๋ธ

* Jamba : Mamba ๊ธฐ๋ฐ˜์˜ MoE ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ

 

 

4.2. Analysis of Similarity Measures

Llama3-8B

similarity metric : {Cosine Similarity, Mahalanobis Distance, Euclidean Distance, t-SNE Similarity, Autoencoder Similarity}

 

 

์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity), ๋งˆํ• ๋ผ๋…ธ๋น„์Šค ๊ฑฐ๋ฆฌ(Mahalanobis Distance), ๊ทธ๋ฆฌ๊ณ  ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ(Euclidean Distance)๋Š” ์ˆ˜์ง ์ค„๋ฌด๋Šฌ์™€ ๋‹ค์–‘ํ•œ ์—ด ๊ฐ’(heat values)์„ ๊ฐ€์ง€๋Š” ์œ ์‚ฌํ•œ ๋ถ„ํฌ ํŒจํ„ด์„ ๋ณด์ž„์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Mahalanobis Distance๋Š” ์ด๋Ÿฌํ•œ ์ค„๋ฌด๋Šฌ ๋‚ด์—์„œ ๋ถˆ๊ทœ์น™ํ•œ ์—ด ๊ฐ’์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ด๋Š” ์œตํ•ฉ๋œ ๋ ˆ์ด์–ด ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์™€์˜ ๋ถˆ์ผ์น˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. t-SNE ์œ ์‚ฌ๋„๋Š” ๋ฌด์ž‘์œ„์ ์ด๋ฉฐ ์ผ๊ด€๋œ ํŒจํ„ด์ด ๋ถ€์กฑํ•˜๋‹ค. ์˜คํ† ์ธ์ฝ”๋” ์œ ์‚ฌ๋„(Autoencoder Similarity)์˜ ๊ฒฝ์šฐ, ๋†’์€ ์—ด ๊ฐ’์ด ์ ์ ˆํ•œ ๋ณ‘ํ•ฉ ์˜์—ญ์ด๋‚˜ ์˜ˆ์ƒ๋˜๋Š” ๋†’์€ ์œ ์‚ฌ๋„ ์˜์—ญ๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š”๋‹ค.

- ๊ทธ๋‹ˆ๊นŒ manifold learning์„ ํ†ตํ•ด์„œ similarity๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒŒ ๊ฐ€์žฅ ์ข‹๋‹ค๋Š” ๊ฒƒ ๊ฐ™์Œ. ๊ทธ๋ƒฅ ์ผ๋ฐ˜์ ์ธ metric์„ ์“ฐ๋ฉด ์ €๋ ‡๊ฒŒ ์ด์ƒํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋‹ˆ๊นŒ.. (๋‹ค๋ฅธ ๋…ผ๋ฌธ์€ ์•ˆ ๊ทธ๋žฌ๋Š”๋ฐ...  ์–ด๋–ค matrix๋ฅผ ๊ธฐ์ค€์œผ๋กœ sim์„ ๊ตฌํ–ˆ๋Š”์ง€๊ฐ€ ๋‹ฌ๋ผ์„œ ๊ฒฐ๊ณผ๋„ ๋‹ค๋ฅธ๊ฑด๊ฐ€)

 

4.3. Variations in Accuracy Across Different MMLU Subjects During Layer Merging

Subject : {College Medicine, College Biology, High School Psychology, College Physics}

- ๊ณ ๋“ฑํ•™๊ต ์‹ฌ๋ฆฌํ•™(High School Psychology)์€ ์ •ํ™•๋„์—์„œ ์•ฝ๊ฐ„์˜ ๋ณ€๋™๋งŒ์„ ๋ณด์ด๋ฉฐ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ์œ ์ง€

- College Biology์€ 12.5% ๋ณ‘ํ•ฉ ๋น„์œจ์—์„œ ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ํ•˜๋ฝํ•œ ํ›„ ํšŒ๋ณต๋˜๋Š” ์–‘์ƒ

- College Physics์€ ์ •ํ™•๋„์˜ ์žฆ์€ ๋ณ€๋™์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋ ˆ์ด์–ด ๋ณ‘ํ•ฉ์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„๊ฐ€ ๋†’์Œ

- College Medicine์€ ์„ฑ๋Šฅ์ด ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•˜์˜€๊ณ , ๋ณ€๋™์€ ๋ฏธ๋ฏธ

 

- ๋ ˆ์ด์–ด๋ฅผ ๋” ์—†์•ด๋Š”๋ฐ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋Š” ๊ฑด ๋ฌด์Šจ ์˜๋ฏธ์ผ๊นŒ (์•„ ์ด๊ฑด SLEB ์ฒ˜๋Ÿผ ๊ณ„์† ๊ฐฑ์‹ ์ด ์•„๋‹Œ๊ฐ€? ๋น„์œจ๋งˆ๋‹ค ํ”„๋ฃจ๋‹ ์กฐํ•ฉ์ด ๋‹ค๋ฅธ๊ฑด๊ฐ€)

 

5. Conclusion

(merge ๋ฐฉ์‹์ด weighted sum์ด๋ผ๋Š” ๊ฒƒ์—์„œ ํฅ๋ฏธ๋ฅผ ์žƒ์Œ ..ใ…œใ…œ

adaptive ratio๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ด ์—ฌ๊ธฐ์„œ๋Š” similarity ์–ด์ฉŒ๊ณ ์–ด์ €๊ณ  ์˜€๋Š”๋ฐ

์ด ๋ถ€๋ถ„์„ ์ƒˆ๋กญ๊ฒŒ ์ƒ๊ฐํ•ด์„œ adaptive ratio๋ฅผ ๊ตฌํ•˜๊ณ  weighted sum์„ ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ •๋„๋Š” ์–ป์„ ์ˆ˜ ์žˆ์„ ๋“ฏ.

(์™œ๋ƒ  ๋‚˜๋Š” importance score๋‚˜ similarity ์กฐํ•ฉ์„ ๊ตฌํ•  ํ•„์š”๊ฐ€ ์—†๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์‹ถ๊ธฐ ๋•Œ๋ฌธ) )

๋ณต์žกํ•˜๊ณ  ์˜ˆ๋ฏผํ•˜๊ณ  ๋ณ„๋ก  ๊ฑฐ ๊ฐ™์€๋ฐ novelty ๊ฐ€ ๋‹ค ํ•œ ๊ฑด๊ฐ€

๋ชจ๋“  ์ˆ˜์‹์„ ๋œฏ์–ด๋ณด์ง„ ์•Š์•˜๊ณ  (์ฝ”๋“œ๋„ ์—†์–ด์„œ ์–ด์ผ€ ๋œ๊ฑด์ง€ ๋ชจ๋ฆ„), high level๋กœ ๋ฐฉ๋ฒ•๋ก ๊ณผ ์ค‘์š”ํ•œ ๋ถ€๋ถ„ใ…‡ใ…” ์ง‘์ค‘ํ•ด์„œ ์ฝ์Œ

merge ๋‹จ์œ„๋„ layer๋ผ๊ณ  ํ•˜๋Š”๋ฐ ์ •ํ™•ํžˆ ์–ด๋А weight์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ mergeํ–ˆ๋Š”์ง€ ์•ˆ ๋‚˜์™€์„œ ๋ชจ๋ฅด๊ฒŸ์Œ.... ์ „๋ถ€ ๋‹ค ํ•œ ๊ฑด๊ฐ€

 

Limitations

manifold learning ๊ณผ์ •์—์„œ, input dataset์˜ ํ€„๋ฆฌํ‹ฐ์— ์‹ฌํ•˜๊ฒŒ ์˜์กด(heavily depends on)ํ•œ๋‹ค๋Š” ํ•œ๊ณ„.

์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜(์–‘)๋„ manifold learning ๊ฒฐ๊ณผ์— ํฐ ์˜ํ–ฅ(significantly impact)์„ ๋ฏธ์นœ๋‹ค.

 

์กฐ๊ฑด์ˆ˜(Condition Number)๋ฅผ 2000 ์ดํ•˜๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์€ ํ•™์Šต๋œ manifold representations์˜ ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. activation๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์ด ๋ชจ๋ธ์˜ ๋™์ž‘ ๋ฒ”์œ„(operational range)๋ฅผ ์ถฉ๋ถ„ํžˆ ํฌ๊ด„ํ•˜์ง€ ๋ชปํ•  ๊ฒฝ์šฐ, ํ•™์Šต๋œ ๋‹ค์–‘์ฒด ํ‘œํ˜„์€ ๋ฐ์ดํ„ฐ์˜ ์‹ค์ œ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ์ œ๋Œ€๋กœ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค.

 

ํ˜„์žฌ MKA์˜ ๊ตฌํ˜„์€ ์ฃผ๋กœ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜์—์„œ ํ…Œ์ŠคํŠธ๋˜์—ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์ด ๋ณธ์งˆ์ ์œผ๋กœ ์ค‘๋ณต์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ๋ฏฟ์ง€๋งŒ, MKA๊ฐ€ ๋‹ค๋ฅธ ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜—์˜ˆ๋ฅผ ๋“ค์–ด, CNN์ด๋‚˜ RNN—์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์„์ง€, ๊ทธ๋ฆฌ๊ณ  ๋™์ผํ•œ ์••์ถ• ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์„์ง€๋Š” ์•„์ง ์ถฉ๋ถ„ํžˆ ํƒ๊ตฌ๋˜์ง€ ์•Š์•˜๋‹ค.

 

* Condition Number : ์–ด๋–ค ํ•จ์ˆ˜ y=f(x)์˜ ์กฐ๊ฑด์ˆ˜(condition number)๋Š” ํ•จ์ˆ˜์˜ ์ž…๋ ฅ์ธ x์˜ ์ž‘์€ ๋ณ€ํ™”์šธ์— ๋Œ€ํ•ด ํ•จ์ˆ˜์˜ ์ถœ๋ ฅ์ธ y์˜ ๋ณ€ํ™”์œจ์ด ์–ผ๋งˆ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜๋กœ์„œ, ํ•จ์ˆ˜์˜ ๋ฏผ๊ฐ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ

 

 

 

 

 

 

 

 

 

 

 


์ฐธ๊ณ :

Manifold Learning https://bkshin.tistory.com/entry/%EC%BB%B4%ED%93%A8%ED%84%B0-%EB%B9%84%EC%A0%84-7-%EC%98%A4%ED%86%A0%EC%9D%B8%EC%BD%94%EB%8D%94AutoEncoder%EC%99%80-%EB%A7%A4%EB%8B%88%ED%8F%B4%EB%93%9C-%ED%95%99%EC%8A%B5Manifold-Learning

Condition number https://pasus.tistory.com/103