๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

LaCo: Large Language Model Pruning via Layer Collapse

 

EMNLP 2024

Code

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€


โœจ Point โœจ


 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

 

quantization, knowledge distillation, model pruning ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์€ ๋งŽ์€ ์ด์Šˆ๋“ค์— ์˜ํ•ด ์ œ์•ฝ์ด ์žˆ๋‹ค. (hardware support, ๋ฐฉ๋Œ€ํ•œ ํ•™์Šต, ๋ชจ๋ธ ๋‚ด๋ถ€ ๊ตฌ์กฐ ๋ณ€ํ™”)

 

-> ๊ฐ„๊ฒฐํ•œ(concise) layer-wise pruner์ธ Layer Collapse(LaCo)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. 

์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์˜ ํ›„๋ฐ˜ layer๋ฅผ ์•ž์ชฝ layer์— ํ•ฉ์น˜๋Š”(collapse) ๋ฐฉ๋ฒ•์ด๋‹ค. -> ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

 

25-30%์˜ pruning ratio์—์„œ๋„ 80%๋ฅผ ์›ƒ๋„๋Š” ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•œ๋‹ค. (ํ˜„์กดํ•˜๋Š” SOTA ๋ชจ๋ธ๋ณด๋‹ค ์•„์›ƒํผํฌ๋ฐํ•จ.) ใ„นใ…‡..?

์ถ”๊ฐ€์ ์œผ๋กœ post-training ์‹คํ—˜ ์ง„ํ–‰, layer-wise similarity, various pruning ratio์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค.

 


1. Introduction

ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ LLM์€ ๋‹ค์–‘ํ•œ ํ…Œ์Šคํฌ์—์„œ ์ƒ๋‹นํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์ด๊ณ  ์žˆ์œผ๋‚˜, ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด์„œ computational resource์˜ ํ•„์š”๋„๋„ ๋†’์•„์ง€๊ณ  ์žˆ๋‹ค.

์ถ”๋ก  ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ํ•™์Šต cost๋ฅผ ์ค„์ด๊ณ , ์ž‘์€ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•๋“ค : quantization, knowledge distillation, model pruning

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉ๋ฒ•๋“ค์— ๋‹จ์ ์ด ์กด์žฌํ•จ.

- quantization : ํŠน์ •ํ•œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋•Œ๋กœ๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค (์ด๋ ‡๊ฒŒ ๋‹น์—ฐํ•˜๊ณ  ๋‹ค๋ฅธ ๊ฑฐ์—๋„ ํ•ด๋‹นํ•˜๋Š” ๋ง์„ ์จ๋„ ๋˜๋Š” ๊ฑฐ์ž„?)

- kd : ์ž‘์€ ๋ชจ๋ธ์„ ์žฌํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค.

- non-structured pruning: ๋ชจ๋ธ์ด sparseํ•ด์ง€๋ฉฐ ์„ฑ๋Šฅ ์ €ํ•˜ ์œ ๋ฐœ๋จ, ํŠน์ •ํ•œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

- strucrured pruning: ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๊ฑฐ๋‚˜ ๋ชจ๋ธ์˜ portability๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค.

 

์œ„์™€ ๊ฐ™์€ ์ด์Šˆ๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ, ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

- ์ด๋ฏธ ํ•™์Šต๋œ LLM์—์„œ ๋ช‡ layer๋ฅผ pruningํ•œ๋‹ค.

- ํ•œ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋กœ ๋Œ€์ฒดํ•œ๋‹ค. (substitute the parameters of one layer for multiple layers) (ํ•ฉ์นœ๋‹ค๋Š” ๋œป)

 

ํŠน์ • ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ฐจ์ด(differentials)์™€ ์ดํ›„๋ ˆ์ด์–ด๋“ค์„ mergegํ•ด๋„ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ. 

Reserving-Differences-whileSeeking-Common (RDSC) Layer Merge ์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋กœ ํ•จ.

 

 

In this paper :

- 30-50% ๋ฅผ ์ œ๊ฑฐํ•ด๋„ ์ถ”๊ฐ€ํ•™์Šต์—†์ด ์„ฑ๋Šฅ ์œ ์ง€ํ•จ. ๋‹ค์–‘ํ•œ benchmarks ํ…Œ์ŠคํŠธ๋ฅผ ํ†ตํ•ด SOTA๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚จ์„ ๋ณด์—ฌ์คŒ

- LLM์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•จ. ์‹œ์Šคํ…œ ๊ตฌํ˜„์„ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ ๋„ ๊ธฐ์กด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ์›ํ™œํ•˜๊ฒŒ ํ†ตํ•ฉ๋  ์ˆ˜ ์žˆ์Œ

- ์••์ถ•๋œ ๋ชจ๋ธ์ด ํšจ์œจ์ ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ›์•„์™”์œผ๋ฉฐ(interit), ์ตœ์†Œํ•œ์˜ ํ•™์Šต๋งŒ์œผ๋กœ ์›๋ณธ ๋ชจ๋ธ์ˆ˜์ค€์œผ๋กœ ๋ณต๊ตฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด post-training์„ ์ง„ํ–‰.

 

 

2. Method

2.1. Reserving-Differences-wile-Seeking-Common Layer Merge

l : LLM์˜ ๋ ˆ์ด์–ด
$๐œƒ_l$ : l๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค
$๐œƒ^*_l$ : ํ•ฉ์ณ์ง„ ์ตœ์ข… ๋ ˆ์ด์–ด

์—ฌ๊ธฐ์„œ $theta_(l+k) - theta_l$์€ layer-wise ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ฐจ์ด(difference)๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

์‹ค์ œ๋กœ ๊ณ„์‚ฐํ•  ๋•Œ๋Š” self-attention(SAN)๊ณผ MLP ๋ ˆ์ด์–ด๋ฅผ ๊ฐ๊ฐ ์ฒ˜๋ฆฌํ•จ.

๊ทธ๋ฆฌ๊ณ  ๋ฐ˜์˜์ด ๋œ m๊ฐœ์˜ ๋ ˆ์ด์–ด๋Š” ์—†์• ๋ฒ„๋ฆฐ๋‹ค.

 

์ดํ›„์˜ pruning ๊ณผ์ •์—์„œ ๊ณ„์† RDSC Layer Merge๊ฐ€ ํฌํ•จ๋˜๋ฉฐ, ์ด๋Š” ํŠน์ • ๋ ˆ์ด์–ด๋กœ์˜ ์—ฐ์†์ ์ธ layer collapse๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ‘Layer Collapse’๋ผ๋Š” ์ด๋ฆ„์„ ๋ถ™์ž„

 

2.2. Layer Collapse

์ตœ์ƒ์œ„ ๋ ˆ์ด์–ด(topmost layer, ==ํ›„๋ฐ˜ ๋ ˆ์ด์–ด)๋ถ€ํ„ฐ ์ธ์ ‘ํ•œ ๋ ˆ์ด์–ด๋ฅผ dynamicallyํ•˜๊ฒŒ mergeํ•œ๋‹ค.

few-shot calibration sample์„ ์‚ฌ์šฉํ•˜์—ฌ ์›๋ž˜ ๋ชจ๋ธ๊ณผ์˜ ์„ฑ๋Šฅ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. 

(1) Preparation

M : LLM
C : mergeํ•  ๋ ˆ์ด์–ด ์ˆ˜
[L, H] : mergeํ•  ๋ ˆ์ด์–ด ๋ฒ”์œ„
I(i) : merge ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ์ตœ์†Œ ๊ฐ„๊ฒฉ
D : few-shot calibration data
T : ์›๋ณธ ๋ชจ๋ธ๊ณผ merge๋œ ๋ชจ๋ธ์˜ ์œ ์‚ฌ๋„ threshold

 

(2) Pruning (lline 1-17)

 

l : layer pointer (H-C) : ์ฆ‰ ํ›„๋ฐ˜ ๋ ˆ์ด์–ด๋ถ€ํ„ฐ ๋‚ด๋ ค์˜ค๋ฉด์„œ ๊ณ„์‚ฐํ•œ๋‹ค.

K :C -1(ํ•ฉ์น  ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜-1) ๊ณผ M^* - l(์ „์ฒด ๋ ˆ์ด์–ด ์ˆ˜์—์„œ l์„ ๋บ€ ๊ฐ’) ์ค‘ ์ž‘์€ ๊ฐ’

 

RDSC Layer Merge (line 4-5)

-๋ ˆ์ด์–ด l ๋ฐ”๋กœ ๋‹ค์Œ์˜ K๊ฐœ ๋ ˆ์ด์–ด๋ฅผ ๋ ˆ์ด์–ด l์— mergeํ•œ ํ›„, ์ค‘๋ณต๋œ K๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐ

 

 

Calculate similarity (line6)

- calibration data (D)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์›๋ณธ ๋ชจ๋ธ๊ณผ ์••์ถ•๋ชจ๋ธ ๊ฐ๊ฐ์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์˜ Output hidden state๋ฅผ ๊ตฌํ•œ๋‹ค.

- ๊ตฌํ•œ ๋‘ ๊ฐœ์˜ ๊ฐ’์˜ similarity score (s)๋ฅผ ๊ตฌํ•œ๋‹ค.

 

 

Merge Evaluation and Adjustment (line 7-15)

- similarity score ๊ฐ€ threshold(T)๋ฅผ ๋„˜์œผ๋ฉด mergeํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ํฌ์ธํ„ฐ l์€ ๊ฐ„๊ฒฉ ์„ค์ •๊ฐ’(i)๋งŒํผ ๋‚ด๋ ค๊ฐ„๋‹ค.

- ์ง„ํ–‰ํ•˜๋‹ค๋ณด๋ฉด ํฌ์ธํ„ฐ l์ด ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜๋ณด๋‹ค ์ ์–ด์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— l์„ M^* - C๋กœ ์žฌ์„ค์ •ํ•œ๋‹ค. (line11)

 

2.3. Complexity Analysis

complexity๋Š” ๋ชจ๋ธ์˜ inference ์†๋„์— ๋‹ฌ๋ ค์žˆ๋‹ค.

์ตœ์•…์˜ ๊ฒฝ์šฐ, L = 0, H=์ „์ฒด๋ ˆ์ด์–ด์ˆ˜ , ๋ชจ๋“  ๋ฐ˜๋ณต์—์„œ s < T ์ด๋ฉด ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ์ˆœํšŒํ•˜๊ฒŒ ๋œ๋‹ค.

-> O(H × ||D||)

e.g., Llama2-13B (40layers) ์™€ calibration data 10๊ฐœ ์‚ฌ์šฉํ•˜๋ฉด, ์ตœ๋Œ€ inference ํšŸ์ˆ˜๋Š” 400๋ฒˆ์ด๊ธฐ ๋–„๋ฌธ์—

single GPU ํ™˜๊ฒฝ์—์„œ ๋ช‡๋ถ„๋‚ด๋กœ ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

3. Experiments

3.1. Models

Llama2-7B, 13B

Baichuan2-7B, 13B (์ค‘๊ตญ์–ด, ์˜์–ด)

 

3.2. Benchmarks

ํ‰๊ฐ€ ํˆด: OpenCompass evaluation framework

 

- Reasoning: CMNLI, HellaSwag, PIQA

- Language: CHID, WSC

- Knowledge: CommonSenseQA, BoolQ

- Examination: MMLU, CMMLU

- Understanding: Race-Higt/Midddle, XSum, C3

 

์ œ๋กœ์ƒท์ด๊ฑฐ๋‚˜ few์ƒท (์ถ”๊ฐ€ ํ•™์Šต x)

 

Evaluation

- perplexity(PPL), generation(GEN) for CHID, XSum, WSC

(ํ‰๊ฐ€ ํˆด์ธ OpenCompass์— ๋”ฐ๋ผ ์ ์ˆ˜๊ฐ€ ๋ณ€ํ™˜๋˜์–ด, ๋†’์€ ์ ์ˆ˜๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ์˜๋ฏธํ•œ๋‹ค)

 

3.3. Baselines

SOTA structured pruning ๋ฐฉ๋ฒ•๋“ค์„ ์„ ํƒํ•จ.

- LLM-Pruner, SliceGPT (์—ฌ๊ธฐ ๋‘ ๋ชจ๋ธ์€ SparseGPT๋ฅผ ๋Šฅ๊ฐ€ํ•œ methode๋“ค์ž„)

 

3.4. Settings

Hyperparameter Setting

Appendix A.

 

Calibration data

Llama2 : English Widipedia ์—์„œ ๋žœ๋ค 10๊ฐœ 

Baichuan2 : eng/cn wikipedia์—์„œ ๊ฐ๊ฐ ๋žœ๋ค 5๊ฐœ

- Eng: English Widipedia ์—์„œ ๋žœ๋ค 10๊ฐœ 

- Cn: Chinese Widipedia

 

GPU

8 Nvidia A100 80GB GPU๋ฅผ ์‚ฌ์šฉ

 

3.5. Main Results

 

- ๋‹ค๋ฅธ baseline ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, LaCo๊ฐ€ pruning ๋น„์œจ์ด ๋” ๋†’์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์•ฝ๊ฐ„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค.

 

- Reasoning ๋Šฅ๋ ฅ์€ ์•ฝ๊ฐ„ ๋–จ์–ด์ง€์ง€๋งŒ, ๊ทธ๋ž˜๋„ ๋น„์Šทํ•œ ์ˆ˜์ค€์ด๋‹ค.

- ์ „์ฒด์ ์œผ๋กœ LoCo์˜ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค. ์›๋ณธ ๋ชจ๋ธ์˜ 80%์ •๋„์˜ ์„ฑ๋Šฅ ์œ ์ง€๋ฅผ ํ•˜๊ณ  ์žˆ์Œ. (๋ฐ˜๋ฉด ๋‹ค๋ฅธ baseline์€ 70%๋„ ๋„˜์ง€ ๋ชปํ•จ)

 

Apendix D.5

- ์ฃผ๋ชฉํ•  ์ ์€, GEN ๋ชจ๋“œ๋กœ ํ‰๊ฐ€ํ•œ ์„ธ ๊ฐ€์ง€ ๋ฒค์น˜๋งˆํฌ(CHID, XSUM, WSCG)์—์„œ LaCo๋กœ pruningํ•œ LLM์€ ๋น„๊ต์  ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š” ๋ฐ˜๋ฉด, ๊ธฐ์กด ๋ฐฉ์‹์œผ๋กœ pruningํ•œ ๋ชจ๋ธ๋“ค์€ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜์–ด ์ผ๋ถ€ ๊ฒฐ๊ณผ๋Š” 0.00๊นŒ์ง€ ๋–จ์–ด์กŒ๋‹ค๋Š” ๊ฒƒ

- ๊ธฐ์กด ๋ฐฉ์‹์œผ๋กœ pruningํ•œ ๋ชจ๋ธ์€ ์˜๋ฏธ ์—†๋Š” ๋ฐ˜๋ณต ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝํ–ฅ (Table.23)

- Llama2-70B์—์„œ๋„ outperformํ•œ ๊ฒฐ๊ณผ

 

๊ฒฐ๋ก ์ ์œผ๋กœ LaCo๋Š” ์šฐ์ˆ˜ํ•œ pruner์ด๋ฉฐ, 

 ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ , ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ฐจ์ด์™€ ์ถ”๊ฐ€์—๋งŒ ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ„๊ฒฐํ•˜๊ณ  ํšจ์œจ์ ์ธ pruning ๋ฐฉ๋ฒ•์ด๋‹ค.

 

3.6. Comparison of Perplexity

- Llama2-7B

27% sparsity

500 sentences selected from Wikipedia (length of 512 tokens)

 

 

 

 

3.7. Pruning Time

llama2-7B / 27% sparsity / A100 GPU

๋ชจ๋ธ ๋กœ๋”ฉ, ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ, ๋ชจ๋ธ ์ €์žฅ ์‹œ๊ฐ„์€ ์ œ์™ธํ•˜๊ณ  ์ฃผ์š” pruning ๊ณผ์ •๋งŒ ์ธก์ •

- LaCo๋Š” ๋” ๋‚ฎ์€ ์‹œ๊ฐ„ ๋ณต์žก๋„์™€ ๋” ๋น ๋ฅธ pruning ์†๋„

 

3.8. Memory Usage and Inference Speed

llama2-13B / English Wiki dataset / bf16 / A100 GPU

-> consume less memory / achieve faster inference speed !

- baseline๋“ค์€ dense ๋ชจ๋ธ๋ณด๋‹ค ์ถ”๋ก ์†๋„๊ฐ€ ๋А๋ ค์กŒ๋‹ค (์˜คํ˜ธ..) ๋ฐ˜๋ฉด LaCo๋Š” ๊ทธ๋Ÿฐ ๋ฌธ์ œ ์—†์Œ.

 

 

 

4. Further Analysis

4.1. Post-training and Re-pruning

4.1.1. Post-training

purning์œผ๋กœ ์ธํ•œ ๋ถˆ๊ฐ€ํ”ผํ•œ ์„ฑ๋Šฅ ์†์‹ค ๋•Œ๋ฌธ์—, LaCo ๋ชจ๋ธ์ด ์›๋ž˜ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž˜ ๋ณด์กดํ•˜๊ณ  ์ž‡์œผ๋ฉฐ, post-training์œผ๋กœ ํšŒ๋ณต์ด ๊ฐ€๋Šฅํ•œ์ง€ ํ…Œ์ŠคํŠธํ•˜์˜€๋‹ค.

llama2-7b / Baichuan2-7b

- LLaMA-Factory framework๋ฅผ ์‚ฌ์šฉํ•ด์„œ post-training์„ ์ง„ํ–‰.

 

- ํ•™์Šต ๊ณผ์ •์—์„œ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋ฉฐ, ์•ฝ 250 ์Šคํ… ์ดํ›„ ์†์‹ค์ด ๊ธ‰๊ฒฉํžˆ ๊ฐ์†Œํ•œ ๋’ค ์•ˆ์ •ํ™”.

- 5B ํฌ๊ธฐ์˜ pruned llama2-7B์™€ Baichuan2-7B ๋ชจ๋ธ์˜ ์ตœ์ข… convergence loss๋Š” ๊ฐ๊ฐ 1.6๊ณผ 2.0์œผ๋กœ, ์ด๋Š” Llama2-7B(1.75)์™€ Baichuan2-7B(1.9)์˜ ๊ธฐ์ˆ  ๋ณด๊ณ ์„œ์— ๊ธฐ์žฌ๋œ ๊ฐ’๊ณผ ์ƒ๋‹นํžˆ ์œ ์‚ฌ

- Nvidia A100 80GB GPU 4๊ฐœ ์‚ฌ์šฉ, ํ•™์Šต ์‹œ๊ฐ„์€ ๊ฐ๊ฐ ์•ฝ 28์‹œ๊ฐ„๊ณผ 35์‹œ๊ฐ„

(์ฐธ๊ณ ๋กœ, 5B(50์–ต) ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ์˜ LLM์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋ ค๋ฉด ์ˆ˜๋ฐฑ ๊ฐœ์˜ A100 GPU๋ฅผ ๋ช‡๊ฐœ์›” ๋™์•ˆ ์‚ฌ์šฉํ•ด์•ผ ํ•จ.)

 

---

[ ํ‰๊ฐ€ ]

Appendix E

- llama2-7b์˜ ๊ฒฝ์šฐ, post-training์„ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ๋” ์˜ค๋ฆ„

   -> ์ผ๊ด€๋œ ์ ์ˆ˜ ํ–ฅ์ƒ์€, LaCo ๋ฐฉ๋ฒ•์œผ๋กœ pruning๋œ ๋ชจ๋ธ์ด ์›๋ž˜ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์Šนํ•˜๊ณ , low-cost post-training์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํšŒ๋ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์คŒ

 

- ๋ฐ˜๋ฉด baichuan2-7b์˜ ๊ฒฝ์šฐ, ์ผ๋ถ€ ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” ํ–ฅ์ƒํ•˜๊ณ  ์ผ๋ถ€์—์„œ๋Š” ํ•˜๋ฝํ•จ

   -> ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์ถœ์ฒ˜๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์–ด, ์šฐ๋ฆฌ์˜ post-training ๋ฐ์ดํ„ฐ์™€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— post-training์˜ ํšจ๊ณผ๊ฐ€ ์ œํ•œ๋˜์—ˆ๋‹ค๊ณ  ์ถ”์ธก

4.1.2. Re-pruning

post-training์œผ๋กœ ์„ฑ๋Šฅ์„ ํšŒ๋ณต์‹œ์ผฐ๊ธฐ ๋•Œ๋ฌธ์—, ์—ฌ๊ธฐ์„œ ๋” purningํ•˜์—ฌ 50%์˜ ์••์ถ•๋ฅ ์ด ๊ฐ€๋Šฅํ• ๊นŒ?์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์ง„ํ–‰.

- llama2-7b / 17 layers (55%)

Appendix E

- ์›๋ณธ 7b์˜ 70%์˜ ์„ฑ๋Šฅ์„ ๋ณด์กดํ•˜๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚จ.

- ๋” ๋‚˜์€ data์™€ ๋” ๋งŽ์€ data๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚  ๊ฒƒ์ž„.

 

4.2. Layer-wise Similarity

[ ๊ฐ€์ค‘์น˜ ์œ ์‚ฌ๋„ ๋ถ„์„ ]

๊ฐ€์žฅ ํฐ L2 ๊ฐ’์ด 200์„ ๋„˜์ง€ ์•Š๋Š”๋‹ค. ์ฆ‰ ์ธ์ ‘ํ•œ ๋ ˆ์ด์–ด๋ผ๋ฆฌ ๋งค์šฐ ์œ ์‚ฌํ•˜๋‹ค.

MLP matrix ์‚ฌ์ด์ฆˆ(11008*4096) ์™€ SAN q,k,v ์‚ฌ์ด์ฆˆ (4096*4096)๋ฅผ ๊ณ ๋ คํ–ˆ์„๋•Œ, ์ธ์ ‘ํ•œ ๋ ˆ์ด์–ด๊ฐ„์˜ ๊ฐ’ ๋ณ€ํ™”๋Š” ์ž‘๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

(L2 distance๊ฐ€ 200์ด๋ฉด, ๊ฐ ์›์†Œ๋‹น ํ‰๊ท ์ ์ธ ์ฐจ์ด๊ฐ€ 0.05์ •๋„์ธ ๊ฒƒ์ž„)

 

[ ๋ธ”๋Ÿญ ์•„์›ƒํ’‹ ์œ ์‚ฌ๋„ ๋ถ„์„ (5-a) ]

3-28 layer์—์„œ cosine ์œ ์‚ฌ๋„๊ฐ€ ๊ฑฐ์˜ 1์— ๊ฐ€๊น๋‹ค.

 

[ ๋ธ”๋Ÿญ merge (5-b) ]

RDSC Layer Merge๊ฐ€ ์—ฌ๋Ÿฌ ์ธต์„ ํ•˜๋‚˜๋กœ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Œ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ์‹คํ—˜

- 10์ธต๋ถ€ํ„ฐ 19์ธต ์‚ฌ์ด์˜ ์—ฐ์†๋œ 4๊ฐœ ์ธต์„ ํ•˜๋‚˜๋กœ ๋ณ‘ํ•ฉ

- ๋ณ‘ํ•ฉ๋œ ์ธต์˜ ์ถœ๋ ฅ๊ณผ ์›๋ž˜ ๋งˆ์ง€๋ง‰ ์ธต์˜ ์ถœ๋ ฅ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ํ‰๊ฐ€ (๋ญ”๋ง์ด์•ผ์ด๊ฒŒ)

4096์ฐจ์› ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ์ตœ์ € ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” 0.996 ์ด์ƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜, RDSC Layer Merge๊ฐ€ ํ‘œํ˜„์„ ์ž˜ ๋ณด์กดํ•จ์„ ํ™•์ธํ•˜์˜€๋‹ค.

 

 

 

4.3. Varying Pruning Ratio

- llama2-7b / llama2-13b

- 10%, 25%, 50%

 

- pruning ๋น„์œจ์ด ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ๋„ ํ•˜๋ฝํ•œ๋‹ค.

- ๊ทธ๋Ÿฌ๋‚˜ 10-25%์—์„œ๋Š” ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๊ณ  ์žˆ๊ธฐ ๋–„๋ฌธ์—, ์ด range์—์„œ๋Š” LaCo๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•จ.

- 50%์˜ ratio์—์„œ๋„ ์›๋ณธ์˜ 70% ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. 

 

5. Related Work

- Model Quantization

- Knowledge Distillation

- Model Pruning

 

6. Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” Layer Collapse(LaCo)๋ผ๋Š” ๊ฐ„๊ฒฐํ•œ ์ธต๋ณ„ ๊ตฌ์กฐํ™” pruning ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. LaCo๋Š” ๋ชจ๋ธ์˜ ๋’ค์ชฝ ์ธต๋“ค์„ ์•ž์ชฝ ์ธต์— ๋ณ‘ํ•ฉํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ค„์ธ๋‹ค. LaCo๋Š” ํŠน์ˆ˜ ํ•˜๋“œ์›จ์–ด ์ง€์›์ด ํ•„์š” ์—†์œผ๋ฉฐ ๋ชจ๋ธ์˜ ๊ณ ์œ  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ LaCo๋Š” ํ˜„์žฌ์˜ SOTA structured pruning ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ํ˜„์ €ํžˆ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด LLM์—์„œ ์ž ์žฌ์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘๋ณต์„ฑ์„ ๋“œ๋Ÿฌ๋ƒˆ๋‹ค. ๋˜ํ•œ, ๋‹ค์–‘ํ•œ LaCo ์„ค์ •์— ๋Œ€ํ•œ ์ œ๊ฑฐ(ablation) ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. pruned ๋ชจ๋ธ์— ๋Œ€ํ•ด post-training์„ ์ง„ํ–‰ํ•˜์—ฌ LaCo๊ฐ€ ์›๋ณธ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์Šนํ•จ์„ ํ™•์ธํ–ˆ๋‹ค. ์•„์šธ๋Ÿฌ ์ธต๋ณ„ ์œ ์‚ฌ์„ฑ ๊ด€์ ์—์„œ ๋™๊ธฐ๋ฅผ ๋…ผ์˜ํ•˜๊ณ , ์„œ๋กœ ๋‹ค๋ฅธ pruning ๋น„์œจ์—์„œ LaCo pruning ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํƒ๊ตฌํ•˜์˜€๋‹ค.

 

 

Limitations

1. layer-wise ์ด๊ธฐ ๋•Œ๋ฌธ์—, pruning ๋น„์œจ์„ ์ž์œ ๋กญ๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์—†๋‹ค.

2. ๐›•(์›๋ณธ ๋ชจ๋ธ๊ณผ merge๋œ ๋ชจ๋ธ์˜ ์œ ์‚ฌ๋„ threshold)์™€ ๊ฐ™์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•ด์•ผ ํ•œ๋‹ค.

3. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค(baselines)๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด๋ก ์ ์ธ ์ฆ๋ช…์ด ๋ถ€์กฑํ•˜๋‹ค. (our method lacks a complete theoretical proof)