๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

A Simple Linear Patch Revives Layer-Pruned Large Language Models

NeurIPS 2025

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€


โœจ method ์ •๋ฆฌ โœจ
ํ”„๋ฃจ๋‹๋œ ๋ ˆ์ด์–ด ์‚ฌ์ด์— activation channel๊ฐ„ magnitude๊ฐ€ ๋งค์šฐ ๋ถˆ์ผ์น˜ํ•œ ํ˜„์ƒ์— ์ฃผ๋ชฉ.

์ด activation scale์„ ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•œ scaling factor๋ฅผ ๋„์ž…ํ•œ๋‹ค.
 1. channel-wise scaling : d
      ํ”„๋ฃจ๋‹ ์ดํ›„ ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๋‘ ๋ ˆ์ด์–ด๊ฐ„์˜ activation (X)์˜ ํ‰๊ท  activation magnitude์˜ ๋น„์œจ
 2. token-wise scaling : H
      outlier๊ฐ€ ๋˜๋Š” ํ† ํฐ๋“ค์ด ์žˆ๋‹ค.(eg. [BOS] ...) ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Hadamard transform์„ ์ ์šฉํ•œ๋‹ค.

์œ„ ๋‘ ๊ฐœ์˜ scaling ๊ณผ์ •์„ ํ•˜๋‚˜๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ patch matrix P๋ฅผ ๋งŒ๋“ ๋‹ค. (dim x dim)
P๋Š” offline distillation ๊ณผ์ •์œผ๋กœ KL-div๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ finetuning ์‹œํ‚จ๋‹ค.

์™„์„ฑ.

---
* ๊ทผ๋ฐ pruningํ•  ๋ ˆ์ด์–ด ์ •ํ•˜๋Š” ๊ฑฐ๋Š” ๊ทธ๋ƒฅ ๊ธฐ์กด์— ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋Œ€๋กœ cosine sim์„ ์ผ๋‹ค๊ณ  ๋ฐํž˜.
๊ทธ๋Ÿผ ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๋ฅผ ๋„ฃ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ˆ ๋‹น์—ฐํ•˜๊ฒŒ๋„ ์ข‹์•„์ง€์ง€ ์•Š์œผ๋ ค๋‚˜ ์‹ถ๊ธด ํ•จ.
์ € ์ƒˆ๋กœ์šด patch ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด ํ˜„์ƒ(activation mag ๋ถˆ์ผ์น˜) ์žˆ์–ด๋ณด์ด๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•(hardamard transform)์„ ๋Œ์–ด๋‹ค์™€๊ฐ€์ง€๊ณ  novelty๊ฐ€ ์ƒ๊ธด ๊ฒƒ ๊ฐ™๋‹ค.
* ๊ทธ๋ฆฌ๊ณ  ํŠ€๋Š” ์—‘ํ‹ฐ๋ฒ ์ด์…˜์ด ์žˆ๋‹ค๋ฉด.. ๊ทธ๊ฒƒ์กฐ์ฐจ ํ‹ฐ์ฒ˜๋ชจ๋ธ์—์„œ ๋‚˜์˜จ๊ฑด๋ฐ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ๊ฑฐ ์•„๋‹Œ๊ฐ€? ์ € ์•„๋‹ค๋งˆ๋ฅด ๋ณ€ํ™˜์€ ๊ทธ๊ฑธ ์œ ์ง€ํ•˜๋‚˜?? ์•„๋‹ˆ ๋ ˆ์ด์–ด๊ฐ€ ์ œ๊ฑฐ๋˜์ง€ ์•Š์œผ๋ฉด ๊ทธ ํŠ€๋Š” ์—‘ํ‹ฐ๋ฒ ์ด์…˜์ด ์ ์  ์‚ฌ๋ผ์ง€๋Š” ๊ฑฐ์•ผ? ๊ทธ๊ฒŒ ์•„๋‹ˆ๋ผ๋ฉด ๊ตณ์ด ์—†์•จํ•„์š”๊ฐ€ ์—†์ž–์•„

---
๋ชฉ์ฐจ๋ถ€ํ„ฐ ๋ญ”๊ฐ€ ๊น”๋”ํ•˜๋„ค
์‹ ๋ฐ•ํ•˜๊ธด ํ•จ
์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด.. 
์–ด์จŒ๋“  finetuning์„ ์‹œํ‚ค๋Š” ๊ฑฐ๋ฉด ์ดˆ๊ธฐํ™”์— ๋ถˆ๊ณผํ•œ ๊ฒƒ ๊ฐ™์Œ. (์ข‹์€์ดˆ๊ธฐํ™”? ์–ผ๋งˆ๋‚˜ ์ข‹์€๋ฐ??)
๊ทธ๋ฆฌ๊ณ  ๋’ท์ชฝ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹๊ณผ activation์„ ์œ ์‚ฌํ•˜๊ฒŒ ๋งž์ถฐ์„œ ๋„ฃ์–ด์ค€๋‹ค....๋ผ๋Š” ๊ฒŒ ์™œ ์ž‘๋™์„ ํ•˜์ง€??
์›๋ž˜ input์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” activation์ด ๋น„์Šทํ•ด์„œ ๊ทธ๋Ÿฐ๊ฑด๊ฐ€???

 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

 

Layer pruning์€ LLM์„ compressํ•˜๋Š”๋ฐ widelyํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

ํ•˜์ง€๋งŒ ๊ธฐ์กด layer purning ๋ฐฉ๋ฒ•๋“ค์€ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—๋Š” ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ์ €ํ•˜์˜ ๋Œ€๋ถ€๋ถ„์ด ์ด์ „์—๋Š” ๊ฐ„๊ณผ๋˜์—ˆ๋˜ ํ”„๋ฃจ๋‹ ์ธํ„ฐํŽ˜์ด์Šค์—์„œ์˜ activation magnitudes ๋ถˆ์ผ์น˜ ๋ฌธ์ œ์—์„œ ๊ธฐ์ธํ•จ์„ ํ™•์ธํ–ˆ๋‹ค.

ํ”„๋ฃจ๋‹ ์ „ํ›„์— ํ™œ์„ฑํ™”๋˜๋Š” ์Šค์ผ€์ผ(????)์ด ๋งŽ์ด ๋‹ฌ๋ผ์ ธ์„œ, ๋‚จ์€ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ distributional shift๊ฐ€ ์ผ์–ด๋‚œ๋‹ค.

different activation scale์ด ๋จธ์ž„?????????????????? ์„ค๋ช… ์ œ๋Œ€๋กœ ํ•ด์ฃผ์ง€ ์•Š์œผ๋ฉด ํ™”๊ฐ€ ๋‚ ๊ฒƒ.

 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LinearPatch๋ฅผ ์ œ์•ˆ,

lightweight ํ•˜๊ณ  plug-and-playํ•œ ๋ฐฉ๋ฒ•์ด๋ฉฐ, ํ”„๋ฃจ๋‹ ๊ณผ์ •(interface)์—์„œ ๋‘ ๊ฐœ์˜ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ matrix multiply๋กœ ํ†ตํ•ฉํ•œ๋‹ค.

(i) ํŠน์ • ํ† ํฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ฑฐ๋Œ€ํ•œ outlier๋“ค์„ ์–ต์ œํ•˜๊ธฐ ์œ„ํ•œ Hadamard transformation

์˜คํ˜ธ ์ด๋ถ€๋ถ„ ๊ถ๊ธˆํ•จ.. ๊ฑฐ๋Œ€ํ•œ outlier์กฐ์ฐจ ์›๋ž˜ ๋ชจ๋ธ์˜ ์ง€์‹์ผํ…๋ฐ.. ๋ฐ์ดํ„ฐ๋ฅผ ๋งŽ์ด ๋ฝ‘์•„์„œ ์‚ฌ์šฉํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์€๋ฐ? ์˜๋„์ ์œผ๋กœ ์ œ๊ฑฐํ•˜๊ธฐ?

๊ฑฐ๋Œ€ํ•œ outlier ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋А calibration ๋˜๋Š” ๋ช‡๋ฒˆ์งธ๋ ˆ์ด์–ด์—์„œ ๋‚˜์˜ค๋Š”์ง€๋„ ์•Œ๋ ค์ฃผ๋‚˜? ์•Œ๋ ค์ฃผ๊ฒ ์ง€?

(ii) activation statistics๋ฅผ ์ •๋ ฌ(align)ํ•˜๊ธฐ ์œ„ํ•œ channel-wise scaling

 

 

LaMA-3-8B ๋ชจ๋ธ์—์„œ LINEARPATCH๋Š” 32๊ฐœ ๋ ˆ์ด์–ด ์ค‘ 5๊ฐœ๋ฅผ ํ”„๋ฃจ๋‹ํ•  ๋•Œ๋„ 94.15%๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ, ์ด์ „ SOTA ๋ฐฉ๋ฒ• ๋Œ€๋น„ 4% ๋†’์€ ์„ฑ๋Šฅ. (5๊ฐœ๋Š” ....  15% ํ”„๋ฃจ๋‹ํ•œ๊ฑด๋ฐ ....... ๋‚˜๋„ ์ด๋ ‡๊ฒŒ ์ž๋ž‘ํ•ด์•ผ๊ฒ ๋‹ค...... )

5์ฒœ ๊ฐœ์˜ ๋ผ๋ฒจ ์—†๋Š” ์ƒ˜ํ”Œ์„ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ offine distillation์œผ๋กœ ํŒจ์น˜๋ฅผ ์ถ”๊ฐ€๋กœ ์ •์ œํ•˜๋ฉด, ๋‹จ์ผ GPU์—์„œ 30๋ถ„ ๋งŒ์— ์„ฑ๋Šฅ ์œ ์ง€์œจ์„ 95.16%๊นŒ์ง€ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค.


1. Introduction

 

๋ ˆ์ด์–ดํ”„๋ฃจ๋‹์ด emergeํ•˜๊ณ  ์žˆ๋‹ค. ํŠน๋ณ„ํ•œ ํ•˜๋“œ์›จ์–ด specificํ•œ optimization์ด๋‚˜ low-level kernel modification์— ์˜์กดํ•˜์ง€ ์•Š๋Š” ๋ฐฉ๋ฒ•์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ณ„๋„์˜ dependency ์—†์ด ๋ถˆํ•„์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์ด๋‹ค. 

๋ฐ˜๋ฉด unstructured pruning์€ ๋ถˆ๊ทœ์น™์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋•Œ๋ฌธ์— ๊ฐ€์†ํ™”๊ฐ€ ์–ด๋ ต๊ณ ,

structured์˜ ๊ฒฝ์šฐ์—๋Š” ์ข…์ข… ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์˜ ๋ณ€ํ˜• ๋˜๋Š” ๋งž์ถคํ˜• kernel์ด ์š”๊ตฌ๋œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. 

๋ ˆ์ด์–ดํ”„๋ฃจ๋‹์€! ๋ณ„๋„์˜ dependency ์—†์ด ๋ถˆํ•„์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์ด๋‹ค.  --> ํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์‹ฌํ•˜๋‹ค๋Š” ํฌ๋ฆฌํ‹ฐ์ปฌํ•œ challenge๋“ค์ด ์žˆ๋‹ค. 


์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ˜„์ƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค: ํ”„๋ฃจ๋‹ ์ง€์ ์—์„œ layer๊ณผ token ๊ฐ„์˜ activation magnitude ๋ถˆ์ผ์น˜์ด๋‹ค.
๊ตฌ์ฒด์ ์œผ๋กœ, ์ผ๋ถ€ ์ธต์ด ํ”„๋ฃจ๋‹๋  ๋•Œ ๋‚จ์€ ์ธต๋“ค์˜ activation ๊ฐ’์€ ์ข…์ข… ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์„ ๋ณด์ด๋ฉฐ, ํ”„๋ฃจ๋‹ ์ง€์  ์ด์ „ ์ธต์˜ activation์ด ์ดํ›„ ์ธต์˜ activation ์ •๋ ฌ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถˆ์ผ์น˜๋Š” ํŠน์ˆ˜ ํ† ํฐ(eg.: [BOS] ๋˜๋Š” ๊ตฌ๋ถ„์ž ํ† ํฐ)์˜ ํ™œ์„ฑํ™”์—์„œ ๊ด€์ฐฐ๋˜๋Š” ๊ทน๋‹จ์ ์ธ outlier ์กด์žฌ๋กœ ์ธํ•ด ๋”์šฑ ์‹ฌํ™”๋œ๋‹ค. (ref1, ref2)
๊ฒฐ๊ณผ์ ์œผ๋กœ ํ”„๋ฃจ๋‹๋œ LLM์€ ์‹ฌ๊ฐํ•œ activation ๋ถˆ์ผ์น˜๋ฅผ ๊ฒช๊ฒŒ ๋˜๋ฉฐ, ์ด๋Š” ๊ฒฐ๊ตญ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์ง„๋‹ค.

 

์ด๋Ÿฌํ•œ ์ด์Šˆ๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด LinearPatch ๋ฉ”์„œ๋“œ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ activateion mismatch๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋””์ž์ธ๋œ plug-and-play ๋ฐฉ๋ฒ•์ด๋‹ค. LInearPatch๋Š” ๋‹ค์–‘ํ•œ pruning metric์— ๊ฐ„๋‹จํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

์šฐ์„  Hadamard transformation๋ฅผ ์ ์šฉํ•˜์—ฌ ์ŠคํŽ˜์…œํ† ํฐ์— ๋Œ€ํ•œ activation (== outliers) ๋ฅผ ์–ต์ œ์‹œํ‚จ๋‹ค.

 

์ดํ›„ channel-wise scaling parameter๋ฅผ ๋„์ž…ํ•˜์—ฌ, activateion magnitude์— ์žˆ๋Š” ๊ฐญ์„ ๋ฉ”์šด๋‹ค. Spectral Theory์— ์˜ํ•˜๋ฉด hardamard transformation๊ณผ diagonalized channel-wise scaling์€ ํ•˜๋‚˜์˜ real symmetric matirx๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ LinearPatch์— ํ™œ์šฉํ•œ๋‹ค. (๋จธ๋ผ๋…ธ๋‹ค๋ฅธ์ง€์‹๋ญ์•ผ์ด๊ฑฐ)

์ด ๋ฐฉ๋ฒ•์€ ์ถ”๋ก (inference) ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๊ฑฐ์˜ ๋ฐœ์ƒ์‹œํ‚ค์ง€ ์•Š์œผ๋ฉด์„œ๋„ ํ™œ์„ฑํ™” ํฌ๊ธฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ ฌ(alignment)ํ•œ๋‹ค.

 

์ •๋ ฌ ์ด์™ธ์—๋„, memory-efficient knowledge distillation๋ฅผ ํ†ตํ•ด ๊ฐ€์ง€์น˜๊ธฐ๋œ LLM์„ ์ถ”๊ฐ€๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋ชจ๋“  ๋‹ค๋ฅธ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ ์ •ํ•œ ์ฑ„ LINEARPATCH ํ–‰๋ ฌ๋งŒ finetuning ํ•œ๋‹ค. ๋‹จ 5,000๊ฐœ ์ƒ˜ํ”Œ๋งŒ ์‚ฌ์šฉํ•ด๋„ ๋˜๋ฉฐ, 7B ๊ทœ๋ชจ ๋ชจ๋ธ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์ผ GPU์—์„œ 30๋ถ„ ์ด๋‚ด์— ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์‹คํ—˜ ๊ฒฐ๊ณผ~~

๋ฒค์น˜๋งˆํฌ์—์„œ LLaMA-3-8B์˜ 5๊ฐœ ์ธต์„ ๊ฐ€์ง€์น˜๊ธฐํ•œ ๊ฒฝ์šฐ, LINEARPATCH๋Š” ๊ธฐ์กด ์„ฑ๋Šฅ์˜ 94.15%๋ฅผ ์œ ์ง€,

LLM-Streamline(90.84%) ๋“ฑ ์ตœ์‹  ๋ฐฉ๋ฒ•๋“ค์„ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ (์˜คํ˜ธ.)

 

 

 

2. Related Work

Weight Pruning

- (unstructured) Wanda

- (structured) entire groups of weights๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ• (attention heads, MLP neuraons, or hidden dimenstions)

   - N:M sparsity

   - unstructure purning๋ณด๋‹ค๋Š” ํ•˜๋“œ์›จ์–ด friendlyํ•˜์ง€๋งŒ, ์—ญ์‹œ ์žฌํ•™์Šต์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

 

 

Layer Pruning

๋ ˆ์ด์–ดํ”„๋ฃจ๋‹ ๋“ฑ์žฅ

width pruning๊ฐ€ ์ข…์ข… ๋ถˆ๊ทœ์น™ํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ๋ ˆ์ด์–ด ๊ฐ€์ง€์น˜๊ธฐ๋Š” Transformer์˜ ์ „์ฒด ์ธต(์ฆ‰, Attention๊ณผ MLP ๋ชจ๋“ˆ ๋ชจ๋‘)์„ ์ œ๊ฑฐํ•˜๋ฏ€๋กœ, ๋ฐฐํฌ ๋ฐ ๊ฐ€์†ํ™”๊ฐ€ ๋” ์šฉ์ดํ•˜๋‹ค.

- ShortGPT (์ธต ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๊ฐ„์˜ cosine similarity๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ ์ธต์˜ ์ค‘์š”๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , ๊ฐ€์žฅ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ์ธต์„ ์ œ๊ฑฐ)

- SLEB (ppl + iterative! pruning)

- Shortened LLaMA (tayler, ppl (ํ•œ๋ฒˆ์—๊ตฌํ•ด๋‘ ) + LoRA)

- UIDL (๊ฐ ์ธต ๊ฐ„์˜ ๊ฐ๋„ ๊ฑฐ๋ฆฌ(angular distance)๋ฅผ ๋„์ž…ํ•˜์—ฌ ์—ฐ์†์ ์ธ ์ธต์„ ์‹๋ณ„ํ•˜๊ณ  ์ œ๊ฑฐํ•˜๋ฉฐ, ์ดํ›„ QLoRA ์ ์šฉ)

- LLM-Streamline (cosine sim + ์—ฐ์†์  ๋ ˆ์ด์–ด ์„ ํƒ + lightweight layer๋กœ ๋Œ€์ฒด)

 

 

 

3. Method

3.1. Preliminaries on LLM Layer Pruning

- transformer layer์˜ ๊ธฐ๋ณธ ์‹.

X: Input activation

theta : parameters

 

Pruning Metrics.

๋ณดํ†ต cosine similarity[ShortGPT, LLMStremline], gradient-based score[Shortened Llama,LLM-Pruner] , perlexity-based score[Shortened Llama, SLEB]๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

Layer Pruning.

- ํ”„๋ฃจ๋‹ ์ดํ›„ ์‹

l* ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋ถ€ํ„ฐ n๊ฐœ์˜ ์—ฐ์†๋œ ๋ ˆ์ด์–ด๊ฐ€ ์ œ๊ฑฐ๋˜์—ˆ์„ ๊ฒฝ์šฐ, l*์˜ ์ธํ’‹์ด l*+n ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ(๋ ˆ์ด์–ด)์— ๋“ค์–ด๊ฐ„๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, ๊ฐ€์ง€์น˜๊ธฐ ๊ฒฝ๊ณ„(pruning interface)์—์„œ channel magnitude์˜ ํฐ ๋ถˆ์ผ์น˜๋ฅผ ์œ ๋ฐœํ•˜๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์‹ฌ๊ฐํ•˜๊ฒŒ ์ €ํ•˜ ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ Figure 1์—์„œ ํ™•์ธํ•œ๋‹ค. (cont. sections 3.2 / 3.3)

 

 

3.2. Channel Magnitude Alignment

Layer-wise Channel Mismatch.

figure1(a)์— ๋‚˜ํƒ€๋‚œ ๊ฒƒ์ฒ˜๋Ÿผ, hidden state์˜ ํฌ๊ธฐ๋Š”, layer์™€ channel์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค. 

* channel: ๋ชจ๋ธ์˜ hidden dimension์„ ์˜๋ฏธ. Llama2-7b ๊ธฐ์ค€์œผ๋กœ 4096.

 

์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, channel-wise scaling factor๋ฅผ statisticallyํ•˜๊ฒŒ ๊ณ„์‚ฐํ•œ๋‹ค. 

๊ฐ ์ฑ„๋„ k์— ๋Œ€ํ•ด, calibration set์„ ์‚ฌ์šฉํ•˜์—ฌ, l*๋ฒˆ์งธ ๋ ˆ์ด์–ด์™€ (l*+n)๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ํ‰๊ท  activation magnitude์˜ ๋น„์œจ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด scaling vector d ๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค.

 

-> Channel wise ์Šค์ผ€์ผ๋ง ์ง„ํ–‰

 

Quantitative Evaluation.

์ถ”๊ฐ€์ ์ธ scaling factor์ธ ์•ŒํŒŒ๋ฅผ ์‚ฌ์šฉํ•ด์„œ d ์ฃผ๋ณ€์—์„œ ๋ณ€ํ˜•ํ•œ๋‹ค.

figure1(b)์— ๋‚˜ํƒ€๋‚œ ๊ฒƒ์ฒ˜๋Ÿผ, ๊ทธ๋ƒฅ ์•ŒํŒŒ==1์ธ ๊ฒฝ์šฐ๊ฐ€ ๊ฐ€์žฅ ์ž˜ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฒ—์–ด๋‚  ๊ฒฝ์šฐ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์‹ฌํ•˜๊ฒŒ ๋ฐœ์ƒํ–ˆ๋‹ค.

 

 

3.3. Token Magnitude Smoothing

Token-wise Scaling Mismatch

์ตœ์‹  ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, [BOS]ํ† ํฐ์ด๋‚˜ ๊ตฌ๋ถ„์žํ† ํฐ๊ณผ ๊ฐ™์€ ํŠน์ • ํ† ํฐ์— ๋Œ€ํ•ด ํฌ๊ธฐ๊ฐ€ 10^3์ด์ƒ์ธ ๊ฑฐ๋Œ€ํ•œ outlier๊ฐ€ ์กด์žฌํ•œ๋‹ค.

๋”ฐ๋ผ์„œ single channel scaling d_k๋งŒ์œผ๋กœ๋Š” ์ฑ„๋„ ๋‚ด์˜ ๋ชจ๋“  ํ† ํฐ์— ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. (figure2(a))

X_i,k : the activations of channel k for batch i ( i๋ฒˆ์งธ ๋ฐฐ์น˜์— ๋Œ€ํ•œ ์ฑ„๋„k์˜ activation)

σ(·) : standard deviation

σ_d๊ฐ€ ์ž‘์„์ˆ˜๋ก ํ† ํฐ ๊ฐ„ ์Šค์ผ€์ผ๋ง์ด ์ผ๊ด€๋จ์„ ์˜๋ฏธํ•œ๋‹ค. (ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ์ž‘๋‹ค๋Š” ๊ฒƒ์ด๋ฏ€๋กœ)

๊ทธ๋Ÿฌ๋‚˜ LLaMA-2-7B์—์„œ 9๊ฐœ ๋ ˆ์ด์–ด๋ฅผ pruningํ•  ๋•Œ σ_d= 2137.75๋กœ ๋‚˜ํƒ€๋‚˜, ํ† ํฐ ์ˆ˜์ค€์—์„œ ์‹ฌ๊ฐํ•œ ๋ถˆ์ผ์น˜๊ฐ€ ์กด์žฌํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

Hadamard Transformation

-> ํ† ํฐ๋ณ„ scaling์„ ์ง„ํ–‰

์ตœ๊ทผ ์—ฐ๊ตฌ [30, 34, 4, 45]์— ๋”ฐ๋ฅด๋ฉด, Hadamard transform์„ ์ ์šฉํ•˜๋ฉด outlier๋ฅผ ์–ต์ œํ•  ์ˆ˜ ์žˆ๋‹ค.

* Hadamard transform : ์„ ํ˜•๋ณ€ํ™˜ ๋ฐฉ๋ฒ•. ๋ชจ๋“  ์›์†Œ๊ธฐ +1 ๋˜๋Š” -1์ด๊ณ , orthogonal(์ง๊ต)ํ•˜๋‹ค. 

 

 

 

1๏ธโƒฃ Walsh–Hadamard ํ–‰๋ ฌ Hโ‚‚ ๋งŒ๋“ค๊ธฐ (2x2)

- 1/root2๋Š” ์ •๊ทœํ™” ์ƒ์ˆ˜. ์ด๊ฑธ ๊ณฑํ•ด์ค˜์•ผ ๋ณ€ํ™˜ ํ›„์—๋„ ๋ฒกํ„ฐ ๊ธธ์ด๊ฐ€ ๋ฐ”๋€Œ์ง€ ์•Š๋Š”๋‹ค.

--> H_2๋ฅผ ๋ฒกํ„ฐ์— ๊ณฑํ•˜๋ฉด, ๋ฒกํ„ฐ๋ฅผ 45๋„ ํšŒ์ „์‹œํ‚ค๊ณ , ๋ฐ˜๋Œ€๋กœ ๋’ค์ง‘์€ ์„ฑ๋ถ„๊นŒ์ง€ ํฌํ•จ์‹œํ‚จ๋‹ค. 

 

2๏ธโƒฃ ๋” ํฐ ํ–‰๋ ฌ Hโ‚‚โฟ ๋งŒ๋“ค๊ธฐ (์žฌ๊ท€)

์—ฌ๊ธฐ์„œ ⊗๋Š” ํฌ๋กœ๋„ค์ปค ๊ณฑ(Kronecker product)

 

3๏ธโƒฃ C๊ฐ€ 2โฟ์ด ์•„๋‹Œ ๊ฒฝ์šฐ

C = 2^n m \quad \Rightarrow \quad H_C = H_{2^n} \otimes H_m
• ๋งŒ์•ฝ ์ฑ„๋„ ์ˆ˜๊ฐ€ 2์˜ ์ œ๊ณฑ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ฉด, ๊ฐ€์žฅ ํฐ 2์˜ ์ œ๊ณฑ์ˆ˜ ๋ถ€๋ถ„๊ณผ ๋‚˜๋จธ์ง€๋ฅผ ๋‚˜๋ˆ ์„œ ๋งŒ๋“ค ์ˆ˜ ์žˆ์–ด์š”.
• ์ด๋ ‡๊ฒŒ ํ•ด๋„ ์ง๊ต ์„ฑ์งˆ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.

 

 

Hadamard matrix์˜ ์ง๊ต์„ฑ(H^T * H = I) ๋•๋ถ„์— ๋‹ค์Œ ๋ณ€ํ™˜์ด ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋œ๋‹ค:

** activation X์—๋‹ค๊ฐ€ H๋ฅผ ๊ณฑํ•˜๋ฉด activation ๊ฐ’๋“ค์ด ์ฑ„๋„์— ๊ณจ๊ณ ๋ฃจ ์„ž์ด๊ณ 

** H^T๋ฅผ ๋‹ค์‹œ ๊ณฑํ•˜๋ฉด ์›๋ž˜ ๊ฐ’์œผ๋กœ ๋Œ์•„์˜จ๋‹ค.

์ฆ‰, ์ •๋ณด ์†์‹ค ์—†์ด rotated activation์„ ์ง„ํ–‰ํ•œ ๊ฒƒ.

 

 

์ด ํšŒ์ „์€ outlier๋ฅผ ๋ชจ๋“  channel์— ์žฌ๋ถ„๋ฐฐํ•˜๊ณ , ์ฑ„๋„ ๊ฐ„ activation์˜ ๋ถ„ํฌ๋ฅผ ๋ณด๋‹ค ๊ท ํ˜• ์žˆ๊ฒŒ ๋งŒ๋“ ๋‹ค.
ํšŒ์ „๋œ activation์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋“  ํ† ํฐ์— ๋™์ผํ•œ ์Šค์ผ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ d๋ฅผ ์ ์šฉํ•˜๊ธฐ๊ฐ€ ์šฉ์ดํ•ด์ง€๋ฉฐ, σ_d๋Š” 230.32๊นŒ์ง€ ๋‚ฎ์•„์ง„๋‹ค.

(๊ทผ๋ฐ.. ์•„. (l*๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜) Output์œผ๋กœ ๋ฐ›๋Š” ์—‘ํ‹ฐ๋ฒ ์ด์…˜๋งˆ๋‹ค ์ € H ํ–‰๋ ฌ์„ ๊ณฑํ•˜๊ณ , ์Šค์ผ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ d๋ฅผ ๊ณฑํ•œ ๊ฒƒ์„ ๋‹ค์Œ ๋ ˆ์ด์–ด(l*+n)์˜ input์œผ๋กœ ๋„ฃ์–ด์ค€๋‹ค๊ณ ..? - ๊ทธ๋Ÿผ ์ด๊ฑด (l*+n)๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ์›๋ž˜ input๊ณผ๋Š” ํฌ๊ฒŒ ์ƒ๊ด€์—†๊ณ , ์–˜์˜ ์•„์›ƒํ’‹๊ณผ ์œ ์‚ฌํ•œ magnitude๋กœ ๋ณ€ํ™˜ํ•ด์„œ ๋„ฃ์–ด์ฃผ๋Š” ๊ฑฐ ๊ฐ™์€๋ฐ ์–ด๋–ค ์˜๋ฏธ๊ฐ€ ์žˆ๋Š”๊ฑด์ง€??? ์˜คํ˜ธ ๊ทธ๋ƒฅ ์ด ๋ณ€ํ™˜์ด ์ „๋ถ€๋„ค?)

 

 

3.4. LinearPatch: the Ultimate Recipe

 

๋จผ์ € X์—๋‹ค๊ฐ€ Hadamard transform์„ ์ ์šฉํ•œ ๋’ค, ํšŒ์ „๋œ ๊ณต๊ฐ„์—์„œ D๋กœ ์Šค์ผ€์ผ๋งํ•œ๋‹ค.

์œ„ ๋‘ ์—ฐ์‚ฐ์€ ํ•˜๋‚˜์˜ ๋Œ€์นญํ–‰๋ ฌ P๋กœ ํ†ตํ•ฉ๋œ๋‹ค. 

๋งˆ์ง€๋ง‰ ๋“ฑ์‹์€ ์ŠคํŽ™ํŠธ๋Ÿผ ์ •๋ฆฌ [21]์—์„œ ์œ ๋ž˜ํ•˜๋ฉฐ, ์ฆ‰ ๋ชจ๋“  ์‹ค์ˆ˜ ๋Œ€์นญ ํ–‰๋ ฌ์€ ์ง๊ต ํ–‰๋ ฌ(H)๊ณผ ๋Œ€๊ฐ ํ–‰๋ ฌ(D)๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. (ํ•˜..๋ญ๋ผ๋…ธ..)

 

figure3. patch matrix P๊ฐ€ ๋ ˆ์ด์–ด๊ฐ€ ์ œ๊ฑฐ๋œ LLM์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฉ์ฐจ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ณด์™„ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋˜ํ•œ LINEARPATCH๋Š” ๋ณ€ํ™˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ด๊ณ  ํšจ์œจ์ ์ธ ํŒŒ์ธํŠœ๋‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”๋ฐ, ํ–‰๋ ฌ ๊ณฑ์„ ์œ„ํ•œ ๋‹จ์ผ GEMM (General Matrix Multiplication_ ๊ฑ ์ผ๋ฐ˜์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์˜๋ฏธ) ์—ฐ์‚ฐ๋งŒ ํ•„์š”ํ•˜๋ฉฐ, ์„ธ ๊ฐœ์˜ ๋ณ„๋„ GEMM ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค.

 

 

Memory-Efficient Offline Knowledge Distillation (ํ•™์Šต!!!!!!!!!!!!!!!!!!!)

๊ธฐ์กด์˜ KD ๋ฐฉ๋ฒ•์€ Teacher๊ณผ Student์„ ๋ชจ๋‘ GPU ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค์•ผ ํ•˜๋ฏ€๋กœ, LLM์—์„œ๋Š” ๋ง‰๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์œผ๋กœ ์ธํ•ด ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋ ต๋‹ค. ๋ฐ˜๋ฉด, ์‹ (9)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” LINEARPATCH๋Š” ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ์˜คํ”„๋ผ์ธ ์ฆ๋ฅ˜ ์ „๋žต์„ ์ง€์› : ํ‹ฐ์ฒ˜๋ชจ๋ธ์˜ ์ž…์ถœ๋ ฅ๋งŒ ์ €์žฅํ•˜๊ณ , distillation ๊ณผ์ •๋™์•ˆ์€ offline์œผ๋กœ ์œ ์ง€ํ•œ๋‹ค.

 

์ž‘์€ training corpus X (์˜ˆ.5000๊ฐœ)๊ฐœ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ํ‹ฐ์ฒ˜๋ชจ๋ธ์˜ top-K๊ฐœ์˜ ์•„์›ƒํ’‹ logit probability distribution o_t ์™€ ๊ทธ ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์‹ค์ œ๋กœ๋Š” K=100์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์ „์ฒด 32K ์–ดํœ˜๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„๊ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 320๋ฐฐ ์ ˆ๊ฐํ•œ๋‹ค.

 

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ•™์ƒ๋ชจ๋ธ์—์„œ ๋™์ผํ•œ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ top-K๊ฐœ์˜ ์•„์›ƒํ’‹ logit probability distribution o_s๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค.

 

์ด๋ ‡๊ฒŒ ์–ป์€ ๋‘ ๊ฐœ์˜ logit probability distribution์˜ KL divergencee๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ patch matrix P๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

 

ํŒŒ์ธํŠœ๋‹ ๊ณผ์ •์—์„œ๋Š” P์— ๋Œ€ํ•œ ์–‘์˜ ์ •๋ถ€ํ˜ธ ์ œ์•ฝ(positive-definite constraint)์„ ์ œ๊ฑฐํ•˜์—ฌ ๋” ํฐ ์œ ์—ฐ์„ฑ์„ ๋ถ€์—ฌํ•˜๊ณ , ๋‚˜๋จธ์ง€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” freezeํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.

* positive-definite constraint : ํ–‰๋ ฌ์˜ ๋ชจ๋“  ๊ณ ์œ ๊ฐ’์ด 0 ์ด์ƒ์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ์ œ์•ฝ. ๋ชจ๋ธ ์—ฐ์‚ฐ ์•ˆ์ •์„ฑ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜๋Š”๋ฐ, finetuning ์ž์œ ๋„๊ฐ€ ์ œํ•œ๋จ. 

์ด ๊ณผ์ • ์ „์ฒด๋Š” ๊ฐ€๋ฒผ์›Œ์„œ, ์˜ˆ๋ฅผ ๋“ค์–ด LLaMA-2-7B ํŒŒ์ธํŠœ๋‹์€ ๋‹จ์ผ NVIDIA V100 GPU์—์„œ 30๋ถ„ ๋งŒ์— ์™„๋ฃŒ๋œ๋‹ค.

 

 

 

 

- ๋งŽ์•„์งˆ์ˆ˜๋ก ํ‰๊ท  ์„ฑ๋Šฅ์ด ์ฆ๊ฐ€ํ•˜๊ธฐ๋Š” ํ•˜๋‚˜, benefits๊ณผ costs ์ธก๋ฉด์—์„œ k=100์ด ์ ํ•ฉ.

 

- MSE๋Š” KL๋ณด๋‹ค ์•ˆ์ข‹์•˜๊ณ  ์˜ค๋ฒ„ํ”ผํŒ…๋˜๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค๊ณ  ํ•จ.

 

 

 

4. Experiments

4.1. Setup

Models and Baselines.

[Models]

LLaMA2-7b, 13b

LLaMA3-8b

Baichuan2-7b

DeepSeek-R1-Distill

 

[Baselines]

(gradient based)

LLM-Pruner

(ppl based)

SLEB

(Taylor based)

shortend Llama

(cosine sim based)

ShortGPT

LLM-Streamline

 

 

Evaluation.

(ppl)

WikiText-2

C4

PTB

(NLU)

MMLU

(QA)

ARC-e / c

BoolQ

BellaSwag

PIQA

WinoGrande

WSC273

Race-h

CoPA

 

- MMLU๋Š” ์˜คํ”ผ์…œ ์ฝ”๋“œ ์‚ฌ์šฉ, ์ด์™ธ์—๋Š” lm-eval-harness ์‚ฌ์šฉ

 

4.2. Implementation Details

Calibration and Fine-tuning

 Calibration :

ํ”„๋ฃจ๋‹ํ•  ๋ ˆ์ด์–ด๋ฅผ ์ •ํ•˜๊ณ  channel-wise scaling ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ธฐ ์œ„ํ•œ calibration dataset์ด ํ•„์š”ํ•จ.

WikiText-2์—์„œ sequence length 2048์ธ 128๊ฐœ ๋ฐ์ดํ„ฐ๋ฅผ ๋žœ๋ค ์ƒ˜ํ”Œ๋งํ•จ.

 

Appendix C

- wiki-2 ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋‹ค๋ฅด๊ฒŒํ•˜์—ฌ ํ…Œ์ŠคํŠธํ•จ.

- ๋ฐ์ดํ„ฐ์–‘๊ณผ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ์žฌ๋ดค์„ ๋•Œ 128์ด ์ตœ์ ์ด๋‹ค.

Appendix D

- ํƒ€๊ฒŸ๋„๋ฉ”์ธ๊ณผ ๊ฐ™์€ ๋„๋ฉ”์ธ์˜ calibration set์„ ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ.

- ๋„๋ฉ”์ธ์ด ๋‹ฌ๋ผ๋„ ppl์€ ๊ฑฐ์˜ ๋ณ€ํ•˜์ง€ ์•Š์Œ -> ์šฐ๋ฆฌ ๋ฐฉ๋ฒ•์˜ ์•ˆ์ •์„ฑ (.........์™€ ์ด๋ ‡๊ฒŒ ํ•ด์„ํ•˜๋Š”๊ตฌ๋‚˜...)

- ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด ๋‹ฌ๋ผ๋„ ppl์€ ๊ฑฐ์˜ ์œ ์‚ฌํ•จ -> ์šฐ๋ฆฌ ๋ฐฉ๋ฒ•์˜ ์•ˆ์ •์„ฑ (22)

 

 

 For fine-tuning :
LINEARPATCH, we use AdamW with a learning rate of 1e−4, training for one epoch on 5,000 WikiText-2 sentences of length 2048

Appendix E

- ๋ฐ์ดํ„ฐ์–‘๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒํญ์„ ๋น„๊ตํ–ˆ์„ ๋•Œ, 5000์ด ์ตœ์ ์˜ ๊ฐ’์ด๋‹ค. 

 

Resource Consumption

- PyTorch ์‚ฌ์šฉ

- single NVIDIA V100 GPU with 24GB memory

- 7b ๋ชจ๋ธ์—์„œ, LinearPatch์˜ ์ดˆ๊ธฐํ™”๋Š” 30์ดˆ, fine-tuning์€ 30๋ถ„๋งŒ์— ์™„๋ฃŒ๋จ.

 

 

Pruning Configurations

 ์ด์ „ ์—ฐ๊ตฌ๋ฅผ ๋”ฐ๋ผ ๊ฐ€์ง€์น˜๊ธฐ ๋น„์œจ์„ 30% ๋ฏธ๋งŒ์œผ๋กœ ์ œํ•œ

 

4.3. Main Results

 

์šฐ๋ฆฌ๋Š” ๋จผ์ € LINEARPATCH์˜ ํ•™์Šต ์—†๋Š”(training-free) ํ™˜๊ฒฝ์—์„œ์˜ ํšจ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ํ”„๋ฃจ๋‹๋œ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ ์œ ์ง€ ์ •๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ƒ์‹ ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ ์‘๋‹ต(QA) ๋ฒค์น˜๋งˆํฌ์™€ ํผํ”Œ๋ ‰์„œํ‹ฐ(PPL) ๋ฒค์น˜๋งˆํฌ์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค. ๋น„๊ต์˜ ๊ณต์ •์„ฑ์„ ์œ„ํ•ด, ๊ณ ๋ ค๋œ ๋ชจ๋“  ์ ‘๊ทผ๋ฒ•์€ fine-tuning์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š๋Š”๋‹ค. ํŠนํžˆ, LLM-Pruner์˜ ๊ฒฝ์šฐ LoRA ๊ธฐ๋ฐ˜ ํŒŒ์ธํŠœ๋‹ ๋‹จ๊ณ„๋ฅผ ์ œ์™ธํ•˜๋ฉฐ, LLM-Streamline์€ ๊ณต์‹ ํ”„๋กœํ† ์ฝœ์„ ๋”ฐ๋ฅด๋˜, ๋ ˆ์ด์–ด ๊ต์ฒด(layer replacement)์™€ ์˜คํ”„๋ผ์ธ ์ฆ๋ฅ˜(offline distillation)๋ฅผ ์ œ๊ฑฐํ•œ ๋ณ€ํ˜•์„ LLM-Streamline (None)์œผ๋กœ ํ‘œ๊ธฐํ•œ๋‹ค. ์ถ”๊ฐ€ LLM ๋ฐฑ๋ณธ ๋ฐ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋Š” ๋ถ€๋ก I(Appendix I)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

(??????????????? LLM-stremline์˜ ํ•ต์‹ฌ ๋ฐฉ๋ฒ•์„ ์ œ๊ฑฐํ•ด๋ฒ„๋ฆฌ๋ฉด ์–ด์บ„?????? ๋‚ด๊ฐ€ ์ €์ž์˜€์œผ๋ฉด ๊ทน๋Œ€๋…ธํ–ˆ์„๋“ฏ)

 

4.3.1. Comparison on Training-free Methods

 

Results on QA Benchmarks

 

Results on PPL Benchmarks

 

Results on PPL Benchmarks

 

 

4.3.2. Comparison on Post-training Methods

 

Results on QA Benchmarks

 

Results on PPL Benchmarks

 

 

4.4. Discussions and Ablation Studies

Tunable Parameters and Loss Functions

 

 

The Ingredients of LinearPatch

 

Online Inference Overhead

 

Offline Storage Overhead

 

 

5. Conclusion

 

 

6. Limitation and Broader Impact

Limitation

๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…(task)์—์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ถˆ๊ท ํ˜•์ ์ธ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ผ๋ถ€ ์งˆ๋ฌธ ์‘๋‹ต(QA) ์ž‘์—…์€ ์—ฌ์ „ํžˆ ๊ฐ•์ธํ•˜๊ฒŒ ์œ ์ง€๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ณต์žกํ•œ ์ถ”๋ก (complex reasoning)์ด๋‚˜ ๋ฌธ๋งฅ ์˜์กด(context-dependent) ์ž‘์—…์€ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋  ์ˆ˜ ์žˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” ํšจ์œจ์„ฑ ํ–ฅ์ƒ๊ณผ ์ž‘์—…๋ณ„ ์„ฑ๋Šฅ ๊ฐ„์˜ ๊ท ํ˜•(trade-off)์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ฒด๊ณ„๋ฅผ ๊ตฌ์ถ•ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

 

Broader Impact

๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ•์€ LLM์˜ ๋ฐฐํฌ์— ํ•„์š”ํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์—ฌ, ๋” ๋งŽ์€ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ ‘๊ทผ์„ฑ์„ ๋†’์ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ LLM์— ๋‚ด์žฌ๋œ social biases์„ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ํŽธํ–ฅ์€ ์ข…์ข… training data์—์„œ ๋น„๋กฏ๋˜์–ด ๊ณต์ •์„ฑ๊ณผ ํฌ์šฉ์„ฑ(fairness and inclusivity)์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ LLM์„ ์œค๋ฆฌ์ ์œผ๋กœ ๋ฐฐํฌํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.