๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

Streamlining Redundant Layers to Compress Large Language Models

ICLR 2025 Spotlight

Code

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€
LLM-Streamline
1. Layer pruning
    * cosine ์œ ์‚ฌ๋„๋กœ ์ œ๊ฑฐํ•  ์—ฐ์†๋œ ๋ ˆ์ด์–ด ์„ ํƒ
   (๋‹ค๋ฅธ ๋ฉ”ํŠธ๋ฆญ์€ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธ, ppl์€ data dependency๋†’์•„์„œ ์ œ์™ธ)
2. Layer replacement
     * ๋Œ€์ฒดํ•˜๋Š” ๋ ˆ์ด์–ด ์•„ํ‚คํ…์ฒ˜: FNN, Transformer block(์›๋ณธ๋ชจ๋ธ์ด๋ž‘ ๋™์ผํ•œ ๊ตฌ์กฐ)
     * finetuning:  ๋Œ€์ฒด ๋งจ์•ž๋ ˆ์ด์–ด์˜ input๊ณผ ๋Œ€์ฒด ๋งจ๋’ค๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹ ํžˆ๋“ ๋ฒกํ„ฐ๋กœ ํ•™์Šต
     * finetuning ํ•  ๋•Œ loss: MSE loss


* ์•„์˜ˆ ์ œ๊ฑฐํ•˜๊ณ  ์‹ถ์€ ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด๋‘๊ณ , ๊ทธ ๊ฐ„๊ฒฉ๋ผ๋ฆฌ์˜ cosine sim์„ ๊ตฌํ•œ๋‹ค. 
์˜ˆ๋ฅผ๋“ค๋ฉด 7๊ฐœ๋ฅผ ์ œ๊ฑฐํ•  ๋ชฉ์ ์ด๋ผ๋ฉด, 0๋ฒˆ์งธ์™€ 6๋ฒˆ์งธ hidden vector์˜ cos sim, 1-7์˜ cos sim .... ์ด๋Ÿฐ์‹์œผ๋กœ ๊ตฌํ•˜๊ณ , ๊ฐ€์žฅ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์• ๋ฅผ ๊ตฌํ•ด์„œ ๋ญ‰ํ……์ด๋กœ ๋‚ ๋ ค๋ฒ„๋ฆฌ๋Š” ๋ฐฉ๋ฒ•. (๊ณต์‹ ์ฝ”๋“œ ์ฐธ๊ณ )




โœจ ๋ณผ๋งŒํ•œ ๋ถ€๋ถ„โœจ
.* .. ๋”ฐ๋ผ์„œ ๋ฒกํ„ฐ ํฌ๊ธฐ์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์„ ํƒํ•จ .
(ํ์Œ..์ฝ”์‚ฌ์ธ์œ ์‚ฌ๋„๋Š” ๋†’์ง€๋งŒ, magnitude ์œ ์‚ฌ๋„๊ฐ€ ์ž‘์œผ๋ฉด ์ด๊ฑด ์–ด๋–ป๊ฒŒ ๊ณ ๋ คํ• ๊ฑด๋””?? - '๋ฒกํ„ฐ์˜ ์œ ์‚ฌํ•จ'์— ๋Œ€ํ•œ ์ข€ ๋” ๋ช…ํ™•ํ•œ ์ด์œ ๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์„๋“ฏ)
* ๋ฐ˜๋ฉด LoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ”„๋ฃจ๋‹ ํ›„ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์€, ์ œ๊ฑฐ๋œ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋Šฅ์„ ๋‚จ์€ ๋ ˆ์ด์–ด์— ์žฌ๋ถ„๋ฐฐํ•˜๋Š” ๊ณผ์ •
์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
* - ๋ชจ๋“  ํ”„๋ฃจ๋‹ ๋ฐฉ์‹์ด GSM8K ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ž˜ ๋ชป์žก๊ณ  ์žˆ์Œ!!
* calibratin/train dataset์„ SlimPajama๋งŒ ์‚ฌ์šฉํ•จ
* -> FFN ๋ ˆ์ด์–ด๊ฐ€ transformer layer๋ณด๋‹ค ์ˆ˜๋ ด์ด ๋น ๋ฅด๋‹ค

* ๋ฒค์น˜๋งˆํฌ ๋งŽ์ด ์”€
* Stability๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฉ”ํŠธ๋ฆญ ์ œ์•ˆ

--
์˜คํ”ˆ๋ฆฌ๋ทฐ ๋ฆฌ๋ทฐ์–ด๋“ค ๋ฆฌ๋ทฐ๊ฐ€ ๋งค์šฐ ์ข‹๋‹ค... ์™ค๊นŒ ๊ถ๊ธˆํ•˜๋‹ค.
๋ฆฌ๋ทฐ๊นŒ์ง€ ์ฝ์–ด๋ดค๋Š”๋ฐ.. ๊ทธ๋ƒฅ ์„ธ์„ธํ•œ ๋น„๊ต์‹คํ—˜๋“ค์ด ๋งŽ์•„์„œ ๊ทธ๋Ÿฐ ๊ฒƒ ๊ฐ™์Œ ..
๋” ๋‹ค์–‘ํ•œ (์ข…๋ฅ˜/์‚ฌ์ด์ฆˆ) ๋ชจ๋ธ์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ ๋น„๊ต, LoRA์™€ ๋น„๊ต ์‹คํ—˜ -> ๋ฆฌ๋ทฐ์ง€์ ๋ฐ›๊ณ  ์ถ”๊ฐ€๋จ

 

 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

๊ฐ€์žฅ ๋œ ์ค‘์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ œ๊ฑฐํ•˜๋Š” LLM์˜ layer pruning์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋กœ, LLM-Streamline์„ ์ œ์•ˆํ•œ๋‹ค.

 

LLM-Streamline์€ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋‰œ๋‹ค.

1. layer pruning: ๊ฐ€์žฅ ๋œ ์ค‘์š”ํ•œ ์—ฐ์†๋œ ๋ ˆ์ด์–ด๋“ค์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•

2. layer replacement: lightweight network๋ฅผ ํ•™์Šตํ•˜๊ณ  pruned layer๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹. ์„ฑ๋Šฅ ์†์‹ค์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.

 

์ถ”๊ฐ€์ ์œผ๋กœ, Stability๋ผ๋Š” ์ƒˆ๋กœ์šด metric์„ ์ œ์•ˆํ•œ๋‹ค.

์ด ๋ฉ”ํŠธ๋ฆญ์€ model compression ํ…Œ์Šคํฌ์—์„œ accuracy๋งŒ ์‚ฌ์šฉํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.

 

๋‹ค๋ฅธ sota pruning method๋ฅผ ์„ฑ๋Šฅ๊ณผ ํ•™์Šตํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ์•„์›ƒํผํผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. 

 


1. Introduction

LLM์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์ง€๋ฉด์„œ, ํ•˜๋“œ์›จ์–ด์˜ ์š”๊ตฌ๊ฐ€ ์ƒ๋‹นํžˆ ์‹ฌํ•ด์ง€๊ณ , ๋”ฐ๋ผ์„œ real-world scenario์— ์ ์šฉํ•˜๊ธฐ์— ์ œ์•ฝ์ด ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฐ ์ œ์•ฝ์„ ์—†์• ๊ธฐ ์œ„ํ•ด, model compression์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ compactํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์Ÿ์•„์ง€๊ณ  ์žˆ๋‹ค.

model compression => {kd, quantization, pruning}

Knowledge distillation achieves compression by transferring the capabilities of a larger teacher model to a smaller student model. Quantization compresses the model by quantizing the weights to lower precision. Alternatively, pruning compresses the model by eliminating unimportant parameters and modules.

 

์ด๋ฒˆ ์—ฐ๊ตฌ์—์„œ๋Š” popular pruning method์— ํฌ์ปค์‹ฑํ•œ๋‹ค. ์ด์ „ ํ”„๋ฃจ๋‹ ์—ฐ๊ตฌ์—์„œ์˜ ํ”„๋ฃจ๋‹ ๋‹จ์œ„๋Š” dense matrices (SliceGPT), attention heads, filters, parameters ๋“ฑ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ํšจ๊ณผ์ ์ด๊ธด ํ•˜์ง€๋งŒ, ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ๋ถˆ๊ทœ์น™์„ฑ(structural irregularity)์„ ์ดˆ๋ž˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„, ํ”„๋ฃจ๋‹๋œ ๋ชจ๋ธ์„ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ๋ฐฐํฌํ•˜๊ธฐ์— inflexibleํ•˜๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

๋ฐ˜๋ฉด layer pruning method๋Š” ๋‹จ์ˆœํžˆ LLM์˜ depth๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. nn.ModuleList ์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ตฌ์กฐ ์•ˆ์— ์ €์žฅ๋˜์–ด์žˆ๋Š” ๋ ˆ์ด์–ด๋ฅผ ๋‹จ์ˆœํ•˜๊ฒŒ ์ œ๊ฑฐํ•˜๋ฉด ๋˜๋Š” ์•„์ฃผ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ๋”ฐ๋ผ์„œ ํšจ์œจ์ ์ธ layer-wise pruning ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ์ค‘์š”ํ•˜๋‹ค.

 

layer purning์€ LLM์—์„œ ๋œ ์ค‘์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์ฐพ๊ณ  ์—†์• ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๊ฐ ๋ ˆ์ด์–ด๋Š” Hidden states๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋”ฐ๋ผ์„œ ํŠน์ • ๋ ˆ์ด์–ด์˜ Input/output hidden state์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค๋ฉด, ๋ ˆ์ด์–ด์˜ ์˜ํ–ฅ์ด ์ž‘๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๊ด€๋ จ์—ฐ๊ตฌ

without further training - SLEB, ShortGPT

with finetuning - Shortened llama, LaCo, Gromov at al.

๊ทธ๋Ÿฌ๋‚˜ ๋ ˆ์ด์–ด๋ฅผ ์ง์ ‘ ์ œ๊ฑฐํ•˜๋Š” ๊ฒฝ์šฐ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋” ํฌ๊ฒŒ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ LoRA(Hu et al., 2021)์™€ ๊ฐ™์€ parameter-efficient fine-tuning ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ํ”„๋ฃจ๋‹๋œ LLM์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ,

์›๋ž˜์˜ ๋น„์—ฐ์†์ ์ธ(non-contiguous) ๋ ˆ์ด์–ด๋“ค์ด ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์™„ํ•˜๋„๋ก ๋ชจ๋ธ์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” ๊ณผ์ •์€ ์‰ฝ์ง€ ์•Š๋‹ค(2.3 ์ฐธ์กฐ)

 

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” LLM-Streamline์ด๋ผ๋Š” ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

์ ์€ ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ, ๋†’์€ ์„ฑ๋Šฅ, ํ•™์Šตํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ.

   1. ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ 2. ๋ ˆ์ด์–ด replacement 

lightweight network๋Š” ๋‹ค์–‘ํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. (FFN, SwiGLU, Transformer ๋“ฑ..)

 

์ถ”๊ฐ€์ ์œผ๋กœ, ๋ชจ๋ธ compression method๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” metricd์˜ ํ•œ๊ณ„๋ฅผ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. 

๊ตฌ์ฒด์ ์œผ๋กœ, multiple-choice classification๋ฅผ ํฌํ•จํ•˜๋Š” NLU ๊ณผ์ œ์—์„œ, ์••์ถ•๋œ ๋ชจ๋ธ์€ ์›๋ž˜ ๋ชจ๋ธ์ด ๋ถˆํ™•์‹คํ–ˆ๋˜ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์šฐ์—ฐํžˆ ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ์„ฑ๋Šฅ์„ ๊ณผ๋Œ€ํ‰๊ฐ€ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฌธ์ œ๋ฅผ ์ดˆ๋ž˜ํ•œ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ƒˆ๋กœ์šด metric์ธ stability์„ ์ œ์•ˆํ•œ๋‹ค. ์ด metric์€ ํ”„๋ฃจ๋‹ ์ „ํ›„ ์˜ˆ์ธก ์ผ๊ด€์„ฑ(consistency)์„ ์ธก์ •ํ•˜๋ฉฐ, ํŠนํžˆ ์›๋ž˜ ๋ชจ๋ธ์˜ prediction confidence๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•œ๋‹ค.

2. LLM-Streamline

 

2.1. Layer Redundancy in LLMs

๊ทธ๋ƒฅ LLM ๋ ˆ์ด์–ด ์‹

- ๊ฐ ๋ ˆ์ด์–ด์˜ input/output hidden vector๋ฅผ cosine similarity๋กœ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

- layer importace๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋Š” pre-training data์—์„œ ๋žœ๋ค์ƒ˜ํ”Œ๋ง ํ•ด์˜ด.

๊ทธ๋ƒฅ ๋ ˆ์ด์–ด๋งˆ๋‹ค hidden vector์˜ cosine sim ๊ตฌํ•˜๋Š” ๊ฒƒ ์‹์œผ๋กœ ์“ด ๊ฑฐ

 

- ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์™€ ์ข…๋ฅ˜์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์™„ํ™”ํ•˜๊ณ ์ž, 4๊ฐœ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•จ. 

 -> ๋ชจ๋“  ๋ชจ๋ธ์—์„œ, ์—ฐ์†๋œ ๋ ˆ์ด์–ด๋“ค์˜ input/output์œ ์‚ฌ๋„๊ฐ€ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚จ. 

 

Discussion I: Why not use other similarity to measure the importance of layers? 

๋‚ด์ (dot product)์™€ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋„ ์‚ฌ์šฉ๋˜์ง€๋งŒ, ์ด๋“ค์€ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ณ ๋ คํ•จ.

์— ๋”ฐ๋ฅด๋ฉด, pre-norm ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” transformer์˜ hidden states๋Š”, ๋ ˆ์ด์–ด์˜ ๊นŠ์ด๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์ ์  ์ปค์ง€๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

์ด๋กœ ์ธํ•ด ํ›„๋ฐ˜ ๋ ˆ์ด์–ด์—์„œ๋Š” dot product similarity๊ฐ€ ๋†’์•„์ง€๊ณ ,

์ดˆ๋ฐ˜ ๋ ˆ์ด์–ด์—์„œ๋Š” duclidean distance๊ฐ€ ์ž‘์•„์ง€๋Š” ํŽธํ–ฅ์ด ๋ฐœ์ƒํ•œ๋‹ค.

 

๋”ฐ๋ผ์„œ ๋ฒกํ„ฐ ํฌ๊ธฐ์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์„ ํƒํ•จ ..

(ํ์Œ..์ฝ”์‚ฌ์ธ์œ ์‚ฌ๋„๋Š” ๋†’์ง€๋งŒ, magnitude ์œ ์‚ฌ๋„๊ฐ€ ์ž‘์œผ๋ฉด ์ด๊ฑด ์–ด๋–ป๊ฒŒ ๊ณ ๋ คํ• ๊ฑด๋””?? - '๋ฒกํ„ฐ์˜ ์œ ์‚ฌํ•จ'์— ๋Œ€ํ•œ ์ข€ ๋” ๋ช…ํ™•ํ•œ ์ด์œ ๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์„๋“ฏ)

 

 

Discussion II: Why not use perplexity as the metric to measure the importance of layers?

ppl์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์ „ ์—ฐ๊ตฌ์—์„œ๋Š”,๊ฐ ๋ ˆ์ด์–ด๋ฅผ ํ•˜๋‚˜์”ฉ ์ œ๊ฑฐํ•˜๋ฉด์„œ, pre-training ๋ฐ์ดํ„ฐ์—์„œ ๋ชจ๋ธ์˜ ํผํ”Œ๋ ‰์„œํ‹ฐ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•˜๊ณ , ํผํ”Œ๋ ‰์„œํ‹ฐ ๋ณ€ํ™”๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜  ppl ์ง€ํ‘œ๋Š” ๋งค์šฐ data-sensitiveํ•˜๋‹ค๊ณ  ํŒ๋‹จ, ์ฆ‰, ์„œ๋กœ ๋‹ค๋ฅธ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ œ๊ฑฐ๋˜๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉฐ,

๊ฒฐ๊ณผ์ ์œผ๋กœ ํ”„๋ฃจ๋‹์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ํผํ”Œ๋ ‰์„œํ‹ฐ๊ฐ€ ๋‚ฎ๋”๋ผ๋„๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

๋ฐ˜๋ฉด cosine similarity๋Š” ๋งค์šฐ ์•ˆ์ •์ ์ด๋ฉฐ, ํ•ญ์ƒ ๋™์ผํ•œ ๋ ˆ์ด์–ด๊ฐ€ ์„ ํƒ๋˜๋Š” consistency๋ฅผ ๋ณด์ธ๋‹ค .

Appendix A

2.2. Layer Pruning (step1)

 

 

 

2.3. Layer Replacement (step2)

 

Discussion: Layer Replacement of Fine-Tuning Pruned LLMs?

์šฐ์„ , resource overhead ๊ด€์ ์—์„œ, layer replacement๊ฐ€ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด hardware ๋ฆฌ์†Œ์Šค ์ œ์•ฝ์ด ์ ๋‹ค. PEFT ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋ธ์˜ ๋ชจ๋“  weight, activation value, PEFT๋ชจ๋“ˆ์˜ optimizer ์ƒํƒœ ๋“ฑ์„ gpu์— ์˜ฌ๋ ค์•ผ ํ•œ๋‹ค. ๋ฐ˜๋ฉด layer replacement ๋ฐฉ๋ฒ•์€, ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์™€ ์ˆœ์ „ํŒŒ(forward) ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ ์ €์žฅํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” lightweight network ๊ฐ€์ค‘์น˜, activation ๊ฐ’, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋งŒ ์ €์žฅํ•˜๋ฉด ๋œ๋‹ค.

๋‘๋ฒˆ์งธ๋กœ, MSE ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์ œ๊ฑฐ๋œ ๋ ˆ์ด์–ด์˜ ์ง€์‹์„ ๊ฒฝ๋Ÿ‰ ๋„คํŠธ์›Œํฌ์— ์ฆ๋ฅ˜(distill)ํ•œ๋‹ค.
๋ฐ˜๋ฉด LoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ”„๋ฃจ๋‹ ํ›„ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์€, ์ œ๊ฑฐ๋œ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋Šฅ์„ ๋‚จ์€ ๋ ˆ์ด์–ด์— ์žฌ๋ถ„๋ฐฐํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
๋”ฐ๋ผ์„œ ํ”„๋ฃจ๋‹๋œ ๋ ˆ์ด์–ด๋ฅผ ๊ฒฝ๋Ÿ‰ ๋„คํŠธ์›Œํฌ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด, ๋‚จ์€ ๋ ˆ์ด์–ด์— ๊ธฐ๋Šฅ์„ ์žฌ๋ถ„๋ฐฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ•™์Šต ๋‚œ์ด๋„๊ฐ€ ๋‚ฎ์„ ์ˆ˜ ์žˆ๋‹ค.

3. Metrics for Evaluating Pruned Models

 

3.1. Shortcoming of Accuracy Metric

TP์™€ TN์˜ ํ‘œ์ค€ํŽธ์ฐจ(std) ๊ฐ€ FN๊ณผ FP๋ณด๋‹ค ํ˜„์ €ํžˆ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค.
→ ์ด๋Š” ๋ชจ๋ธ์ด FN๊ณผ FP ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋ถˆํ™•์‹ค์„ฑ(uncertainty) ์ด ํฌ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ

3.2. Stability Metric

 

์ •ํ™•๋„(accuracy)์™€ ๋‹ฌ๋ฆฌ, stability๋Š” ๋ชจ๋ธ์˜ ๋‹ต๋ณ€์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„(confidence)์™€ ํ”„๋ฃจ๋‹ ์ „ํ›„ ๋ชจ๋ธ์˜ ์ผ๊ด€์„ฑ(consistency)์— ์ดˆ์ ์„ ๋งž์ถ˜๋‹ค.

๋”ฐ๋ผ์„œ stability๋Š” ํ”„๋ฃจ๋‹ ํ›„ ๋ชจ๋ธ์ด ์›๋ž˜ ๋ชจ๋ธ๊ณผ ์ตœ๋Œ€ํ•œ ์œ ์‚ฌํ•˜๊ฒŒ ์œ ์ง€๋˜๋Š”๊ฐ€๋ผ๋Š” ๋ชจ๋ธ ์••์ถ•์˜ ๋ณธ๋ž˜ ๋ชฉํ‘œ์— ๋” ๋ถ€ํ•ฉํ•˜๋Š” ์ง€ํ‘œ์ด๋‹ค.

 

4. Experiments

4.1. Setup

Model

Llma2-7B, 13B

 

pruning ratio : 25%

lightweight network: 1) FFN (๋žœ๋คinit) 2) Transformer layer(ํ”„๋ฃจ๋‹๋œ ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋”ฐ๋ฆ„)

calibration dataset : SlimPajama

๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ, ์ตœ์ข…์ ์œผ๋กœ 30,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹

 500๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ LLM์— ์ž…๋ ฅํ•˜๊ณ , Fig. 2๋ฅผ ์ƒ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด 500๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์€ ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹(layer pruning)์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๋‚˜๋จธ์ง€ 30,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋Š” ๊ฒฝ๋Ÿ‰ ๋„คํŠธ์›Œํฌ(lightweight network) ํ•™์Šต์— ์‚ฌ์šฉ

 * SlimPajama: CommonCrawl ์•ฝ 52.2 %, C4 ์•ฝ 26.7 %, GitHub ์•ฝ 5.2 %, Books ์•ฝ 4.2 %, arXiv ์•ฝ 4.6 %, Wikipedia ์•ฝ 3.8 %, StackExchange ์•ฝ 3.3 %.

 

 

4.2. Benchmark

12๊ฐœ NLU task.

CMNLI, HellaSwag, PIQA, CHID, WSC, CommonSencseQA, BoolQ, MMLU, CMMLU, Race-High/Middle/ C3

์ถ”๊ฐ€ 3๊ฐœ. (OpenCompass ํ”„๋ ˆ์ž„์›Œํฌ ์‚ฌ์šฉ)

XSum, GSM8K, StrategyQA

4.3. Baseline

LLM-Pruner
SliceGPT

LaCo

4.4 Main Results

-> benchmark๋“ค์— ๋Œ€ํ•œ accuracy ๊ฒฐ๊ณผ

 

-> benchmark๋“ค์— ๋Œ€ํ•œ Stability (์–˜๋„ค๋“ค์ด ์ œ์•ˆํ•œ metric) ๊ฒฐ๊ณผ

 

- ๋ชจ๋“  ํ”„๋ฃจ๋‹ ๋ฐฉ์‹์ด GSM8K ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ž˜ ๋ชป์žก๊ณ  ์žˆ์Œ!!

 

 

OPT-1.3B, OPT-2.7B, OPT-6.7B, Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B(Yang et al., 2023), Llama3.1-8B, Llama3.1-70B(Dubey et al., 2024), Mixtral-8x7B-v0.1(Jiang et al., 2024)์—์„œ๋„ ์‹คํ—˜์„ ์ˆ˜ํ–‰ (Appendix E)

 

Appendix E

-> purning ratio=50%

 

 

 

4.5. Impact of Different Lightweight Networks

Why FFN achieves the best result, Transformer layer still has performance potential. 

- ๋‹ค์–‘ํ•œ lightweight network ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์‹คํ—˜

1) FNN 2)SwiGLU ๊ธฐ๋ฐ˜ FNN 3)Transformer layer

 

+ Transformer layer๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

3-1) ๋žœ๋ค 3-2) ์ฒซ๋ฒˆ์งธ ํ”„๋ฃจ๋‹๋œ ๋ ˆ์ด์–ด ์ƒ์† 3-3)๋งˆ์ง€๋ง‰ ํ”„๋ฃจ๋‹๋œ ๋ ˆ์ด์–ด ์ƒ์† 3-4)ํ”„๋ฃจ๋‹๋ ˆ์ด์–ด๋“ค ํ‰๊ท 

 

=> FFN์ด ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ. ํ•œํŽธ, Transformer ๋ ˆ์ด์–ด์—์„œ๋Š” ํ”„๋ฃจ๋‹๋œ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋ฅผ ์ƒ์†ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ LaCo์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ Layer-Avg๋Š” ๊ฐ€์ค‘์น˜ ํ‰๊ท ํ™”๊ฐ€ ํ”„๋ฃจ๋‹๋œ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋งŒํผ ํšจ๊ณผ์ ์ด์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์ค€๋‹ค

(LaCo๋Š” ํ‰๊ท ์ด ์•„๋‹ˆ๋ผ ์ฐจ์ด๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๊ฑด๋ฐ...)

 

-> FFN ๋ ˆ์ด์–ด๊ฐ€ ์ˆ˜๋ ด์ด ๋น ๋ฅด๋‹ค

 

4.6. Impact of Different Pruning Ratios

The performance of the pruned model is linearly correlated with the number of parameters at modest pruning ratios.

ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ์„ ํ˜•์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๋ชจ์Šต์„ ๋ณด์˜€๊ณ , ์ด๋Š” LLM-Strimeline ๋ฐฉ๋ฒ•์œผ๋กœ ํ”„๋ฃจ๋‹๋œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋™์ผํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ๊ณผ ๋น„๊ตํ• ๋งŒํ•˜๋‹ค (๊ฒฌ์ค€๋‹ค)๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค.

(์—์—ฅ ๋ผ๋งˆ 3b ๋ž‘ ์™œ ๋น„๊ต์•ˆํ•จ? ๋…ผ๋ฆฌ ๊นจ์ ธ์„œ ๊ทธ๋Ÿฐ๋“ฏ ใ…‹ใ…‹ ์ด๋ž˜๋„๋ผ?)

 

4.7. Comparison of Layer Replacement and LoRA

Layer Replacement outperforms LoRA in both performance and GPU memory consumption

- layer replacement๋Š” LoRA์™€ ํ•™์Šต ๋ชฉ์ ์ด ๋‹ค๋ฅด๋ฏ€๋กœ, ์ถ”์ž๊ฑฐ์œผ๋กœ 1epoch LM loss๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ์ž„.

- layer replacement๋Š” 30,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ, LoRA๋Š” 300,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•จ.

- LoRA์˜ rank๋Š” ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถ”๊ธฐ ์œ„ํ•ด 128๋กœ ์„ค์ •

LoRA๋ณด๋‹ค ํ•ญ์ƒ ์šฐ์ˆ˜ํ•จ. ์ฆ‰ ํ›จ์”ฌ ์ ์€ GPU ๋ฉ”๋ชจ๋ฆฌ์™€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž„.

 

AppendixE.8.

[layer replacement ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜์— ๊ด€ํ•˜์—ฌ]

SlimPajama-6B ์ „์ฒด๋กœ post training์„ ํ–ˆ์„ ๋•Œ, ์„ฑ๋Šฅ์ด ์•ฝ๊ฐ„ ์˜ค๋ฅด๊ธด ํ•˜์ง€๋งŒ computational time์ด 100๋ฐฐ ์ฆ๊ฐ€ํ•œ ๊ฒƒ์— ๋น„ํ•˜๋ฉด ๊ทธ์ € ๊ทธ๋ ‡๋‹ค.

 

5. Related Work

LLM-Streamline๊ณผ ๋™์‹œ์— ์ง„ํ–‰๋œ ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ ๊ด€๋ จ ์—ฐ๊ตฌ์—๋Š” LaCo (Yang et al., 2024), ShortGPT (Men et al., 2024), UIDL (Gromov et al., 2024), SLEB (Song et al., 2024), Shortened Llama (Kim et al., 2024) ๋“ฑ์ด ์žˆ๋‹ค.

  • LaCo (Yang et al., 2024) ๋Š” ์—ฐ์†๋œ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋ฅผ ํ•˜๋‚˜์˜ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ณ , ๊ทธ๋“ค์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ‰๊ท  ๋‚ด์–ด ๋ ˆ์ด์–ด๋ฅผ ์••์ถ•(compress) ํ•œ๋‹ค. (์•„๋‹ˆ laco ํ‰๊ท  ์•„๋‹ˆ์ž”์•„!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)
  • ShortGPT (Men et al., 2024) ๋Š” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(cosine similarity) ์™€ ๋™์ผํ•œ BI ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ ˆ์ด์–ด์˜ ์ค‘์š”๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , ๋œ ์ค‘์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.
  • UIDL (Gromov et al., 2024) ์—ญ์‹œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ ๊ฑฐ๋ฆฌ(angular distance) ๋ฅผ ์ด์šฉํ•ด ๋œ ์ค‘์š”ํ•œ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉฐ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด QLoRA ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•œ๋‹ค.
  • SLEB (Song et al., 2024) ๋Š” ํผํ”Œ๋ ‰์„œํ‹ฐ(perplexity) ๋ฅผ ํ†ตํ•ด ๋ ˆ์ด์–ด์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.
  • Shortened Llama (Kim et al., 2024) ๋Š” ๋‹ค์–‘ํ•œ ๋ ˆ์ด์–ด ์„ ํƒ ๊ธฐ์ค€(metric) ์„ ํƒ์ƒ‰ํ•˜๊ณ , ํ”„๋ฃจ๋‹ ์ดํ›„ ์—ฐ์† ์‚ฌ์ „ํ•™์Šต(continual pre-training) ๊ณผ LoRA ์ ์šฉ์˜ ํšจ๊ณผ๋ฅผ ๋ถ„์„ํ•œ๋‹ค.

์ „ํ†ต์ ์ธ ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ ๊ธฐ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ, LLM-Streamline์€ ํ”„๋ฃจ๋‹๋œ ๋ ˆ์ด์–ด๋ฅผ ๋‹จ์ˆœํžˆ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ํ”„๋ฃจ๋‹๋œ ๋ชจ๋ธ์„ ์žฌํ•™์Šต(retrain)ํ•˜๋Š” ๋Œ€์‹ , ๊ฐ€๋ฒผ์šด ๋Œ€์ฒด ๋ชจ๋ธ(lightweight model)์„ ํ•™์Šต์‹œ์ผœ ๊ทธ ๋ ˆ์ด์–ด๋ฅผ ๋Œ€์ฒดํ•œ๋‹ค.

 

 

6. Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” LLM-Streamline์ด๋ผ๋Š” LLM์šฉ ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ ๋ฐ ๋Œ€์ฒด(layer pruning-and-replacement) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค.
๋˜ํ•œ ๊ธฐ์กด์˜ ์ •ํ™•๋„(accuracy) ์ง€ํ‘œ์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ , ๋ชจ๋ธ ์••์ถ• ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ง€ํ‘œ์ธ stability๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ฐ€๋ฒผ์šด ๋„คํŠธ์›Œํฌ(lightweight network) ๋ฅผ ํ™œ์šฉํ•œ ๋ณธ ๋ ˆ์ด์–ด ๋Œ€์ฒด ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ SOTA ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ๋™์‹œ๋Œ€์˜ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด ํ”„๋ฃจ๋‹ ๊ธฐ๋ฒ•๋“ค๋ณด๋‹ค ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ชจ๋‘์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.