๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

ICLR 2024 Workshop

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€


โœจ Point โœจ
block-level (Transformer block)๋กœ importance ๊ณ„์‚ฐ (mag / taylor / ppl)
LoRA / Continued Pretraining์œผ๋กœ ์žฌํ•™์Šต (retraining)

 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

pruning์˜ ๋ฐฉ๋ฒ•์œผ๋กœ width์™€ depth๊ฐ€ ์žˆ๋Š”๋ฐ,

๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๊ฐ€ width-only ๋˜๋Š” blend of width and depth๋กœ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. 

 

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ„๋‹จํ•œ depth pruning๋งŒ์œผ๋กœ๋„ LLM์„ ํšจ๊ณผ์ ์œผ๋กœ compressํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ (?) ์ฃผ์žฅํ•œ๋‹ค.

์ถ”๋ก  ์†๋„๋„ ๋น ๋ฅด๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ๋„ ๊ณ ๋ คํ•˜์˜€๋‹ค(์ œํ•œ๋œ Batch size)


 

1. Introduction

- ๋ณธ ์—ฐ๊ตฌ๋Š” structured pruning์— ๊ด€ํ•œ๋‹ค. 

structured pruning์€ ๋ถˆํ•„์š”ํ•œ ๊ฐ€์ค‘์น˜ ๊ทธ๋ฃน์„ ์ œ๊ฑฐํ•˜๊ณ , hardware-agnostic acceleration์„ ์šฉ์ดํ•˜๊ฒŒ ํ•œ๋‹ค.

* hardware-agnostic acceleration: ํŠน์ • ํ•˜๋“œ์›จ์–ด์— ์ข…์†๋˜์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด(์˜ˆ: CPU, GPU, NPU, FPGA ๋“ฑ) ์œ„์—์„œ๋„ ์ž˜ ๋™์ž‘ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ธฐ์ˆ 

 

- LLM inference๋Š” autoregressive decoding ๋งค์ปค๋‹ˆ์ฆ˜์„ ๋”ฐ๋ฅธ๋‹ค. ์•ž ํ† ํฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ† ํฐ ํ•˜๋‚˜์”ฉ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ memory-bound nature๋ฅผ ๋ณด์ด๋Š”๋ฐ, ์ด๋Š” GPU ์—ฐ์‚ฐ๋Šฅ๋ ฅ ์ธก๋ฉด์—์„œ ์ƒ๋‹นํ•œ ๋น„ํšจ์œจ์ ์ด๋‹ค.

-> batch size๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ๋ฒ•์„ ์“ฐ๊ณ  ์žˆ์ง€๋งŒ, ์ œํ•œ๋œ GPU ํ™˜๊ฒฝ์—์„œ๋Š” ๋ฐฐ์น˜ํฌ๊ธฐ๋ฅผ ์ž‘๊ฒŒํ•  ์ˆ˜๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋Ÿฐ ํ™˜๊ฒฝ์—์„œ๋„ ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ด๊ณ ์ž ํ•˜์˜€์Œ.

 

- depth pruning์€ ํฐ ์œ ๋‹›์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค๋ณด๋‹ˆ width์— ๋น„ํ•ด ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ์—ฌ๊ฒจ์กŒ์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ๊ผญ ๊ทธ๋ ‡์ง€๋Š” ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋ฐํ˜”๋‹ค. 

 

Contribution:

1. ์ œํ•œ๋œ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ์—์„œ, width pruning์€ ์ถ”๋ก ์†๋„ ํ–ฅ์ƒ X

2. ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ depth pruning ๋ฐฉ๋ฒ• ์ œ์•ˆ

3. pruning ratio๊ฐ€ ์ ๋‹นํ•˜๋ฉด LoRA๋ฅผ ํ†ตํ•œ retraining์ด, ratio๊ฐ€ ์ปค์ง€๋ฉด full-parameter update๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ค‘์š”ํ•˜๋‹ค. 

 

2. Problem: Small-batch LLM Inference

 

our focus is on accelerating the inference of LLMs under small-batch conditions caused by hardware restrictions. (๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์›Œ์„œ inference ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, pruning์„ ํ†ตํ•ด์„œ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ์˜๋ฏธ์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ ๊ฐ™์Œ. ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ๋ฒ•์€ GPU ๋ถ€์ž์ผ ๋•Œ๋งŒ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์ž‘์€ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ์—์„œ๋„ ์ ์šฉํ•  ์ˆ˜ ์ž‡๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.-)

1. width ํ”„๋ฃจ๋‹์€ ์ƒ์„ฑ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค์ง€ ์•Š์œผ๋ฉฐ, ๊ฐ€์ค‘์น˜์˜ ์ฐจ์›์ด GPU์— ์ ํ•ฉํ•˜์ง€ ์•Š๊ฒŒ ๋ณ€ํ•˜๋Š” ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๊ธฐ๋„ ํ•œ๋‹ค.

2. ์˜๋ฏธ์žˆ๋Š” ์†๋„ ๊ฐœ์„ ์€ depth ํ”„๋ฃจ๋‹์„ ํ†ตํ•ด์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

3. Method: Block Pruning

ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋Ÿญ ์ž์ฒด๋ฅผ ํ”„๋ฃจ๋‹ํ•  ํ•˜๋‚˜์˜ ์œ ๋‹›์œผ๋กœ ๋ณธ๋‹ค.

๋ฐฉ๋ฒ•: ๊ฐ„๋‹จํ•œ metric์œผ๋กœ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๋ธ”๋Ÿญ์„ ์‹๋ณ„ํ•˜๊ณ , one-shot pruning์„ ํ•œ๋‹ค.

3.1. Evaluation of Block-level Importance

: linear weight matrix 

size: (d_out, d_in)
k: type of operation (e.g. ๋ฉ€ํ‹ฐํ—ค๋“œ์–ดํ…์…˜์˜ query projection,, FFN์˜ up projection ๋“ฑ)
n: n๋ฒˆ์งธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋Ÿญ

 

- output neuraon level๋กœ weight importance score์„ ๊ณ„์‚ฐํ–ˆ๋‹ค...(๋ญ๋ผ๋Š”๊ฑฐ)

 

Magnitude(Mag). 

์ž‘์€ norm์„ ๊ฐ€์ง„ weight์€ ๋œ ์ค‘์š”ํ•œ ์ •๋ณด์ด๋‹ค.

Taylor

๊ธฐ๋ณธ ์‹

: ์–ด๋–ค weight $W_{i,j}^{k,n}$๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์—ˆ์„ ๋•Œ ๋ชจ๋ธ ์„ฑ๋Šฅ(Loss)์ด ์–ผ๋งˆ๋‚˜ ๋ณ€ํ•˜๋Š”์ง€(์ฆ‰, ์ œ๊ฑฐํ–ˆ์„ ๋•Œ์˜ ์˜ํ–ฅ๋ ฅ)๋Š”,

๊ทธ weight์˜ ํฌ๊ธฐ์™€ ์†์‹ค์— ๋Œ€ํ•œ gradient์˜ ๊ณฑ์˜ ์ ˆ๋Œ“๊ฐ’์œผ๋กœ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์šฐํ•ญ : ๋ณ€ํ™”๋ฅผ 1์ฐจ ๋„ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ•œ ๊ฐ’ (Taylor expansion)

ํŠน์ • ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ ์ƒ๊ธฐ๋Š” error๋ฅผ ํ†ตํ•ด ์ค‘์š”๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค.

 

 

Mag+ and Taylor+

ํŠธ๋žœ์Šคํฌ๋จธ ์ดˆ๋ฐ˜ ๋ธ”๋Ÿญ์€ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ๋ ˆ์ด๋ธ”๋˜๊ธด ํ•˜๋Š”๋ฐ, ๋ง‰์ƒ ์—†์• ๋ฉด ์„ฑ๋Šฅ ์ €ํ•˜๋œ๋‹ค๋Š” ์ด์ „ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ

-> ์ฒซ 4๋ธ”๋Ÿญ, ๋งˆ์ง€๋ง‰ 2๋ธ”๋Ÿญ์€ ๋ณด์กด

 

Perplexity (PPL)

๊ฐ ๋ธ”๋Ÿญ์„ ์—†์• ๋ฉด์„œ PPL ๋ณ€ํ™”๋ฅผ ์ธก์ • (calibration set ์‚ฌ์šฉ)

์„ธํƒ€^n : n๋ฒˆ์งธ ๋ธ”๋Ÿญ์„ ๋บ€ ๋ชจ๋ธ
s = 1,...,S : ์‹œํ€€์Šค
l = 1,...,L : ํ† ํฐ

- PPL์€ next-token prediction loss์—์„œ ์œ ๋„๋˜๊ณ , forward-apss computation๋งŒ ์š”๊ตฌํ•œ๋‹ค. 

์•ž ๋’ค ๋ธ”๋Ÿญ์„ ๋นผ๋ฉด ppl์ด ์น˜์†Ÿ๋Š”๋‹ค..

(์˜ค................... ์˜ค... ์˜คํžˆ๋ ค ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๋ฅผ ๋บ€ ๊ฒฝ์šฐ๊ฐ€ ppl ๋ณ€ํ™”๊ฐ€ ์ ๋‹ค.. ์˜ค..

 ์•ž ๋’ค ๋ธ”๋Ÿญ์— ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ฒจ์ž‡๋Š” ๊ฑด๊ฐ€??? ์ž‰? 

๊ทธ๋ƒฅ ๋‹จ์ˆœํžˆ '์ œ๊ฑฐ'๋งŒ ํ•œ ๊ฑฐ๋ผ์„œ ppl์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฑฐ๊ฒŸ์ง€? distill์ฒ˜๋Ÿผ ์กฐ์น˜๋ฅผ ์ทจํ•˜๋ฉด ใ„ฑใ…Š์•„์งˆ ๊ฑฐ ๊ฐ™๊ธฐ๋‘ ..ํ•˜๊ณ .....)

accuracy ๊ฒฐ๊ณผ๋„ ๋น„์Šทํ•˜๋ ค๋‚˜???

 

 

3.2. One-shot Pruning

๋ธ”๋Ÿญ๋ณ„๋กœ ์ค‘์š”๋„ ์ˆœ์„œ๋ฅผ ๊ตฌํ•ด๋‘๊ณ  ์ด์ œ ํ”„๋ฃจ๋‹์„ ์ง„ํ–‰ํ•  ์ˆœ์„œ์ด๋‹ค.

๋ธ”๋Ÿญ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์›ํ•˜๋Š” ์‚ฌ์ด์ฆˆ๋กœ ํ”„๋ฃจ๋‹ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

iterative pruning์€ one-shot pruning๋ณด๋‹ค ์ปดํ“จํŒ… ํƒ€์ž„์ด ๊ธธ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

๊ฒŒ๋‹ค๊ฐ€ ์–ด๋–ค ํ”„๋ฃจ๋‹ scheme์„ ์ผ๋А๋ƒ๋ณด๋‹ค retraining ์ „๋žต์ด ๋” ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ–ˆ๋‹ค.

 

3.3. Retraining for Performance Restoration

structured pruning์€ ์žฌํ•™์Šต์ด ํ•„์š”์—†๊ฑฐ๋‚˜ ๋‚ฎ์€ ์žฌํ•™์Šต ๋น„์šฉ์œผ๋กœ ์‹คํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” ์•”์‹œํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์žฌํ•™์Šต '๋ฐฉ๋ฒ• types'์— ๋Œ€ํ•œ ๋ถ„์„์€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ง„ํ–‰ํ•˜์˜€๋‹ค.

๊ทผ๋ฐ ํ”„๋ฃจ๋‹์„ ํ•˜๊ณ  ์žฌํ•™์Šต์„ ํ•˜๋ฉด.............................................์ข‹์€ ์ดˆ๊ธฐํ™”๋ง๊ณ  ์žฅ์ ์ด ๋” ์žˆ๋‚˜? ํ์œผ์œผ์Œ 

ํฐ ๋ชจ๋ธ์„ ์–ด์ผ€ ์‚ด๋ฆด ๊ฑด์ง€?? ์ƒ๊ฐํ•˜๋‹ค๊ฐ€ MoE๋กœ ๋„˜์–ด๊ฐ“๋‹ค ์˜ด ,, ํฐ ๋ชจ๋ธ๊ณผ ์ „๋ฌธ๊ฐ€.........................

Low-Rank Adaptation (LoRA)

width์— ์ ์šฉํ•œ ์ด์ „ ์—ฐ๊ตฌ(Ma et al. (2023)) (๋ฆฌ๋ทฐ) ๋ฅผ ๋”ฐ๋ผ ๋ณธ ์—ฐ๊ตฌ์—๋„ ์ ์šฉํ•ด๋ด„.

ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ์œผ๋‚˜ ํ”„๋ฃจ๋‹ ๋น„์œจ์ด ๋†’์•„์ง€๋ฉด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ํ˜„์ƒ (width, depth ๋ชจ๋‘)

Continued Pretraining (CPT)

large-scale pretraining corpus๋กœ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ๋ฒ•.

LoRA๋ณด๋‹ค๋Š” ๋” ๋งŽ์€ ๋ฆฌ์†Œ์Šค๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ ํ•™์Šต ์†๋„๋„ ๋น ๋ฅด๊ณ , ๋žœ๋ค ์ดˆ๊ธฐํ™”๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค.

CPT -> LoRA

CPT ํ•˜๊ณ  ๋‚˜์„œ instruction set์„ ์‚ฌ์šฉํ•ด์„œ LoRA๋ฅผ ์ ์šฉํ•ด์„œ ์ถ”๊ฐ€์ ์ธ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์žˆ๋Š”์ง€ ๋ณด์•˜๋‹ค. 

 

4. Experimental Setup

Source Model

LLaMA-7B

Vicuna-{7B, 13B}-v1.3

Baseline

[ Width pruning ]

LLM-Pruner

FLAP

Wanda-sp

[ Retraining-free block pruning method ]

SLEB

 

 

Data

BookCorpus

Alpaca (for LoRA)

SlimPajama (for CPT)

Evaluation

zero-shot accuracy (BoolQ, PIQA, HellaSwag, WinoGrande, ARC-easy, ARC-challenge, OpenbookQA)

zero-shot PPL (WidiText2, PTB)

Latency and Throughput(์ฒ˜๋ฆฌ๋Ÿ‰)

(์ถ”๋ก  ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค๋Š” ๊ฑธ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•จ)

batch size: M

output sequence length: L

latency: T (M L ์•„์›ƒํ’‹ ํ† ํฐ ์ƒ์„ฑ๊นŒ์ง€)

throughput: M L / T

 

Implementation

..

5. Results

5.1. Moderate Pruning and LoRA Retraining

- width pruning์ด LLM ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค.

- width pruning์˜ ๊ฒฝ์šฐ ์†๋„๊ฐ€ ์˜คํžˆ๋ ค ์ฆ๊ฐ€ํ•œ ๊ฒฝ์šฐ๋„ ์žˆ๋Š”๋ฐ, GPU์— ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ์ฐจ์›์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค (e.g., FFN์˜ hidden size๊ฐ€ 8๋กœ ์•ˆ ๋‚˜๋ˆ ์ง)

- ๋ฐ˜๋ฉด depth๋Š” ์†๋„๋„ ๋นจ๋ผ์ง€๊ณ , ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค.

- LLM-Pruner๋„ ๋˜‘๊ฐ™์ด LoRA๋ฅผ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ width๋ž‘ ๋น„์Šทํ•˜๊ฒŒ ์„ฑ๋Šฅ์ด ๋‚˜์™“๋‹ค.. .. . 

- SLEB๋ž‘ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘์•„์ง€๋ฉด ์ƒ๋Œ€์ ์œผ๋กœ ๋ณ„๋กœ์ž„

 

 

 

5.2. Aggressive Pruning and CPT Retraining

- ํ”„๋ฃจ๋‹ ratio๊ฐ€ ํด ๋•Œ(fewer than 3.7B param) LoRA-based tuning์ด๋ž‘ retraining-free approach ๋‘˜ ๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค.

- CPT ๋ฐฉ๋ฒ•์€ ํšจ๊ณผ์ ์ž„ / CPT->LoRA๋Š” zs accuracy๋Š” ์กฐ๊ธˆ ํ–ฅ์ƒ, PPL์€ ์กฐ๊ธˆ ์•…ํ™”

- CPT ๋ฐฉ๋ฒ•์€ LoRA๋ž‘ ๋น„๊ตํ–ˆ์„ ๋•Œ๋Š” ๊ณ„์‚ฐ ๋น„์šฉ ๋†’๊ธดํ•œ๋ฐ, ๊ธฐ๋ณธ ๋ชจ๋ธ์— CPT(GPU8๊ฐœ๋กœ 2์ฃผ)ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” ์™„์ „ ํšจ์œจ์ ์ž„(GPU1๊ฐœ๋กœ ํ•˜๋ฃจ)

 

- 60% ํ”„๋ฃจ๋‹๋œ ๋ชจ๋ธ (2.7B) , our๋Š” ์ž˜ ์ƒ์„ฑํ•œ๋‹ค.

- ๊ฐ™์€ ์‚ฌ์ด์ฆˆ์˜ ๋ชจ๋ธ์ธ ๊ฒฝ์šฐ, ๋žœ๋ค ์ดˆ๊ธฐํ™”๋ณด๋‹ค pruning์œผ๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ฒŒ ๋” ์ข‹์€ ๊ฒฐ๊ณผ

5.3. Applicability with Quantization

 

 

GPTQ ๋ฐฉ๋ฒ•์œผ๋กœ ํฐ ์„ฑ๋Šฅ์ €ํ•˜ ์—†์ด VRAM ์‚ฌ์šฉ์„ ์ค„์˜€๋‹ค. 

 

* PTQ: ๋ชจ๋ธ ํ•™์Šต(traning)์ด ๋๋‚œ ํ›„, ๋ฌด๊ฒ๊ณ  ๋А๋ฆฐ ๋ชจ๋ธ์„ ๊ฐ€๋ณ๊ฒŒ(๋น ๋ฅด๊ฒŒ) ๋งŒ๋“ค๋ ค๊ณ  ์ˆซ์ž(ํŒŒ๋ผ๋ฏธํ„ฐ, ์—ฐ์‚ฐ ๊ฒฐ๊ณผ)๋ฅผ ‘์ž‘๊ฒŒ’ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ธฐ์ˆ 

* VRAM: Video RAM, GPU์ „์šฉ ๋ฉ”๋ชจ๋ฆฌ

5.4. Ablation Study

 

5.4.1. Importance Criteria for Block Pruning

'+' ํ‘œ์‹œ๊ฐ€ ์—†๋Š” ๋ฉ”์„œ๋“œ๋“ค์€ essentialํ•œ initial block๋“ค์„ ์œ ์ง€ํ•˜๋Š”๋ฐ ์‹คํŒจํ–ˆ์Œ -> ์„ฑ๋Šฅ ์ €ํ•˜

- ๊ฐ€์ค‘์น˜์˜ ํฌ๊ธฐ์—๋งŒ ์˜์กดํ•˜๋Š” Mag ๋ฐฉ๋ฒ•๋ณด๋‹ค๋Š” Taylor ๋ฐฉ๋ฒ•์ด ์šฐ์ˆ˜ํ•˜๋‹ค

 

5.4.2. Structural Unit for Depth Pruning

๊ฐ ๋ชจ๋“ˆ(MHA, FFN)์„ ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ์˜ ์˜ํ–ฅ์„ ์ธก์ • (+LoRA)

- 5B ๋ณด๋‹ค ํด ๋•Œ๋Š” ๊ฐ ๋ชจ๋“ˆ์„ ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ accuracy๊ฐ€ ์ข€ ๋” ๋†’์ง€๋งŒ, ๊ทธ ์™ธ๋Š” ๋ธ”๋Ÿญ ๋‹จ์œ„๊ฐ€ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„.

์ด๋Š” ์ž‘์€ ๋‹จ์œ„๋กœ ์ œ๊ฑฐํ• ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฏฟ์Œ๊ณผ ๋‹ค๋ฅด๋‹ค. (์—ฅ? ๋‹น์—ฐํžˆ .. ์ค‘๊ฐ„์— ์žˆ๋Š” MHA๋‚˜ FFN์ด ์ œ๊ฑฐ๋˜๋ฉด ํ•™์Šต๋œ ํ๋ฆ„์ด ๋Š๊ธฐ๋‹ˆ๊นŒ ์—„์ฒญ ์•ˆ ์ข‹์„ ๊ฑฐ ๊ฐ™์€๋””.;; ๊ทผ๋ฐ ๊ทธ๋Ÿฐ๊ฑฐ ์น˜๊ณ ๋Š” ์„ฑ๋Šฅ์ด ๋†’์€ ํŽธ์ธ ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•˜๊ณ )

 

- ์‚ฌ์‹ค ๊ฐ ๋ชจ๋“ˆ์˜ ๊ณต๋™์˜ ์—ญํ• ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒŒ ์ตœ์ ์˜ ๋ฐฉ๋ฒ•์€ ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ๋‹ค.

     - Table 6์—์„œ 5.3B์˜ ๊ฒฝ์šฐ ์ผ๋ถ€ ๊ตฌ๊ฐ„์—์„œ FFN๋งŒ ์—ฐ์†์ ์œผ๋กœ ๋‚จ์€ ๊ฒฝ์šฐ๋„ ์žˆ์—ˆ๋‹ค -> attention์ด ์—†์–ด์ ธ์„œ word interaction์„ ๋‹ค๋ฃจ๋Š” ๋ชจ๋ธ ๋Šฅ๋ ฅ ๋–จ์–ด์กŒ์„ ๊ฒƒ

     - ๋ฐ˜๋ฉด ๋ธ”๋Ÿญ๋‹จ์œ„ ํ”„๋ฃจ๋‹์€, ์ด์›ƒํ•œ ๋ธ”๋Ÿญ๋“ค์ด ์†์‹ค๋œ ์ •๋ณด์™€ ์œ ์‚ฌํ•œ ๊ธฐ๋Šฅ์„ ํ–ˆ์„ ๊ฒƒ

 

5.4.3. Calibration(๊ต์ •) Data Volume

- block-level importance๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด calibration data๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. 

- Table 7์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ 10๊ฐœ๋กœ๋„ ์ถฉ๋ถ„ํ•˜๋‹ค.

- Taylor+์˜ ๊ฒฝ์šฐ์—๋Š”, 1k๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์“ธ ๋•Œ ์ •ํ™•๋„๊ฐ€ ๋” ๋‚ฎ์•„์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์ธ๋‹ค.

(์›์ธ์— ๋Œ€ํ•œ ํƒ๊ตฌ๋Š” future research๋กœ ๋„˜๊ธด๋‹ค)

6. Related Work

* SparseGPT (Frantar and Alistarh, 2023) addresses the layer-wise reconstruction problem for pruning by computing Hessian inverses.

* structured pruning removes organized patterns, such as layers (Fan et al., 2020; Jha et al., 2023), (์•„๋‹ˆ abstract๋งŒ ๋ณด๊ธด ํ–‡์ง€๋งŒ ๋ ˆ์ด์–ด ๋‹จ์œ„ ์•„๋‹Œ๊ฑฐ ๊ฐ™์€๋ฐ ;;;;;;) FFN’s hidden sizes (Nova et al., 2023; Santacroce et al., 2023), and some hybrid forms (Lagunas et al., 2021; Xia et al., 2022; Kwon et al., 2022; Kurtic et al., 2023)

* Sheared-LLaMA (Xia et al., 2024) introduces a mask learning phase aimed at identifying prunable components in both the network’s width and depth.

* depth pruning approaches (Song et al., 2024(SELB); Men et al., 2024(ShortGPT); Tang et al., 2024(Rethinking...))