๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐Ÿ“Ž paper/NLP

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

ICML 2024

Code

๐Ÿ‘€ ์š”์•ฝ ๐Ÿ‘€


โœจ Point โœจ


 

Abstract๋กœ ํ๋ฆ„ ํŒŒ์•…ํ•˜๊ธฐ

 

๊ธฐ์กด์˜ pruning ๋ฐฉ๋ฒ•์€ end-to-end LLM inference ์†๋„ ํ–ฅ์ƒ์— ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค.

๋ถˆํ•„์š”ํ•œ transformer blocks์„ ์ œ๊ฑฐํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์‹ ์ œ์•ˆ (streamline ๊ฐ„์†Œํ™”(๋Šฅ๋ฅ ํ™”)ํ•˜๋‹ค)

- high similarity between the outputs of neighboring blocks ์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค.

(์˜ค ์ด ๋…ผ๋ฌธ์€ importance score๋กœ ํ”„๋ฃจ๋‹ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๊ณ  redundency/similarity๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋ฃจ๋‹ํ•œ๋‹ค)

์—? ๊ทผ๋ฐ metric2๋กœ ํ•œ ๊ฑฐ๋ฉด ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜์ด ์•„๋‹ˆ์ž–์•„;;;;??

 

LLM ์†๋„ ํ–ฅ์ƒ, ์„ฑ๋Šฅ ์œ ์ง€๋˜์—ˆ๋‹ค-


1. Introduction

์ƒ๋‹นํ•œ ์–‘์˜ parameters๋Š” real-world sevices์— ๋ชจ๋ธ ์ ์šฉ์„ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. (๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ฆ๊ฐ€, computational demands)

๋”ฐ๋ผ์„œ ์ž‘๊ณ  ํšจ์œจ์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ธฐ์ˆ  ๋ฐœ์ „์ด ์ค‘์š”ํ•จ.

 

Network pruning(ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ๊ฑฐ)์˜ ๋‹จ์  - sparse matrix๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ๋ฐœ์ƒํ•˜๋Š” ์–ด๋ ค์›€. ์ง€๊ธˆ GPU๋Š” dense matrix๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์ตœ์ ํ™”๋˜์–ด์žˆ๋‹ค.

 

In the realm of LLMs, a significant similarity in output is observed among successive transformer blocks (Din et al., 2023; Liu et al., 2023).

Transformer ๋ธ”๋Ÿญ ๋‚ด๋ถ€์— ์žˆ๋Š” residual path ๋•Œ๋ฌธ์— ๋ธ”๋Ÿญ๊ฐ„ ์ถœ๋ ฅ์ด ์ƒ๋‹นํžˆ ์œ ์‚ฌํ•ด์ง€๋ฉฐ, ๊ฒฐ๊ณผ์ ์œผ๋กœ redundancy๊ฐ€ ์ƒ๊ธด๋‹ค. 

(์—‡ .. ๊ทธ๋Ÿผ pre training ๋๋‚œ ๋ชจ๋ธ์—์„œ, residual connection์„ ์ œ๊ฑฐํ•˜๊ณ  finetuning ํ•˜๋ฉด, ์ข€ ๋” ์„ธ๋ถ€์ ์ธ ์ •๋ณด๋ฅผ ์žก์„ ์ˆ˜ ์žˆ์œผ๋ ค๋‚˜? ์˜ˆ๋ฅผ ๋“ค๋ฉด ์ถ”๋ก  ๋Šฅ๋ ฅ์ด๋ผ๋“ ๊ฐ€..)

* residual path๋Š” ํ•™์Šตํ•˜๋Š” ๋™์•ˆ, backpropagation์„ ์•ˆ์ •์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋„์ž…๋œ ๊ฒƒ (ใ…‹ใ…‹ ใ„ฑ.,๊ทธ๋ ‡๊ตฌ๋‚˜..)

 

 

SLEB ์ œ์•ˆํ•จ.

์ด ๋ฐฉ๋ฒ•์œผ๋กœ careful elimination of redundant transformer blockํ•˜๋ฉด text generation ๋Šฅ๋ ฅ์— ์˜ํ–ฅ์—†๋‹ค๊ณ  ํ•œ๋‹ค.

2. Motivation

2.1. Pruning

compact and fast LLMs์„ ๋งŒ๋“œ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Œ

1. ๊ฐœ๋ณ„ ๋ธ”๋Ÿญ์˜ ํšจ์œจ์„ ํ–ฅ์ƒ

2. ์ „์ฒด ๋ธ”๋Ÿญ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ

Challenge 1) Limitation in Achieving LLM Inference Speedup:

ํ”„๋ฃจ๋‹์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” type์ด ์žˆ์Œ 1. Unstructured 2. Structured

 

- Unstructured pruning์€ individual weight๋ฅผ ์ œ๊ฑฐํ•˜๋Š”๋ฐ, ์ด๋Š” sparse weight matrix๋ฅผ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฐ๋‹ค.

๋ณต์žกํ•œ data access pattern์„ ๋งŒ๋“ค๊ณ , ๊ด€๋ฆฌ๋ฅผ ๋ณต์žกํ•˜๊ฒŒ ํ•˜๋ฉฐ, ์‹ฌ์ง€์–ด ๋ชจ๋ธ ๊ฐ€์†ํ™”๋ฅผ ๋ฐฉํ•ดํ•  ์ˆ˜๋„ ์žˆ์Œ

NVIDIA GPU์—์„œ๋Š” unstructured pruning์„ ํ†ตํ•ด ์†๋„ ํ–ฅ์ƒ์„ ์–ป์œผ๋ ค๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ 90% ์ด์ƒ์˜ ๋†’์€ sparsity์„ ๋‹ฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค๋Š” ์—ฐ๊ตฌ ๊ฒฐ๊ณผ ()

ํ˜„์‹ค์€ 50% ํ”„๋ฃจ๋‹๋„ ํž˜๋“ค๊ธด ํ•จ

 

- Sturctured pruning์€ (๊ฐ์ž ์ •์˜ํ•œ) units of weights๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. 

ํ•˜๋“œ์›จ์–ด friendlyํ•œ dense matrix ํ˜•ํƒœ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.

์ด์ƒ๊ณผ๋Š” ๋‹ฌ๋ฆฌ ํ”„๋ฃจ๋‹ ๋น„์œจ์— ๋น„๋ก€ํ•˜๊ฒŒ ์†๋„ ํ–ฅ์ƒ์ด ๋˜์ง€๋Š” ์•Š์Œ

 

- GPU ์–ด์ฉŒ๊ณ ์–ด์ฉŒ๊ณ  ใ…œใ…œ 2:4 pruning techniques (์ผ๋‹จ ๋„˜์–ด๊ฐ..)

 

 

- ์ตœ์‹  ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ•๋“ค!! [LLM-Pruner] [SliceGPT]

weight matrix์˜ ์ „์ฒด ์ฑ„๋„(row/column) ๋‹จ์œ„ (channel-wise)๋กœ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ. (dense format์ด ์œ ์ง€๋œ๋‹ค)

ํ•˜์ง€๋งŒ extensive fine-tuning์˜ ๋„์›€์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋‹จ์  [LLM-Pruner]

์†๋„์— ํฐ ํ–ฅ์ƒ X [SliceGPT]

 

- ์ƒˆ๋กœ์šด ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ• [Deja Vu]

์ž…๋ ฅ context๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋ ˆ์ด์–ด ์—ฐ์‚ฐ์˜ ํŠน์ • ๊ตฌ๊ฐ„์„ dynamicํ•˜๊ฒŒ ์šฐํšŒํ• ์ง€ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹

์‹ฑ๊ธ€ ๋ฐฐ์น˜ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ํšจ๊ณผ์ ์ธ ์ถ”๋ก ์†๋„ ํ–ฅ์ƒ, but Early Exit๊ณผ ์œ ์‚ฌํ•œ ๋ฌธ์ œ์ .. contd...

 

2.2. Early Exit

ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์€ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์ด Early Exit.

๋ชจ๋ธ์ด ์ผ์ •ํ•œ ์ˆ˜์ค€์˜ confidence level์— ๋„๋‹ฌํ•˜๋ฉด ๋ฉˆ์ถ”๊ณ  output์„ ๋‚ธ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์„ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์Œ - ์ด ์ „๋žต์€ ํŠนํžˆ LLM์˜ ์ดˆ๊ธฐ ๋ธ”๋ก์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ๋” ์šฉ์ดํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ  (์˜ค???????????????????????? ์ถฉ๋Œํ•œ๋‹ค;;;;;; ์–ด์ฐจํ”ผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๊ฑฐ๋ฉด remove๋ž‘ ๊ฐ™์€ ๊ฑฐ ์•„๋‹˜??)

* dynamic decision-making or extensive training to be effective ์ด ์š”๊ตฌ๋œ๋‹ค๋Š” ๋‹จ์ .

 

 

์–ผ๋ฆฌ์—‘์‹ฏํ•˜๋ฉด ppl ๋†’์•„์ง (์„ฑ๋Šฅ ์ €ํ•˜)

์ œ๊ฑฐํ•˜๋Š” ๋ธ”๋Ÿญ ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ppl ๋†’์•„์ง (์„ฑ๋Šฅ ์ €ํ•˜) - by testing all possible removable points of consecutive blocks

==> ๋”ฐ๋ผ์„œ LLM์—์„œ ์—ฐ์†๋œ ๋ธ”๋ก์„ ๋‹จ์ˆœํžˆ ์ œ๊ฑฐํ•˜๋Š” ๊ฐœ๋…์€ dynamic decision-making๊ณผ training ์—†์ด๋Š” ํšจ๊ณผ์ ์ด์ง€ ์•Š๋‹ค

 

Challenge 2) Limitation in Acceleration in Multi-batch Settings

์ฃผ๋กœ multi-batch ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ž‘๋™ํ•˜๋Š”๋ฐ, ๊ฐœ๋ณ„ ํ† ํฐ์— ๋Œ€ํ•ด skipํ•˜๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

implementation์„ ๋ณต์žกํ•˜๊ฒŒ ํ•˜๊ฑฐ๋‚˜ ํšจ์œจ์„ฑ์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๋Š” ๋ฌธ์ œ

 

Challenge 3) Inability to Reduce Memory Requirements:

early exit๊ณผ ๊ฐ™์€ dynamic methods๋Š” ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ๊ธด ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, 

๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด์ง€ ๋ชปํ•œ๋‹ค.

 

Challenge 4) Resource-Intensive Training

์œ„ ์‹คํ—˜ -> ๋ชจ๋ธ์˜ ํ›„๋ฐ˜๋ถ€๋ถ„์ด ๋˜์–ด์„œ์•ผ ์ตœ์ข… ๊ฒฐ๊ณผ์™€ ๋น„์Šทํ•ด์ง.

early exit์„ ์จ๋„ ๋Œ€๋žต 90%์˜ transformer ๋ธ”๋Ÿญ์€ ์—ฌ์ „ํžˆ ํ•„์š”ํ•œ ์ƒํ™ฉ์ด๋‹ค. ์ฆ‰ ์–ผ๋ฆฌ์—‘์‹ฏ์„ ์œ„ํ•œ training ๋งŽ์ด ํ•ด์•ผ๋จ.

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— LLaMA2-70B ๊ฐ™์€ ๋งค์šฐ ํฐ ๋ชจ๋ธ์—๋Š” ์–ผ๋ฆฌ์—‘์‹ฏ ์ „๋žต์„ ์ฑ„ํƒํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

 

Solution!!

1) LLM Inference Speedup

  transformer block์„ ์ œ๊ฑฐ ๋‹จ์œ„๋กœ ์„ค์ •

 

2) Acceleration in Multi-batch Setting

์ฒ ์ €ํ•œ ์ค‘๋ณต์„ฑ ๊ฒ€์ฆ์„ ๊ฑฐ์นœ ํ›„ ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์„ ์ •์ (static) ๋ฐฉ์‹์œผ๋กœ ์ œ๊ฑฐํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ด๋Š” ์ „ํ†ต์ ์ธ ํ”„๋ฃจ๋‹(pruning) ๊ธฐ๋ฒ•๊ณผ ๋™์ผํ•จ

(์ฆ‰ ๋‹ค์ด๋‚˜๋ฏนํ•˜์ง€ ์•Š๋‹ค๋Š” ๋œป๊ฐ™์Œ..)

 

3) Reduction in Memory Requirements

๋ถˆํ•„์š”ํ•œ ๋ธ”๋Ÿญ์„ ์•„์˜ˆ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ ใ„ฑใ„ด

 

4) Training-free Compression (์˜ค-)

  training-freeํ•œ ์ค‘๋ณต์„ฑ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, intensive(์ง‘์•ฝ์ ์ธ) (์žฌ)ํ•™์Šต ๊ณผ์ • ํ•„์š” ์—†๋‹ค.

 

 

3. Proposed SLEB

3.1. Output Similarity across Transformer Blocks

x_i : i๋ฒˆ์งธ transformer block์˜ output
T_i : i๋ฒˆ์งธ transformer block์˜ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ

 

transformer ๋ธ”๋Ÿญ์˜ Output ๋ผ๋ฆฌ์˜ cosine similarity๋ฅผ ๊ตฌํ•œ๋‹ค.

-> ๋ชจ๋ธ ์ „๋ฐ˜์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์œ ์‚ฌ๋„๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ ์ธ์ ‘ํ•œ ๋ธ”๋Ÿญ๋ผ๋ฆฌ๋Š” ์ผ๊ด€์ ์œผ๋กœ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค.

์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ๋ชจ๋ธ ๋‚ด์— ์ž ์žฌ์ ์ธ redundancy๊ฐ€ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.

 

- ์—ฌ๊ธฐ์„œ early exit์˜ ๊ทผ๋ณธ์ ์ธ misalignment๊ฐ€ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์ž‡์ฃ 

์—ฐ์†์ ์ธ ๋ธ”๋Ÿญ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ

๋˜ํ•œ ๊ผญ ํ•„์š”ํ•œ ์ค‘์š” ๋ธ”๋Ÿญ์„ ๋†“์น  ์šฐ๋ ค๋„ ์žˆ๋‹ค

 

3.2. Redundancy Verification of Transformer Blocks

์ผ๋‹จ ๋ถˆํ•„์š”ํ•œ ๋ธ”๋Ÿญ์„ ์‹๋ณ„ํ•ด๋‚ด๋Š” ๊ฒŒ ์šฐ์„ ์ž„.

 

Metric 1. ๊ฐ ๋ธ”๋Ÿญ์˜ Input๊ณผ Output๊ฐ„์˜ distance๋ฅผ ๊ณ„์‚ฐ.

์ž‘์€ ๊ฐ’์„ ๊ฐ€์ง€๋ฉด, ์ „๋ฐ˜์ ์ธ LLM inference์— ์žˆ์–ด minor impact๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

* cosine similarity
A_j : j ๋ฒˆ์งธ ๋ธ”๋Ÿญ์˜ input
B_j : j ๋ฒˆ์งธ ๋ธ”๋Ÿญ์˜ output

* ppl ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•จ

-> minor changes in that block can be amplified, especially if the block lies in the early stage of the LLM, leading to a more substantial impact on the overall results.

==> ์ดˆ๋ฐ˜ ๋ธ”๋Ÿญ ์ค‘์š”ํ•จ

 

* ์ € ๊ทธ๋ž˜ํ”„๊ฐ€ [Shortened LLaMA]์˜ ํ•œ๊ณ„ ๊ฐ™์€๋ฐ? ๊ทธ๋ƒฅ ๋ ˆ์ด์–ด ํ•˜๋‚˜์”ฉ ๋นผ๋ฉด์„œ ๊ตฌํ•œ ppl ์ฆ๊ฐ€๋ฅ  ์ˆœ์œ„๋Œ€๋กœ ์ œ๊ฑฐํ•˜๋Š”๊ฑฐ

 

Metric 2. ๊ฐ ๋ธ”๋Ÿญ์„ ๋บ์„ ๋•Œ ๋‹ค์Œ ํ† ํฐ ํ™•๋ฅ 

M_j : j ๋ฒˆ์งธ ๋ธ”๋Ÿญ์ด ์ œ๊ฑฐ๋œ LLM

 

* Metric 1๋ณด๋‹ค๋Š” ๋‚ซ์ง€๋งŒ, ์ œ๊ฑฐํ•˜๋Š” ๋ธ”๋Ÿญ์ด ๋งŽ์•„์ง€๋ฉด ppl ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•จ.

๋ธ”๋Ÿญ์ด ์ œ๊ฑฐ๋  ๋•Œ๋งˆ๋‹ค, ๋‚จ์€ ๋ธ”๋Ÿญ๊ฐ„์˜ ์ค‘์š”๋„๊ฐ€ ๋ณ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ผ ๊ฒƒ.

e.g., 6.7B ๋ชจ๋ธ์—์„œ 7๊ฐœ ๋ธ”๋Ÿญ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒฝ์šฐ, 3 4 5 6 7 8 10 th ๋ธ”๋Ÿญ์ด ์ œ๊ฑฐ๋˜์—ˆ๋‹ค๊ณ  ํ•จ. (์ฆ‰ ์—ฐ์†๋œ ์• ๋“ค์ด ๋ชจ๋‘ ์ œ๊ฑฐ)

 

์Œ?? ์ด metric์ด ๊ทธ๋ƒฅ ppl ์ด์ž–์•„..?

https://bitrader.tistory.com/77

 

 

Metric 3. iterative removal process

ํ•˜๋‚˜ ์ œ๊ฑฐํ•  ๋•Œ๋งˆ๋‹ค redundant block ๋‹ค์‹œ ๊ตฌํ•˜๊ธฐ

LLM์˜ ์—…๋ฐ์ดํŠธ๋˜๋Š” ์ƒํƒœ์— ๊ธฐ๋ฐ˜ํ•ด์„œ ์ œ๊ฑฐํ•  ๋ธ”๋Ÿญ์„ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. 

M' : ์ „ ๋‹จ๊ณ„์—์„œ ๋ธ”๋Ÿญ์ด ์ œ๊ฑฐ๋œ LLM

 

* ์„ฑ๋Šฅ ์–‘ํ˜ธํ•จ

e.g., 6.7B ๋ชจ๋ธ์—์„œ 7๊ฐœ ๋ธ”๋Ÿญ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒฝ์šฐ, 6 7 3 24 18 30 11 th ๋ธ”๋Ÿญ ์ˆœ์„œ๋Œ€๋กœ ์ œ๊ฑฐ๋˜์—ˆ๋‹ค๊ณ  ํ•จ

 

 

3.3. Proposed SLEB Algorithm

 

Calibration data๋ฅผ ์‚ฌ์šฉํ•ด์„œ transformer block์˜ redundancy๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ ์ถ”๊ฐ€ํ•™์Šต ์—†์ด ๋ชจ๋ธ์„ stramlineํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

(๋˜๊ฒŒ.....๊ฐ„๋‹จํ•˜๋‹ค...............................)

 

4. Experiments

4.1. Experimental Setup

- NVIDIA A100 GPUs equipped with 80GB of memory

- SLEB requires 2 GPUs for pruning OPT-66B and LLaMA-70B, and 1 GPU for pruning smaller models

- ๋ผ๋งˆ 70B ํ”„๋ฃจ๋‹ํ•˜๋Š”๋ฐ 1.5์‹œ๊ฐ„ ๊ฑธ๋ฆผ

ํ”„๋ฃจ๋‹ ์†๋„ Appendix A.1.

 

- fine-tuning ์—†์ด inference ๊ณผ์ •๋งŒ์œผ๋กœ ์™„๋ฃŒ๋จ.

- Calibration data : WidiText-2์—์„œ 128๊ฐœ ๋žœ๋ค ์„ ํƒ [SliceGPT๋ฅผ ๋”ฐ๋ฆ„]

- pruning ratio๋Š” ์•ฝ 10% ๋˜๋Š” 20%๋กœ ์‹คํ—˜ํ•จ

 

[Model]

OPT fam

LLaMA-2 fam

 

[Baseline]

(2:4 structured pruning methods)

SparseGPT

Wanda

DSnoT

 

(channel-wise pruning)

LLM-Pruner

SliceGPT

 

4.2. Elimination of Transformer Blocks using SLEB

 

- ๋ชจ๋ธ์— ๋”ฐ๋ผ ์ œ๊ฑฐ๋˜๋Š” transformer ๋ธ”๋Ÿญ ์œ„์น˜๊ฐ€ ๋‹ค๋ฅด๋‹ค

(-> ์ด๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์ž‡๋Š” metric 3 ๊ฐ™์€ ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค~)

 

Appendix A.2.

 

4.3. Language Modeling

[Data]

C4 validation dataset

WikiText-2 

 

C4 dataset - ppl ๊ฒฐ๊ณผ

 

WikiText2 dataset - ppl ๊ฒฐ๊ณผ (Appendix B.2.)

- ๋‹ค๋ฅธ ํ”„๋ฃจ๋‹ ๋ฐฉ๋ฒ• ๋ณด๋‹ค ratio๋Š” ๋‚ฎ๊ธด ํ•˜์ง€๋งŒ ์ถ”๋ก  ์†๋„์—์„œ ์šฐ์ˆ˜ํ•˜๋‹ค.

- Wanda์™€ DSnoT๋Š” OPT-66B ๋ชจ๋ธ์—์„œ ์™„์ „ํžˆ ์‹คํŒจ

(์•„๋‹ˆ ppl๋„ ๋งŽ์ด ์•ˆ ์˜ฌ๋ž๋„ค..........๋จธ์•ผ ์ข‹์ž–์•„)

- transformer block ์ด๋ผ๋Š” ๊ฑฐ๋Œ€ํ•œ(์•ˆ์ข‹์•„๋ณด์ด๋Š”) ๋‹จ์œ„๋ฅผ ์„ค์ •ํ–ˆ์Œ์—๋„, ์ƒ๋‹นํžˆ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„

 

 

 

4.4. Dependency on Calibration Dataset

ํฐ์ƒ‰ ๋ง‰๋Œ€๊ธฐ๊ฐ€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ณ  ์ปฌ๋Ÿฌ๋“œ๊ฐ€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ

 

- SLEB๊ฐ€ calibration dataset์— ๊ฐ€์žฅ ๋‚ฎ์€ ์˜์กด๋„๋ฅผ ๋ณด์ž„.

- ์ด์ „์— ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋“ค์ด ๋ ˆ์ด์–ด ์ˆ˜์ค€์—์„œ ์ค‘๋ณต์„ฑ์„ ์ธก์ •ํ•œ ๋ฐ˜๋ฉด, SLEB๋Š” Metric3๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ๋„คํŠธ์›Œํฌ ์ˆ˜์ค€์—์„œ ๊ฐ ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์˜ ์ค‘๋ณต์„ฑ์„ ํ‰๊ฐ€ํ•œ๋‹ค. 

์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์‚ฌ์ „ ํ•™์Šต๋œ LLM์— ์กด์žฌํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์˜์กด๋„๊ฐ€ ๋” ๋‚ฎ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

( calibration set ๊ฐœ์ˆ˜!!!!!!!!! ๋Š” ์ด๋…ผ๋ฌธ์€ ์‹คํ—˜ ์•ˆ ํ–‡๋„น ) ๊ทผ๋ฐ ๋ ˆํผ ๋”ฐ์˜จ SliceGPT์—์„œ๋Š” ํ–ˆ์Œ. ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ž‘ sequence length๊นŒ์ง€.. ([SliceGPT] Appendix A.3.)

Appendix B.3.

 

4.5. Zero-shot Tasks

[Task]

PIQA

WinoGrande

HellaSwag

ARC-easy / challenge

LM Evaluation Harness

 

ํ‰๊ท  Accuracy

๋ผ๋งˆ๊ฐ€ OPT ๋ณด๋‹ค ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์‹ฌํ•˜๋„น .... 

Appendix B.4.

 

4.6. Speedup

Appendix B.5.

 

4.7. Compatibility with Post-Training Quantization

 

 

 

5. Conclusion