Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

EMNLP 2024

👀 요약 👀

✨ Point ✨
(1-1) Extracting Layer Activations
(1-2) Applying the Diffusion Kernel 알고리즘 (차원 축소)

(2-1) NPIB를 사용하여 similarity matrix를 구축
(2-2) 최적화된 merging ratio를 결정하기 위해, adaptive weight allocation function을 도입.
(2-3) weighted sum 을 통해 선택된 (유사한) layer의 파라미터를 fuse(merge)한다.

Abstract로 흐름 파악하기

LLM의 complexity와 scale 문제로, resource-limited 환경에서 활용이 어렵다.

parameter pruning같은 compression 기술은 제거되는 파라미터의 지식을 효과적으로 사용하지 못한다는 문제점.

-> Manifold-Based Knowledge Alignment and Layer Merging compression (MKA)

manifold learning(Diffusion Kernel Algo.)을 사용, Normalized Pairwise Information Bottleneck (NPIB)를 사용해서 merge할 유사한 레이어를 선택(?)함

(레이어의 단위가 뭐야? activation끼리만 merge하는 건 아닐 거잔아)

=> 성능 유지, 상단한 압축 비율

quantization까지 적용한 경우 더 좋은 압축을 해냄

1. Introduction

computational resources, memory requirements, and energy consumption ㅠㅠ

- 모델 compression은 러프하게 두 가지로 나눠볼 수 있음

1) quantization

- 더 적은 비트값을 사용하는 방법 (low-precistion values)

- 하드웨어 지원에 의존적임

- 어떨 땐 추가 finetuning 필요함

2) pruning

- retraining 필요 없는 경우도 있음 (뭐야 위랑 같은 말인데 말하기 나름이네;;)

- hardware-friendly

- While effective, pruning usually risks losing valuable model structures and determining how to prune the LLM with minimal disruption to the origin remains an unsolved problem [LLM-Pruner]

- 위 문제들을 해결하기 위해 model merging에 대해 delve into(철저하게 조사)함.

1) 여러 모델을 merge 하는 방법 (현재 연구는 이쪽에 국한되었음)

- 여러 모델의 강점과 지식을 원활하게 결합

- 같은 architecture를 가진 모델들의 weight를 평균내는 방식

- bias와 error를 상쇄함으로써 성능이 향상(!)되는 경우도 있다고 함 (ref)

2) 하나의 모델 내부에서 같은 구조를 merge 하는 방법 (본 논문은 이쪽!)

- 레이어 간 지식의 점진적인 merge을 통해 전체 레이어 수를 줄이는 방식으로 모델 압축이 가능할까?

-> Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA) 방법 제안

(1) Manifold Learning for LLM Knowledge:

manifold learning 방법을 활용하여 레이어의 activation을 추출

-> Diffusion Kernel 알고리즘을 적용하여, 레이어 간 지식을 alignment 한다.

==> activation에 있는 nonlinear structure을 더 잘 캡처할 수 있음.

중요한 activation feature를 보존하면서 차원축소가 가능함.

따라서 서로다른 레이어 간의 knowledge 패턴을 효과적으로 비교할 수 있다.

(2) Similarity Alignment Layer Merging:

- Normalized Pairwise Information Bottleneck (NPIB)을 사용하여

레이어 간 유사도를 정량화하는 유사도행렬(similarity matrix)을 구한다.

- 이 측정값은 각 레이어의 엔트로피를 고려하면서, 상호 정보량(mutual information)을 최대화하는 방식으로 레이어 간의 유사도를 계산한다.

(뭐라노)

- 이렇게 구한 유사도행렬(similarity matrix)를 기반으로, 유사한 레이어 쌍을 선택하여 merge한다.

(merge는 어케 하는데?? 이것도 걍 이전 방법 따라서 평균내??)

cf. Information Bottleneck

Main Contributions:

- innovative model compression technique: MKA (align하기 위해 manifold learning 적용 / 레이어의 지식 통합) -> 성능 유지하면서 모델 사이즈 줄임

- develop a manifold-based knowledge alignment approach (Diffusion Kernel & NPIB) -> 파라미터 공간의 유사도를 잘 캡처하고 align 가능하게 함

- 다양한 benchmark datasets & 다양한 LLM 사용 -> 모델 성능에 큰 저하없이 상당한 압축 해냄

(첫번째 두번째 같은말 같은데 뭐가 더 잇는지-?)

2. Manifold-Based(차원축소-기반) Knowledge Alignment and Layer Merging

모델 후반(the latter) 레이어에 존재하는 redundancy를 기반으로 한다. [ref]

input-output 유사도가 높은 레이어들을 뒤에서 앞으로(back to front) merge함. (내가 생각한 방법이랑 방향 동일한 것 같음)

고차원의 intermediate states는 분석이 어렵기 때문에,

해당 states의 추출 및 차원 축소 과정을 설명한다.

그 다음, similarity alignment에 기반한 layer merging 기법을 제안한다.

이 기법은 intermediate states를 alignment하면서 merge하는 방식으로 성능을 유지하는 것을 목표로 한다.

2.1. Manifold Learning for LLM Knowledge

LLM의 계층 간 지식을 효과적으로 정렬하기 위해, MKA는 manifold learning 기법을 활용하여 LLM 내부 구조 내의 복잡한 비선형 의존성을 포착한다. 이러한 접근 방식은 layer activations를 의미 있는 방식으로 비교하고 정렬할 수 있도록 하며, 핵심 정보를 보존하면서도 모델의 복잡도를 줄일 수 있게 한다.

layer activations H^l 을 추출한다. (dataset: w)

여기서 activations은 인풋 샘플들이 주어졌을 때, 각 레이어의 아웃풋을 의미함. (~~뭐.. H가 히든repre-인건가그럼~~ 밑에서 바로 설명함)

고차원인 activation을 저차원 공간으로 바꾸기 위해 Diffustion Kernel algorithm을 사용한다. (LLM 내부가 아. 128디멘션 막 이러니까 그렇네 고차원이네..)

Extracting Layer Activations:

activations of each layers (H^l) 를 추출

Constructing the Pairwise Distance Matrix:

pairwise Euclidean distance matrix (D)를 계산

모든 activation의 쌍의 distance를 알 수 있다.

Applying the Diffusion Kernel:

Diffusion Kernel 알고리즘을 사용하여 distance matrix (D)를 저차원 manifold representation (Φ_i)로 변환한다.

σ_K : the kernel bandwidth parameter (실험에서는 8로 설정했다고 함)
EigVectors_d : eigenvectors corresponding to the d smallest eigenvalues of the Laplacian matrix L

EigVectors는 **라플라시안 행렬 L**의 가장 작은 d개의 고유값에 대응되는 고유벡터를 의미한다.

이러한 변환은 활성값(activation) 내에 존재하는 핵심적인 특징과 관계를 포착하며,

서로 다른 레이어 간의 효과적인 비교를 가능하게 해준다.

..일단 패ㅅ스..~

* 사용한 데이터셋 : the first question from the 57-question MMLU dataset

2.2. Similarity-based Layer Merging (이제 merge 하는 거)

manifold learning representations를 바탕으로, 유사도-기반(similarity-based) 레이어 merging을 진행한다.

레이어들 간의 similarity를 수치화하기 위해 Normalized Pariwise Information Bottleneck (NPIB) metric을 사용한다!

(1) NPIB를 사용하여 similarity matrix를 구축

(2) 최적화된 merging ratio를 결정하기 위해, adaptive weight allocation function을 도입.

(3) weighted sum 을 통해 선택된 (유사한) layer의 파라미터를 fuse한다. (아ㅏㅏㅏㅏㅏㅏㅏㅏ웨이티드썸이요?;;; 이런 옛 방법 ㄱㅊ아?? 이게 지식 보존이 돼????????????????????????????????????????????????????????????? 실험은 해볼만 한듯 ... distillation 로스 안 쓰고 그냥 weighted sum...ㅋㅋ 완전 간단해지겟다 우왕)

Constructing the Similarity Matrix (1)

Normalized Pairwise Information Bottleneck(NPIB)는 각 레이어의 개별 엔트로피를 정규화하면서, 레이어 간에 공유되는 정보량을 정량화하여,

레이어 간 지식 패턴을 비교하기에 이상적인 척도를 제공..한다..

P(x, y) : E_i와 E_j의 joint probability distribution
p(x) : E_i의 marginal probability distribution
p(y) : E_j의 marginal probability distribution

위 유사도행렬은 어떤 레이어들이 aligned knowledge representation을 가지고 있는지 알 수 있게 해줌. (즉 merge할 애들 결정 ㄱㄴ)

Calculate Weight Ratio (2)

레이어 사이의 similarity difference에 기반하여 merging weight를 정한다.

두 레이어 사이의 유사도 차이가 크면 Ψ는 유사도가 더 높은 레이어에 더 높은 가중치를 할당하고, 유사도가 낮은 레이어의 가중치는 줄인다.

(뭔소리야 기준이 되는 레이어가 뭔데?;;;; 두 개를 비교하는데 두 개 중 더 유사한 한 개??? 뭔소리야 이게)

λ_m : the merging ratio
Ψ : the adaptive weight allocation function

Merging Layer Parameters (3)

L_m : 새롭게 merge된 레이어 (L_i 와 L_j 의 mergence)

3. Experiments

a comprehensive set of experiments to evaluate the effectiveness and generalizability of our MKA method across various domains

3.1. Experimental Setup

3.1.1. Datasets

(전부 accuracy)

MMLU

PIQA

HellaSwag

RACE-H

BoolQ

3.1.2. LLMs

Llama-2

Llama-3

Mistral-7B

3.1.3. Baselines

(1. 기존 pruning 기법 / 2.기존 pruning 기법 + quntization )

SparseGPT

ShortGPT

Reverse Pruning (레이어의 중요도를 모델 내 순서에 반비례하는 것으로 간주하는 휴리스틱 접근 방식으로, 초기 레이어를 우선적으로 보존하는 것을 목표로 한다.)

SmoothQuant

GPTQ

AWQ

3.2. In what ways does MKA surpass conventional pruning techniques?

MMLU dataset using the Llama3-8B, Llama3-70B, Mistral-7B, Llama2-7B, and Llama2-13B models

- Llama3-8B의 압축률은 43.5%에 도달하고, Mistral-7B는 40%, Llama2-13B는 놀랍게도 57.5%

- 두 방법 모두 모델 성능의 붕괴(collapse)를 경험하지만, 모델 병합 방식은 어느 정도 레이어 붕괴를 지연시키고 모델 성능을 안정적으로 유지할 수

- 논문에서 제안한 방법이 Reverse Prune에 기반하고 있기 때문에 (언제 그런 말을..), Llama3-8B, Llama2-7B, Llama2-13B 모델에 대한 점수는 Reverse Prune과 매우 유사 (다른 경우는 유사하지 않기도 했다. ..)

3.3. How Does MKA Combined with Quantization Perform Compared to Pruning Combined with Quantization?

- pruning된 모델들이 추가로 quantization되면서도 성능을 유지하고, 더 높은 압축률을 달성할 수 있다

- For example, on Llama3-8B, at a compression ratio of 85.94%, MKA with SmoothQuant achieves 64.20%, far exceeding ShortGPT with SmoothQuant at 37.66%.

- 에ㅔㅔㅔㅔㅔㅔㅔㅔㅔ엥?????? pruning 비율이 80%가 넘어?????? ㅁ뭐야 남는 게 있는 거야?? 말이 돼,,?

quantization 쪽이 원래 이런건가봄 ...........

3.4. MKA vs. Other Pruning Methods on varies benchmarks

Llama3-8B

ratios of {34.375%, 37.5%, 40.625%, 43.75%}

- 모든 데이터셋에서 좋은 성능을 냄

- For example, at a compression ratio of 34.375% on the MMLU dataset, our method can outperform ShortGPT by 21.92% and SparseGPT by 20.42%.

- 오.. 프루닝하면 PIQA랑 HellaSwag(랑 MMLU) 성능 엄청 떨어지네

3.5. Are Inter-Layer Knowledge Alignment Similarity Matrices Consistent Across different Large Models?

- Visualize the knowledge alignment and layer merging effects of MKA on various models. (엥 근데 before and after MKA라는데 뭐 어디가 비포고 어디가 애프터인지;;;;;;;)

- 전반적으로 모델의 후반레이어(the later layers)가 높은 유사도를 보인다

- 초반 레이어의 중요성 -> Additionally, when merging the earlier layers, we notice a collapse of the matrix in the final figure, suggesting that earlier layers have a significant influence on later layers.

4. Discussion

4.1. Extension to Multimodal and Specialized Models

- MoE 와 Mamba 모델에도 적용이 가능하다. (두 모델도 비슷한 redundancy를 보이고 있음)

-jamba와 Mixtral-8x7B의 유사도 분포는 LLM과는 약간 다른 경향을 띈다(!) (구조가 다르니까 그럴법도 하지만 이유는 궁금하다)

* Mixtral-8x7B : attention 사용, Mistral 7B+Mixture of Expert

* Mamba : attention을 사용하지 않는 병렬 가능한 RNN-like 구조의 State Space Model(SSM) 기반 시퀀스 모델

* Jamba : Mamba 기반의 MoE 오픈소스 모델

4.2. Analysis of Similarity Measures

Llama3-8B

similarity metric : {Cosine Similarity, Mahalanobis Distance, Euclidean Distance, t-SNE Similarity, Autoencoder Similarity}

코사인 유사도(Cosine Similarity), 마할라노비스 거리(Mahalanobis Distance), 그리고 유클리디안 거리(Euclidean Distance)는 수직 줄무늬와 다양한 열 값(heat values)을 가지는 유사한 분포 패턴을 보임을 관찰할 수 있다. 그러나 Mahalanobis Distance는 이러한 줄무늬 내에서 불규칙한 열 값을 보여주며, 이는 융합된 레이어 데이터 구조와의 불일치를 나타낸다. t-SNE 유사도는 무작위적이며 일관된 패턴이 부족하다. 오토인코더 유사도(Autoencoder Similarity)의 경우, 높은 열 값이 적절한 병합 영역이나 예상되는 높은 유사도 영역과 일치하지 않는다.

- 그니까 manifold learning을 통해서 similarity를 구하는 게 가장 좋다는 것 같음. 그냥 일반적인 metric을 쓰면 저렇게 이상하게 나타나니까.. (다른 논문은 안 그랬는데... 어떤 matrix를 기준으로 sim을 구했는지가 달라서 결과도 다른건가)

4.3. Variations in Accuracy Across Different MMLU Subjects During Layer Merging

Subject : {College Medicine, College Biology, High School Psychology, College Physics}

- 고등학교 심리학(High School Psychology)은 정확도에서 약간의 변동만을 보이며 안정적인 성능을 유지

- College Biology은 12.5% 병합 비율에서 정확도가 크게 하락한 후 회복되는 양상

- College Physics은 정확도의 잦은 변동을 나타내며, 레이어 병합에 대한 민감도가 높음

- College Medicine은 성능이 꾸준히 증가하였고, 변동은 미미

- 레이어를 더 없앴는데 성능이 오르는 건 무슨 의미일까 (아 이건 SLEB 처럼 계속 갱신이 아닌가? 비율마다 프루닝 조합이 다른건가)

5. Conclusion

(merge 방식이 weighted sum이라는 것에서 흥미를 잃음 ..ㅜㅜ

adaptive ratio를 구하는 방식이 여기서는 similarity 어쩌고어저고 였는데

이 부분을 새롭게 생각해서 adaptive ratio를 구하고 weighted sum을 하는 방법 정도는 얻을 수 있을 듯.

(왜냠 나는 importance score나 similarity 조합을 구할 필요가 없는 방법을 제안하고 싶기 때문) )

복잡하고 예민하고 별론 거 같은데 novelty 가 다 한 건가

모든 수식을 뜯어보진 않았고 (코드도 없어서 어케 된건지 모름), high level로 방법론과 중요한 부분ㅇㅔ 집중해서 읽음

merge 단위도 layer라고 하는데 정확히 어느 weight의 파라미터들을 merge했는지 안 나와서 모르겟음.... 전부 다 한 건가

Limitations

manifold learning 과정에서, input dataset의 퀄리티에 심하게 의존(heavily depends on)한다는 한계.

샘플 데이터 개수(양)도 manifold learning 결과에 큰 영향(significantly impact)을 미친다.

조건수(Condition Number)를 2000 이하로 유지하는 것은 학습된 manifold representations의 정확성을 보장하는 데 매우 중요하다. activation를 추출하는 데 사용된 데이터셋이 모델의 동작 범위(operational range)를 충분히 포괄하지 못할 경우, 학습된 다양체 표현은 데이터의 실제 기하학적 구조를 제대로 포착하지 못할 수 있다.

현재 MKA의 구현은 주로 트랜스포머 기반 아키텍처에서 테스트되었다. 우리는 심층 신경망이 본질적으로 중복을 포함하고 있다고 믿지만, MKA가 다른 신경망 아키텍처—예를 들어, CNN이나 RNN—에 적용될 수 있을지, 그리고 동일한 압축 효과를 발휘할 수 있을지는 아직 충분히 탐구되지 않았다.

* Condition Number : 어떤 함수 y=f(x)의 조건수(condition number)는 함수의 입력인 x의 작은 변화울에 대해 함수의 출력인 y의 변화율이 얼마인지를 나타내는 수로서, 함수의 민감도를 측정하는 지표

참고:

Manifold Learning https://bkshin.tistory.com/entry/%EC%BB%B4%ED%93%A8%ED%84%B0-%EB%B9%84%EC%A0%84-7-%EC%98%A4%ED%86%A0%EC%9D%B8%EC%BD%94%EB%8D%94AutoEncoder%EC%99%80-%EB%A7%A4%EB%8B%88%ED%8F%B4%EB%93%9C-%ED%95%99%EC%8A%B5Manifold-Learning

Condition number https://pasus.tistory.com/103

저작자표시 비영리 변경금지 (새창열림)

'📎 paper > NLP' 카테고리의 다른 글

Streamlining Redundant Layers to Compress Large Language Models (0)	2025.10.30
LaCo: Large Language Model Pruning via Layer Collapse (0)	2025.08.06
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks (0)	2025.05.14
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods (0)	2025.04.29
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering (5)	2024.11.04

nlp gong bu

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging