[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs|江阴雨辰互联

Introduction

作者提出 PrefixQuant，基于 QuaRot，通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调，能在 W4A4 static quantization 上做到比较好的量化效果；但和 CusionCache 一样，PrefixQuant 尽管可以保持所有关键词元无损，但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响

作者发现，对于 static quantization，由于关键词元与其他词元的激活值分布显著不同，如果不对关键词元做特殊处理，校准得到的量化参数会损害非关键词元的量化精度，例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦；如果能保持关键词元无损，对其他 tokens 做校准，就能得到更小的量化范围，提升量化精度
Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
其中， η = 64 \eta=64 η=64
Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ⁡ ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=⌈max(O)⌉，其中 O ∈ R b \mathbf O\in\R^b O∈Rb 为所有 transformer block 中统计的关键词元数量
Which Tokens to Prefix? top- o o o high-frequency outlier tokens + [BOS]

Settings. 权重 per-channel symmetric quantization，KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit，激活值 per-tensor static quantization；校准数据集为 8 Pile samples with a 1024 sequence length，通过 grid search 找到 scale 初始值；微调数据集为 512 samples from Pile with a 1024 context length

Comparison Results.
Results on weight-only quantization.
Inference Speed. (1) Static Quantization Speedup.
(2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
(3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
Ablation Studies. (1) Main Components.
(2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
Quantization Time.

Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
code: https://github/chenmnz/prefixquant

发布者：admin，转转请注明出处：http://www.yc00.com/web/1755027007a5228381.html