[Arxiv 2024] PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

ContentsIntroductionMethodExperimentsReferencesIntroduction 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时

Contents

  • Introduction
  • Method
  • Experiments
  • References

Introduction

  • 作者提出 PrefixQuant,基于 QuaRot,通过在 WA 量化时保持关键词元无损并加上 EfficientQAT 微调,能在 W4A4 static quantization 上做到比较好的量化效果;但和 CusionCache 一样,PrefixQuant 尽管可以保持所有关键词元无损,但却没有讨论过加上 prefix 后会对模型精度产生怎样的影响

Method

  • 作者发现,对于 static quantization,由于关键词元与其他词元的激活值分布显著不同,如果不对关键词元做特殊处理,校准得到的量化参数会损害非关键词元的量化精度,例如关键词元的 down_proj 输入上会存在 massive outlier、KV cache 则会特别平坦;如果能保持关键词元无损,对其他 tokens 做校准,就能得到更小的量化范围,提升量化精度
  • Definition of Outlier Token. 通过 down proj 的输入激活值定位关键词元
    其中, η = 64 \eta=64 η=64
  • Number of Outlier Tokens. 通过校准集统计出每种模型中关键词元的数量 o = ⌈ max ⁡ ( O ) ⌉ o=\lceil\max(\mathbf O)\rceil o=max(O)⌉,其中 O ∈ R b \mathbf O\in\R^b ORb 为所有 transformer block 中统计的关键词元数量
  • Which Tokens to Prefix? top- o o o high-frequency outlier tokens + [BOS]

  • Block-wise Fine-tuning. 采用 EfficientQAT 微调 scale & weights

Experiments

  • Settings. 权重 per-channel symmetric quantization,KV cache per-head symmetric static quantization for 4-bit and per-tensor symmetric static quantization for 8-bit,激活值 per-tensor static quantization;校准数据集为 8 Pile samples with a 1024 sequence length,通过 grid search 找到 scale 初始值;微调数据集为 512 samples from Pile with a 1024 context length

  • Comparison Results.
  • Results on weight-only quantization.
  • Inference Speed. (1) Static Quantization Speedup.
    (2) Linear Layer Speedup. For low-bit matrix multiplication, we use the 4-bit GEMM kernel from CUTLASS and design a custom kernel for W4A4 GEMV. We also integrate the de-quantization process into the GEMM and GEMV kernels.
    (3) End-to-end speedup. 这里测速没有用 KV cache 量化 (it saves memory footprint through more computation overhead and only achieves speedup with large batch sizes)
  • Ablation Studies. (1) Main Components.
    (2) Number of Prefixed Tokens.(3) Content of Prefixed Tokens.
  • Quantization Time.

References

  • Chen, Mengzhao, et al. “PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs.” arXiv preprint arXiv:2410.05265 (2024).
  • code: https://github/chenmnz/prefixquant

发布者:admin,转转请注明出处:http://www.yc00.com/web/1755027007a5228381.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信