写在前面

2023 年 2 月 24 日,Meta AI 发布了 650 亿参数的大语言模型LLaMA ,3 月 3 日,有人“泄露”了它的训练参数,当然也可以通过官方申请得到对应参数(一般申请都会给通过)。本次我们将运行 llama-int8 以及带有 WebUI 的 4 位量化模型 GPTQ-for-LLaMA

概要

  • 模型有 7B,13B,30B,65B 的参数分类

  • Meta 宣称 LLaMA-13B 已经超越了 GPT-3 的性能

  • 模型运行中内存及显存占用情况: image-20230411225755511不同模型版本:

  • Meta AI 原始模型 https://github.com/facebookresearch/llama

环境配置

  • Windows 10 Professional 64 Bit
  • NVIDIA RTX 3090
  • CUDA 11.6
  • cuDNN 8.8.1

创建 conda 环境

conda create -n textgen python=3.10.9
conda activate textgen
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

简洁安装

这是我所创建的 conda 虚拟环境 requirements.txt 文件:

accelerate==0.18.0

aiofiles==23.1.0

aiohttp==3.8.4

aiosignal==1.3.1

altair==4.2.2

anyio==3.6.2

asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work

async-timeout==4.0.2
+ models
  + llama-7b
    + consolidated.00.pth
    + params.json
    + checklist.chk
  + tokenizer.model

attrs==22.2.0

backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work

bitsandbytes @ git+https://github.com/Keith-Hon/bitsandbytes-windows.git@85ff11a7f04af73bc83cbe6ed0eb1a77ade0697b

certifi @ file:///C:/b/abs_85o_6fm0se/croot/certifi_1671487778835/work/certifi

charset-normalizer==3.1.0

click==8.1.3

colorama @ file:///C:/b/abs_a9ozq0l032/croot/colorama_1672387194846/work

comm @ file:///C:/b/abs_1419earm7u/croot/comm_1671231131638/work

contourpy==1.0.7

cycler==0.11.0

datasets==2.10.1

debugpy @ file:///C:/ci_310/debugpy_1642079916595/work

decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work

dill==0.3.6

entrypoints==0.4

executing @ file:///opt/conda/conda-bld/executing_1646925071911/work

fairscale==0.4.13

fastapi==0.95.0

ffmpy==0.3.0

filelock==3.11.0

fire==0.5.0

fonttools==4.39.2

frozenlist==1.3.3

fsspec==2023.3.0

gradio==3.24.1

gradio_client==0.0.5

h11==0.14.0

httpcore==0.16.3

httpx==0.23.3

huggingface-hub==0.13.3

idna==3.4

ipykernel @ file:///C:/b/abs_b4f07tbsyd/croot/ipykernel_1672767104060/work

ipython @ file:///C:/b/abs_d1yx5tjhli/croot/ipython_1680701887259/work

jedi @ file:///C:/ci/jedi_1644315428305/work

Jinja2==3.1.2

jsonschema==4.17.3

jupyter_client @ file:///C:/b/abs_059idvdagk/croot/jupyter_client_1680171872444/work

jupyter_core @ file:///C:/b/abs_9d0ttho3bs/croot/jupyter_core_1679906581955/work

kiwisolver==1.4.4

linkify-it-py==2.0.0

Markdown==3.4.3

markdown-it-py==2.2.0

MarkupSafe==2.1.2

matplotlib==3.7.1

matplotlib-inline @ file:///C:/ci/matplotlib-inline_1661934094726/work

mdit-py-plugins==0.3.3

mdurl==0.1.2

multidict==6.0.4

multiprocess==0.70.14

nest-asyncio @ file:///C:/b/abs_3a_4jsjlqu/croot/nest-asyncio_1672387322800/work

ninja==1.11.1

numpy==1.24.2

orjson==3.8.9

packaging @ file:///C:/b/abs_ed_kb9w6g4/croot/packaging_1678965418855/work

pandas==1.5.3

parso @ file:///opt/conda/conda-bld/parso_1641458642106/work

peft==0.2.0

pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work

Pillow==9.4.0

platformdirs @ file:///C:/b/abs_73cc5cz_1u/croots/recipe/platformdirs_1662711386458/work

prompt-toolkit @ file:///C:/b/abs_6coz5_9f2s/croot/prompt-toolkit_1672387908312/work

psutil==5.9.4

pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work

pyarrow==11.0.0

pydantic==1.10.7

pydub==0.25.1

Pygments @ file:///opt/conda/conda-bld/pygments_1644249106324/work

pyparsing==3.0.9

pyrsistent==0.19.3

python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work

python-multipart==0.0.6

pytz==2023.3

pywin32==305.1

PyYAML==6.0

pyzmq @ file:///C:/ci/pyzmq_1657616000714/work

quant-cuda @ file:///E:/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl

regex==2023.3.23

requests==2.28.2

responses==0.18.0

rfc3986==1.5.0

semantic-version==2.10.0

sentencepiece==0.1.97

six @ file:///tmp/build/80754af9/six_1644875935023/work

sniffio==1.3.0

stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work

starlette==0.26.1

termcolor==2.2.0

tokenizers==0.13.2

toolz==0.12.0

torch==1.13.1+cu116

torchaudio==0.13.1+cu116

torchvision==0.14.1+cu116

tornado @ file:///C:/ci/tornado_1662476985533/work

tqdm==4.65.0

traitlets @ file:///C:/b/abs_e5m_xjjl94/croot/traitlets_1671143896266/work

transformers @ git+https://github.com/huggingface/transformers@4c01231e67f0d699e0236c11178c956fb9753a17

typing_extensions==4.5.0

uc-micro-py==1.0.1

urllib3==1.26.15

uvicorn==0.21.1

wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work

websockets==10.4

wincertstore==0.2

xxhash==3.2.0

yarl==1.8.2

带有 UI 界面模型运行

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
cd GPTQ-for-LLaMa
pip install ninja
conda install -c conda-forge cudatoolkit-dev
python setup_cuda.py install

模型参数下载

建议首先在模型根目录下创建models文件夹:

  • Meta AI 原始参数文件 models

    ├── llama-7b

    │   ├── consolidated.00.pth

    │   ├── params.json

    │   └── checklist.chk

    └── tokenizer.model

参数 Torrent 文件: Safe-LLaMA-HF (3-26-23).zip

参数 Magnet 链接: magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA

  • Huggingface 转化的参数文件 在 text-generation-webui 的根目录下创建 models 文件夹,例如这里我们想导入 13b 的参数,使用 git clone 到decapoda-research/llama-13b-hf 复制对应参数,更名文件为llama-13b
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/decapoda-research/llama-13b-hf

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1>)

我已经下载好了两类参数文件至百度网盘 ,提取码:1234

运行

# run GPTQ-for-LLaMA web client
python server.py --cai-chat --model llama-7b --no-stream

# run llama-int8
python example.py --ckpt_dir [TARGET_DIR]/7b --tokenizer_path [TARGET_DIR]/tokenizer.model --max_batch_size=1

带有 webui 的模型运行如下:

image-20230411225639092

内网端口映射

由于我个人使用的是校园网,需要映射到公网才可公开访问,体验了几个工具暂时使用ngrok 作为临时替代品。


参考文献:

[1] aituts llama

[2] A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes

[3] huggingface LLaMA