llama.cpp
在纯C/C++中推理LLaMA模型
热门话题:
- RMSNorm实现/修复:https://github.com/ggerganov/llama.cpp/issues/173
- 缓存输入提示以加快初始化速度:https://github.com/ggerganov/llama.cpp/issues/64
- 创建
llama.cpp
标志:https://github.com/ggerganov/llama.cpp/issues/105
描述
主要目标是在 MacBook 上使用 4 位量化来运行模型
- 纯 C/C++ 实现,不依赖其他库
- 首选 Apple silicon - 通过 ARM NEON 进行优化
- 支持 x86 架构的 AVX2
- 混合 F16/F32 精度
- 支持 4 位量化
- 在 CPU 上运行
这是在一个晚上破解的 - 我不知道它是否正确工作。
请不要根据此实现的结果对模型做出结论。
据我所知,它可能完全错误。这个项目是为了教育目的。
新功能可能主要通过社区贡献添加。
支持的平台:
- [X] Mac OS
- [X] Linux
- [X] Windows (通过 CMake)
- [X] Docker
以下是使用 LLaMA-7B 的典型运行:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
make: Nothing to be done for `default'.
main: seed = 1678486056
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
1 -> ''
8893 -> 'Build'
292 -> 'ing'
263 -> ' a'
4700 -> ' website'
508 -> ' can'
367 -> ' be'
2309 -> ' done'
297 -> ' in'
29871 -> ' '
29896 -> '1'
29900 -> '0'
2560 -> ' simple'
6576 -> ' steps'
29901 -> ':'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
Building a website can be done in 10 simple steps:
1) Select a domain name and web hosting plan
2) Complete a sitemap
3) List your products
4) Write product descriptions
5) Create a user account
6) Build the template
7) Start building the website
8) Advertise the website
9) Provide email support
10) Submit the website to search engines
A website is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user's browser.
The web pages are stored in a web server. The web server is also called a host. When the website is accessed, it is retrieved from the server and displayed on the user's computer.
A website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user's screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones.
Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
The website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user’s screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones. Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
A website is an address of a website. It is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user’s browser.
A website is known as a website when it is hosted
main: mem per token = 14434244 bytes
main: load time = 1332.48 ms
main: sample time = 1081.40 ms
main: predict time = 31378.77 ms / 61.41 ms per token
main: total time = 34036.74 ms
这里还有一个演示,展示在单个M1 Pro MacBook上同时运行LLaMA-7B和whisper.cpp的情况:
https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
用法
以下是LLaMA-7B模型的步骤:
# build this repo
git clone <https://github.com/ggerganov/llama.cpp>
cd llama.cpp
make
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
# install Python dependencies
python3 -m pip install torch numpy sentencepiece
# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits
./quantize.sh 7B
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
目前最好使用Python 3.9或Python 3.10,因为sentencepiece
尚未为Python 3.11发布wheel。
在运行较大的模型时,请确保您有足够的磁盘空间来存储所有中间文件。
内存/磁盘要求
由于模型目前完全加载到内存中,因此您需要足够的磁盘空间来保存它们,并且需要足够的RAM来加载它们。目前,内存和磁盘要求相同。
model | original size | quantized size (4-bit) |
---|---|---|
7B | 13 GB | 3.9 GB |
13B | 24 GB | 7.8 GB |
30B | 60 GB | 19.5 GB |
65B | 120 GB | 38.5 GB |
交互模式
如果您想获得更像 ChatGPT 的体验,可以通过传递 -i
参数来运行交互模式。
在此模式下,您可以随时通过按下 Ctrl+C 来中断生成,并输入一行或多行文本,这些文本将被转换为标记并附加到当前上下文中。您还可以使用参数 -r "反向提示字符串"
指定反向提示。这将导致在生成中遇到反向提示字符串的确切标记时,提示用户输入。一个典型的用法是使用一个提示符,让 LLaMa 模拟多个用户之间的聊天,比如 Alice 和 Bob,然后传递 -r "Alice:"
。
这是一个例子,使用以下命令进行 few-shot 交互:
./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" \\
-p \\
"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"
请注意使用 -color
来区分用户输入和生成的文本。
Android
你可以通过termux在Android设备上轻松运行llama.cpp
文件。
首先,获取Android NDK,然后使用CMake构建:
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
在您的设备上安装termux,并运行termux-setup-storage
以获取访问SD卡的权限。最后,将llama
二进制文件和模型文件复制到您的设备存储中。以下是在Pixel 5手机上运行交互式会话的示例:
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
Docker
先决条件
- 必须安装并在您的系统上运行Docker。
- 创建一个文件夹以存储大型模型和中间文件(例如,我使用/llama/models)
镜像
我们为此项目提供了两个Docker镜像:
ghcr.io/ggerganov/llama.cpp:full
:此镜像包括主可执行文件和将LLaMA模型转换为ggml并转换为4位量化的工具。ghcr.io/ggerganov/llama.cpp:light
:此镜像仅包括主可执行文件。
用法
下载模型、将它们转换为ggml并进行优化的最简单方法是使用--all-in-one命令,该命令包括完整的Docker镜像。
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
完成后,您就可以开始玩了!
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
或者使用轻量级图像:
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
限制
- 我们目前不知道量化对生成文本质量的影响有多大
- 可能令牌采样可以得到改进
- 实际上,加速框架目前未被使用,因为我发现对于解码器的典型张量形状,与ARM_NEON内部实现相比没有任何好处。当然,有可能我只是不知道如何正确使用它。但无论如何,您甚至可以使用“LLAMA_NO_ACCELERATE = 1 make”禁用它,因为当前实现不会调用任何BLAS调用
贡献
- 贡献者可以打开PR
- 协作者可以将分支推送到
llama.cpp
仓库中,并将PR合并到master
分支中 - 根据贡献邀请协作者
- 非常感谢管理问题和PR的任何帮助!
- 确保阅读此内容:边缘推理
编码准则
- 避免添加第三方依赖项,额外的文件,额外的头文件等
- 始终考虑与其他操作系统和架构的交叉兼容性
- 避免花哨的现代STL构造,使用基本的
for
循环,避免模板,保持简单 - 代码风格没有严格的规则,但尽量遵循代码模式(缩进,空格等)。垂直对齐使事物更易读和批量编辑
- 清除任何尾随空格,使用4个空格缩进,括号在同一行,
void * ptr
,int & a
- 查看good first issues以获取适合首次贡献的任务