Prerequisites
Use amdgpu-install --usecase=graphics,rocm
without opencl
, which might cause HIP issues at the moment.
Install
# clone text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# activate venv
python3 -m venv venv
source venv/bin/activate
# install dependencies
pip3 install torch --index-url https://download.pytorch.org/whl/rocm5.6
# the wheels listed in requirements_amd.txt are built for this specified version of torch
# using other versions here will fail when loading dynamic libraries
pip3 install -r requirements_amd.txt
BitsAndBytes
BitsAndBytes is used in transformers
when load_in_8bit
or load_in_4bit
is enabled. Unfortunately it has bad ROCm support and low performance on Navi 31.
If you only want to run some LLMs locally, quantized models in GGML or GPTQ formats might suit your needs better.
To use BitsAndBytes for other purposes, a tutorial about building BitsAndBytes for ROCm with limited features might be added in the future.
Here is a promising fork if you are willing to try it by yourself (disclaimer: I haven’t tested it yet):
Launch
source venv/bin/activate
# use the first gpu if there are many
export HIP_VISIBLE_DEVICES=0
# override the gfx version
export HSA_OVERRIDE_GFX_VERSION=11.0.0
python3 ./server.py --listen
Performance
Tested on RX 7900 XTX using PCI-E 4.0 x16 slot.
7B Transformers inference
INFO:Loading 7B-hf...
INFO:Loaded the model in 10.36 seconds.
Output generated in 11.64 seconds (28.08 tokens/s, 327 tokens, context 52, seed 832229644)
7B 4-bit AutoGPTQ inference
INFO:Loading Wizard-Vicuna-7B-Uncensored-GPTQ...
INFO:Loaded the model in 2.10 seconds.
Output generated in 8.93 seconds (45.90 tokens/s, 410 tokens, context 52, seed 159282383)
7B 4-bit ExLlama inference
INFO:Loading Wizard-Vicuna-7B-Uncensored-GPTQ...
INFO:Loaded the model in 1.44 seconds.
Output generated in 6.15 seconds (76.30 tokens/s, 469 tokens, context 51, seed 353336174)
Caveats
RuntimeError: HIP error: invalid argument
or Memory access fault
These errors usually occur when the GPU is mistakenly recognized. Making it explicit should solve the problem:
# for navi 3x
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# for navi 2x
export HSA_OVERRIDE_GFX_VERSION=10.3.0
python3 ./server.py --listen