In this post

Run Llama 2 Locally in 7 Lines! (Apple Silicon Mac)

Written By

Suyog Sonwalkar

Published onJuly 20, 2023

Super fast way of getting Llama 2 running locally on your Mac

TLDR:

xcode-select --install # Make sure git & clang are installed
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin --output ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin 
LLAMA_METAL=1 make
./main -m ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -n 1024 -ngl 1 -p "Give me a list of things to do in NYC"

NOTE: The 7B model weights are about 4GB in size, please ensure you have enough space on your machine.

What is this doing?

This is using the amazing llama.cpp project by Georgi Gerganov to run Llama 2. It downloads a 4-bit optimized set of weights for Llama 7B Chat by TheBloke via their huggingface repo here, puts it into the models directory in llama.cpp, then builds llama.cpp with Apple’s Metal optimizations.

This allows you to run Llama 2 locally with minimal work. The 7B weights should work on machines with 8GB of RAM (but better if you have 16GB). Larger models like 13B or 70B will require significantly more RAM.

Note that Llama 2 non-chat weights are also available here: https://huggingface.co/TheBloke/Llama-2-7B-GGML, however if you want a simpler chat interface, the chat weights are preferred.

Performance

On an M2 Max MacBook Pro, I was able to get 35–40 tokens per second using the LLAMA_METAL build flag. Your performance may vary depending on Apple Silicon chip.

Thanks and acknowledgements

It’s amazing that we’re able to get state of the art large language models running locally in such a short amount of time after their release. This wouldn’t be possible without the Llama 2 weights being open sourced by Meta, the llama.cpp project and TheBloke’s huggingface optimized Llama 2 model weights.

LastMile AI

If you want to learn more about AI tools and large language models like Llama, checkout LastMile AI. You can reach us on:

We would also appreciate your feedback on our initial product offering, available at lastmileai.dev.

In this post

Run Llama 2 Locally in 7 Lines! (Apple Silicon Mac)

What is this doing?

Performance

Thanks and acknowledgements

LastMile AI

More Posts

Bertelsmann / LastMile Case Study

Introducing AutoEval Experiments