📝 devlog v8 [ 2025-12-26 ]

setting up a micro llm on luckfox pico max (cortex a7 & 256mb ram)

utilising the luckfox's integrated 1 tops npu to run micro llms

create dir architecture for cleanliness in opt

mkdir /opt/llm

in llm/: mkdir src bin models logs

using llama.cpp and building from source:

sudo git clone https://github.com/ggerganov/llama.cpp cd llama.cpp sudo make LLAMA_NO_ACCELERATE=1

after successful config; compile with sudo make -j$(nproc)

after compile move binaries; sudo cp ./bin/* /opt/llm/bin/

place model(s) in /opt/llm/models, using gpt-mini 124m due to ram constraints

test: ./llama-cli -m /opt/llm/models/gpt-mini/gpt-mini-q6.gguf -c 160 -n 128 -t 2

success.

create .service file @ /etc/systemd/system/llm.service

unit file:

[Unit]
Description=Micro LLM
After=network.target

[Service]
ExecStart=/opt/llm/bin/llama-server -m /opt/llm/models/gpt-mini/gpt-mini-q6.gguf --host 0.0.0.0 --port 8080 --threads 1
WorkingDirectory=/opt/llm
Restart=always
User=root

[Install]
WantedBy=multi-user.target

reload: sudo systemctl daemon-reload
enable: sudo systemctl enable llm.service
start: sudo systemctl start llm.service
now the luckfox max will run this model as a service on each boot.
model can be accessed via cli, or by accessing <luckfox-ip>:8080 in browser.

🔍 insights

compile will take forever on luckfox - cross compile elsewhere
q2 model loads and runs, as does q6
q3,q4,q5 and q8 refuse to load
small llm models like this output nonsense/garbage, but proof of concept is there
micro llms do run on a cortex a7 with only 256mb of ram, just not very well in regards to coherence.
stats: prompt - 3.4t/s & generation - 1.1t/s