📝 devlog v8 [ 2025-12-26 ]
setting up a micro llm on luckfox pico max (cortex a7 & 256mb ram)
utilising the luckfox's integrated 1 tops npu to run micro llmscreate dir architecture for cleanliness in opt
mkdir /opt/llmin
llm/: mkdir src bin models logsusing llama.cpp and building from source:
sudo git clone https://github.com/ggerganov/llama.cpp cd llama.cpp sudo make LLAMA_NO_ACCELERATE=1after successful config; compile with
sudo make -j$(nproc)after compile move binaries;
sudo cp ./bin/* /opt/llm/bin/place model(s) in /opt/llm/models, using gpt-mini 124m due to ram constraints
test:
./llama-cli -m /opt/llm/models/gpt-mini/gpt-mini-q6.gguf -c 160 -n 128 -t 2success.
create
.service file @ /etc/systemd/system/llm.serviceunit file:
[Unit]
Description=Micro LLM
After=network.target
[Service]
ExecStart=/opt/llm/bin/llama-server -m /opt/llm/models/gpt-mini/gpt-mini-q6.gguf --host 0.0.0.0 --port 8080 --threads 1
WorkingDirectory=/opt/llm
Restart=always
User=root
[Install]
WantedBy=multi-user.target reload:
sudo systemctl daemon-reloadenable:
sudo systemctl enable llm.servicestart:
sudo systemctl start llm.servicenow the luckfox max will run this model as a service on each boot.
model can be accessed via cli, or by accessing
<luckfox-ip>:8080 in browser.🔍 insights
compile will take forever on luckfox - cross compile elsewhereq2 model loads and runs, as does q6
q3,q4,q5 and q8 refuse to load
small llm models like this output nonsense/garbage, but proof of concept is there
micro llms do run on a cortex a7 with only 256mb of ram, just not very well in regards to coherence.
stats: prompt - 3.4t/s & generation - 1.1t/s