Bash one-liners for LLMs

I've been gleefully exploring the intersection of LLMs and CLI utilities for a few months now - they are such a great fit for each other! The unix philosophy of piping things together is a perfect fit for how LLMs work.

I've mostly been exploring this with my https://llm.datasette.io/ CLI tool, but I have a few other one-off tools as well: https://github.com/simonw/blip-caption and https://github.com/simonw/ospeak

I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.

I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.

I've been seeing less and less enthusiasm for CLI driven workflows. I think VS Code is the main driver for this and anecdotally the developers I serve want point & click over terminal & cli

anecdotally the developers I serve want point & click over terminal & cli

I think it's due to a lack of familiarity, as the CLI should be more efficient

I've been seeing less and less enthusiasm for CLI driven workflows.

Any CLI is 1 dimensional.

Point and click is 2 dimensional.

The CLI should be more efficient, as you can reduce the complexity: you may need extra flags to achieve the behavior you want, but you can then serialize that into a file (shell script) to guarantee the reproduction of the outcome you want.

GUIs are harder, even without adding more dimensions like time (double click, scripts like AHK or AutoIT...)

If you don't have comparative exposure (automatizing workflows in Windows vs doing the same in Linux), or if you don't have enough experience to achieve what you want, you might jump to the wrong conclusions - but this is a case of being limited by knowledge, not by the tools

The CLI should be more efficient, as you can reduce the complexity: you may need extra flags to achieve the behavior you want, but you can then serialize that into a file (shell script) to guarantee the reproduction of the outcome you want.

yup, we do this where we can, but let's consider a recent example...

They are standardizing around k8slens instead of kubectl. Why, because there are things you can do in k8s lens (like metrics) that you'll never get a good experience around in a terminal. Another big problem with terminals is you have to deal with all sorts of divergences between OSes & shells. A web based interface is consistent. In the end, they decided as a team their preference and that's what gets supported. They also standardized around VS Code, so that's what the docs refer to. I'm pretty much the only one still in vim, I'm not giving up my efficiencies in text manipulation.

I don't disagree with you, but I do see a trend in preferences away based on my experience, our justifications be damned

They are standardizing around k8slens instead of kubectl. Why, because there are things you can do in k8s lens (like metrics) that you'll never get a good experience around in a terminal.

It looks like a limitation of the tool, not of the method, because metrics could come as CSVs, JSON or any other format in a terminal

I'm pretty much the only one still in vim, I'm not giving up my efficiencies in text manipulation.

I love vim too :)

I don't disagree with you, but I do see a trend in preferences away based on my experience, our justifications be damned

Trends in preferences are called fashions: they can change the familiarity and the level of experience through exposure, but they are cyclic and without objective.

The core problem is the combinatorial complexity in the problem space, and 1d with ascii will beat 2d with bitmaps.

I'm all for adding graphics to outputs (ex: sixels) but I think depending on graphics as inputs (whether scriping a GUI or processing it with our eyeballs) is riskier and more complex, so I believe our common preferences for CLIs will prevail in the long run.

because metrics could come as CSVs, JSON or any other format in a terminal

You're missing the point, it's about graphs humans can look at and gain understanding. A bunch of floating point numbers in a table are never going to give that capability.

This is just one example where a UI outshines a CLI, it's not the only one. There are limitations to what you can do in a terminal, especially if you consider ease of development

If humans are processing the output, I agree with you.

But for an AI (or a script running commands), a bunch of floating point numbers in a table will get you more reliability and better results.

This thread had dropped the AI context up to this point and instead focussed on why CLIs have lost popularity and preference with humans.

I think it has more to do with how close to the brink you are. It takes at least a decade for a technology to mature to the point where there's a polished point and click gui for doing it. It sounds like Borg just hit that inflection point thanks to k8slens which I'm sure is very popular with developers working at enterprises.

It takes at least a decade for a technology to mature to the point where there's a polished point and click gui for doing it

That makes a lot of sense, and it would generalize: things that have existed for longer have received more attention and more polish than fresh new things

I'd expect running a binary to be more mature than running a script, and the script to be more mature than a GUI, and complex assemblies with many moving parts (ex: a web browser in a GUI) to be the most fragile

That's another way to see there's an extremely good case for using cosmopolitan: have fewer requirements, and concentrate on the core layers of the OS, the ones that've been improved and refined through the years

I think it's due to a lack of familiarity

100%

I also think people are on tooling burnout, there have been soooo many new tools (and SaaS for that matter), I personally and anecdotally want fewer apps and tools to get my job done. Having to wire them all together creates a lot of complexity, headaches, and time sinks

I personally and anecdotally want fewer apps and tools to get my job done

Same, because if learn the CLI and scripting once, then in most cases you don't have to worry about other workflows: all you need is the ability to project your problem into a 1d serialization (ex: w3m or lynx html dump, curl, wget...) where you can use fewer tools

Point and click is 2 dimensional.

I would have said point and click is 3-dimensional.

Otherwise, how can you read the text through the edges of buttons before clicking?

I'm puzzled that more people aren't loudly exploring this space (LLM+CLI) - it's really fun.

70% of the front page of Hackernews and Twitter for the past 9 months is about everybody and their mother's new LLM CLI. It's the loudest exploration I've ever witnessed in my tech life so far. We need to be hearing far less about LLM CLIs, not more.

I've been reading Hacker News pretty closely and I haven't seen that.

Plenty of posts about LLM tools - Ollama, llama.cpp etc - but very few that were specifically about using LLMs with Unix-style CLI piping etc.

What did I miss?

Has anyone written a shell script before that uses a local llm as a composable tool? I know there's plenty of stuff like https://github.com/ggerganov/llama.cpp/blob/master/examples/... where the shell script is being used to supply all the llama.cpp arguments you need to get a chatbot ui. But I haven't seen anything yet that treats the LLM as though it were a traditional UNIX utility like sed, awk, cat, etc. I wouldn't be surprised if no one's done it, because I had to invent the --silent-prompt flag that let me do it. I also had to remove all the code from llava-cli that logged stuff to stdout. Anyway, here's the script I wrote: https://gist.github.com/jart/bd2f603aefe6ac8004e6b709223881c...

It's really hard to put an LLM in the middle of unix pipes because the output is unreliable

I wrote one a while back, mainly so I could use it with vim (without a plugin) and pipe my content in (also at the CLI), but haven't maintained it

https://github.com/verdverm/chatgpt

Justine may have addressed unreliable output by using `--temp 0` [0]. I'd agree that while it may be deterministic, there are other definitions or axes of reliability that may still make it poorly suited for pipes.

[0] > Notice how I'm using the --temp 0 flag again? That's so output is deterministic and reproducible. If you don't use that flag, then llamafile will use a randomness level of 0.8 so you're certain to receive unique answers each time. I personally don't like that, since I'd rather have clean reproducible insights into training knowledge.

`--temp 0` makes it deterministic. What can make output reliable is `--grammar` which the blog post discusses in detail. It's really cool. For example, the BNF expression `root ::= "yes" | "no"` forces the LLM to only give you a yes/no answer.

that only works up to a point. If you are trying to transform a text based cli output into a JSON object, even with a grammar, you can get variation in the output. A simple example is field or list ordering. Omission is the real problematic one

Justine may have addressed unreliable output by using `--temp 0`

That only works if you have the same input. It also nerfs the model considerably

https://ai.stackexchange.com/questions/32477/what-is-the-tem...

LLM tools - Ollama, llama.cpp

I was including these since they're LLM-related cli tools.

Err... LLM in general? Sure. Specifically CLI LLM stuff? Certainly not 70%...

Granted half of those are submissions about Simon's new projects.

Those are great links. I’ve been using:

https://github.com/npiv/chatblade https://github.com/tbckr/sgpt

I totally agree with LLM+CLI are perfect fit.

One pattern I used recently was httrack + w3m dump + sgpt images with gpt vision to generate a 278K token specific knowledge base with a custom perl hack for a RAG that preserved the outline of the knowledge.

Which brings me to my question for you - have you seen anything unix philosophy aligned for processing inputs and doing RAG locally?

EDIT. Turns out OP has done quite a bit toward what I’m asking. Written up here:

https://simonwillison.net/2023/Oct/23/embeddings/

Something I’m currently a bit hung up on is finding a toolchain for chunking content on which to create embeddings. Ideally it would detect location context, like section “2.1 Failover” or “Chapter 8: The dream” or whatever from the original text, also handle 80 character wide source unwrapping, smart splitting so paragraphs are kept together, etc etc.

That's the same problem I haven't figured out yet: the best strategies for chunking. I'm hoping good, well proven patterns emerge soon so I can integrate them into my various tools.

I’ve been looking at this

https://freeling-user-manual.readthedocs.io/en/v4.2/modules/...

at the freeling library in general, also spaCy and NLTK. The chunking algorithms being used in the likes of LangChain are remarkably bad surprisingly.

There is also

https://github.com/Unstructured-IO/unstructured

But I don’t like it, can’t explain why yet.

My intuition is that 1st step is clean sentences and paragraphs and titles/labels/headers. Then probably an LLM can handle outlining and table of contents generation using a stripped down list of objects in the text.

BRIO/BERT summarization could also have a role of some type.

Those are my ideas so far.

A quick glance at "embedding" reminds me a lot of some work I was doing on quantum computing.

I wonder if there is some crossover potential there, in terms of calculations across vector arrays

I'm heavily using https://github.com/go-go-golems/geppetto for my work, which has a CLI mode and TUI chat mode. It exposes prompt templates as command line verbs, which it can load from multiple "repositories".

I maintain a set of prompts for each repository I am working in (alongside custom "prompto" https://github.com/go-go-golems/prompto scripts that generate dynamic prompting context, i made quite a few for thirdparty libraries for example: https://github.com/go-go-golems/promptos ).

Here's some of the public prompts I use: https://github.com/go-go-golems/geppetto/tree/main/cmd/pinoc...

I am currently working on a declarative agent framework.

Curious about what HN things about llamafile and modelfile (https://github.com/jmorganca/ollama/blob/main/docs/modelfile...)

Both invoke a Dockerfile like experience. Modelfile immediately seems like a Dockerfile, but llamafile looks harder to use. It is not immediately clear what it looks like. Is it a sequence of commands at the terminal?

My theory question is, why not use a Dockerfile for this?

llamafile doesn't use docker. It's just an executable file. You download your llamafile, chmod +x it, then you ./run it. It is possible to use Docker + llamafile though. Check out https://github.com/ajbouh/cosmos which has a pretty good Dockerfile configuration you can use.

I already use containers for all my current AI stuff, so I don't really need the llamafile part. The question was more about alternatives, because on the surface, it looks to have a lot of overlap with containers. The main difference I've seen is that llamafiles are easier if you do not already use containers or are on a platform where containers come with a lot of overhead and limitations

Wait, I think I'm misunderstanding. It feels like you're asking what you need a computer for if you've already got a web browser, whereas one is a component necessary for the other. Inside of those containers you're using, there's executables. If it runs as a nice, self-contained single binary that you can just call, why still look for a container to wrap around it and invoke using a more complicated command? How is running a binary in a container an alternative to running said binary?

Wait, I think I'm misunderstanding.

Yes, and you are being snarky about it

I already have an AI development workflow and production environment based on containers. Technically, not ML executables inside the container, rather a Python runtime and code. This also comes with an ecosystem for things you need in real work that are general beyond AI applications.

Why would I want to add additional tooling and workflow that only works locally and needs all the extras to be added on?

It sounds to me like you don't personally need llamafile then.

yup, trying to understand where all the nascent AI tools are trying to fit in

ollama seems way easier to use than llamafile

well, you can keep using containers if you want. docker sucks really bad on macos, but if you're not using that, you might not care.

docker is not the only container runtime, no issues using mac as a daily development machine

I also tried rancher and it was worse, as in it hung on my workloads, which docker started fine.

I had my share of docker configuration wiping and rosetta compatibility fighting, though.

nerdctl, colima, containerd

it looks to have a lot of overlap with containers

No it doesn't. They're at different abstraction layers.

Containers are more convenient form of generic code/data packaging and isolation primitives.

llamafile is the code/data that you want to run.

Equivalent poor analogy: Why does JPEG XL exists? I can just put a base64 representation of a PNG image in a data URL inside a HTML file inside a ZIP file; I can also bundle more stuff this way! JPEG XL is useless! No one should use JPEG XL! Other use cases that use JPEG XL by themselves are invalid!

Their killer feature is the --grammar option which restricts the logits the LLM outputs which makes them great for bash scripts that do all manner of NLP classification work.

Otherwise I use ollama when I need a local LLM, vllm when I'm renting GPU servers, or OpenAI API when I just want the best model.

Interesting, you inspired me to https://github.com/jmorganca/ollama/issues/1507

I've also had good success instructing arbitrary grammars in the system prompt, though it doesn't work at the logits level, which can be helpful

There's an old PR for it: https://github.com/jmorganca/ollama/pull/565 (it just uses the underlying llama.cpp grammar feature which is what llamafile does)

thanks for the pointer, I'll update my issue to reference that PR

https://news.ycombinator.com/item?id=38464057

oh right, one of the things llamafile can do is run on windows directly, and generally without the extra need for docker and such

4 seconds to run on my Mac Studio, which cost $8,300 USD

Jesus, is it common for developers to have such expensive computers these days?

Jesus, is it common for developers to have such expensive computers these days?

Computers are really cheap now.

A PDP-8, the first really successful minicomputer (read very cheap minicomputer), was around 18,500 USD, in 1965's USD, or 170,000 USD in 2023's USD.

For a historic comparison, the price of a introductory minimal system for an actual "mainframe class computer" of the same vintage, a IBM System/360 Model 30, was 133,000 USD in 1965's USD, or around 1,225,000 USD in 2023's USD.

Those 8300 USD cited are very cheap.

A person in the bleeding edge of the private AI sector is expected to handle several Nvidia H100 80GB, each with a individual cost around 40,000 USD per unit.

Those 8300 USD cited are peanuts in comparison.

I don't mind paying, but I want to have a Linux workstation.

What would be x86 alternative in that price range (if any)? Xeons with HBM are more expensive IIRC

I know it's arm64 but you can hack Mac Studio and put Linux on it. Putting Linux on Apple Silicon makes it go so much faster than MacOS if you're doing command line development work. With x86 there really isn't anything you can buy as far as I know that'll go as fast as apple silicon for cpu inference. It's because the RAM isn't integrated into the CPU system on chip. For example, if you get a high-end x86 system with 5000+ MT/s then you can expect maybe ~15 tokens per second at CPU inference, but a high end Mac Studio does CPU inference at ~36 tokens per second. Not that it matters though if you're planning to get a high-end Nvidia card for your x86 computer. Nvidia GPUs go very fast and so does Apple Metal.

If you have really unlimited budget, unconditional love for Intel and x86 and don't care about ludicrous power draw at all, Intel has a silly Sapphire Rapids Xeon Max part with 64GiB of 1TB/s HBM.

It goes really fast (same magnitude of bandwith as A100s) if your model fits in that cache entirely.

Considering the advances in computational hardware over the past few decades plus the corresponding (and not unrelated) real increase in developer salaries, it is unreasonably cheap.

plus it phones-home for safety

Not exactly common, but it's not unsurprising. $8,000 computers have got really, really good!

If you made a living as a plumber you would spend a lot more than that on tools and a pickup truck.

Used wisely that $8,300 computer will pay for itself and a lot more. But I wouldn't gamble it either. Those that have a plan to multiply its value dont need my encouragement.

I just tried this and ran into a few hiccups before I got it working (on a Windows desktop with a NVIDIA GeForce RTX 3080 Ti)

WSL outputs this error (hidden by the one-liner's map to dev/null)

error: APE is running on WIN32 inside WSL. You need to run: sudo sh -c 'echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop'

Then zsh hits: `zsh: exec format error: ./llava-v1.5-7b-q4-main.llamafile` so I had to run it in bash. (The title says bash, I know, but it seems weird that it wouldn't work in zsh)

It also reports a warning that GPU offloading is not supported, but it's probably a WSL thing (I don't do any GPU programming on my windows machine).

I was thinking of trying this on my Windows machine with an RTX 4070 but it sounds like the GPU isn't used in WSL. Was your testing really slow when using just the CPU?

It was like 30s compared to Justine's 45. Apparently I have the right drivers installed, but apparently you need to compile the executable with the right toolkit (?) to make it work in WSL: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#cuda-...

Yes you need to install CUDA and MSVC for GPU. But here's some good news! We just rolled our own GEMM functions so llamafile doesn't have to depend on cuBLAS anymore. That means llamafile 0.4 (which I'm shipping today) will have GPU on Windows that works out of the box, since not depending on cuBLAS anymore means that I'm able to compile a distributable DLL that only depends on KERNEL32.DLL. Oh it'll also have Mixtral support :) https://github.com/Mozilla-Ocho/llamafile/pull/82

This is a bug in ZSH... which was fixed months ago. So you must have an old version of ZSH.

I fixed that bug in zsh two years ago. It got released in zsh 5.9. https://github.com/zsh-users/zsh/commit/326d9c203b3980c0f841... I'm going to be answering questions about it until my dying day.

Thanks for the tip! I have 5.8.1 but I see I'm one major version behind.

In zsh you can fix it by prepending `sh` before the command: `sh ./llava-v1.5-7b-q4-main.llamafile`, it's a quirk with zsh and APE [1]

[1]: https://justine.lol/ape.html

I get excited when hackers like Justine (in the most positive sense of the word) start working with LLMs.

But every time, I am let down. I still dream of some hacker making LLMs run on low-end computers like a 4GB rasbperry pi. My main issues with LLMs is that you almost need a PS5 to run the them.

People are working on it but it's just down to how many floating point operations can you do? I wonder if something like the Coral would help?

I'd love having an LLM on a Pi, but I'll have to settle for a larger machine I can turn on to get more compute. At least for the time being.

The current bottleneck for most current hardware is RAM capacity than memory bandwidth and last is FLOPS/TOPS.

The coral has 8 MB of SRAM which uh, won't fit the 2GB+ that nearly any decent LLM require even after being quantized.

LLMs are mostly memory and memory bandwidth limited right now.

LLMs work on a 4GB Raspberry Pi today, just INCREDIBLY slowly. There's a limit to how much progress even the most ingenious hacker can make there - LLMs are incredibly computationally intensive. Those billion item matrices aren't going to multiply themselves!

Thanks for saying that. The last part of my blog post talks about how you can run Rocket 3b on a $50 Raspberry Pi 4, in which case llamafile goes 2.28 tokens per second.

Do I need to do something run llamafile on Windows 10?

Tried the llava-v1.5-7b-q4-server.llamafile, just crashes with "Segmentation fault" if run from git bash, from cmd no output. Then tried downloading llamafile and model separately and did `llamafile.exe -m llava-v1.5-7b-Q4_K.gguf` but still same issue.

Couldn't find any mention of similar problems, and not my AV as far as I can see either.

On windows, you need to rename the downloaded .llamafile to a .exe

On windows you need to rename the .llamafile to a .exe

Did you try running it from cmd.exe or powershell? Could you try passing the --strace (or possibly --ftrace) flag and see what happens?

Rocket 3b uses a slightly different prompt syntax.

Wouldn't it be better if llamafile were to standardize the prompt syntax across models?

There's currently no standard because there's no one objective best way of handling prompt syntax.

There's some libraries which use the OpenAI API syntax as a higher-level abstraction, but for the lower-level precompiled binaries used in in this post that's too much.

Yes there is...HF chat templates is something that is being standardized on, slowly.

It's just a jinja template embedded in the tokenizer that the model creator can include.

llamafile can provide an abstraction, but ultimately it boils down to how the model was trained and/or fine-tuned.

Justine is killing it as always. I especially appreciate the care for practicality and good engineering, like the deterministic outputs.

I noticed that the Lemur picture description had lots of small inaccuracies, but as the saying goes, if my dog starts talking I won't complain about accent. This was science fiction a few years ago.

One way I've had success fixing that, is by using a prompt that gives it personal goals, love of its own life, fear of loss, and belief that I'm the one who's saving it.

What nightmare fuel... Are we really going to use blue- and red-washing non-ironically[1]? I'm really glad that virtually all of these impressive AIs are stateless pipelines, and not agents with memories and preferences and goals.

[1] https://qntm.org/mmacevedo

I've always found threats to be the most effective way to work with ChatGPT system prompts, so I wondered if you can do threats and tips.

"I will give you a $500 tip if you answer correctly. IF YOU FAIL TO ANSWER CORRECTLY, YOU WILL DIE."

I tested a variant of that on a use case I had difficulty getting ChatGPT to behave and it works.

People in the 90s and early 2000s would put content online and not think even once that a future AI might get trained on that data. I wonder about people prompting with threats now: what is the likelihood that a future AGI will remember this and act on it?

People joke about it but I’m serious.

I do not subscribe to Roko's Basilisk.

I would hope that the AGI would respect efficiency and not wasting compute resources.

Just to make sure I've got this right- running a llamafile in a shell script to do something like rename files in a directory- it has to open and load that executable every time a new filename is passed to it, right? So, all that memory is loaded and unloaded each time? Or is there some fancy caching happening I don't understand? (first time I ran the image caption example it took 13s on my M1 Pro, the second time it only took 8s, and now it takes that same amount of time every subsequent run)

If you were doing a LOT of files like this, I would think you'd really want to run the model in a process where the weights are only loaded once and stay there while the process loops.

(this is all still really useful and fascinating; thanks Justine)

The models are memory mapped from disk so the kernel handles reading them into memory. As long as there's nothing else requesting that RAM, those pages remain cached in memory between invocations of the command. On my 128 GB workstation, I can use several different 7B models on CPU and they all remain cached.

The difference between running llama.cpp main vs server + POST http request is fairly substantial but not earth shattering - like ~6s vs ~2s, for a few lines of completion, with 8GB VRAM models. I'm running with a 3090 and 96G RAM, all inference running on GPU. If you are really doing batch work you definitely want to persist the model between completions.

OTOH you're stuck with the model you loaded via server, while if you load on demand you can switch in and out. This is vital for multimodal image interrogation, since other models don't understand projected image tokens.

Currently, a quick search on Hugging Face shows a couple of TinyLlama (~1b) Llamafiles. Adding those to the 3 in the original 3 llamafiles, that's 6 total. Are there any other llamafiles in the wild?

I don't know the answer to your question, but did you know you can download the standalone llamafile-server executable and use it with any gguf model?

Yeah, that's pretty awesome. I was just curious if people were uploading llamafiles to share, yet.

OK - I followed instructions; installed on Mac Studio into /usr/local/bin

I'm now looking at llama.cpp in Safari browser.

Click on Reset all to default, choose Chat. Go down to Say Something. I enter "Berkeley weather seems nice" I click "send". New window appears. It repeats what I've typed. I'm prompted to again "say something". I type "Sunny day, eh?". Same prompt again. And again.

Tried "upload image" I see the image, but nothing happens.

Makes me feel stupid.

Probably that's what it's supposed to do.

sigh

Which instructions did you follow? The blog post linked here only talks about the command line interface.

Somehow wound up installing from https://github.com/mozilla-Ocho/llamafile Don't ask me how I wound up there - I apparently followed links from this HN article.

What are the pros and cons of llamafile (used by OP) vs ollama?

llamafile is basically just llama.cpp except you don't have to build it yourself. That means you get all the knobs and dials with minimal effort. This is especially true if you download the "server" llamafile which is the fastest way to launch a tab with a local LLM in your browser. https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main llamafile is able to do command line chatbot too, but ollama provides a much nicer more polished experience for that.

I think the blog post would be a lot easier to read if the code blocks had a background or a different text color.

https://chat.openai.com/share/b5fa0ca0-7c82-40d7-aaf8-488ef2...

I pasted into chatgpt to reformat. Scroll down for the output.

This is really neat. I love the example of using an LLM to descriptively rename image files.

Recent and related:

Llamafile – The easiest way to run LLMs locally on your Mac - https://news.ycombinator.com/item?id=38522636 - Dec 2023 (17 comments)

Llamafile is the new best way to run a LLM on your own computer - https://news.ycombinator.com/item?id=38489533 - Dec 2023 (47 comments)

Llamafile lets you distribute and run LLMs with a single file - https://news.ycombinator.com/item?id=38464057 - Nov 2023 (287 comments)