very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave.
Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al that interact with the internals of the model (https://www.reddit.com/r/LocalLLaMA/comments/17a4zlf/reliabl...). The structured outputs can be arbitrary grammars, for example, not just JSON (https://github.com/outlines-dev/outlines#using-context-free-...).
Yeah, JSON mode in Ollama, which isn’t even the full llama.cpp grammar functionality, performs better than OpenAI for me at this point. I don’t understand how they can be raking in billions of dollars and can’t even get this basic stuff right.
llama.cop Performs better than what?
3.5 Turbo, one of the 4.0 models, API or App?
JSON mode and function-calling with a JSON schema in the OpenAI API.
Right, but which model?
It makes a huge difference.
I’ve been using OpenChat 3.5 1210 most recently. Before that, Mistral-OpenOrca. Both return JSON more consistently than gpt-3.5-turbo.
gpt-3.5-turbo is not the benchmark
I don’t know what point you’re trying to make. They also return JSON more consistently than gpt-4, but I don’t use that because it’s overkill and expensive for my text extraction tasks.
Because people have different interests and want to hear your results for different reasons.
Some want to consider results relative to cost, and some are interested only in how it compares to SOTA.
I mean, sure, but the parent should also just explicitly state what it is they were asking or claiming. I’ve answered every question asked. Making vague declarations about something not being “the benchmark,” while not stating what you think “the benchmark” should be, is unhelpful.
Yes, but you should also instruct the model to follow that specific pattern in its answer, or else the accuracy of the response degrades even though it's following your grammar/pattern/whatever.
For example, if you use Llama-2-7b for classification (three categories, "Positive", "Negative", "Neutral"), you might write a grammar like this:
```
root ::= "{" ws "sentiment:" ws sentiment "}"
sentiment ::= ("Positive" | "Neutral" | "Negative" )
ws ::= [ \t\n]*
```
But if the model doesn't know it has to generate this schema, the accuracy of classifications drops because it's trying to say other things (e.g., "As an AI language model...") which then get suppressed and "converted" to the grammar.
Similarly, I think it is important to provide an “|” grammar that defines an error response, and explain to the model that it should use that format to explain why it cannot complete the requested operation if it runs into something invalid.
Otherwise, it is forced to always provide a gibberish success response that you likely won’t catch.
I’ve tested this with Mixtral, and it seems capable of deciding between the normal response and error response based on the validity of the data passed in with the request. I’m sure it can still generate gibberish in the required success response format, but I never actually saw it do that in my limited testing, and it is much less likely when the model has an escape hatch.
Can you elaborate? So you instruct the model to either follow the grammar OR say why it can't do that? But the model has no idea this grammar exists (you can tell it the schema but the model doesn't know its tokens are going through a logprobs modification).
No, the grammar can do OR statements. You provide two grammars, essentially. You always want to tell the model about the expected response formats, so that it can provide the best response it can, even though you’re forcing it to fit the grammar anyways.
In JSON Schema, you can do a “oneOf” between two types. You can easily convert a JSON Schema into the grammar that llama.cpp expects. One of the types would be the success response, the other type would be an error response, such as a JSON object containing only the field “ErrorResponse”, which is required to be a string, which you explain to the model that this is used to provide an explanation for why it cannot complete the request. It will literally fill in an explanation when it runs into troublesome data, at least in my experience.
Then the model can “choose” which type to respond with, and the grammar will allow either.
If everything makes sense, the model should provide the successful response you’re requesting, otherwise it can let you know something weird is going on by responding with an error.
Ah I see. So you give the entire "monadic" grammar to the LLM, both as a `grammar` argument and as part of the prompt so it knows the "can't do that" option exists.
I'm aware of the "OR" statements in grammar (my original comment uses that). In my experience though, small models quickly get confused when you add extra layers to the JSON schema.
I wouldn’t provide the grammar itself directly, since I feel like the models probably haven’t seen much of that kind of grammar during training, but just JSON examples of what success and error look like, as well as an explanation of the task. The model will need to generate JSON (at least with the grammar I’ve been providing), so seeing JSON examples seems beneficial.
But, this is all very new stuff, so certainly worth experimenting with all sorts of different approaches.
As far as small models getting confused, I’ve only really tested this with Mixtral, but it’s entirely possible that regular Mistral or other small models would get confused… more things I would like to get around to testing.
I've tested giving the JSON schema to the model (bigger ones can handle multi-layer schemas) __without__ grammar and it was still able to generate the correct answer. To me it feels more natural than grammar enforcement because the model stays in its "happy place". I then sometimes add the grammar on top to guarantee the desired output structure.
This is obviously not efficient because the model has to process many more tokens at each interaction, and its context window gets full quicker as well. I wonder if others have found better solutions.
This can sometimes be fixed with a few-shot example for in-context-learning.
But you are right that the model can go off the rails if it is being forced too far from where its 'happy place' is, especially for smaller models.
There are now several open source models that are fine tuned for function calling including:
* Functionary [https://github.com/MeetKai/functionary]
* NexusRaven [https://github.com/nexusflowai/NexusRaven-V2]
* Gorilla [https://github.com/ShishirPatil/gorilla]
Could be interesting to try some of these exercises with these models.
... and I spent last few hours trying them now :)
Low latency, high quality function calling API product may be a billion dollar business in two years.
What are your findings?
They are not at the level of gpt-4 tool calling. But at least they are open source and they will get better.
That last link is interesting. See https://github.com/outlines-dev/outlines#using-context-free-... specifically
sure, that's "correct" per the definition of the grammar, but it's also one of the worst possible way to get to the number 5