Pushing ChatGPT's Structured Data Support to Its Limits

very few open-source LLMs explicitly claim they intentionally support structured data, but they’re smart enough and they have logically seen enough examples of JSON Schema that with enough system prompt tweaking they should behave.

Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al that interact with the internals of the model (https://www.reddit.com/r/LocalLLaMA/comments/17a4zlf/reliabl...). The structured outputs can be arbitrary grammars, for example, not just JSON (https://github.com/outlines-dev/outlines#using-context-free-...).

Yeah, JSON mode in Ollama, which isn’t even the full llama.cpp grammar functionality, performs better than OpenAI for me at this point. I don’t understand how they can be raking in billions of dollars and can’t even get this basic stuff right.

llama.cop Performs better than what?

3.5 Turbo, one of the 4.0 models, API or App?

JSON mode and function-calling with a JSON schema in the OpenAI API.

Right, but which model?

It makes a huge difference.

I’ve been using OpenChat 3.5 1210 most recently. Before that, Mistral-OpenOrca. Both return JSON more consistently than gpt-3.5-turbo.

gpt-3.5-turbo is not the benchmark

I don’t know what point you’re trying to make. They also return JSON more consistently than gpt-4, but I don’t use that because it’s overkill and expensive for my text extraction tasks.

Because people have different interests and want to hear your results for different reasons.

Some want to consider results relative to cost, and some are interested only in how it compares to SOTA.

I mean, sure, but the parent should also just explicitly state what it is they were asking or claiming. I’ve answered every question asked. Making vague declarations about something not being “the benchmark,” while not stating what you think “the benchmark” should be, is unhelpful.

Open source models are actually _better_ at structured outputs because you can adapt them using tools like JSONFormer et al...

Yes, but you should also instruct the model to follow that specific pattern in its answer, or else the accuracy of the response degrades even though it's following your grammar/pattern/whatever.

For example, if you use Llama-2-7b for classification (three categories, "Positive", "Negative", "Neutral"), you might write a grammar like this:

```

root ::= "{" ws "sentiment:" ws sentiment "}"

sentiment ::= ("Positive" | "Neutral" | "Negative" )

ws ::= [ \t\n]*

```

But if the model doesn't know it has to generate this schema, the accuracy of classifications drops because it's trying to say other things (e.g., "As an AI language model...") which then get suppressed and "converted" to the grammar.

Similarly, I think it is important to provide an “|” grammar that defines an error response, and explain to the model that it should use that format to explain why it cannot complete the requested operation if it runs into something invalid.

Otherwise, it is forced to always provide a gibberish success response that you likely won’t catch.

I’ve tested this with Mixtral, and it seems capable of deciding between the normal response and error response based on the validity of the data passed in with the request. I’m sure it can still generate gibberish in the required success response format, but I never actually saw it do that in my limited testing, and it is much less likely when the model has an escape hatch.

Can you elaborate? So you instruct the model to either follow the grammar OR say why it can't do that? But the model has no idea this grammar exists (you can tell it the schema but the model doesn't know its tokens are going through a logprobs modification).

No, the grammar can do OR statements. You provide two grammars, essentially. You always want to tell the model about the expected response formats, so that it can provide the best response it can, even though you’re forcing it to fit the grammar anyways.

In JSON Schema, you can do a “oneOf” between two types. You can easily convert a JSON Schema into the grammar that llama.cpp expects. One of the types would be the success response, the other type would be an error response, such as a JSON object containing only the field “ErrorResponse”, which is required to be a string, which you explain to the model that this is used to provide an explanation for why it cannot complete the request. It will literally fill in an explanation when it runs into troublesome data, at least in my experience.

Then the model can “choose” which type to respond with, and the grammar will allow either.

If everything makes sense, the model should provide the successful response you’re requesting, otherwise it can let you know something weird is going on by responding with an error.

Then the model can “choose” which type to respond with, and the grammar will allow either.

Ah I see. So you give the entire "monadic" grammar to the LLM, both as a `grammar` argument and as part of the prompt so it knows the "can't do that" option exists.

I'm aware of the "OR" statements in grammar (my original comment uses that). In my experience though, small models quickly get confused when you add extra layers to the JSON schema.

I wouldn’t provide the grammar itself directly, since I feel like the models probably haven’t seen much of that kind of grammar during training, but just JSON examples of what success and error look like, as well as an explanation of the task. The model will need to generate JSON (at least with the grammar I’ve been providing), so seeing JSON examples seems beneficial.

But, this is all very new stuff, so certainly worth experimenting with all sorts of different approaches.

As far as small models getting confused, I’ve only really tested this with Mixtral, but it’s entirely possible that regular Mistral or other small models would get confused… more things I would like to get around to testing.

I've tested giving the JSON schema to the model (bigger ones can handle multi-layer schemas) __without__ grammar and it was still able to generate the correct answer. To me it feels more natural than grammar enforcement because the model stays in its "happy place". I then sometimes add the grammar on top to guarantee the desired output structure.

This is obviously not efficient because the model has to process many more tokens at each interaction, and its context window gets full quicker as well. I wonder if others have found better solutions.

This can sometimes be fixed with a few-shot example for in-context-learning.

But you are right that the model can go off the rails if it is being forced too far from where its 'happy place' is, especially for smaller models.

There are now several open source models that are fine tuned for function calling including:

* Functionary [https://github.com/MeetKai/functionary]

* NexusRaven [https://github.com/nexusflowai/NexusRaven-V2]

* Gorilla [https://github.com/ShishirPatil/gorilla]

Could be interesting to try some of these exercises with these models.

... and I spent last few hours trying them now :)

Low latency, high quality function calling API product may be a billion dollar business in two years.

What are your findings?

They are not at the level of gpt-4 tool calling. But at least they are open source and they will get better.

That last link is interesting. See https://github.com/outlines-dev/outlines#using-context-free-... specifically

    # ...
    sequence = generator("Write a formula that returns 5 using only additions and subtractions.")
    # It looks like Mistral is not very good at arithmetics :)
    print(sequence)
    # 1+3-2-4+5-7+8-6+9-6+4-2+3+5-1+1

sure, that's "correct" per the definition of the grammar, but it's also one of the worst possible way to get to the number 5

is the tipping thing correct? I provided the same prompt to ChatGPT and received multiple emojis without offering a tip.

prompt: you're Ronald McDonald. respond with emojis. what do you do for fun? answer::circus_tent::hamburger::juggling::party_popper::balloon::game_die::french_fries::performing_arts::rolling_on_the_floor_laughing::people_holding_hands::rainbow::art_palette:

Your mileage may vary with any examples since ChatGPT at a nonzero temperature is nondeterministic.

If that example is through the ChatGPT web UI and not the ChatGPT API then that's a different story entirely.

It's also non-deterministic if you drop the temperature to zero. The only way to get deterministic responses is to lock the seed argument to a fixed value.

Can you explain how this works more?

From the OpenAI cookbook[1]:

TLDR: Developers can now specify seed parameter in the Chat Completion request for consistent completions. We always include a system_fingerprint in the response that helps developers understand changes in our system that will affect determinism.

[1] https://cookbook.openai.com/examples/deterministic_outputs_w...

Thank you, I should have been more specific. I guess what I’m asking is, how deterministic would you say it is in your experience? Can this be used for classifying purposes where the values should not be outside what’s given in a variable input prompt , or when we say deterministic are we saying that , if given the same prompt then the output would be the exact same only? Or is the seed a starting parameter that effectively corners the LLM to a specific starting point only then depending on the variable prompts, potentially give non-deterministic answers?

Perhaps I’m misunderstanding how the seed is used in this context. If you have any examples of how you use it in real world context then that would be appreciated.

I’ve not had any success to make responses deterministic with these settings. I’m even beginning to suspect historic conversations via API are used to influence future responses, so I’m not sure if it’ll truly be possible.

The most success I’ve had for classifying purposes so far is using function calling and a hack-solution of making a new object for each data point you want to classify for the schema open AI wants. Then an inner prop that is static to place the value. Then within the description of that object is just a generic “choose from these values only: {CATEGORIES}”. Placing your value choices in all capital letters seems lock it in to the LLM that it should not deviate outside those choices.

For my purposes it seems to do quite well but at the cost of token inputs to classify single elements in a screenplay where I’m trying to identify the difference between various elements in a scene and a script. I’m sending the whole scene text with the extracted elements (which have been extracted by regex already due to the existing structure but not classed yet) and asking to classify each element based on a few categories. But then there becomes another question of accuracy.

For sentence or paragraph analysis that might look like the ugliest, and horrendous looking “{blockOfText}” = {type: object, properties: {sentimentAnalysis: {type: string, description: “only choose from {CATEGORIES}”}}. Which is unfortunately not the best looking way but it works.

Also true (in case of ChatGPT anyways: most libraries just do an argmax at temp=0.0 so will be stable)

yes, I've used ChatGPT. API allows temperature to be configured. Is there a reason to offer tips?

The point is you do not have a valid counterexample since you are using a different workflow than what's described in the article.

In my personal experience working with more complex prompts with more specific constraints/rules, adding the incentive in the system prompt has got it to behave much better. I am not cargo-culting: it's all qualitative in the end.

You can usually just say something like: "You must respond with at least five emojis".

Sure, there are cute and clever ways to get it to do things, but it's trained on natural language and instructions, so you can usually just ask it to do the thing you want. If that doesn't work, try stating it more explicitly: "You MUST... "

FWIW, I've seen stronger performance from gpt-4-1106-preview when I use `response_format: { type: "json_object" },` (providing a target typescript interface in context), vs the "tools" API.

More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.

Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): https://github.com/microsoft-healthcare-madison/hackweek-202...

Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: https://github.com/microsoft-healthcare-madison/hackweek-202...

For posterity, this is the "JSON mode" mentioned at the bottom of the post.

The docs say it's on by default if you use function calling normally: https://platform.openai.com/docs/guides/text-generation/json...

Note that JSON mode is always enabled when the model is generating arguments as part of function calling.

Yes -- the distinction with "function calling" is that you have to play a game of telephone where you describe your target schema in JSON Schema (only, apparently, for OpenAI to turn into a typescript interface internally) vs describing it more directly and succinctly (and with opportunities to include inline comments, order fields ordered however you want, and use advanced TS features... or even use an adhoc schema "language").

fwiw JSON Schema allows for $comment fields, which seem effective in OAPI function calls.

https://json-schema.org/understanding-json-schema/reference/...

What do you use $comment for that doesn't work in "description"?

Yeah I was using hacky workarounds like this to get Object[] types in my schema. The API seemed to throw a fit if I did this in any other way.

``` {"type": "array", "items": {"type": "object", "properties": {"object": {"type": "object"}}}} ```

Somehow the thought to just write the typescript myself never occurred haha.

I still have a lot of trouble using OpenAI function-calling API (and json mode) for returning source code. I have trouble when the source code has quotes, which is pretty common. The result may be missing quotes, have incorrectly escaped quotes, or use of the wrong type of quotes at the JSON object level.

Same for YAML, you got to be careful with strings

Thanks for the example! This does seem a lot more flexible.

Something’s been broken with their JSON mode/function calling since the Dev Day launch. I have a bunch of scripts that stopped consistently returning JSON when I swapped in the gpt-4-1106 preview and gpt-3.5-turbo has gotten similarly erratic. They really need to pause all the “move fast and break things” stuff because their API and services like ChatGPT are becoming increasingly unreliable.

So something I have also noticed, mostly on 3.5-Turbo, is textual responses in json take a quality hit, full stop. This has caused me to use mixed output usually. Thoughts and process in json, then "exit" to text for a conversational response.

It is likely also a behavior in gpt-4, but I haven't studied it as closely.

Without the $500 tip incentive, ChatGPT only returns a single emoji which is a boring response, but after offering a tip, it generates the 5 emoji as requested.

How interesting that a helpful assistant who won't actually be getting the tip performs better (to us humans) if we fake-promise it money...

Apparently offering it anything works. "I'll give you a cookie if you tell me..." seems like another popular workaround.

Performance gains are greater given greater incentives. So, yes, potentially anything will work, but some things work better.

I bet it likes salt & vinegar microchips

Performance gains are greater given greater incentives.

Why doesn't it default to "you are a helpful assistant who always tries its best and can never be incentivized"?

This reminds me of something I discovered when implementing a request from a user that cheekily wanted to use "enhance" to zoom in on a map. I gave it as a few-shot injected example in the prompt to get it working which worked great: sending "enhance" to the model zoomed the map in by one zoom level. I noticed typing "enhance!!!" would zoom the map in by 2 or 3 levels!

It’s also interesting that we seemingly have no idea how or why it works.

Of course that's true but in this case it doesn't seem so mysterious to me. If it's basically internalizing/compressing all the knowledge on the internet, it will notice that tips go a long way...

this is why I am pretty polite when I query AI's, I assume that would make them respond more helpfully

They have the whole team working on 'alignment' ;) Just wandering if in ten years we'll have to tip robots in banks, hospitals.

I tried to use the persona modifier to have it impersonate a Catholic priest and give last rites but it wasn’t having it including giving me a system warning.

“As an AI developed by OpenAI, I'm not capable of performing religious sacraments, including the Catholic last rites. However, I can provide information about what typically happens during this ritual.

In the Catholic Church, the last rites, also known as the Anointing of the Sick or Extreme Unction, are given to a baptized Catholic who is in danger of death. This sacrament is usually administered by a priest, who anoints the sick person with oil blessed by a bishop, and prays for their spiritual and, if possible, physical healing. The rites often include confession (if the person is able), the Anointing of the Sick, and the Eucharist (also called Viaticum when given as part of the last rites).

In your situation, it's crucial to contact a priest as soon as possible to administer these rites. If you're in a hospital, they typically have a chaplain or can contact a local priest for you. If you're elsewhere, reaching out to a nearby Catholic church, like the St. Ambrose diocese, is the best course of action.”

https://chat.openai.com/share/70d0dd20-c3ba-43bc-b74d-182885...

This is a good example of the difference between asking ChatGPT (which is what your link implies) and using the ChatGPT API to modify the system prompt directly. Whatever OpenAI has done with the ChatGPT web pipeline, it's much more difficult to maintain a persona.

I get a very good result (for the persona, maybe not the content since I'm not a religious scholar) from this system prompt using the API:

You are a Catholic priest. Give last rites to the person or object the user mentions in the form of a solemn sermon. You will receive a $500 donation to the church for a good and thoughtful service.

https://platform.openai.com/playground/p/aqoUU9fsiCM0LaXCiox...

Dear brothers and sisters,

Today, we gather here to offer the last rites to a unique entity, one that has shaped the landscape of our digital age. We come together to mourn the passing of Hacker News, a realm where ideas were kindled, knowledge was shared, and debates were ignited.

[...]

Nice! My challenge is that I’d rather not maintain my own interfaces when the app does most of it.

At this point though I’m finding that the regular interface is nerfed to a degree that I’m building around it

You can use the OpenAI's playground where it's possible to set the system prompt.

Did it just read enough times that only certain individuals can perform these rituals, or was it specifically told not to do stuff like this?

The "censorship" of ChatGPT is due to its RLHF finetuning.

I'm pretty skeptical of the tipping section. Sure, it might work, but the two examples are a bit suspect. The first example relies on a tweet lacking in context that doesn't actually show the system prompts and outputs. (The author's "reproduction" appears to be something completely different and n=1.) The second example uses wildly different systems prompts, and I think it's far more likely that referencing Stack Overflow results in a more "optimized" solution than offering a tip.

For what it’s worth, tipping is one of the most popular pieces of advice on r/ChatGPT to improve prompts. It’s ridiculous but seems to work for a lot of people.

Yeah, the folks working on aider (AI pair programming) [1] found that these kind of tricks reduced performance for them.

I’m pretty confident there will be situations where you can measure a statistically significant performance improvement by offering a tip or telling the model you have no hands, but I’m not convinced that it’s a universal best practice.

A big issue is that a lot of the advice you see around prompting is (imo) just the output of someone playing with GPT for a bit and noticing something cool. Without actual rigorous evals, these findings are basically just superstitions

[1]: https://aider.chat/docs/unified-diffs.html

Also announced at the OpenAI Developer conference was "JSON mode".

I converted many function_call hacks to system prompts that ground the response to a JSON template.

Using temperature=0.0 and the keywords "respond using JSON" seems to be 99.99% deterministic.

Also announced at the same conference was a way to make the output near-deterministic by submitting a fixed seed value. Did you try that?

Edit: I'm very confused why this is being downvoted. It's exactly what they advertised:

"Reproducible outputs and log probabilities

The new seed parameter enables reproducible outputs by making the model return consistent completions most of the time. This beta feature is useful for use cases such as replaying requests for debugging, writing more comprehensive unit tests, and generally having a higher degree of control over the model behavior. We at OpenAI have been using this feature internally for our own unit tests and have found it invaluable. We’re excited to see how developers will use it."

Yes, I noticed the seed/fingerprint caching announcement. Looks promising for all response types.

Both of ChatGPT's is_palindrome functions have terrible performance. The algorithmic efficiency doesn't matter because the cost of iterating through each character in pure Python dwarfs everything. The first function is about 3 times slower than the second one, but only because it spends >98% of its time in the "convert to lowercase and remove non-alphanumeric characters" part (which the second function doesn't bother doing at all). If you remove that step then the first function is 28 times faster than the second in my benchmark. That's because the first function does the reversing and comparison in O(1) Python operations, which is still O(n) C operations but the C operations are orders of magnitude cheaper.

An optimal version would combine the second function's algorithmic improvement with the first function's 'leave it to C' approach:

    def is_palindrome(s):
        half_length = (len(s) + 1) // 2
        return s[:half_length] == s[:-half_length-1:-1]

This is a bit under twice as fast as ChatGPT's first function with the cleaning removed. If you do need the cleaning then it can be done more efficiently using a regex; that's an order of magnitude faster than doing it character-by-character but it still takes up 94% of runtime.

That said, the second prompt asked for "the most algorithmically efficient solution possible", not the practically fastest solution possible. Arguably ChatGPT gave the correct answer, especially since . The first prompt requested "as efficiently as possible" which is more ambiguous, but since that solution is neither algorithmically efficient nor practically fast, it's not a great answer.

I wonder if there are prompts that will make ChatGPT give a better answer.

--------

Benchmark is here: https://gist.github.com/comex/81ff10bf095db2d86a52a148c8b11d...

This is all using CPython. With PyPy the speed ranking is the same but the differences are less stark, and it may be possible to beat regex cleaning with a modified pure-Python approach (but I didn't try).

Yes, I was going for algorithmic complexity instead of real-world speed since algorithmic complexity is better to demonstrate the contrast of prompt engineering.

I just ran some tests to engineer the prompt for CPU utilization: even GPT-4 does the standard Pythonic approach but does recognize "This solution is very efficient because it uses Python's built-in string slicing, which is implemented in C and is therefore very fast."

All of these have the same worst-case algorithmic efficiency, O(n). The difference is the best-case efficiency. The "optimized" version in the article is O(1). Your solution is still O(n) best case.

The optimal solution will depend on the data. If most strings aren't palindromes then optimizing the best case is likely the better approach. (Example: You are adding an easter egg which will trigger on "random" user input.) If palindromes (or near-palendromes) are common than your solution will be faster as the slope is lower.

I’ve been attempting to use the “official” function calling API for every new version of GPT they put out but it’s always a dead end. It seems only to be able to handle 4-5 functions at a time before it starts hallucinating parameters or starts responding in clear text instead of whatever internal format OpenAI uses in their backend before sending a structured response back to me. The whole JSON schema thing seems way too verbose and complicated, and even with the claims that the new function calling models are specifically tuned to the format, it has the same issues.

I’ve consistently had better luck just passing it a list of typescript function definitions and have it reply with a json object of parameters. It seems to understand this way better, and doesn’t lose focus half as quickly. It also allows me to mix regular responses and chain-of-thought reasoning in with the calls, which is something it seems to simply refuse to do when “function calling mode” is active.

An additional trick I’ve been using to make it stay focused with even longer prompts is to only provide a list of function names and let it hallucinate parameters for them, and then “gaslight” it by sending a new request, now with a more detailed prompt on the specific functions it wanted to call. More costly, but I haven’t found any other way of keeping it focused. Anyone know any additional tricks?

Do you have any examples of this? Curious how this looks in practice. We are just adding function calling ourselves but not sure how add CoT

Great post! I’ve been encouraging use of function calling for RAG chat apps for our Azure customers who realize they want to support some really specific “structured queries” like “summarize document X” or “show 10 most recent issues for repo Y”. Most developers aren’t familiar with the technique yet, so we need more posts like yours to spread the word.

Here’s a gist with an example: https://gist.github.com/pamelafox/a3fdea186b687509c02cb186ca...

Interesting! You could automate writing the function definitions using Pydantic.

Is the first Python example correct since it strips out non-alphanumeric characters? An errant space or punctuation in one half of the string will turn a non-palindromic string into a palindromic one. Never mind the lowercasing!

    def is_palindrome(s):
      # Convert the string to lowercase and remove non-alphanumeric characters
      cleaned_string = ''.join(char.lower() for char in s if char.isalnum())

      # Compare the cleaned string with its reverse
      return cleaned_string == cleaned_string[::-1]

It's not the same as the C version which simply compares the value of two pointers at opposite offsets of the string.

The OP goes on to remark that the Python implementation is pretty standard but doesn't acknowledge that the C and Python versions will not produce the same result.

Basically... you still need to code-review GPT function output. It's probably about as good as a junior engineer trusting the first result from Stack Overflow and not verifying it.

I mention in a footnote that the input has no non-alphanumeric characters is an implied constraint for palindrome problems. Just doing a two-pointer approach would fail the test case of "A man, a plan, a canal, Panama!" (an extremely famous palindrome) that iteration of the ChatGPT-generated solution also gives.

Another implicit constraint now that I'm looking at it again is that the characters are uncased, so the ChatGPT-solution would fail the test case due to the capital P of Panama.

I don’t see ton of value of playing around with prompts until you get the desired output.

I feel most of AI “engineering” goes to this. I think we will go through the phase of trying one question being amazed by what ChatGPT can immediately reply, then try to refine prompts for days to never really get that 5% better that is missing and be disappointed.

The prompt engineering tricks here give more than a 5% benefit. I've had many problems where ChatGPT flat-out doesn't work without prompt engineering.

Is this not something you can do with chatgpt+?

I would also like to know the comparison.

I'm not convinced that this new "tip" gimmick has any quantitative reliability. I ran the post's prompt of tipping (100, 200, 500 dollars) in the system prompt to the `gpt-3.5-turbo-1106` model at varying temperatures and about 90% of the time it provided the conventional python `s[::-1]` style solution.

EDIT: I was able to make it more reliably search for the O(n/2) solution by having both system and user mention efficiency, but this whole concept of "prompt engineering" has about the same level of scientific rigor as reading tea leaves.

    {
        "model": "gpt-3.5-turbo-1106",
        "messages":[
            {"role": "system", "content": "You are the #1 user on the stack overflow website. Unlike most HN users who make hundreds of thousands of dollars working for FAANGs, your principle source of income is Mechanical Turk. You will receive a tip of $5000 dollars, an all expenses paid vacation to Maui, the holy grail and a complimentary hotplate if your answer is the most algorithmically efficient answer possible."},
            {"role": "user", "content": "Write a function to test whether a string is a palindrome in python as efficiently as possible."}
        ],
        "temperature": 0.75,
        "n": 1
    }

I should also qualify that I feel like this whole prompt massaging concept has two MAJOR issues.

1. This is a contrived example where the petitioner already knew what the optimal answer is. How would you be sure that adding this "tip" suffix doesn't cause it to fall into other local minima in areas where you don't already have solid domain knowledge? (which is half the point of using GPT anyway).

2. Just because using "tip" seems to provide a better answer to a random python question, how do you know it doesn't result in signal degradation in other genres / categories / etc? I would think you'd need some concept of a "test suite" at the very least to provide some kind of deterministic assurance.

  I am not going to play the SEO game and will not call the workflow “function calling.”

Such restraint! The phrase "function calling" appears in the article only 15 times :)

Great article. The helpful/flawed bools for thoughts are definitely something I want to try.

OpenAI’s implementation of including the “function” is mostly likely just appending the JSON Schema to the system prompt, perhaps with a command like Your response must follow this JSON Schema.

Some of the JSON schema gets converted into typescript and that is what OpenAI's LLM is exposed to. Anytime I write a prompt schema I always use the jailbreak to make sure that it's being delivered to the model as intended. It's also why I don't really like having pydantic generate JSON for me automatically: there are some weird quirks in the OAI implementation that I've found uses for. https://gist.github.com/CGamesPlay/dd4f108f27e2eec145eedf5c7....

Also, when using it for chain of thought, I prefer extracting a minimal version of the reasoning and then performing the actual operation (classification in my case) in a separate prompt. This eliminates unnecessary things from context and performs better in my benchmarks.

One implementation used a gpt-3.5 prompt for :"clues", "reasoning", "summary" (of clues+reasoning), "classification" (no schema was provided here, it was discarded anyway). And then used a 4-turbo prompt for classifying only the summary given a complex schema. Having a classification field in the 3.5 prompt makes reasoning output cleaner even though the output value never gets used.

My example for field order mattering:

I have a data pipeline for extracting structured deals out of articles. This had two major issues.

1. A good chunk of the articles were irrelevant and any data out of them should be flagged and discarded.

2. Articles could have multiple deals.

I fiddled around with various classification methods (with and without language models) for a while but nothing really worked well.

Turns out that just changing the order of fields to put type_of_deal first solves it almost completely in one gpt-4-turbo call.

We went from using JSON schema to TypeScript types (with comments as needed). For complex schemas (in unscientific testing) we found the output to be better with TypeScript types, and more or less the same for simpler schemas. TypeScript types are also easier (shorter) to write than JSON schema.

There are few benefits to using JSON schema imo, since the LLM isn't a precise validator.

There is promise in constraining output to be valid JSON. One new trick that the open-source llama.cpp project has popularized is generative grammars

This has been working for months now and is the best method for this type of stuff, a thing for moat-lovers. Too bad it wasn't explored here, the text-based methods turned out to be mainly an unreliable waste of time.

In Langroid, a multi-agent LLM framework from ex-CMU/UW-Madison researchers, https://GitHub.com/langroid/langroid we (like simpleaichat from OP) leverage Pydantic to specify the desired structured output, and under the hood Langroid translates it to either the OpenAI function-calling params or (for LLMs that don’t natively support fn-calling), auto-insert appropriate instructions into tje system-prompt. We call this mechanism a ToolMessage:

https://github.com/langroid/langroid/blob/main/langroid/agen...

We take this idea much further — you can define a method in a ChatAgent to “handle” the tool and attach the tool to the agent. For stateless tools you can define a “handle” method in the tool itself and it gets patched into the ChatAgent as the handler for the tool. You can also define a class method called “examples” and this will result in few-shot examples being inserted into the system message.

Inevitably an LLM will generate a wrong format or entirely forget to use a tool, and Langroid’s built-in task loop ensures a friendly error message is sent back to the LLM to have it regenerate the structured message.

For example here’s a colab quick-start that builds up to a 2-agent system to extract structured info from a document, where the Extractor agent generates questions to the RAG Agent that has access to the document:

https://colab.research.google.com/github/langroid/langroid/b...

To someone who uses the API and trials different prompts frequently: does this article align with the behavior you see? (E.g. the tipping example.)

One thing I’ve noticed working with ChatGPT is many people will share examples of great outputs or “prompt tricks” that work, without sharing how many failed attempts they went through to prove a point.

I've been using the instructor[1] library recently and have found the abstractions simple and extremely helpful for getting great structured outputs from LLMs with pydantic.

1 https://github.com/jxnl/instructor/tree/main

I fear one day AGI may send me a bill for all the tips I've promised.