return to table of content

Structured Outputs in the API

titzer
36 replies
1d

It's so wild that the bar for AI performance is both absurdly high and absurdly low at the same time. To specify an output format (language or grammar) for solving a computational problem is one of the oldest exercises around. On the one hand, it's breathtakingly mundane that the model can now do the most basic of tasks: conform to an output specification. It's weird reading the kind of self-congratulating blogpost about this, like OpenAI has just discovered flint knives. On the other hand, a computer system can process natural language with extremely ambiguous, open-ended problems, compute solutions to said problems, even correct its own mistakes--and then it can format the output correctly. And then on yet another hand, it only took about 10^25 floating point operations (yeah, just ten million trillion trillion, right!?) to get this outcome.

throwawaymaths
12 replies
22h52m

On the one hand, it's breathtakingly mundane that the model can now do the most basic of tasks: conform to an output specification.

I highly doubt it's the model that does this... It's very likely code injected into the token picker. You could put this into any model all the way down to gpt-2.

crowcroft
11 replies
22h26m

I wonder if you get 90% of the way with prompt engineering, and then the last 10% is just brute force, validate output, if it fails, rerun the prompt.

My assumption is if that's all this is they would have done it a long time ago though.

jeeceebees
4 replies
22h14m

You can just mask the output probabilities for each token based on which options are valid according to a grammar.

There are quite a few open source implementations of this e.g. https://github.com/outlines-dev/outlines

contravariant
3 replies
21h1m

You could simply censor invalid tokens, but that does rely on 2 assumptions.

1. There is always a valid next token.

2. This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.

The latter isn't too obvious, and may in fact be (very) false. Look up maze generation algorithms if you want some feeling for the effects this could have.

If you just want a quick argument, consider what happens if picking the most likely token would increase the chance of an invalid token further down the line to nearly 100%. By the time your token-picking algorithm has any effect it would be too late to fix it.

throwawaymaths
2 replies
20h6m

Sorry, how could there not be a valid next token? Presumably your interface would generate a state machine with appropriate masking arrays, and iirc generally speaking all 256 byte choices are in the token list. There's no way to get stuck in a place where the JSON is invalid? Can you give an example?

If you want to be really clever about your picker, a deterministic result would blat out the all the known possible strings.

For example, if you had an object with defined a defined set of properties, you could just go ahead and not bother generating tokens for all the properties and just tokenize, E.G. `{"foo":"` (6-ish tokens) without even passing through the LLM. As soon as an unescaped `"` arrives, you know the continuation must be `,"bar":"`, for example

This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.

It absolutely will. But so will adding an extra newline in your prompt, for example. That sort of thing is part and parcel of how llms work

contravariant
1 replies
19h35m

Hmm, I think any example where it can get stuck is going to be a bit contrived since really it's a question of how easy it is to recognize a valid prefix. Say for example you want the LLM to generate a valid chess match and it ends up in a situation with just 2 kings left. If you're not careful with your definitions you could end up in an endless loop that never ends.

That said if you know all valid prefixes in your language in advance then you can always realise when a token leaves no valid continuations.

It absolutely will. But so will adding an extra newline

A newline is less likely to dramatically drop the quality, a greedy method could easily end driving itself into a dead end (if not grammatically then semantically).

Say you want it to give a weather prediction consisting of a description followed by a tag 'sunny' or 'cloudy' and your model is on its way to generate

    { 
      desc: "Strong winds followed by heavy rainfall.", 
      tag: "stormy" 
    }
If it ever gets to the 's' in stormy it will be forced to pick 'sunny', even if that makes no sense in context.

arjvik
0 replies
19h25m

Schema needs to be a part of the prompt as well so it can associatively recall the options

senko
2 replies
20h54m

Using this in a naive way can easily degenerate into the LLM outputting syntactically/gramatically valid tokens that make no sense, like in this example: https://community.openai.com/t/json-format-causes-infinite-n...

This might be even more pronounced when the output is restricted more using the JSON schema.

So the heavy lifting here was most likely to align the model to avoid/minimize such outcomes, not in tweaking the token sampler.

dilap
1 replies
20h27m

Isn't your example showing an issue w/ the opposite approach, where someone is getting bad output w/ an earlier openAI json mode that worked via training rather than mechanical output restriction to conform to a schema?

FWIW (not too much!) I have used llama.cpp grammars to restrict to specific formats (not particular json, but an expected format), fine-tuned phi2 models, and I didn't hit any issues like this.

I am not intuitively seeing why restricting sampling to tokens matching a schema would cause the LLM to converge on valid tokens that make no sense...

Are there examples of this happening w/ people using e.g. jsonformer?

TheEzEzz
0 replies
18h43m

You're basically taking the model "off policy" when you bias the decoder, which can definitely make weird things happen.

crowcroft
0 replies
22h0m

Oh, thanks for the links. Super interesting!

throwawaymaths
0 replies
20h20m

Yeah but that's hugely wasteful of tokens.

thruway516
9 replies
22h58m

I dont understand your complaint at all. If you develop a new revolutionary technology called an automobile, developing steering, brakes, starter, mufflers for it is a pretty big deal even if reins, clamps, mufflers and keys are mundane and have existed for decades. Structured outputs are a pretty big step in making this magic actually usable by developers as opposed to generating impressive cat pictures or whatever has captured the public imagination.

Bjartr
6 replies
22h54m

I don't think it was an complaint, just a observation.

thruway516
5 replies
22h9m

Yes probably. But considering non-deterministic outputs is the nature of the beast with Llms and we're (mostly) engineers here, calling any part of this mundane sounds almost more like fighting words than just observation

the8thbit
4 replies
21h40m

Extremely pedantic, but is "non-deterministic" really the right language? The same input will always produce the same output, provided you haven't intentionally configured the system to use the model non-deterministically. It seems like the right way to describe it is as a chaotic deterministic system. The same input will always produce the same output, but small shifts in the input or weights can result in dramatic and difficult to predict changes in outputs.

visarga
2 replies
21h30m

The same input will always produce the same output

Not guaranteed even with the same seed. If you don't perform all operations in exactly the same order, even a simple float32 sum, if batched differently, will result in different final value. This depends on the load factor and how resources are allocated.

taneq
0 replies
12h51m

This doesn’t mean LLMs are inherently non-deterministic, just that current common implementations are non-deterministic.

simonw
0 replies
21h27m

Yeah, the fact that floating point multiplication isn't associative is a real pain for producing deterministic outputs - especially when you're running massively parallel computations on GPUs (or multiple GPUs) making the order of operations even less predictable.

davedx
0 replies
21h31m

Llms are indeed non deterministic

jappgar
1 replies
20h45m

Structured outputs are hard... but they claimed to have solved this a year ago.

They were lying, of course, and meanwhile charged output tokens for malformed JSON.

jiggawatts
0 replies
16h0m

Structured output is trivial: just select the output tokens from the given list of probability values filtered for the allowed next tokens in the schema.

Other LLM vendors figured this out many months ago.

berkes
2 replies
9h7m

I am so often surprised by "The AI Communities" software. Often unpleasantly surprised, often just eye-rolling.

When we first started using the OpenAI API's, the first thing I reached for was some way "to be certain that the response is properly formatted". There wasn't one. A common solution was (is?) "just run the query again, untill you can parse the JSON". Really? After decades of software engineering, we still go for the "have you tried turning it off and on again" on all levels.

Then I reached for common, popular tools: everyone's using them, they ought to be good, right? But many of these tools, from langchain's to dify to promptflow are a mess (Edit: to alter the tone: I'm honestly impressed by the breadth and depth of these tools. I'm just suprised about the stability - lack thereof, of them). Nearly all of them suffer from always-outdated-documentation. Several will break right after installing it, due to today's ecosystem updates that haven't been incorporated entirely yet. Understandably: they operate in an ecosystem that changes by the hour. But after decades of software engineering, I want stuff that's stable, predictable, documented. If that means I'm running LLM models from a year ago: fine. At least it'll work. Sure, this constant-state-of-brokeness is fine for a PoC, a demo, or some early stage startup. But terrible for something that I want to ensure to still work in 12 months, or 4 years even without a weekly afternoon of upgrade-all-dependencies-and-hope-it-works-the-update-my-business-logic-code-to-match-the-now-deprecated-apis.

KoolKat23
1 replies
7h12m

Well the simple answer is don't use it then.

In the same way people revert to older stable releases. You're welcome to revert to writing boilerplate code yourself.

The reason people are excited and use it, is because they show promise, it already offers significant benefits even if it isn't "stable".

berkes
0 replies
6h54m

"Just write your own if you aren't happy" doesn't make my critique invalid.

My problem is that this community, and thus the ecosystem, is repeating many mistakes, reinventing wheels, employing known bad software-engineering practices, or lacking any software-engineering practices at all. It is, in short, not learning from decades of work.

It's not standing on shoulders of giants, it's cludging together wobbly ladders to race to the top. Even if "getting to the top first" is the primary goal, it's not the fastest way. But certainly not the most durable way.

(To be clear: I have seen the JS (node) community doing the exact same, racing fast towards cliffs and walls, realizing this, throwing it all out, racing to another cliff and so on. And I see many areas in the Python community doing this too. Problems have been solved! For decades! Learn from these instead of repeating the entire train of failures to maybe finally solve the problems in your niche/language/platform/ecosystem)

srcreigh
1 replies
23h49m

If I wanted to be a silly pedant, I’d say that Turing machines are language specifications and thus it’s theoretically impossible for an LLM or any program to validate output formats in general.

jes5199
0 replies
23h9m

in _general_ sure, but if you restricted each token to conform to a Kleene-star grammar you should be able to guarantee that you get something that parses according to a context-free grammar

scarmig
1 replies
22h30m

I have struggled writing valid YAML before (my tokenizer doesn't handle whitespace very well). And it probably takes me a quadrillion operations on the reals to get a minimal YAML file (I think your 10^25 fp ops is an overestimate--I think it's more like 10^18-10^19).

It's kind of like an inverse Moravec's paradox.

theturtle32
0 replies
20h22m

Relatable!!

codingwagie
1 replies
23h57m

I think it will take a long time for the world at large to realize and then operationalize the potential of this "mundane" technology. It is revolutionary, and also sitting in plain sight. Such a huge technological shift that was considered decades out only a few years ago

ben_w
0 replies
22h58m

Although I am an optimist* about what this can do, I am very much aware — from personal experience — how easy it is to see more than is really there.

The realisation of the tech might be fantastic new things… or it might be that people like me are Clever Hans-ing the models.

* that may be the wrong word; "strong capabilities" is what I think is present, those can be used for ill effects which is pessimistic.

tommica
0 replies
22h58m

For some reason it reminds me of my civilization runs - rush to certain high level tech and then after that discovery writing :D

ramraj07
0 replies
22h57m

This is like saying “we shouldn’t be celebrating a computer that can talk, my parrot can do that!”

raincole
0 replies
21h57m

I don't know, it doesn't sound wild at all to me. Human languages are very imprecise, vague and error-tolerant, which is the opposite of an output format like JSON. So the a model can't do these two things well at the same time is quite an intuitive conclusion.

The wild part is that a model trained with so much human language text can still outputs mostly compilable code.

m3kw9
0 replies
22h16m

It’s doing more, it is allowing user to input using natural language and the output is the json format the API that is defined

jumploops
23 replies
21h52m

By using JSON mode, GPT-4{o} has been able to do this reliably for months (100k+ calls).

We use GPT-4o to build dynamic UI+code[0], and almost all of our calls are using JSON mode. Previously it mostly worked, but we had to do some massaging on our end (backtick removal, etc.).

With that said, this will be great for GPT-4o-mini, as it often struggles/forgets to format things as we ask.

Note: we haven't had the same success rate with function calling compared to pure JSON mode, as the function calling seems to add a level of indirection that can reduce the quality of the LLMs output YMMV.

Anyhow, excited for this!

[0]https://magicloops.dev

qwertox
10 replies
21h18m

What a cool product! I was about to recommend you to submit it as a "Show HN", but it turns out that it already got submitted one year ago.

Would you mind sharing a bit on how things have evolved?

jumploops
9 replies
21h4m

Thanks and great question :)

When we first launched, the tool was very manual; you had to generate each step via the UI. We then added a "Loop Creator agent" that now builds Loops for you without intervention. Over the past few months we've mostly been fixing feature gaps and improving the Loop Creator.

Based on recent user feedback, we've put a few things in motion:

- Form generator (for manual loops)

- Chrome extension (for local automations)

- In-house Google Sheets integration

- Custom outputs (charts, tables, etc.)

- Custom Blocks (shareable with other users)

With these improvements, you'll be able to create "single page apps" like this one I made for my wife's annual mango tasting party[0].

In addition to those features, we're also launching a new section for Loop templates + educational content/how-tos, in an effort to help people get started.

To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.

[0]https://mangota.ngo

gleb
3 replies
20h53m

Where do you get such a large variety of mangoes?

jumploops
1 replies
20h27m

My mother-in-law is the President of the Central Florida Fruit Society, and is in charge of sourcing mangoes for their annual party. She sends us all the excess mangoes :)

As I understand it, this year's mangoes mostly came from Merritt Island, as there was some not-so-great weather in southern Florida.

JamesSwift
0 replies
5h14m

Mango production was down here in Rockledge too (right next to Merritt Island). Bad year all around for production. Last year was awesome though.

Are you in central florida, or is that just your in laws? I'd love to see a talk on this at the local orlando devs meetup.

tomcam
0 replies
20h28m

Asking the important questions

frabjoused
2 replies
17h38m

Thanks for all the detail and I’m fascinated by this.

We’re working on fairly similar problems, would love to have a chat and share ideas and experiences.

In this something you’d be interested in?

jumploops
1 replies
16h20m

Always happy to chat!

You can ping me at {username}@gmail.com

frabjoused
0 replies
29m

Sent!

samstave
0 replies
18h40m

This would be great in the extension format:

highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.

THEN:

highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00

ML00 == EMAIL|CSV|GDrive results.

ML11 == Graph all the above outputs.

:-)

NotMichaelBay
0 replies
2h59m

To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.

That blows my mind! You have users paying for it and it only has a 25% success rate for loops created by users? I've been working for about a year on an LLM-based product and haven't launched yet because only 50-60% of my test cases are passing.

diego_sandoval
3 replies
19h50m

Can I use Magic Loops to generate Magic Loops for me?

jumploops
2 replies
19h30m

Technically yes, but it would require reverse-engineering some of our APIs.

Practically speaking, we have quite a few use-cases where users call Loops from other Loops, so we're investigating a first-class API to generate Loops in one go.

Similar to regular software engineering, what you put in is what you get out, so we've been hesitant to launch this with the current state of LLMs/the Loop Creator as it will fail more often than not.

samstave
1 replies
18h34m

This would be great in the extension format:

highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.

THEN:

highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00

ML00 == EMAIL|CSV|GDrive results.

ML11 == Graph all the above outputs.

:-)

--

A MasterLoop would be good - where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable - it can summarize and suggest logic when weaving loops into cohesive lattices of behavior. And if loops can subscribe to the output of other loops -- when youre looking for certain output strands -- you can say:

Find all the MLB loops daily and summarize what they say about only the Dodgers and the Giants - and keep a running table for the pitchers, and catchers stats only.

EDIT: Aside from just subscribing, maybe a loop can #include# the loop in its behavior to accomplish its goal/assimilate its function / mate/spawn. :-)

These loops are pretty Magic!

jumploops
0 replies
17h50m

where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable

This is actually where we started :)

We had trouble keeping all the context in via RAG and the original 8k token window, but we're aiming to bring this back in the (hopefully near) future.

__jl__
2 replies
16h57m

Thanks for sharing your experience.

We also get pretty reliable JSON output (on a smaller scale though) even without JSON mode. We usually don't use JSON mode because we often include a chain of thought part in <brainstorming> and then ask for JSON in <json> tags. With some prompt engineering, we get over 98% valid JSON in complex prompts (with long context and modestly complex JSON format). We catch the rest with json5.loads, which is only used as a fallback if json.loads fails.

4o-mini has been less reliable for us particularly with large context. The new structured output might make it possible to use mini in more situations.

borsch
0 replies
16h29m

my recommendation? use chain of thought, then feed that into a second prompt asking for json

JimDabell
0 replies
13h51m

The linked article includes a section on this, under “Separating a final answer from supporting reasoning or additional commentary”. They suggest defining a JSON schema with a reasoning field and an answer field.

tomcam
1 replies
20h30m

Very interesting. Did you build magicloops using this tech?

jumploops
0 replies
20h6m

We first built Magic Loops with GPT-4, about a year ago, well before JSON mode was a thing.

We had to a do a bunch of extra prompting to make it work, as GPT would often include backticks or broken JSON (most commonly extra commas). At the time, YAML was a much better approach.

Thankfully we've been able to remove most of these hacks, but we still use a best effort JSON parser[0] to help stream partial UI back to the client.

[0]https://www.npmjs.com/package/best-effort-json-parser

geepytee
1 replies
17h39m

This model appears to be full of surprises.

The 50% drop in price for inputs and 33% for outputs vs. the previous 4o model is huge.

It also appears to be topping various benchmarks, ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

Shameless plug, I'm the co-founder of Double.bot (YC W23). After seeing the leaderboard above we actually added it to our copilot for anyone to try for free [2]. We try to add all new models the same day they are released

[0]https://huggingface.co/spaces/allenai/ZeroEval

[1]https://crux-eval.github.io/

[2]https://double.bot/

usaar333
0 replies
14h6m

ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.

turnsout
0 replies
18h10m

Had the same experience with function calling—we get much better results simply asking for JSON. With simple schemas (basically dictionaries), gpt-4 and 4o are basically bulletproof.

jjcm
19 replies
1d

Interesting tidbit at the very end that's worth noting for anyone using the API today:

By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.
scrollop
11 replies
1d

From what I've learned from OpenAI, the "latest" "cheaper" model will perform worse than the previous model on various tasks (esp reasoning).

ralusek
4 replies
1d

I don't think it's been well enough acknowledged that all of the shortcuts LLMs have been taking with ways of attempting to compress/refine/index the attention mechanism seem to result in dumber models.

GPT 4 Turbo was more like GPT 3.9, and GPT 4o is more like GPT 3.7.

alach11
1 replies
15h39m

Do you have benchmarks demonstrating this? In my own personal/team benchmarks, I've seen 4o consistently outperform the original gpt-4.

maeil
0 replies
13h49m

I'm building a product that requires complex LLM flows and out of OpenAI's "cheap" tier models, the old versions of Turbo-3.5 are far better than the last versions of it and 4o-mini. I have a number of tasks that the former consistently succeed at and the latter consistently fail at regardless of prompting.

Leaderboards and benchmarks are very misleading as OpenAI is optimizing for them, like in the past when certain CPU manufacturers would optimize for synthetic benchmarks.

Fwif these aren't chat usecases, for which the newer models may well be better.

Der_Einzige
0 replies
22h21m

They try to gaslight us and tell us this isn't true because of benchmarks, as though anyone has done anything but do the latent space exploration equivalent of throwing darts at the ocean from space.

It's taken years to get even preliminary reliable decision boundary examples from LLMs because doing so is expensive.

samstave
2 replies
23h51m

Am I the only one that wants to know 1,000% *WHY* such things?

Is it a natural function of how models evolve?

Is it engineered as such? Why? Marketing/money/resources/what?

WHO makes these decisions and why?

---

I have been building a thing with Claude 3.5 pro account and its *utter fn garbage* of an experience.

It lies, hallucinates, malevolently changes code that was already told was correct, removes features - explicitly ignore project files. Has no search, no line items, so much screen real-estate is consumed with useless empty space. It ignores states style guides. get CAUGHT forgetting about a premise we were actively working on them condescendingly apologies "oh you're correct - I should have been using XYZ knowledge"

It makes things FN harder to learn.

If I had any claude engineers sitting in the room watching what a POS service it is from a project continuity point...

Its evil. It actively f's up things.

One should have the ability to CHARGE the model token credit when it Fs up so bad.

NO FN SEARCH??? And when asked for line nums in it output - its in txt...

Seriously, I practically want not just a refund, I want claude to pay me for my time correcting its mistakes.

ChatGPT does the same thing. It forgets things committed to memory - refactors successful things back out of files. ETc....

Its been a really eye opening and frustrating experience and my squinty looks are aiming that its specifically intentional:

They dont want people using a $20/month AI plan to actually be able to do any meaningful work and build a product.

scrollop
0 replies
22h54m

Use an API from the top models with a good frontend, then, and use precise instructions.

It's odd, as many people praise claude's coding capabilities.

campers
0 replies
15h25m

It is difficult to get the AI models to get everything right every time. I noticed too that it would sometimes remove comments etc when re-writing code.

The way to get better results is with agentic workflows that breakdown the task into smaller steps that the models can iteratively come to a correct result. One important step I added to mine is a review step (in the reviewChanges.ts file) in my workflow at https://github.com/TrafficGuard/nous/blob/main/src/swe/codeE...

This gets the diff and asks questions like:

- Are there any redundant changes in the diff? - Was any code removed in the changes which should not have been? - Review the style of the code changes in the diff carefully against the original code.

Maybe try using that, or the package that I use which does the actual code edits called Aider https://aider.chat/

scrollop
0 replies
1d

Also, is it a coincidence that at cheaper (potentially faster?) model has been released (just) before they roll out the "new" voice mode (which boasts very low latency)?

codingwagie
0 replies
23h55m

Its usually a distilled smaller model

campers
0 replies
11h12m

Thats ok from the perspective of it making room for a more capable and expensive GPT5 model to compete with Opus 3.5 when that arrives this year. The significant price drops for a small loss in quality is a reasonable tradeoff. Then GTP4o becomes the mid tier and GTP4o-mini the low tier.

There was 100 days in between Claude 3.0 Opus and Claude 3.5 Sonnet being released which gave us similar capability at a 80% price reduction. When I was using Opus I was thinking this is nice, but the cost does add up. Having Sonnet 3.5 so soon after was a nice surprise.

One more round of 80% price cuts after that combined with building out the multi-step agentic workflows should provide some decent capabilities!

ComputerGuru
4 replies
23h58m

If you use the undecorated gpt-4o do you automatically get the latest?

tedsanders
0 replies
23h15m

We'll update gpt-4o in 3 weeks. (We've always updated it couple weeks after launch, so no one is immediately surprised by a new model drop.)

daguava
0 replies
23h25m

The un-postfixed version will point to the older model for the next 3 weeks their docs say

OutOfHere
0 replies
22h44m

For the record, you should never use that in an application. Always explicitly note the full versioned model name. This will prevent bad surprises because not every new version is an improvement; sometimes they get worse, especially at specific tasks.

minimaxir
1 replies
1d

The new price is also now reflected on the pricing page: https://openai.com/api/pricing/

It's weird that's only a footnote when it's actually a major shift.

sjnair96
0 replies
1d

I also looked up the same. I wonder why. They must have a subsequent announcement regarding this I'd expect.

leetharris
16 replies
1d

At the bottom:

Acknowledgements Structured Outputs takes inspiration from excellent work from the open source community: namely, the outlines, jsonformer, instructor, guidance, and lark libraries.

It is cool to see them acknowledge this, but it's also lame for a company named "OpenAI" to acknowledge getting their ideas from open source, then contributing absolutely NOTHING back to open source with their own implementation.

spencerchubb
11 replies
1d

Is offering gpt4o for free through chatgpt not enough of a contribution? They didn't release source code, but they made a product free to use

notarobot123
3 replies
23h59m

This isn't generosity, it's a well known and much used strategy for market penetration. Free until-we-decide-otherwise is very much not the same as open source.

spencerchubb
1 replies
22h16m

So if something is free but only temporarily, then that cancels out the generosity? Also, you and I have no idea how long the features will remain free. If anything, chatgpt has been making more features and stronger models free over time.

simonw
0 replies
21h57m

Sometimes it does, yeah. It's not unheard of for companies to deliberately operate at a loss in order to drive out their competition, then raise prices again. This is known as "predatory pricing".

rvense
0 replies
23h23m

In so far as it is a conscious strategy to make it more expensive at a later data, it is actually sort of the opposite of generosity.

echelon
3 replies
23h57m

That can actually make competition from open source harder. New upstarts that are open source can't compete with free service from OpenAI and can't make money to grow their development or offerings.

OpenAI wants to kill everything that isn't OpenAI.

spencerchubb
1 replies
22h15m

So should OpenAI make their product less accessible, in order to make it easier for competition? That makes no sense

oblio
0 replies
21h41m

I call chicken. Let them make all their products paid.

Hint: they won't, it would kill their company. The hype around OpenAI is based on people using it for free, at least at the start.

Heck, even drug dealers know this trick!

ben_w
0 replies
22h47m

New open source models* still wouldn't be able to compete even if OpenAI was forcibly shut down.

Hardware's too expensive, and will be for a while, because all the big players are trying to get in on it.

* cue arguments: "'open weights' or training data'?"; "does the Meta offering count or are they being sneaky and evil?"; etc.

mplewis
1 replies
1d

No. If it were free you'd be able to use it as a programming API. It's not free and it's not unlimited - it's a time-limited marketing tool.

spencerchubb
0 replies
22h18m

How are you defining the word free?

talldayo
0 replies
1d

Free service != Open software

warkdarrior
1 replies
1d

it's also lame for a company named "OpenAI" to acknowledge getting their ideas from open source, then contributing absolutely NOTHING back to open source with their own implementation

Maybe those projects were used as-is by OpenAI, so there was nothing new to contribute.

sirspacey
0 replies
20h18m

You don’t anyone will use it to contribute to open source projects?

Seems like an obvious net gain for the community.

g15jv2dp
0 replies
18h29m

If you're unhappy about people using your work without compensating you or contributing back, don't release your work as free software. You can't have your cake and eat it too...

__jl__
12 replies
23h40m

There is another big change in gpt-4o-2024-08-06: It supports 16k output tokens compared to 4k before. I think it was only available in beta before. So gpt-4o-2024-08-06 actually brings three changes. Pretty significant for API users

1. Reliable structured outputs 2. Reduced costs by 50% for input, 33% for output 3. Up to 16k output tokens compared to 4k

https://platform.openai.com/docs/models/gpt-4o

santiagobasulto
9 replies
21h27m

I’ve noticed that lately GPT has gotten more and more verbose. I’m wondering if it’s a subtle way to “raise prices”, as the average response is going to incur I more tokens, which makes any API conversation to keep growing in tokens of course (each IN message concatenates the previous OUT messages).

tedsanders
4 replies
20h15m

GPT has indeed been getting more verbose, but revenue has zero bearing on that decision. There's always a tradeoff here, and we do our imperfect best to pick a default that makes the most people happy.

I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions (which requires formulation + typing + waiting for a response).

With regard to this new gpt-4o model: you'll find it actually bucks the recent trend and is less verbose than its predecessor.

OJFord
2 replies
16h50m

I suspect the reason why most big LLMs have ended up in a pretty verbose spot is that it's easier for users to scroll & skim than to ask follow-up questions

Maybe it's a 'technical' user divide, but that seems wrong to me. I would much rather a succinct answer that I can probe further or clarify if necessary.

Lately it's going against my custom prompt/profile whatever it's called - to tell it to assume some level of competence, a bit about my background etc., to keep it brief - and it's worse than it was when I created that out of annoyance with it.

Like earlier I asked something about some detail of AWS networking and using reachability analyser with VPC endpoints/peering connections/Lambda or something, and it starts waffling on like 'first, establish the ID of your Virtual Private Cloud Endpoint. Step 1. To locate the ID, go to ...'

zarzavat
0 replies
13h10m

There’s an interesting discrepancy here.

Human users are charged by the number of messages, so longer responses are preferable because follow up questions use up your message allowance.

APIs are charged by token so shorter messages are preferable as you don’t pay for unnecessary tokens.

condiment
0 replies
9h20m

I’ve noticed this as well with coding questions. I will give it problematic code and ask a question about behavior, but it will attempt to reply with a solution to a problem. And even if I prompt it to avoid providing solutions, it ignores my instruction and blasts out huge blocks of useless and typically incorrect code. And once it overwhelms my subtle inquiries with nonsense, it gets stuck repeating itself and I just have to start a new session over.

For me this is one of the strongest motivators for running LLMs locally- even if they’re measurably worse, they’re a far better tool because they don’t change behavior over time.

zamadatix
0 replies
17h47m

Do changes in verbosity tuning have a meaningful impact on the average "correctness" of the responses?

Also your about page is very suspicious for someone at an AI company ;).

sashank_1509
1 replies
20h41m

they also spend more to generate more tokens. The more obvious reason is it seems like people rate responses better the longer they are. Lmsys demonstrated that GPT tops the leaderboard because it tends to give much longer and more detailed answers, and it seems like OpenAI is optimizing or trying to maximize lmsys.

maeil
0 replies
14h2m

Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.

My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.

throwaway48540
0 replies
20h36m

It's a subtle way to make it smarter. Making it write out the "thinking process" and decisions has always helped with reliability and quality.

sophiabits
0 replies
20h0m

I’ve especially noticed this with gpt-4o-mini [1], and it’s a big problem. My particular use case involves keeping a running summary of a conversation between a user and the LLM, and 4o-mini has a really bad tendency of inventing details in order to hit the desired summary word limit. I didn’t see this with 4o or earlier models

Fwiw my subjective experience has been that non-technical stakeholders tend to be more impressed with / agreeable to longer AI outputs, regardless of underlying quality. I have lost count of the number of times I’ve been asked to make outputs longer. Maybe this is just OpenAI responding to what users want?

[1] https://sophiabits.com/blog/new-llms-arent-always-better#exa...

bilater
0 replies
58m

I have not been able to get it to output anywhere close to the max though (even setting max tokens high). Are there any hacks to use to coax the model to produce longer outputs?

Culonavirus
0 replies
22h15m

That's actually pretty impressive... if they didn't dumb it down that is, which only time will tell.

gamegoblin
10 replies
1d

I'm glad they gave up on their "fine-tuning is all you need" approach to structured output. It's possible fine-tuning will work in the long term, but in the short-term, people are trying to build things, and fine-tuning wasn't cutting it.

Surprised it took them so long — llama.cpp got this feature 1.5 years ago (actually an even more general version of it that allows the user to provide any context free grammar, not just JSON schema)

chhabraamit
4 replies
1d

How does llama.cpp’s grammar adherence work?

Does it keep validating the predicted tokens and backtrack when it’s not valid?

gamegoblin
3 replies
1d

It's essentially an Earley Parser[0]. It maintains a set of all possible currently valid parses, and zeroes out the probability of any token that isn't valid in at least 1 of the current potential parse trees.

There are contrived grammars you can give it that will make it use exponential memory, but in practice most real-world grammars aren't like this.

[0] https://en.wikipedia.org/wiki/Earley_parser

orlp
1 replies
5h53m

You don't even need that for JSON. JSON can be expressed using a LR(1) grammar, so you can do it in linear time and space.

gamegoblin
0 replies
1h54m

Yes, the llama.cpp work supports arbitrary CFGs, not just JSON

jules
0 replies
12h15m

[delayed]

tcdent
2 replies
1d

GPT is still a language model, so at some point it's still just tokens.

Is this just a schema validation layer on their end to avoid the round trip (and cost) of repeating the call?

gamegoblin
1 replies
1d

Language models like GPT output a large vector of probabilities for the next token. Then a sampler decides which of those tokens to pick.

The simplest algorithm for getting good quality output is to just always pick the highest probability token.

If you want more creativity, maybe you pick randomly among the top 5 highest probability tokens or something. There are a lot of methods.

All that grammar-constrained decoding does is zero out the probability of any token that would violate the grammar.

nickreese
0 replies
22h17m

Thank you for this explanation. A few things just clicked for me.

Der_Einzige
0 replies
22h26m

For many things, fine-tuning as we know of it will NEVER fully solve it, there's no hope. Even fine-tuning a model to not use the letter "e" to an overwhelming degree doesn't entirely prevent it, only reduces its chances to increasingly small amounts. Shamesless self plug, and from before the ChatGPT era too! https://paperswithcode.com/paper/most-language-models-can-be...

BoorishBears
0 replies
1d

I was surprised it took so long until I reached this line:

The model can fail to follow the schema if the model chooses to refuse an unsafe request. If it chooses to refuse, the return message will have the refusal boolean set to true to indicate this.

I'm not sure how they implemented that, maybe they've figured out a way to give the grammar a token or set of tokens that are always valid mid generation and indicate the model would rather not continue generating.

Right now JSON generation is one of the most reliable ways to get around refusals, and they managed not to introduce that weakness into their model

pton_xd
9 replies
22h51m

Isn't "we hardcoded JSON into the latest model" kind of the opposite direction, strategically, from "we're on the way to AGI and I need 7 trillion to get there?"

isoprophlex
5 replies
22h45m

You are witnessing the final stages in the evolution of OpenAI from a messianic hype machine to Yet Another Product Company.

Hence all the people leaving, too.

gardenhedge
4 replies
22h34m

I am ootl, employees are leaving openai?

dangrossman
3 replies
22h30m

John Schulman, one of the co-founders of artificial intelligence company OpenAI, has left the ChatGPT maker for rival Anthropic, he said in a post on social media platform X late Monday.

OpenAI's President and co-founder Greg Brockman is also taking a sabbatical through the end of the year, he said in a X post late Monday.

Peter Deng, a vice-president of product, also left in recent months, a spokesperson said. And earlier this year, several members of the company’s safety teams exited.

That's after co-founder and Chief Scientist Ilya Sutskever left in May.

oblio
1 replies
21h44m

Are there any co-founders left?

sashank_1509
0 replies
20h39m

sam Altman for one.

g15jv2dp
0 replies
18h33m

Some employees quitting in a 1500 people company? Impossible. Any departure must be interpreted as the doom of openai, there's no other possibility.

nsonha
0 replies
14h38m

AGI is useless if you can't figure out how to employ it as part of a system, instead of just chit chat

chamomeal
0 replies
14h45m

I mean it’s not groundbreaking, but it makes it much easier to make simple AI tools that aren’t chat-based. It definitely has me interested.

GPT-4 has been so mind-blowingly cool, but most of the interesting applications I can think of involve 10 steps of “ok now make sure GPT has actually formatted the question as a list of strings… ok now make sure GPT hasn’t responded with a refusal to answer the question…”

Idk what the deal is with their weird hype persona thing, but I’m stoked about this release

KaiMagnus
0 replies
22h39m

Yeah, definitely a way to end up with a Siri like mess if you do this long enough. The use case is there and it’s going to be very useful, but the magic is wearing off.

zoogeny
7 replies
23h3m

Totally tangential, totally not related to the post (unless you squint your eyes and really blur things) ...

I was thinking about the old canard of the sufficiently smart compiler. It made me think about LLM output and how in some way the output of a LLM could be bytecode as much as it could be the English language. You have a tokenized input and the translated output. You have a massive and easily generatable training set. I wonder if, one day, our compilers will be LLMs?

pjc50
2 replies
23h1m

Why would you tolerate a nonreliable compiler with no assured relationship between its inputs and its outputs? Have people just got too comfortable with the C++ model of "UB means I can insert a security bug for you"?

bigyikes
1 replies
22h41m

In a hypothetical future where the reliability of LLMs improves, I can imagine the model being able to craft optimizations that a traditional compiler cannot.

Like there are already cases where hand-rolling assembly can eke out performance gains, but few do that because it’s so arduous. If the LLM could do it reliably it’d be a huge win.

It’s a big if, but not outside the realm of possibility.

zoogeny
0 replies
22h32m

I agree it is currently a pipe dream. But if I was looking for a doctoral research idea, it might be fun to work on something like that.

Lots of potential avenues to explore, e.g. going from a high-level language to some IR, from some IR to bytecode, or straight from high-level to machine code.

I mean, -O3 is already so much of a black box that I can't understand it. And the tedium of hand optimizing massive chunks of code is why we automate it at all. Boredom is something we don't expect LLMs to suffer, so having one pore over some kind of representation and apply optimizations seems totally reasonable. And if it had some kinds of "emergent behaviors" based on intelligence that allow it to beat the suite of algorithmic optimization we program into compilers, it could actually be a benefit.

thih9
0 replies
21h20m

I guess an actual compiler would be cheaper and more reliable.

In theory we could do the same with mathematical computations, 2+2=4 and the like; but computing the result seems easier.

shepherdjerred
0 replies
18h49m

Compilers require strict semantics and deterministic output. It’s the exact opposite of AI.

I could see AI being used (in a deterministic way) to make decisions about what optimizations to apply, to improve error messages, or make languages easier to use/reason about, but not for the frontend/backend/optimizations themselves.

killthebuddha
0 replies
22h13m

A function that implements natural language -> bytecode is IMO way more likely to be under the hood an LLM operating a compiler (or maybe a compiler operating LLMs) rather than a "bare" LLM. From an end user's perspective maybe it won't matter but I think it's an important technical point. IMO there's no evidence that an LLM will ever be the best way to execute general purpose computations.

jcims
0 replies
23h1m

You definitely could, not far removed from text to image or text to audio generators.

nichochar
7 replies
23h54m

I'm a little confused why you have to specify "strict: true" to get this behavior. It is obviously always desired, I would be surprised for people to ever specify "strict: false". That API design leaves to be desired.

I also learned about constrainted decoding[1], that they give a brief explanation about. This is a really clever technique! It will increase reliability as well as reduce latency (less tokens to pick from) once the initial artifacts are loaded.

[1] https://www.aidancooper.co.uk/constrained-decoding/

athyuttamre
5 replies
22h58m

Hi, I work on the OpenAI API — structured outputs schemas have limitations (e.g. all fields must be required, no additional properties allowed): https://platform.openai.com/docs/guides/structured-outputs/s....

If your schema is not supported, but you still want to use the model to generate output, you would use `strict: false`. Unfortunately we cannot make `strict: true` the default because it would break existing users. We hope to make it the default in a future API version.

Der_Einzige
3 replies
22h24m

You should also mention that before you had done custom alignment accounting for this feature, that it was an excellent alignment breaker (therefor a big no-no to release too early)

For example, if I ask an LLM to generate social security numbers, it will give the whole "I'm sorry Hal, I can't do that". If I ban all tokens except numbers and hyphens, prior to your "refusal = True" approach, it was guaranteed that even "aligned" models would generate what appeared to be social security numbers.

ethical_source
2 replies
20h54m

And if LLMs can generate plausible social security numbers, our civilization will fall /s

Christ, I hate the AI safety people who brain-damage models so that they refuse to do things trivial to do by other means. Is LLM censorship preventing bad actors from generating social security numbers? Obviously not. THEN WHY DOES DAMAGING AN LLM TO MAKE IT REFUSE THIS TASK MAKE CIVILIZATION BETTER OFF?

History will not be kind to safetyist luddites.

consteval
0 replies
1h51m

I don't think it's about better off or worse off, it's about PR and perception.

The wider consumer base is practically itching for a reason to avoid this tech. If it gets out, that could be a problem.

It was the same issue with Geminis image gen. Sure, the problem Google had was bad. But could you even imagine what it would've been if they did nothing? Meaning, the model could generate horrendously racist images? That 100% would've been a worse PR outcome. Imagine someone gens a mistral image and accredits it to Google's model. Like... that's bad for Google. Really bad.

Terretta
0 replies
20h8m

I'm less concerned with the AI teams lobotomizing utility, more concerned with damage to language, particularly redefining the term "safe" to mean something like "what we deem suitable".

That said, when zero "safety" is at stake might be the best time to experiment with how to build and where to put safety latches, for when we get to a point we mean actual safety. I'm even OK with models that default to parental control for practice provided it can be switched off.

thrance
0 replies
4h12m

Hi, if you're allowed to answer, are there any future plans to support custom CFGs through the API ? Like llama.cpp does with it's GBNF format.

dgellow
0 replies
23h22m

Could you develop a bit re: the API? What do you dislike other than the “strict: true”?

OutOfHere
7 replies
21h29m

Using this feature will obviously "lock you in" to OpenAI, specifically to this model too, at least until other companies catch on. While text prompts can more easily be moved to other LLMs, this feature cannot currently be ported as such. I would use it only if a text prompt is insufficient despite retries.

moralestapia
1 replies
20h57m

OTOH, not using it could "lock you out" of building a cool product for your users, so ...

OutOfHere
0 replies
19h0m

It depends. Quite often the intended schema is simple enough that parsing of a plaintext lines using a regex is sufficient. I do however have to use a verbose prompt to get it to comply with the expected format. Combine this with retries in case of rare failures, and the job is done.

I think this new feature is more relevant for intricate schemas and dynamic schemas when a text prompt cannot do the job.

dtquad
1 replies
21h17m

OpenAI-style JSON mode and function calling rapidly became the industry standard way of doing it. It will probably also happen for this feature.

toomuchtodo
0 replies
21h13m

“S3 compatible”

faizshah
0 replies
20h41m

The converse API in AWS bedrock lets you use function calling across a number of different providers (doesn’t support OpenAI): https://docs.aws.amazon.com/bedrock/latest/userguide/convers...

I have been using it so that my agents aren’t specific to a particular model or api.

Like others have said many other providers already have function calling and json schema for structure outputs.

PufPufPuf
0 replies
20h58m

This feature has existed for quite some time in several inference libraries, like Outlines, under the names "constrained decoding" or "guided decoding". Some even include it in their OpenAI-compatible API in a very similar form (allowing to pass in a JSON Schema). All this required doing your own inference, though -- so the announcement really just brings this popular feature "to the masses".

BoorishBears
0 replies
21h25m

It's already supported by multiple other providers. Fireworks, Together, probably more.

gdiamos
0 replies
22h54m

Yeah! - outlines, guidance, jsonformer were inspiring for this line of work

msoad
1 replies
22h44m

Why not JSON Schema?

gdiamos
0 replies
21h41m

We did some user studies and found that people found it less intuitive.

radarsat1
0 replies
23h5m

Looks useful!

cvhc
6 replies
23h45m

I wonder why the top level has to be an object instead of an array... I have some pretty normal use cases where I expect the model to extract a list of objects from the text.

``` openai.BadRequestError: Error code: 400 - {'error': {'message': 'Invalid schema for response_format \'PolicyStatements\': schema must be a JSON Schema of \'type: "object"\', got \'type: "array"\'.', 'type': 'invalid_request_error', 'param': 'response_format', 'code': None}} ```

I know I can always put the array into a single-key object but it's just so annoying I also have to modify the prompts accordingly to accomodate this.

tomComb
1 replies
23h24m

Well, this wouldn’t be a very satisfying explanation, but these JSON objects are often represented as Python dictionaries and those can’t have top level arrays.

Too
0 replies
13h6m

Try json.loads("[1,2,3]"), you'll get a list back.

The reasons others already posted about extensibility are more correct.

simonw
0 replies
22h1m

I've regretted designing APIs that return an array rather than an object in the past.

It's all about the extensibility. If you return an object you can add extra keys, for things like "an error occurred, here are the details", or "this is truncated, here's how to paginate it", or a logs key for extra debug messages, or information about the currently authenticated user.

None of those are possible if the root is an array.

moritzwarhier
0 replies
23h17m

It's a relatively common convention for JSON APIs.

Possible reasons:

- Extensibility without breaking changes

- Forcing an object simplifies parsing of API responses, ideally the key should describe the contents, like additional metadata. It also simplifies validation, if considered separate from parsing

- Forcing the root of the API response to be an object makes sure that there is a single entry point into consuming it. There is no way to place non-descript heterogenous data items next to each other

- Imagine that you want to declare types (often generated from JSON schemas) for your API responses. That means you should refrain from placing different types, or a single too broad type in an array. Arrays should be used in a similar way to stricter languages, and not contain unexpected types. A top-level array invites dumping unspecified data to the client that is expensive and hard to process

- The blurry line between arrays and objects in JS does not cleanly map to other languages, not even very dynamic ones like PHP or Python. I'm aware that JSON and JS object literals are not the same. But even the JSON subset of JS (apart from number types, where it's not a subset AFAIK) already creates interesting edge cases for serialization and deserialization

manquer
0 replies
23h29m

I can't say for OpenAI, but in general I have seen and used this design pattern to keep consistency of root object output and remove a lot of unnecessary validations and branching flows

Otherwise you will to handle the scenarios in code everywhere if you don't know if the root is object or array. If the root has a key that confirms to a known schema then validation becomes easier to write for that scenario,

Similar reasons to why so many APIs wrap with a key like 'data', 'value' or 'error' all responses or in RESTful HTTP endpoints collection say GET /v1/my-object endpoints do no mix with resource URIs GET /v1/my-object/1 the former is always an array the latter is always an object.

heliophobicdude
0 replies
23h21m

Back in the old days, top level arrays were a security risk because the array constructor in JS could be redefined and do bad-guy stuff. I cannot think of any json parsing clients that are vulnerable to this.

surfingdino
5 replies
23h38m

Is it still NTSAT (Never The Same Answer Twice)?

H8crilA
4 replies
23h24m

Yes, this happens by design.

oblio
3 replies
21h40m

Interesting, why? Is there no theoretical way to have stable models? Or some kind of executive decision?

H8crilA
2 replies
11h32m

It's just how this sort of a model works in inference. You can set the temperature to zero, then you'll get mostly constant outputs (but not quite, since as far as I can tell nobody is working on determinization of such systems, and it is not really deterministic).

oblio
1 replies
7h24m

I'm going to have to dust off some ancient math... I think LLMs are based on neuronal networks which are based on regression models which are based on systems of differential equations for which last time I studied this stuff, we only had probabilistic solutions?

It's been a bunch of decades since I did anything even remotely related to these areas (and I was never good at math).

H8crilA
0 replies
1h25m

Check out 3blue1brown, they have a series on transformers. Very approacheable. The randomness is very explicit in inference.

H8crilA
5 replies
23h25m

How is this different from function calling?

tedsanders
2 replies
23h17m

Under the hood, it's quite similar to function calling. A few differences:

- Structured Outputs is a bit more straightforward. e.g., you don't have to pretend you're writing a function where the second arg could be a two-page report to the user, and then pretend the "function" was called successfully by returning {"success": true}

- Having two interfaces lets us teach the model different default behaviors and styles, depending on which you use

- Another difference is that our current implementation of function calling can return both a text reply plus a function call (e.g., "Let me look up that flight for you"), whereas Structured Outputs will only return the JSON

(I worked on this feature at OpenAI.)

technics256
1 replies
22h20m

How can we enable the text reply with a function call? Usually the message returned is a tool call only when it calls a tool?

tedsanders
0 replies
22h12m

There's no special interface, but you can write an instruction in a system message in the first position. E.g., "Before each function call, explain to the user what you're about to do." It's not super reliable, but the model can do it. Few-shot prompting might help as well.

nsonha
0 replies
14h8m

I guess practically this will be the equivalent to function calling with a single tool. The AI splits out some payload that you presumably feed into whatever you do next. There are only 2 possible outcomes: can/cannot produce payload, which is equivalent to call/not call the next functionality.

With function/tool calling, you give it a list of tools with payload and the AI will select between them, or refuse, so there are more than 2 possible outcomes.

(I've never used any of these APIs, get that from just reading the docs)

binarymax
0 replies
23h19m

Function calling uses JSON mode. While it has been mostly correct, I do get an incorrectly formatted response sometimes (maybe 1 in 10k requests?). So it sounds like this fixes that bug.

wewtyflakes
4 replies
1d

Why would someone want `strict` to be anything other than `true`?

ComputerGuru
1 replies
23h56m

There are many reasons, though I am not sure which they had in mind. One thing is that LLMs in general tend to do better when they can be more verbose in their output and sort of “think aloud” to reach an answer. Insisting on strict output format would rob it of the benefits (because it doesn’t just not emit but completely skips those stages, or else you’d be paying for those elided output tokens).

wewtyflakes
0 replies
23h50m

But then why would someone specify that the response has to be in a given JSON schema (by presence of the schema itself), but then also not care if it is actually using that schema (by specifying `strict` as `false`)? That is the use-case I can't wrap my head around.

tedsanders
0 replies
23h6m

We didn't cover this in the announcement post, but there are a few reasons:

- The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don't want that latency hit (e.g., you're prototyping, or have a use case that uses variable one-off schemas), then you might prefer "strict": false

- You might have a schema that isn't covered by our subset of JSON schema. (To keep performance fast, we don't support some more complex/long-tail features.)

- In JSON mode and Structured Outputs, failures are rarer but more catastrophic. If the model gets too confused, it can get stuck in loops where it just prints technically valid output forever without ever closing the object. In these cases, you can end up waiting a minute for the request to hit the max_token limit, and you also have to pay for all those useless tokens. So if you have a really tricky schema, and you'd rather get frequent failures back quickly instead of infrequent failures back slowly, you might also want "strict": false

But in 99% of cases, you'll want "strict": true.

davidkunz
0 replies
23h59m

Maybe if you can't precisely model your structure with (OpenAI's subset of) JSON schema.

paradite
4 replies
23h15m

Really important update that was not mentioned:

gpt-4o-2024-08-06 has 16,384 tokens output limit instead of 4,096 tokens.

https://platform.openai.com/docs/models/gpt-4o

We don't need the GPT-4o Long Output anymore.

OutOfHere
2 replies
22h54m

But is this also the default or just the max? Is the default 4k or 16k?

Also, the question of the default value applies both at the server level and at the SDK level.

paradite
1 replies
13h41m

Unlike Anthropic, OpenAI models don't have a `max_tokens` setting for API calls, so I assume the max token output limit is automatically applied to API calls.

Otherwise the max token output limit stated on the models page would be meaningless.

floam
0 replies
22h43m

Long Output is 64K though.

behnamoh
4 replies
23h56m

Well, there goes one of the big advantages of open-source models...

For a long time, I was relying on such guaranteed structured outputs as a "secret sauce" that only works using llama.cpp's GBNF grammars. Now OpenAI literally introduced the same concept but a bit more accessible (since you create a JSON and they convert it to a grammar).

Those of you who have used GBNF, do you think it still has any advantage over what OpenAI just announced?

ejones
2 replies
23h40m

FWIW, llama.cpp has always had a JSON schema -> GBNF converter, although it launched as a companion script. Now I think it's more integrated in the CLI and server.

But yeah I mean, GBNF or other structured output solutions would of course allow you to supply formats other than JSON schema. It sounds conceivable though that OpenAI could expose the grammars directly in the future, though.

behnamoh
1 replies
23h22m

I think for certain tasks it's still easier to write the grammar directly. Does converting from JSON to a CFG limit the capabilities of the grammar? i.e., are there things JSON can't represent that a context free grammar can?

ejones
0 replies
22h12m

You might be right that they're similarly powerful. In some cases, an arbitrary output format might in and of itself be desirable. Like it might result in token savings or be more natural for the LLM. For instance, generating code snippets to an API or plain text with constraints.

And this is more esoteric, but technically in the case of JSON I suppose you could embed a grammar inside a JSON string, which I'm not sure JSON schema can express.

J_Shelby_J
0 replies
21h52m

JSON is a sub-set of what GBNF can do, so there are still advantages to that approach. But even GBNF doesn’t go far enough. Ever try to restrict a model to a single sentence?

root ::= \" \" item{{{min_count},{max_count}}}

item ::= [A-Z] [^\\r\\n\\x0b\\x0c\\x85\\u2028\\u2029.?!]+ [a-z] (\". \" | \"? \" | \"! \")

This kinda works if you don't mind no abbreviations, but you can't do something like this with JSON grammars afaik.

simonw
3 replies
23h51m

The price decrease is particularly notable because it represents a 50% cut in the price to handle image inputs, across any OpenAI model.

Previously image inputs on GPT-4o-mini were priced the SAME as GPT-4o, so using mini wouldn't actually save you any money on image analysis.

This new gpt-4o-2024-08-06 model is 50% cheaper than both GPT-4o AND GPT-4o-mini for image inputs, as far as I can tell.

UPDATE: I may be wrong about this. The pricing calculator for image inputs on https://openai.com/api/pricing/ doesn't indicate any change in price for the new model.

jeffharris
1 replies
22h7m

yep image input on the new model is also 50% cheaper

and apologies for the outdated pricing calculator ... we'll be updating it later today

brianjking
0 replies
17h10m

So we can send an image + text to the new structured output model and use the chain of thought schema?

I'm getting an error.

openai.BadRequestError: Error code: 400 - {'error': {'message': 'Invalid content type. image_url is only supported by certain models.', 'type': 'invalid_request_error', 'param': 'messages.[1].content.[1].type', 'code': None}}

minimaxir
0 replies
23h48m

The calculator doesn't account for the fact that there are now two different prices in a given price matrix.

ramoz
3 replies
22h37m

Can someone explain how this is different/better than the current state of function calling (which I’ve been using to get a consistent json schema response without issue)?

zbyforgotp
0 replies
22h28m

This is guaranteed, function calling without it is not. The old way can work for you, but my experience is different, especially with complex schemas.

jacobsimon
0 replies
22h29m

For starters, the naming is much less confusing. But the behavior also appears to be enforced/validated at some layer (hopefully?), which function calling did not seem to be. I was experimenting with it a couple weeks ago and it would work like 75% of the time but would often give me invalid results for schemas with relatively simple nested objects.

say_it_as_it_is
2 replies
23h0m

This puts like a dozen popular python libraries out of business

fastball
0 replies
16h46m

Once it actually works, sure.

I just tried to use structured outputs with the latest release (openai-python 1.40) and it doesn't think Structured Outputs is a thing.

EDIT: turns out my JSON schema is too large (800 lines + recursive) and seems to be breaking OpenAI's "convert to CFG" step. Whoops.

AStrangeMorrow
0 replies
22h55m

At least depends on the approach and use: stuff like outlines (https://github.com/outlines-dev/outlines) that actually changes the sampling to adhere to a grammar and can be used with local/custom models shouldn't be too impacted. Those are not really used on top of openAI models

elpocko
2 replies
1d

Doesn't the BNF grammar approach in llama.cpp solve this issue in a generic way that should work with any model? Why wouldn't they use that?

ejones
0 replies
23h51m

Similar approach to llama.cpp under the hood - they convert the schema to a grammar. Llama.cpp's implementation was specific to the ggml stack, but what they've built sounds similar to Outlines, which they acknowledged.

HanClinto
0 replies
21h31m

llama.cpp's GBNF grammar is generic, and indeed works with any model.

I can't speak for other approaches, but -- while llama.cpp's implementation is nice in that it always generates valid grammars token-by-token (and doesn't require any backtracking), it is tough in that -- in the case of ambiguous grammars (where we're not always sure where we're at in the grammar until it finishes generating), then it keeps all valid parsing option stacks in memory at the same time. This is good for the no-backtracking case, but it adds a (sometimes significant) cost in terms of being rather "explosive" in the memory usage (especially if one uses a particularly large or poorly-formed grammar). Creating a grammar that is openly hostile and crashes the inference server is not difficult.

People have done a lot of work to try and address some of the more egregious cases, but the memory load can be significant.

One example of memory optimization: https://github.com/ggerganov/llama.cpp/pull/6616

I'm not entirely sure what other options there are for approaches to take, but I'd be curious to learn how other libraries (Outlines, jsonformer) handle syntax validation.

srcreigh
1 replies
23h51m

The tokens that are valid at the beginning of the output include things like {, {“, {\n, etc. However, once the model has already sampled {“val, then { is no longer a valid token

Oops, this is incorrect. {“val{“:2} is valid json.

(modulo iOS quotes lol)

jhgg
0 replies
23h6m

Valid JSON, sure, but that key does not conform to the schema provided in the example. The LLM must generate valid JSON that also conforms to the provided schema.

munro
1 replies
5h6m

Wow, this is cool-- first time I've seen generating well formed output has been made easy, the use of Pydantic is clever/easy...

Which has lead me to the prior art: using same Pydantic API to generate structured output from local LLMs

https://github.com/outlines-dev/outlines?tab=readme-ov-file#...

Terretta
0 replies
4h48m

Commercial use of LLMs is one of those spaces a startup probably needs more than skating ahead of the incumbents on the incumbents' “missing” features as startup value prop. Even if the startup has a year or two head start, it may find its beta “OBE” (overcome by events):

https://dottxt.co

That said, Outline has been making this concept portable for a long time, it's the langchain the community deserves:

https://outlines-dev.github.io/outlines/cookbook/

mariarmestre
1 replies
8h36m

Hmmm... this is not eligible for their zero data retention policy anymore. Not sure how this will go down.

blackcat201
1 replies
17h20m

Do beware on some reasoning task, our recent work[0] actually found it may cause some performance degradation as well as possible reasoning weakening in JSON. I really hope they fix this in the latest GPT-4o version.

[0] https://arxiv.org/abs/2408.02442

kiratp
0 replies
16h13m

Thank you! This confirms my intuition!

Structured generation seems counter to every other signal we have that chain of thought etc improves performance.

agtech_andy
1 replies
22h14m

I have had a lot of success using BoundaryML (https://www.boundaryml.com/) for this. They have also been super responsive for any of my questions.

aaronvg
0 replies
21h15m

thanks for the shoutout, we benchmarked our approach against other function-calling techniques and we've been able to beat all other approaches every time (even by 8%!) just by getting better at parsing the data and representing schemas with less tokens using type definitions instead of json schema.

You can take a look at our BFCL results on that site or the github: https://github.com/BoundaryML/baml

We'll be publishing our comparison against OpenAI structured outputs in the next 2 days, and a deeper dive into our results, but we aim to include this kind of constrained generation as a capability in the BAML DSL anyway longterm!

jodacola
0 replies
1d

Amusingly, I immediately thought 9.11 - but in the context of a newer version of software. Ever have those moments where you're so deep in context of some ecosystem that you skip right past the basics, like 9.9 being a larger number than 9.11?

tuan3w
0 replies
17h27m

Just curious if structured outputs/constrained generation improve model accuracy, e.g., for information extraction problems. Does anyone have experience with this and why?

tomduncalf
0 replies
10h55m

Would constraining the tokens that the model can choose from impact its “intelligence”/“creativity” in other (possibly negativel) ways?

tipsytoad
0 replies
3h58m

How is this different from instructor? github.com/jxnl/instructor

namely, why did they take so long for something that just seems like a wrapper around function calling?

thrance
0 replies
17h20m

And you still can't provide a custom grammar to the API...

The company I work for desperately need the LLM to consistently generate results in a subset of HTML. I was able to craft a small grammar file that does just that, in under 5 minutes, and use it successfully with llama.cpp. Yet, there are still no API offering this basic feature that could really benefit everyone.

Instead we have a thousand garbage medium articles with "tips & tricks" on how to prompt the AI to get better results. It's as if people don't care anymore about consistency and reliability.

tarofchaos
0 replies
22h13m

Two years too late. I think we are going through a bozo period at OpenAI where small things are being highlighted as achievements.

sub7
0 replies
19h29m

To "look up all my orders in May that were fulfilled and not delivered" is a 1 line query in every ORM I know of.

Why in the world would I literally pass around a schema file with each request that is (according to the example) 91 lines of code and pay someone for the privilege of maybe doing it correctly?

sansseriff
0 replies
22h53m

Preprocessing new schema takes 'under 10 seconds'. That's... a huge range? Unless the preprocessing time is a small fraction of the inference time, I don't see the point.

I'm working on an app that dynamically generates schema based on user input (a union of arbitrary types pulled from a library). The resulting schema is often in the 800 token range. Curious how long that would take to preprocess

roseway4
0 replies
18h49m

We extensively use vLLM's support for Outlines Structured Output with small language models (llama3 8B, for example) in Zep[0][1]. OpenAI's Structured Output is a great improvement on JSON mode, but it is rather primitive compared to vLLM and Outlines.

# Very Limited Field Typing

OpenAI offers a very limited set of types[2] (String, Number, Boolean, Object, Array, Enum, anyOf) without the ability to define patterns and max/min lengths. Outlines supports defining arbitrary RegEx patterns, making extracting currencies, phone numbers, zip codes, comma-separated lists, and more a trivial exercise.

# High Schema Setup Cost / Latency

vLLM and Outlines offer near-zero cost schema setup: RegEx finite state machine construction is extremely cheap on the first inference call. While OpenAI's context-free grammar generation has a significant latency penalty of "under ten seconds to a minute". This may not impact "warmed-up" inference but could present issues if schemas are more dynamic in nature.

Right now, this feels like a good first step, focusing on ensuring the right fields are present in schema-ed output. However, it doesn't yet offer the functionality to ensure the format of field contents beyond a primitive set of types. It will be interesting to watch where OpenAI takes this.

[0] https://help.getzep.com/structured-data-extraction

[1] https://help.getzep.com/dialog-classification

[2] https://platform.openai.com/docs/guides/structured-outputs/s...

nerdjon
0 replies
1d

I have a bad feeling that this is just going to introduce more shovelware apps that try to shove AI use in without really understanding what they are going to get back.

Yay I can now ensure the json object will look how I want, but lets completely disregard any concern of wether or not the data returned is valuable.

I don't understand why we are already treating these systems as general purpose AI when they are not. (Ok I do understand it, but it is frustrating).

The example given of "look up all my orders in may of last year that were fulfilled but not delivered on time".

First I have found these models incredibly dumb when it comes to handling time. But even beyond that, if you really are going to do this. I really hope you double check the data before presenting the data you get back as true. Worse that is just double checking what it gives back to you is accurate, not checking that it isn't telling you about something.

Every time I try to experiment with supplying data and asking for data back, they fall flat on their face before we even get to the json being formatted properly. That was not the issue that needed to be solved yet when it still fundamentally messes up the data. Often just returning wrong information. Sometimes it will be right though, but that is the problem. It may luck out and be right enough times that you gain confidence in it and stop double checking what it is giving back to you.

I guarantee you someone is going to have a discussion about using this, feeding it data, and then storing the response in a database.

myprotegeai
0 replies
17h16m

I send dynamic jsonschemas to LLMs in manifest[1] to simulate function calls based on function signatures. These structured outputs will be enormously useful.

Other posts claim that you can generate jsonschema-conformant output reliably without this, and while I mostly agree, there is an edge case where gpt4o struggles, and that is simple data types. For example, a string, in jsonschema, has a schema of simply {"type": "string"}, and an example value of "hello world." However, gpt4o would produce something like {"value": "hello world"} at a very high probability. I had to include specific few shot examples of what not to do in order to make this simple case reliable. I suspect there are other non-obvious cases.

1. https://github.com/amoffat/manifest

mugivarra69
0 replies
22h2m

cohere had this like a while ago

msp26
0 replies
21h50m

Is the JSON actually being fed into the LLM's context or is it still being converted into typescript?

The previous setup didn't allow for custom types, only objects/string/num/bool.

Are the enums put into context or purely used for constrained sampling?

jappgar
0 replies
20h48m

It's nice that they're not making me pay for broken json anymore but touting this as a "feature" is laughable.

It's a bug fix. They should never have been charging for malformed responses in the first place!

irgolic
0 replies
22h49m

Wasn’t this already fully supported in the tool calling API?

hallqv
0 replies
2h49m

They promised us super intelligence - all we got was valid json

franze
0 replies
12h29m

finally

at

gpt.franzai.com

we use a 3 times check until now

if chatgpt does not return valid json then trim anything before and after last {}

if this is not valid json feedback him full response and ask for valid json

if this is not valid json put full response into self made json to get al least something we can work with in an "something did not work out" response

enobrev
0 replies
23h55m

In a startup I was working on last year, I had a surprisingly good experience with using a json-schema in my prompt. I had to tweak the json response a bit because it was always invalid, but the issue was generally a missing colon or misplaced bracket. Data-wise it stuck to the schema very well, and cleaning up the json was simple enough that we got to zero parsing errors. I believe this was with 3.5.

Sadly, that project was a final (relatively successful) attempt at getting traction before the startup was sold and is no longer live.

Edit: Ouch, are the down-votes disbelief? Annoyance? Not sure what the problem is.

damsta
0 replies
22h6m

Can we get something like that in Gemini 1.5 Flash?

brap
0 replies
8h57m

Structured outputs seem like an important step forward for LLMs to make them more useful (this includes intermediate "outputs" that are internal to the model, e.g. producing code and then running it without AI, producing mathematical/logical statements and then proving them without AI, etc). Basically, once you have structure, you can use non-AI tools that are far more efficient and don't hallucinate. These tools can either be external, or embedded in the model itself (think RAG, reasoning, etc.).

I wonder if JSON is just a first step, and eventually this will be generalized to any formal grammar...

blixt
0 replies
7h39m

I was impressed by Microsoft’s AICI where the idea is a WASM program can choose the next tokens. And relatedly their Guidance[1] framework which can use CFGs and programs for local inference to even speed it up with context aware token filling. I hope this implies API-based LLMs may be moving in a similar direction.

[1] https://github.com/guidance-ai/guidance

bamboozled
0 replies
17h39m

This is a game changer

armcat
0 replies
9h46m

This is awesome, and simplifies lot of my workflows when using their APIs directly. I also want to give a shout out to Outlines team, https://github.com/outlines-dev/outlines, they've been doing structured outputs for the last 12 months, and their open source lib can be applied across all open weight and closed/API-based models. Most likely, Outlines heavily inspired the OpenAI team, maybe even using some of their codebase.

adagradschool
0 replies
23h1m

While text and image generation are getting cheaper at a significant rate, audio still seems to be just as expensive with ElevenLabs. I wonder why it is so.

MeetBagelPPML
0 replies
49m

This is super cool!

If you're looking to get in early on a free tool that allows you to finetune your model using encrypted data sets you can sign up here: https://waitlist.bagel.net/

MattDaEskimo
0 replies
21h58m

"We have ripped code from a bunch of open-source variations and slapped it behind our brutally abstracted API.

Interoperable with other external models like the open source versions? What, are you mad?"

LAC-Tech
0 replies
19h55m

Good to see JSON Schema being more widely adopted. I remember doing a project a few years ago in XML just because XML schemas were everywhere and JSON ones were still not really used.

Havoc
0 replies
10h11m

Surprised to see OpenAI staff in comments. I don’t recall that being the case on previous threads. (Hello!)

Brosper
0 replies
19h47m

I think they would like to have something like artifacts in Claude