HN comments for: RouteLLM: A framework for serving and evaluating LLM routers

furyofantares

11 replies

4d16h

2024-07-10 01:45:26 UTC

I don't really get who these are for - do people use them in their projects?

I don't find success just using a prompt against some other model without having some way to evaluate it and usually updating it for that model.

veb

5 replies

4d16h

2024-07-10 02:05:07 UTC

From what I understand, it's from people using it in their workflows - say, Claude but keep hitting the rate limits, so they have to wait until Claude says "you got 10 messages left until 9pm", so when they hit that, or before they switch to (maybe) ChatGPT manually.

With the router thingy, it keeps a record, so you know every query where you stand, and can switch to another model automatically instead of interrupting workflow?

I may be explaining this very badly, but I think that's one use-case for how these LLM Routers help.

Kiro

3 replies

4d9h

2024-07-10 09:11:22 UTC

I don't think that's a use case since you don't get rate limited when using the API.

Onawa

1 replies

4d5h

2024-07-10 13:13:37 UTC

We get rate limited when using Azure's OpenAI API. As a gov contractor working with AI, I have limited means for getting access to frontier LLMs. So routing tools that can fail over to another model can be useful.

fkyoureadthedoc

0 replies

4d4h

2024-07-10 14:29:26 UTC

Same. Initially we just load balanced between various regions, ultimately bought some PTUs.

kordlessagain

0 replies

4d4h

2024-07-10 14:13:30 UTC

Anthropic Build Tier 4: 4,000 RPM, 400,000 TPM, 50,000,000 TPD for Claude 3.5 Sonnet

PiRho3141

0 replies

4d15h

2024-07-10 02:59:33 UTC

This is for applications that use LLMs or Chat GPT via API.

vatican_banker

0 replies

4d16h

2024-07-10 02:31:50 UTC

Trained routers are provided out of the box, which we have shown to reduce costs by up to 85%

The answer is here. This is a cost-saving tool.

All companies and their moms want to be in the GenAI game but have strict budgets. Tools like this help to keep GenAI projects within budget.

rodrigobahiense

0 replies

4d11h

2024-07-10 06:52:31 UTC

For the company I work for, one of the most important aspects is ensuring we can fallback to different models in case of content filtering since they are not equally sensitive/restrict.

monarchwadia

0 replies

4d11h

2024-07-10 06:43:55 UTC

I think a lot of people are just interested in hitting the LLM without any bells or whistles, from Typescript. A low level connector lib would come in handy, yeah? https://github.com/monarchwadia/ragged

killerstorm

0 replies

3d9h

2024-07-11 09:32:23 UTC

Yeah. I can't even consistently get JSON output out of all models. What are people doing that they don't care about output format?...

brandall10

0 replies

4d16h

2024-07-10 02:04:48 UTC

You may have a variety of model types/sizes, fine tunes, etc, that serve different purposes - optimizing for cost/speed/specificity of task. At least that's the general theory with routing. This one only seems to optimize for cost/quality.

vatican_banker

5 replies

4d16h

2024-07-10 02:34:22 UTC

The tool currently allows only one set of strong and weak models.

I’d be really good to allow more than two models and change dynamically based on multiple constraints like latency, reasoning complexity, costs, etc.

voiper1

1 replies

4d12h

2024-07-10 06:12:39 UTC

I think unify.ai (like openrouter) does that - it has several paramters you can choose from.

But the underlying "how to choose a model that's smart enough but not too smart" seems difficult to understand.

TechDebtDevin

0 replies

4d10h

2024-07-10 08:07:30 UTC

Its just sentiment analysis.

Oras

0 replies

4d9h

2024-07-10 09:26:03 UTC

Portkey does that with configuration. You assign a base model, then add more models with weight to load-balance.

ModelBox

0 replies

2d5h

2024-07-12 13:27:13 UTC

actually ModelBox(model.box) offers that, the autorouter function can dynamically switch to different models according to latency, geo-position and costs.

KTibow

0 replies

4d15h

2024-07-10 02:52:31 UTC

Some of that is already possible, since it can generate a difficulty score for a prompt that could be manually mapped between models based on ranges.

tananaev

5 replies

4d15h

2024-07-10 02:50:31 UTC

The problem is to understand how complex the request is, you have to use a smart enough model.

ethegwo

0 replies

4d15h

2024-07-10 03:04:01 UTC

The weak-to-strong assumption is that it is easier to eval the result of a task than to generate it. If it is wrong, human can not make a stronger intelligence than us.

PiRho3141

0 replies

4d15h

2024-07-10 02:57:17 UTC

Not true. You can easily train a BERT single class classification model without having to train an LLM.

Grimblewald

0 replies

4d14h

2024-07-10 04:33:13 UTC

not true at all, you could have a efficient cheap model which is generally terrible at most things but has a savant like capacity for categorizing tasks by requirement and difficulty. Even easier when you dont need to support multiple languages and a truly staggering breadth of domains, like a conventional llm does. You could train a really small model to reject out of domain requests and partition the rest, running at a fraction of the cost of a more capable model.

Fripplebubby

0 replies

4d3h

2024-07-10 15:03:44 UTC

In this paper, they tried a couple different methods for determining how similar the incoming request is to requests that they have scored in their dataset. Actually, one of the best methods they used does not involve using a model at all to evaluate the incoming query (similarity-weighted ranking) although it _does_ use pre-trained embeddings.

Using this, they were able to produce quite good results applying this similarity measurement to unseen queries using a standard benchmark. The leap of faith here is assuming that the same query similarity method will continue to bear fruit when extended to queries that aren't benchmarkable.

CuriouslyC

0 replies

4d15h

2024-07-10 02:58:33 UTC

You can distill evaluation ability

fbnbr

2 replies

4d4h

2024-07-10 14:39:27 UTC

This RouteLLM framework sounds really promising, especially for cost optimization. It reminds me of the KNN-router project ([https://github.com/pulzeai-oss/knn-router](https://github.co...), which uses a k-nearest neighbors approach to route queries to the most appropriate models.

What I like about these kinds of solutions is that they address the practical challenges of using multiple LLMs. Rate limits, cost per token, and even just choosing the right model for the job can be a real headache.

KNN-router, for example, lets you define your own logic for routing queries, so you can factor in things like model accuracy, response time, and cost. You can even set up fallback models for when your primary model is unavailable.

It's cool to see these kinds of tools emerging because it shows that people are starting to think seriously about how to build robust, cost-effective LLM pipelines. This is going to be crucial as more and more companies start incorporating LLMs into their products and services.

antupis

0 replies

3d12h

2024-07-11 06:07:43 UTC

Cost is a plus but at least what I see is that getting good response time is even bigger. Something like OpenAI Azure instances are inconsistent and it is far too normal to get a 40sec lag with responses with gpt4-o.

Terretta

0 replies

3d7h

2024-07-11 10:58:14 UTC

Addendum to parent to make link clickable:

https://github.com/pulzeai-oss/knn-router

// HN doesn't handle squared circle as MD.

daghamm

1 replies

4d10h

2024-07-10 08:25:43 UTC

My take from this is that 85% of times we don't need a powerfull LLM like 4o.

Or am I reading this wrong? :)

thomashop

0 replies

4d9h

2024-07-10 08:55:53 UTC

You're reading it right. They have developed a system that automatically decides which model is sufficient, depending on your inputs, saving you costs even within one conversation stream.

The OpenAI-compatible API allows you to talk to the router like a regular GPT model.

bangaladore

1 replies

4d1h

2024-07-10 16:49:07 UTC

I've been using OpenRouter only for personal use, not for its router functionality, so I can use the API of various models (or open-source models) without signing up and prepaying/paying a subscription on all their websites.

I believe OpenRouter also provides an API that does the same thing as RouteLLM. Again, you only have to pay OpenRouter, not every model's service you use.

localfirst

0 replies

3d22h

2024-07-10 20:25:43 UTC

OpenRouter is also interesting solution but I almost end up using like one or two LLMs and I rarely feel the need to switch between different LLMs so I ask why I am even using openrouter in the first place.

worstspotgain

0 replies

4d16h

2024-07-10 01:44:26 UTC

I like their "LLM isovalue" graph, and the idea that different vendors can be forced to partake in the same synergy/distillation scheme. Vendors dislike these schemes, but they're probably OK with them as long as they're niche.

localfirst

0 replies

3d22h

2024-07-10 20:20:53 UTC

solution for a non-critical problem imho

im open to differing opinions but after dealing with langchain, premature optimization for non-critical problems is rampant in this space rn

TZubiri

0 replies

3d14h

2024-07-11 04:03:38 UTC

Or just use a single LLM provider.

Problem solved, next.

PetrBrzyBrzek

0 replies

3d22h

2024-07-10 20:05:13 UTC

There is a similar project called NotDiamond, which is available on Hugging Face: https://huggingface.co/notdiamond/notdiamond-0001.

Havoc

0 replies

4d9h

2024-07-10 09:00:12 UTC

Interesting that it is generalizable to other pairs. That implies some sort of prompt property or characteristic that could be widely used.

I don’t think using different models is the right approach though. They behave differently. Better to use a big and small one from same family. Or alternatively using this to drive whether to give the ai more “thinking time” via chain of thought or agents.