HN comments for: Embeddings are a good starting point for the AI curious app developer

thisiszilff

37 replies

2024-04-17 17:50:19 UTC

One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.

Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.

The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.

HarHarVeryFunny

13 replies

22h17m

2024-04-17 20:08:29 UTC

Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".

thisiszilff

6 replies

21h1m

2024-04-17 21:24:39 UTC

I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

HarHarVeryFunny

5 replies

20h42m

2024-04-17 21:43:53 UTC

But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.

If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.

thisiszilff

4 replies

20h32m

2024-04-17 21:54:24 UTC

Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.

Eg fix your vocab of 50k words (or whatever) and enumerate it.

Then to make an embedding for some piece of text

1. initialize an all zero vector of size 50k 2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it 3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example). initialize an embedding (for a single text) as an all zero vector of size 50k

p1esk

3 replies

19h23m

2024-04-17 23:02:52 UTC

This is not the best way to understand where modern embeddings are coming from.

codetrotter

1 replies

17h39m

2024-04-18 00:46:34 UTC

True, but what is the best way?

HarHarVeryFunny

0 replies

16h44m

2024-04-18 01:41:33 UTC

Are you talking about sentence/text chunk embeddings, or just embeddings in general?

If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.

Other embeddings are normally going to be learnt in end-to-end fashion.

thisiszilff

0 replies

59m

2024-04-18 17:26:49 UTC

I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg

1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.

2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.

3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.

4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.

5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.

IanCal

5 replies

21h55m

2024-04-17 20:30:48 UTC

They're not, I get why you think that though.

They're making a vector for a text that's the term frequencies in the document.

It's one step simpler than tfidf which is a great starting point.

OmarShehata

2 replies

21h49m

2024-04-17 20:37:14 UTC

Are you saying it's pure chance that operations like "man - woman" = "king - queen" (and many, many other similar relationships and analogies) work?

If not please explain this comment to those of us ignorant in these matters :)

napoleongl

0 replies

13h18m

2024-04-18 05:08:23 UTC

3Blue1Brown has some other examples in his videos about transformers, most notably I think is that hitler-Germany+italy ~ Mussolini!

https://www.3blue1brown.com/lessons/gpt

StrangeDoctor

0 replies

21h24m

2024-04-17 21:02:08 UTC

It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.

But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”

HarHarVeryFunny

1 replies

21h42m

2024-04-17 20:44:22 UTC

OK, sounds counter-intuitive, but I'll take your word for it!

It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.

Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.

Do you have any explanation for why this captures any similarity in meaning?

IanCal

0 replies

20h56m

2024-04-17 21:29:50 UTC

rather than making similar-count words closer in the embedding space.

Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.

The problem of multiple meanings isn't solved by this approach at all, at least not directly.

Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.

Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.

1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.

2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.

3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.

These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.

pstorm

7 replies

2024-04-17 18:15:18 UTC

I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.

sdwr

1 replies

23h40m

2024-04-17 18:46:10 UTC

It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.

itronitron

0 replies

23h30m

2024-04-17 18:55:30 UTC

Yeah, that is a poorly written description. I think they meant that each word gets a unique index location into an array, and the value at that word's index location is incremented whenever the word occurs.

zachrose

0 replies

2024-04-17 18:18:38 UTC

Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?

svieira

0 replies

2024-04-17 18:19:29 UTC

You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).

jncfhnb

0 replies

23h23m

2024-04-17 19:02:49 UTC

King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.

It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.

im3w1l

0 replies

23h9m

2024-04-17 19:16:51 UTC

It's a document embedding, not a word embedding.

bschmidt1

0 replies

14h30m

2024-04-18 03:55:41 UTC

This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:

  act: [0.1]
  as:  [0.4]
  at:  [0.3]
  ...

That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like:

  act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
  ...

The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.

So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.

So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".

dekhn

5 replies

23h44m

2024-04-17 18:42:25 UTC

Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.

thisiszilff

4 replies

21h11m

2024-04-17 21:15:24 UTC

You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).

Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.

light_hue_1

3 replies

18h36m

2024-04-17 23:49:47 UTC

A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.

What you're describing is an idea from the 90s that was a dead end. Bag of words representations.

It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).

There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.

thisiszilff

1 replies

17h44m

2024-04-18 00:41:52 UTC

Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.

Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.

BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.

light_hue_1

0 replies

16h46m

2024-04-18 01:40:17 UTC

Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.

The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.

It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.

They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.

No. Nothing at all makes sense about word embeddings from the point of BoW.

Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.

danieldk

0 replies

11h20m

2024-04-18 07:05:43 UTC

There is no conceptual or practical path from what you describe to what modern embeddings are.

There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.

In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.

Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.

One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).

So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.

Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).

Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.

Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.

[1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f...

afro88

2 replies

2024-04-17 18:18:12 UTC

How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value

twelfthnight

0 replies

23h49m

2024-04-17 18:36:31 UTC

I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.

sell_dennis

0 replies

2024-04-17 18:24:31 UTC

You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive

wjholden

1 replies

18h19m

2024-04-18 00:07:24 UTC

Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.

thisiszilff

0 replies

16h51m

2024-04-18 01:35:00 UTC

Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.

Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.

mschulkind

1 replies

23h49m

2024-04-17 18:36:39 UTC

Aren't you just describing a bag-of-words model?

https://en.wikipedia.org/wiki/Bag-of-words_model

thisiszilff

0 replies

21h9m

2024-04-17 21:16:41 UTC

Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.

_giorgio_

0 replies

3h56m

2024-04-18 14:29:56 UTC

Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.

Someone

0 replies

3h15m

2024-04-18 15:11:21 UTC

I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.

I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)

A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

minimaxir

27 replies

23h54m

2024-04-17 18:31:47 UTC

One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.

The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.

Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.

hereonout2

8 replies

23h2m

2024-04-17 19:23:48 UTC

100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.

hot_gril

4 replies

21h4m

2024-04-17 21:22:05 UTC

I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.

jerrygenser

1 replies

19h59m

2024-04-17 22:26:44 UTC

Numpy might not be able to handle a full o(n^2) comparison of vectors but you can use a lib with hsnw and it can have great performance on medium (and large) datasets.

hot_gril

0 replies

17h22m

2024-04-18 01:03:59 UTC

If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.

cornel_io

1 replies

19h41m

2024-04-17 22:44:48 UTC

Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).

Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.

Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.

hot_gril

0 replies

17h33m

2024-04-18 00:53:04 UTC

I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.

The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.

smahs

1 replies

22h28m

2024-04-17 19:57:28 UTC

This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).

jonnycoder

0 replies

21h39m

2024-04-17 20:47:02 UTC

KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.

mritchie712

0 replies

22h23m

2024-04-17 20:02:28 UTC

yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].

We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/

itronitron

4 replies

23h16m

2024-04-17 19:10:19 UTC

Does anyone know the provenance for when vectors started to be called embeddings?

minimaxir

2 replies

22h59m

2024-04-17 19:27:25 UTC

I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.

itronitron

1 replies

22h54m

2024-04-17 19:32:09 UTC

> https://nlp.stanford.edu/projects/glove/

A quick scan of the project website yields zero uses of 'embedding' and 23 of 'vector'

minimaxir

0 replies

22h52m

2024-04-17 19:34:11 UTC

It's how I remember it when I was working with them back in the day (word embeddings): I could be wrong.

nmfisher

0 replies

14h25m

2024-04-18 04:00:41 UTC

In an NLP context, earliest I could find was ICML 2008:

http://machinelearning.org/archive/icml2008/papers/391.pdf

I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.

(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)

twelfthnight

3 replies

23h33m

2024-04-17 18:53:21 UTC

Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.

hackernoteng

1 replies

23h28m

2024-04-17 18:58:03 UTC

Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.

jonplackett

0 replies

23h18m

2024-04-17 19:07:37 UTC

Also on supabase!

danielbln

0 replies

23h26m

2024-04-17 18:59:26 UTC

Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.

ertgbnm

2 replies

21h18m

2024-04-17 21:08:24 UTC

When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.

m1117

0 replies

20h28m

2024-04-17 21:58:04 UTC

Actually, I had a pretty good experience with Pinecone.

laurshelly

0 replies

21h2m

2024-04-17 21:23:36 UTC

Could not agree more. For some reason Pandas seems to get phased out as developers advance.

bryantwolf

2 replies

23h7m

2024-04-17 19:18:55 UTC

As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.

simonw

1 replies

22h45m

2024-04-17 19:41:15 UTC

If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.

bryantwolf

0 replies

21h44m

2024-04-17 20:41:29 UTC

I love it!

christiangenco

1 replies

23h18m

2024-04-17 19:08:18 UTC

Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...

Seems like the next standard feature in every app is going to be natural language search powered by embeddings.

minimaxir

0 replies

23h7m

2024-04-17 19:18:38 UTC

For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)

leobg

0 replies

22h53m

2024-04-17 19:32:28 UTC

hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.

dullcrisp

16 replies

2024-04-17 17:44:17 UTC

Is there any easy way to run the embedding logic locally? Maybe even locally to the database? My understanding is that they’re hitting OpenAI’s API to get the embedding for each search query and then storing that in the database. I wouldn’t want my search function to be dependent on OpenAI if I could help it.

laktek

4 replies

23h30m

2024-04-17 18:55:40 UTC

If you are building using Supabase stack (Postgres as DB with pgVector), we just released a built-in embedding generation API yesterday. This works both locally (in CPUs) and you can deploy it without any modifications.

Check this video on building Semantic Search in Supabase: https://youtu.be/w4Rr_1whU-U

Also, the blog on announcement with links to text versions of the tutorials: https://supabase.com/blog/ai-inference-now-available-in-supa...

jonplackett

1 replies

23h13m

2024-04-17 19:12:28 UTC

So handy! I already got some embeddings working with supabase pgvector and OpenAI and it worked great.

What would the cost of running this be like compared to the OpenAI embedding api?

laktek

0 replies

21h40m

2024-04-17 20:46:11 UTC

There are no extra costs other than the what we'd normally charge for Edge Function invocations (you get up to 500K in the free plan and 2M in the Pro plan)

_bramses

1 replies

12h6m

2024-04-18 06:19:44 UTC

neat! one thing i’d really love tooling for: supporting multi user apps where each has their own siloed data and embeddings. i find myself having to set up databases from scratch for all my clients, which results in a lot of repetitive work. i’d love to have the ability one day to easily add users to the same db and let them get to embedding without having to have any knowledge going in

kiwicopple

0 replies

11h4m

2024-04-18 07:22:16 UTC

This is possible in supabase. You can store all the data in a table and restrict access with Row Level Security

You also have various ways to separate the data for indexes/performance

- use metadata filtering first (eg: filter by customer ID prior to running a semantic search). This is fast in postgres since its a relational DB

- pgvector supports partial indexes - create one per customer based on a customer ID column

- use table partitions

- use Foreign Data Wrappers (more involved but scales horizontally)

ngalstyan4

2 replies

2024-04-17 18:10:18 UTC

We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras>

You can generate CLIP embeddings locally on the DB server via:

  SELECT abstract,
       introduction,
       figure1,
       clip_text(abstract) AS abstract_ai,
       clip_text(introduction) AS introduction_ai,
       clip_image(figure1) AS figure1_ai
  INTO papers_augmented
  FROM papers;

Then you can search for embeddings via:

  SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;

The approach significantly decreases search latency and results in cleaner code. As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.

The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.

charlieyuan

1 replies

23h20m

2024-04-17 19:06:24 UTC

Lantern seems really cool! Interestingly we did try CLIP (openclip) image embeddings but the results were poor for 24px by 24px icons. Any ideas?

Charlie @ v0.app

ngalstyan4

0 replies

21h57m

2024-04-17 20:28:39 UTC

I have tried CLIP on my personal photo album collection and it worked really well there - I could write detailed scene descriptions of past road trips, and the photos I had in mind would pop up. Probably the model is better for everyday photos than for icons

dvt

1 replies

2024-04-17 17:52:49 UTC

Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database (it's just an array of f32s at the end of the day). But for fast vector math (which you need for similarity search), a vector database is recommended, e.g. the pgvector[2] postgres extension.

[1] https://github.com/Anush008/fastembed-rs

[2] https://github.com/pgvector/pgvector

J_Shelby_J

0 replies

22h8m

2024-04-17 20:17:46 UTC

Fun timing!

I literally just published my first crate: candle_embed[1]

It uses Candle under the hood (the crate is more of a user friendly wrapper) and lets you use any model on HF like the new SoTA model from Snowflake[2].

[1] https://github.com/ShelbyJenkins/candle_embed [2] https://huggingface.co/Snowflake/snowflake-arctic-embed-l

simonw

0 replies

2024-04-17 18:00:32 UTC

There are a bunch of embedding models you can run on your own machine. My LLM tool had plugins for some of those:

- https://llm.datasette.io/en/stable/plugins/directory.html#em...

Here's how to use them: https://simonwillison.net/2023/Sep/4/llm-embeddings/

notakash

0 replies

2024-04-17 18:03:25 UTC

If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.

jonnycoder

0 replies

21h36m

2024-04-17 20:49:46 UTC

The MTEB leaderboard has you covered. That is a goto for finding the leading embedding models and I believe many of them can run locally.

https://huggingface.co/spaces/mteb/leaderboard

jmorgan

0 replies

2024-04-17 18:21:06 UTC

Support for _some_ embedding models works in Ollama (and llama.cpp - Bert models specifically)

  ollama pull all-minilm

  curl http://localhost:11434/api/embeddings -d '{
    "model": "all-minilm",
    "prompt": "Here is an article about llamas..."
  }'

Embedding models run quite well even on CPU since they are smaller models. There are other implementations with a library form factor like transformers.js https://xenova.github.io/transformers.js/ and sentence-transformers https://pypi.org/project/sentence-transformers/

internet101010

0 replies

13h33m

2024-04-18 04:52:32 UTC

Open WebUI has langchain built-in and integrates perfectly with ollama. They have several variations of docker compose files on their github.

https://github.com/open-webui/open-webui

bryantwolf

0 replies

23h15m

2024-04-17 19:11:03 UTC

This is a good call out. OpenAI embeddings were simple to stand up, pretty good, cheap at this scale, and accessible to everyone. I think that makes them a good starting point for many people. That said, they're closed-source, and there are open-source embeddings you can run on your infrastructure to reduce external dependencies.

crowcroft

14 replies

22h47m

2024-04-17 19:39:03 UTC

My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.

For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...

The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?

kmeisthax

4 replies

22h9m

2024-04-17 20:17:13 UTC

Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.

crowcroft

2 replies

22h4m

2024-04-17 20:21:44 UTC

Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.

kmeisthax

1 replies

20h28m

2024-04-17 21:58:16 UTC

Yes.

Although if you really wanted to, and normalized your data like a good little Edgar F. Codd devotee, you could write something like this:

SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2))) FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value as v2, v1.value * v2.value as dot FROM vectors as v1 INNER JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE v1.vector_id = "?" AND v2.vector_id = "?") as v;

This assumes one table called "vectors" with columns vector_id, dimension, and value; vector_id and dimension being primary. The inner query grabs two vectors as separate columns with some self-join trickery, computes the product of each component, and then the outer query computes aggregate functions on the inner query to do the actual cosine similarity.

No I have not tested this on an actual database engine, I probably screwed up the SQL somehow. And obviously it's easier to just have a database (or Postgres extension) that recognizes vector data as a distinct data type and gives you a dedicated cosine-similarity function.

crowcroft

0 replies

19h56m

2024-04-17 22:30:01 UTC

Thanks for the explanation! Appreciate that you took the time to give an example. Makes a lot more sense why we reach for specific tools for this.

simonw

0 replies

19h11m

2024-04-17 23:14:57 UTC

You can run similarity scores with a custom SQLite function.

I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

H1Supreme

2 replies

22h42m

2024-04-17 19:43:37 UTC

Vector databases are used to store embeddings.

crowcroft

0 replies

22h29m

2024-04-17 19:56:50 UTC

But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.

chuckhend

0 replies

4h25m

2024-04-18 14:00:55 UTC

check out https://github.com/tembo-io/pg_vectorize - we're taking it a little bit beyond just the storage and index. The project uses pgvector for the indices and distance operators, but also adds a simpler API, hooks into pre-trained embedding models, and helps you keep embeddings updated as data changes/grows

laborcontract

1 replies

22h39m

2024-04-17 19:46:29 UTC

A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).

crowcroft

0 replies

22h32m

2024-04-17 19:53:41 UTC

Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.

alexgarcia-xyz

1 replies

22h15m

2024-04-17 20:10:35 UTC

Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.

For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.

One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.

I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".

Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss

crowcroft

0 replies

21h58m

2024-04-17 20:27:28 UTC

This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.

simonw

0 replies

19h12m

2024-04-17 23:13:32 UTC

I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.

bryantwolf

0 replies

21h17m

2024-04-17 21:08:46 UTC

You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.

I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.

Imnimo

9 replies

2024-04-17 18:25:32 UTC

One of the challenges here is handling homonyms. If I search in the app for "king", most of the top ten results are "ruler" icons - showing a measuring stick. Rodent returns mostly computer mice, etc.

https://www.v0.app/search?q=king

https://www.v0.app/search?q=rodent

This isn't a criticism of the app - I'd rather get a few funny mismatches in exchange for being able to find related icons. But it's an interesting puzzle to think about.

bryantwolf

2 replies

23h5m

2024-04-17 19:20:34 UTC

Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff

jonnycoder

1 replies

21h43m

2024-04-17 20:43:06 UTC

As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!

bryantwolf

0 replies

18h37m

2024-04-17 23:48:31 UTC

Hybrid searches are great, though I'm not sure they would help here. Neither 'crown' nor 'ruler' would come back from a text search for 'king,' right?

I bet if we put a better description into the embedding for 'ruler,' we'd avoid this. Something like "a straight strip or cylinder of plastic, wood, metal, or other rigid material, typically marked at regular intervals, to draw straight lines or measure distances." (stolen from a Google search). We might be able to ask a language model to look at the icon and give a good description we can put into the embedding.

lubesGordi

0 replies

5h1m

2024-04-18 13:25:20 UTC

I think this is the point of the Attention portion in an llm, to use context to skew the embedding result closer to what youre looking for.

It does seem a little strange 'ruler' would be closer to 'king' versus something like 'crown'.

joshspankit

0 replies

20h20m

2024-04-17 22:05:31 UTC

This is imo the worst part of embedding search.

Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.

itronitron

0 replies

23h23m

2024-04-17 19:02:57 UTC

> If I search in the app for "king", most of the top ten results are "ruler" icons

I believe that's the measure of a man.

dceddia

0 replies

22h27m

2024-04-17 19:58:54 UTC

I was reading this article and thinking about things like, in the case of doing transcription, if you heard the spoken word “sign” in isolation you couldn’t be sure whether it meant road sign, spiritual sign, +/- sign, or even the sine function. This seems like a similar problem where you pretty much require context to make a good guess, otherwise the best it could do is go off of how many times the word appears in the dataset right? Is there something smarter it could do?

charlieyuan

0 replies

23h27m

2024-04-17 18:59:07 UTC

Good call out! We think of this as a two part problem.

1. The intent of the user. Is it a description of the look of the icon or the utility of the icon? 2. How best to rank the results which is a combination of intent, CTR of past search queries, bootstrapping popularity via usage on open source projects etc.

- Charlie of v0.app

anon373839

0 replies

18h19m

2024-04-18 00:07:09 UTC

Wouldn’t it help to provide affordances guiding the user to submit a question rather than a keyword? Then, “Why are kings selected by primogeniture?” probably wouldn’t be near passages about measuring sticks in the embedding space. (Of course, this idea doesn’t work for icon search.)

m1117

5 replies

2024-04-17 18:03:08 UTC

ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors

sdesol

1 replies

2024-04-17 18:11:02 UTC

Can you elaborate more on the falling apart? I can see pgvector being intimidating for users with no experience standing up a DB, but I don't see how Postgres or pgvector would fall apart. Note, my reason for asking is I'm planning on going all in with Postgres, so pgvector makes sense for me.

cargobuild

0 replies

23h53m

2024-04-17 18:32:55 UTC

https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it out :)

ntry01

0 replies

10h26m

2024-04-18 07:59:38 UTC

on the other hand, if you have postgres already, it may be easier to add pgvector than to add another dependency to your stack (especially if you are using something like supabase)

another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance

it's a good place to start in those cases and if it is successful and you need extreme performance you can always move to other specialized tools like qdrant, pinecone or weaviate which were purpose-built for vectors

hackernoteng

0 replies

23h23m

2024-04-17 19:02:57 UTC

What is "more vectors"? How many are we talking about? We've been using pgvector in production for more than 1 year without any issues. We dont have a ton of vectors, less than 100,000, and we filter queries by other fields so our total per cosine function is probably more like max of 5000. Performance is fine and no issues.

chuckhend

0 replies

4h12m

2024-04-18 14:13:38 UTC

Take a look at https://github.com/tembo-io/pg_vectorize. It makes it a lot easier to get started. It runs on pgvector, but as a user, its completely abstracted from you. It also provides you with a way to auto-update embeddings as you add new data or update existing source data.

aidenn0

4 replies

22h23m

2024-04-17 20:02:40 UTC

Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?

base698

3 replies

22h15m

2024-04-17 20:11:23 UTC

If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.

aidenn0

2 replies

22h6m

2024-04-17 20:20:20 UTC

I'm guessing 2 dimensions isn't for this.

Here's a concrete example: "bow" would need to be close to "ribbon" (as in a bow on a present) and also close to "gun" (as a weapon that shoots a projectile), but "ribbon" and "gun" would seem to need be far from each other. How does something like word2vec resolve this? Any transitive relationship would seem to fall afoul of this.

base698

1 replies

20h26m

2024-04-17 21:59:57 UTC

Yes, only more sophisticated embeddings can capture that and it's over 300+ dimensions.

aidenn0

0 replies

14h37m

2024-04-18 03:48:50 UTC

But we still need a measure of "closeness" that is non-transitive.

thorum

3 replies

22h37m

2024-04-17 19:48:33 UTC

Can embeddings be used to capture stylistic features of text, rather than semantic? Like writing style?

levocardia

1 replies

22h29m

2024-04-17 19:56:55 UTC

Probably, but you might need something more sophisticated than cosine distance. For example, you might take a dataset of business letters, diary entries, and fiction stories and train some classifier on top of the embeddings of each of the three types of text, then run (embeddings --> your classifier) on new text. But at that point you might just want to ask an LLM directly with a prompt like - "Classify the style of the following text as business, personal, or fiction: $YOUR TEXT$"

vladimirzaytsev

0 replies

21h38m

2024-04-17 20:47:43 UTC

You may get way more accurate results from relatively small models as well as logits for each class if you ask one question per class instead.

vladimirzaytsev

0 replies

22h28m

2024-04-17 19:57:59 UTC

Likely not, embeddings are very crude. Embeddings of a text is just an average of "meanings" of words.

As is embeddings lack a lot of tricks that made transformers so efficient.

hot_gril

3 replies

23h1m

2024-04-17 19:24:55 UTC

This is where I got started too. Glove embedding stored in Postgres.

Pgvector is nice, and it's cool seeing quick tutorials using it. Back then, we only had cube, which didn't do cosine similarity indexing out of the box (you had to normalize vectors and use euclidean indexes) and only supported up to 100 dimensions. And there were maybe other inconveniences I don't remember, cause front page AI tutorials weren't using it.

isoprophlex

2 replies

22h53m

2024-04-17 19:32:29 UTC

PGvector is very nice indeed. And you get to store your vectors close to the rest of your data. I'm yet to understand the unique use case for dedicated vector dbs. It seems so annoying, having to query your vectors in a separate database without being able to easily join/filter based on the rest of your tables.

I stored ~6 million hacker news posts, their metadata, and the vector embeddings in a cheap 20$/month vm running pgvector. Querying is very fast. Maybe there's some penalty to pay when you get to the billion+ row counts, but I'm happy so far.

hot_gril

0 replies

22h41m

2024-04-17 19:45:16 UTC

You can also store vectors or matrices in a split-up fashion as separate rows in a table, which is particularly useful if they're sparse. I've handled huge sparse matrix expressions (add, subtract, multiply, transpose) that way, cause numpy couldn't deal with them.

brianjking

0 replies

18h32m

2024-04-17 23:53:36 UTC

As I'm trying to work on some pricing info for PGVector - can you share some more info about the hacker news posts you've embedded?

* Which embedding model? (or number of dimensions) * When you say 6 million posts - it's just the URL of the post, title, and author, or do you mean you've also embedded the linked URL (be it HN or elsewhere)?

Cheers!

gchadwick

3 replies

23h25m

2024-04-17 19:01:01 UTC

For an article extolling the benefits of embeddings for developers looking to dip their toe into the waters of AI it's odd they don't actually have an intro to embeddings or to vector databases. They just assume the reader already knows these concepts and dives on in to how they use them.

Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.

simonw

0 replies

22h43m

2024-04-17 19:42:52 UTC

I published this pretty comprehensive intro to embeddings last year: https://simonwillison.net/2023/Oct/23/embeddings/

gk1

0 replies

21h46m

2024-04-17 20:40:01 UTC

To add to the other recommendations, here's a primer on vector DB's: https://www.pinecone.io/learn/vector-database/

charlieyuan

0 replies

23h18m

2024-04-17 19:08:14 UTC

Apologies!

Here's a good primer on embeddings from openai: https://platform.openai.com/docs/guides/embeddings

cargobuild

2 replies

23h54m

2024-04-17 18:32:25 UTC

seeing comments about using pgvector... at pinecone, we spent some time understanding it's limitations and pain points. pinecone eliminates these pain points entirely and makes things simple at any scale. check it out: https://www.pinecone.io/blog/pinecone-vs-pgvector/

gregorymichael

1 replies

23h48m

2024-04-17 18:37:57 UTC

Has Pinecone gotten any cheaper? Last time I tried it was $75/month for the starter plan / single vector store.

cargobuild

0 replies

23h39m

2024-04-17 18:47:25 UTC

yep. pinecone serverless has reduced costs significantly for many workloads.

primitivesuave

1 replies

11h19m

2024-04-18 07:07:19 UTC

I learned how to use embeddings by building semantic search for the Bhagavad Gita. I simply saved the embeddings for all 700 verses into a big file which is stored in a Lambda function, and compared against incoming queries with a single query to OpenAI's embedding endpoint.

Shameless plug in case anyone wants to test it out - https://gita.pub

forgingahead

0 replies

10h24m

2024-04-18 08:02:11 UTC

Really nice and beautiful site!

patrick-fitz

1 replies

2024-04-17 18:03:47 UTC

Nice project! I find it can be hard to think of a idea that is well suited to use AI. Using embeddings for search is definitely a good option to start with.

ParanoidShroom

0 replies

23h48m

2024-04-17 18:38:02 UTC

I made a reverse image search when I learned about embeddings. It's pretty fun to work with images https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...

mrkeen

1 replies

20h58m

2024-04-17 21:28:17 UTC

Given

  not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.

  Unlike LLMs, working with embeddings feels like regular deterministic code.

  <h3>Creating embeddings</h3>

I was hoping for a bit more than:

  They’re a bit of a black box

  Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.

akoboldfrying

0 replies

6h22m

2024-04-18 12:04:01 UTC

I agree. The article was useful insofar as it detailed the steps they took to solve their problem clearly, and it's easy to see that many common problems are similar and could therefore be solved similarly, but I went in expecting more insight. How are the strings turned into arrays of numbers? Why does turning them into numbers that way lead to these nice properties?

dvaun

1 replies

23h52m

2024-04-17 18:33:47 UTC

I’d love to build a suite of local tooling to play around with different embedding approaches.

I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.

I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.

PaulHoule

0 replies

23h32m

2024-04-17 18:54:25 UTC

If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.

I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.

clementmas

1 replies

22h6m

2024-04-17 20:20:01 UTC

Embeddings are indeed a good starting point. Next step is choosing the model and the database. The comments here have been taken over by database companies so I'm skeptical about the opinions. I wish MySQL had a cosine search feature built in

bootsmann

0 replies

22h4m

2024-04-17 20:22:05 UTC

pg_vector has you covered

benreesman

1 replies

20h21m

2024-04-17 22:04:26 UTC

Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times.

He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.

James Briggs has great stuff on this: https://youtube.com/@jamesbriggs

aeth0s

0 replies

16h7m

2024-04-18 02:19:19 UTC

crush the NYT’s recommender as a fun project for two reasons

Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?

Sounds fun!

KasianFranks

1 replies

7h1m

2024-04-18 11:25:22 UTC

They are named 'feature' vectors with scored attributes, similar to associative arrays.Just ask MI. Jordan, D. Blie, S. Mian or A. Ng.

jerrygenser

0 replies

6h25m

2024-04-18 12:00:39 UTC

They are embedded into a particular semantic vector space that is learned based on a model. Another feature vector could be hand rolled based on feature engineering, tidf ngrams etc. Embedding is typically distinct from feature engineering that is manual.

willcodeforfoo

0 replies

6h41m

2024-04-18 11:44:37 UTC

One thing I'm not sure of is how much of a larger bit of text should go into an embedding? I assume it's a trade off of context and recall, with one word not meaning much semantically, and the whole document being too much to represent with just numbers. Is there a sweet spot (e.g. split by sentence) or am I missing something here?

voxelc4L

0 replies

14h59m

2024-04-18 03:27:18 UTC

It begs the question though, doesn't it...? Embeddings require a neural network or some reasonable facsimile to produce the embedding in the first place. Compression to a vector (a semantic space of some sort) still needs to happen – and that's the crux of the understanding/meaning. To just say "embeddings are cool let's use them" is ignoring the core problem of semantics/meaning/information-in-context etc. Knowing where an embedding came from is pretty damn important.

Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.

thomasfromcdnjs

0 replies

15h13m

2024-04-18 03:12:30 UTC

I've been adding embeddings to every project I work on for the purpose of vector similarity searches.

I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.

Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs

suprgeek

0 replies

14h46m

2024-04-18 03:40:20 UTC

Great project and excellent initiative to learn about embeddings. Two possible avenues to explore more. Your system backend could be thought of as being composed of two parts: |Icons->Embedder->|PGVector|->Retriever->Display Result|

1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.

2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank

b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search

Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?

pantulis

0 replies

10h14m

2024-04-18 08:11:44 UTC

I strongly agree with the title of the article. RAG is very interesting right now just as an example of how technology moves from being just fresh out of academia to being engineered and commoditized into regular out of the shelf tools. On the other hand I don't think it's that important to understand how embeddings are calculated, for the beginner it's more important to showcase why they work and why they enable simple reasoning like "queen = woman + (king - men)" and the possible use cases.

mistermann

0 replies

5h5m

2024-04-18 13:20:44 UTC

You can even try dog breeds like ‘hound,’ ‘poodle,’ or my favorite ‘samoyed.’ It pretty much just works. But that’s not all; it also works for other languages. Try ‘chien’ and even ‘犬’1!

Can anyone explain how this language translation works? The magic is in the embeddings of course, but how does it work, how does it translate ~all words across all languages?

mehulashah

0 replies

20h54m

2024-04-17 21:32:00 UTC

I think he is saying: embeddings are deterministic, so they are more predictable in production.

They’re still magic, with little explain ability or adaptability when they don’t work.

kaycebasques

0 replies

23h44m

2024-04-17 18:42:02 UTC

I have been saying similar things to my fellow technical writers ever since the ChatGPT explosion. We now have a tool that makes semantic search on arbitrary, diverse input much easier. Improved semantic search could make a lot of common technical writing workflows much more efficient. E.g. speeding up the mandatory research that you must do before it's even possible to write an effective doc.

adamgordonbell

0 replies

18h35m

2024-04-17 23:51:00 UTC

My problem with this is that it doesn't explain a lot.

You can manually make a vector of a word and then step wise get up to word2vec approach and then document embedding. My post[1] does some of the first part and this great word2vec post[2] dives into it in more detail.

[1] https://earthly.dev/blog/cosine_similarity_text_embeddings/

[2] https://jalammar.github.io/illustrated-word2vec/

LunaSea

0 replies

2024-04-17 18:12:58 UTC

Does anyone have examples of word (ngram) disambiguation when doing Approximate Nearest Neighbour (ANN) on word vector embeddings?

EcommerceFlow

0 replies

23h55m

2024-04-17 18:30:50 UTC

Embeddings have a special place in my heart since I learned about them 2 years ago. Working in SEO, it felt like everything finally "clicked" and I understood, on a lower level, how Google search actually works, how they're able to show specific content snippets directly on the search results page, etc. I never found any "SEO Guru" discussing this at all back then (maybe even now?), even though this was complete gold. It explains "topical authority" and gave you clues on how Google itself understands it.