One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.
Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.
Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.
If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.
Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.
Eg fix your vocab of 50k words (or whatever) and enumerate it.
Then to make an embedding for some piece of text
1. initialize an all zero vector of size 50k 2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it 3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example). initialize an embedding (for a single text) as an all zero vector of size 50k
This is not the best way to understand where modern embeddings are coming from.
True, but what is the best way?
Are you talking about sentence/text chunk embeddings, or just embeddings in general?
If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.
Other embeddings are normally going to be learnt in end-to-end fashion.
I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg
1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.
2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.
3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.
4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.
5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.
They're not, I get why you think that though.
They're making a vector for a text that's the term frequencies in the document.
It's one step simpler than tfidf which is a great starting point.
Are you saying it's pure chance that operations like "man - woman" = "king - queen" (and many, many other similar relationships and analogies) work?
If not please explain this comment to those of us ignorant in these matters :)
3Blue1Brown has some other examples in his videos about transformers, most notably I think is that hitler-Germany+italy ~ Mussolini!
https://www.3blue1brown.com/lessons/gpt
It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”
OK, sounds counter-intuitive, but I'll take your word for it!
It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.
Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.
Do you have any explanation for why this captures any similarity in meaning?
Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.
The problem of multiple meanings isn't solved by this approach at all, at least not directly.
Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.
Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.
1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.
2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.
3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.
These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.
I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.
It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.
Yeah, that is a poorly written description. I think they meant that each word gets a unique index location into an array, and the value at that word's index location is incremented whenever the word occurs.
Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?
You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).
King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.
It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.
It's a document embedding, not a word embedding.
This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:
That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like: The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.
So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".
Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.
You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.
What you're describing is an idea from the 90s that was a dead end. Bag of words representations.
It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).
There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.
Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.
Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.
BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.
Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.
The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.
It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.
They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.
No. Nothing at all makes sense about word embeddings from the point of BoW.
Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.
There is no conceptual or practical path from what you describe to what modern embeddings are.
There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.
In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.
Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.
One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).
So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.
Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).
Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.
Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.
[1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f...
How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value
I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.
You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive
Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.
Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.
Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.
Aren't you just describing a bag-of-words model?
https://en.wikipedia.org/wiki/Bag-of-words_model
Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.
Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.
I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.
What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.
I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)
A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.