ChromaDB Defaults to L2 Distance — Why that might not be the best choice

I recently ran into an issue while working with ChromaDB. By default, ChromaDB uses L2 (Euclidean) distance for similarity queries.

Dec 28, 2024

I recently ran into an issue while working with ChromaDB. By default, ChromaDB uses L2 (Euclidean) distance for similarity queries.

But guess what? For most text embedding scenarios, that’s not the best choice — my results after changing to Cosine starts to be 10x better.

The Basics: What’s an “Embedding” Anyway?

In most NLP (Natural Language Processing) tasks, we turn words or sentences into vectors, which are basically lists of numbers. Example:

“Cats are cute” -> [0.21, -0.43, 0.59, …, 0.10]

These numbers try to capture the meaning of the sentence.

Different distance metrics

L2 Distance (Default in ChromaDB)

What It Is

L2 distance (or “Euclidean distance”) is the straight-line distance between two points in space. Think of measuring how far one point is from another if you drew a line between them.

Why It Can Be Wrong for Text

If your embeddings come from something like Sentence Transformers, those vectors are usually normalized — meaning each vector’s length is 1.

L2 distance then starts measuring tiny differences in each dimension.
But if you only care if two sentences mean the same thing (i.e., face the same “direction”), L2 can give you weird results.

It’s kind of like saying:

“Two roads are the same if they start at the same place and end at the exact same spot.”

But for text, all we care is whether the roads head in the same direction, not their exact length.

Cosine Similarity (or Distance)

What’s This “Angle” People Keep Talking About?

Cosine similarity is about the angle between two vectors.

Picture two arrows on a piece of paper.
The smaller the angle (i.e., the closer they are to pointing the same way), the higher the similarity.

In text terms, it means if two sentences share the same topic or meaning, they point roughly the same way in vector space — so Cosine says, “Yep, these are close!”

Why It’s Better for Text

When embeddings are normalized, they all have the same length. If two vectors are “facing the same way,” that means they talk about similar concepts. Cosine easily picks that up and says “High similarity!” Meanwhile, L2 might complain about tiny coordinate differences that don’t really matter if the overall direction is the same.

Silly Example

Imagine you have a recipe for cookies: [milk=2, sugar=3, flour=5].
Someone doubles the recipe: [milk=4, sugar=6, flour=10].

The direction is the same (just double the ingredients), but L2 sees them as “far apart.”

Cosine sees them as basically the same recipe, just scaled up.

Dot Product: The Magnitude + Angle Hybrid

What is a Dot Product?

The dot product between two vectors is the sum of the products of their corresponding entries. For vectors a=(a1,a2,…)a=(a1,a2,…) and b=(b1,b2,…)b=(b1,b2,…),a⋅b=a1b1+a2b2+…a⋅b=a1b1+a2b2+…
It’s related to magnitude (how long each vector is) and the angle between them.

In plain words:

If two vectors are big and also point in the same direction, the dot product will be large.
If they point in opposite directions, the dot product might even be negative.
If you care about both how “big” the embeddings are and how they’re aligned, dot product might be useful.

When to Use Dot Product?

If your embeddings are not normalized and you want to factor in “intensity” or “magnitude.” For example, in recommendation systems, a higher magnitude might mean someone really likes certain items.
For text, many embeddings are normalized to length=1, so dot product becomes basically the same as cosine similarity. In that case, they’re interchangeable.

Imagine Alice and Bob both love coffee. Alice adds a certain mix of sugar, cream, and chocolate syrup. Bob uses exactly the same ratios, but doubles every ingredient. They’re both using the same “flavor profile,” just that Bob’s is bigger and sweeter.

When we talk about dot product, we’re basically saying: “Yes, these two recipes have the same direction (same ingredients) and Bob’s just supersizing it.” The more they match in both type and amount, the bigger the dot product.

Why ChromaDB’s Default L2 Might Bite You

Because ChromaDB defaults to L2:

If you’re storing text embeddings, you might get “just okay” results, not “amazing” results.
Your nearest neighbors might be thrown off by small numeric differences that have zero impact on the actual meaning of the text.

So…Which Metric Should I Use?

Text / NLP → Usually Cosine is best. You’re looking for angle similarity.
Images or numeric data → L2 can be fine, since magnitude matters there.
If you have Dot Product as an option (and your vectors aren’t normalized), that can also be handy for recommendation systems or where you do want magnitude and alignment.

How to Switch in ChromaDB?

Super simple:

import chromadb

client = chromadb.Client()
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "cosine"}  # or "dotproduct"
)

(Exact config may differ based on version — check ChromaDB docs.)

Wrap-Up

ChromaDB is great, but its default L2 distance for text embeddings can be “wrong” in the sense that it measures the length difference instead of the angle.
Cosine distance (or similarity) is usually your friend when working with normalized text embeddings.
Understanding why is as simple as: Do you care about the overall direction (meaning) or the exact numeric differences?

Remember, if you have text embeddings that are all unit-length, switch to Cosine. It’ll give you better “semantic” results — i.e., the results you actually want when your user says, “Hey, give me texts that mean the same thing.”

Razikus Substack

Discussion about this post