Skip to main content
Version: v24.1

Vector Similarity Search in DQL

Dgraph v24 introduces vector data type and similarity search to the DQL query language.

This guide shows how to use vector embeddings and similarity search in Dgraph. This example uses Ratel for schema updates, mutations, and queries, but you can use any DQL client.

Define Schema

Define a DQL schema with a vector predicate. You can set this via the Ratel schema tab using the bulk edit option, or use any DQL client:

<Issue.description>: string .

<Issue.vector_embedding>: float32vector @index(hnsw(metric:"euclidean")) .

type <Issue> {
Issue.description
Issue.vector_embedding
}

The float32vector type is used with the hnsw index type. The hnsw index supports different distance metrics: cosine, euclidean, or dotproduct. This example uses euclidean distance.

Insert Data

Insert data containing vector embeddings using a DQL mutation. You can paste this into Ratel as a mutation, or use curl, pydgraph, or any DQL client:

{
"set": [
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.25, 0.47, 0.8, 0.27]",
"Issue.description": "Intermittent timeouts. Logs show no such host error."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.57, 0.23, 0.68, 0.41]",
"Issue.description": "Bug when user adds record with blank surName. Field is required so should be checked in web page."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.26, 0.12, 0.77, 0.57]",
"Issue.description": "Delays on responses every 30 minutes with high network latency in backplane"
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.45, 0.49, 0.72, 0.2]",
"Issue.description": "vSlow queries intermittently. The host is not found according to logs."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.52, 0.05, 0.22, 0.82]",
"Issue.description": "Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
},
{
"dgraph.type": "Issue",
"Issue.vector_embedding": "[0.33, 0.64, 0.16, 0.68]",
"Issue.description": "Host and DNS issues are causing timeouts in the User Details web page"
}
]
}
note

For simplicity, this example uses small 4-dimensional vectors. In production, you would typically use vectors generated by ML models (e.g., embeddings from language models) which are usually 384, 512, 768, or more dimensions. The embeddings in this example represent four concepts in the four vector dimensions: slowness/delays, logging/messages, networks, and GUIs/web pages.

Basic Similarity Query

Use the similar_to() function to find similar items. For example, to find issues similar to a new issue description "Slow response and delay in my network!", represent it as the vector [0.28, 0.75, 0.35, 0.48].

The similar_to() function takes three parameters:

  1. The DQL field name (predicate)
  2. The number of results to return
  3. The vector to search for
query slownessWithLogs() {
simVec(func: similar_to(Issue.vector_embedding, 3, "[0.28, 0.75, 0.35, 0.48]")) {
uid
Issue.description
}
}

Using Query Variables

You can use query variables to pass the vector dynamically:

query test($vec: float32vector) {
simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
uid
Issue.description
}
}

When making the request, set the variable vec to a JSON float array:

{
"vec": [0.28, 0.75, 0.35, 0.48]
}

Computing Vector Distances and Similarity Scores

The similar_to() function uses the hnsw index with the distance metric declared in the schema (in this case, euclidean distance).

In some cases, you may want to compute the distance or similarity score explicitly. Keep in mind:

  • Distance: Lower values indicate more similarity
  • Similarity score: Higher values indicate more similarity

Dgraph v24 introduces the dot function to compute the dot product of vectors, which you can use to compute various similarity metrics.

Distance Metrics

Given two vectors A=[a1,a2,...,an]A=[a_1,a_2,...,a_n] and B=[b1,b2,...,bn]B=[b_1,b_2,...,b_n]:

Euclidean distance is the L2 norm of A - B:

D=(a1b1)2+...+(anbn)2D = \sqrt{(a_1 - b_1)^2+...+(a_n - b_n)^2}

Which can be expressed as:

D=(AB)(AB)D = \sqrt{(A-B) \cdot (A-B)}

Cosine similarity measures the angle between two vectors:

cosine(A,B)=ABABcosine(A,B) = \frac{A \cdot B}{||A|| \cdot ||B||}

Cosine similarity ranges from -1 to 1 (where 1 means identical vectors). It's often converted to cosine distance:

cosine_distance(A,B)=1cosine(A,B)cosine\_distance(A,B) = 1 - cosine(A,B)

When vectors are normalized (A=1||A|| = 1 and B=1||B|| = 1), which is usually the case with vector embeddings from ML models, cosine computation can be simplified using only a dot product:

dotproduct_distance=1ABdotproduct\_distance = 1 - A \cdot B

A common use case is to compute a similarity score or confidence. For normalized vectors:

similarity=1+AB2similarity = \frac{1 + A \cdot B}{2}

This metric ranges from 0 to 1, with 1 being as similar as possible, making it useful for applying thresholds.

Computing Distances in DQL

Here's an example query that computes euclidean, cosine, and dot product distances:

query slownessWithLogs($vec: float32vector) {
simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
uid
Issue.description
vemb as Issue.vector_embedding

euclidean_distance: Math(sqrt(($vec - vemb) dot ($vec - vemb)))

dotproduct_distance: Math(1.0 - (($vec) dot vemb))

cosine as Math((($vec) dot vemb) / sqrt((($vec) dot ($vec)) * (vemb dot vemb)))
cosine_distance: Math(1.0 - cosine)

similarity_score: Math((1.0 + (($vec) dot vemb)) / 2.0)
}
}

You typically compute the same distance as defined in the index, or use the similarity score.

Ordering Results by Similarity Score

The following query computes the similarity score in a variable and uses it to order the 3 closest nodes by similarity:

query slownessWithLogs($vec: float32vector) {
var(func: similar_to(Issue.vector_embedding, 3, $vec)) {
vemb as Issue.vector_embedding
score as Math((1.0 + (($vec) dot vemb)) / 2.0)
}
# score is now a map of uid -> similarity_score

simVec(func: uid(score), orderdesc: val(score)) {
uid
Issue.description
score: val(score)
}
}

Summary

This guide demonstrates how to:

  • Define a schema with vector predicates and hnsw indexes
  • Insert data with vector embeddings
  • Perform similarity searches using the similar_to() function
  • Compute various distance metrics and similarity scores using the dot function

For production use cases, you would typically:

  1. Generate vector embeddings from your text/data using ML models (e.g., sentence transformers, OpenAI embeddings)
  2. Store these embeddings in Dgraph
  3. Use similar_to() to find semantically similar items
  4. Optionally compute similarity scores to filter or rank results