What is vector search?

An introduction to vector search/nearest neighbors

What is vector search?

Vector search is the process of finding the most similar data points to a reference data point under vector representation of the data. If you are unfamiliar with vectors, we recommend reading about them on New to vectors?.


Word in a 3D space. Each word is represented by a vector of [x,y,z] values.

In the example above, each data point is represented with a vector of [x,y,z]. We can search which are the closer vectors to the word school, as you can see words such as 'elementary' and 'students' are the closest.


Vector search is successful because similar data points are located in similar and close spatial regions, representing the conceptual similarity/differences between items. Such a feature frees us from the need to perform the time-consuming traditional process of word matching search.

Vector search is not limited to text data. Using different algorithms or neural models, we can vectorize various data types such as images or even audio files.
For instance, in an image problem, under an animal dataset, using a reference image of a dog, we can extract all images of dogs or similar animals such as wolfs based on the retrieved similarity score computed between vectors.


Vector search in image space. The image on top is the reference image which is assigned 78% and 34% similarity to the second and third image; another dog and a rabbit respectively.

What is needed for vector search?

  1. A vectorizer model (i.e. a tool that turns data to vectors). We provide you with no-code access to state-of-the-art vectorizers for different data types.
  2. An engine to perform a quick vector search and returns the results. Traditional and vector search are embedded in our insight extraction tools (e.g. Explorer.

Limitations of vector search

  1. Vectors are great for representing concepts. So, their main limitation is when searching for names (e.g. Samsung-21), and ids (e.g. 123fd)
  2. The accuracy of vector search is highly dependent on the model generating the vectors. Or better said the original data that the vectorizer has been trained on.