首页 > 其他分享 >Distance Metrics in Vector Search

Distance Metrics in Vector Search

时间:2024-11-12 15:57:52浏览次数:1  
标签:Distance Search distance vectors two Metrics vector between product

from:  https://weaviate.io/blog/distance-metrics-in-vector-search

 

Distance Metrics in Vector Search

Vector databases - like Weaviate - use machine learning models to analyze data and calculate vector embeddings. The vector embeddings are stored together with the data in a database, and later are used to query the data.

In a nutshell, a vector embedding is an array of numbers, that is used to describe an object. For example, strawberries could have a vector [3, 0, 1] – more likely the array would be a lot longer than that.

Note, the meaning of each value in the array, depends on what machine learning model we use to generate them.

To judge how similar two objects are, we can compare their vector values, by using various distance metrics.

In the context of vector search, similarity measures are a function that takes two vectors as input and calculates a distance value between them. The distance can take many shapes, it can be the geometric distance between two points, it could be an angle between the vectors, it could be a count of vector component differences, etc. Ultimately, we use the calculated distance to judge how close or far apart two vector embeddings are. These metrics are used in machine learning for classification and clustering tasks, especially in semantic search.

Distance Metrics convey how similar or dissimilar two vector embeddings are.

In this article, we explore the variety of distance metrics, the idea behind each, how they are calculated, and how they compare to each other.

If you already have a working knowledge of vector search, then you can skip straight to the Cosine Distance section.

Vectors in Multi-Dimensional Space

Vector databases keep the semantic meaning of your data by representing each object as a vector embedding. Each embedding is a point in a high-dimensional space. For example, the vector for bananas (both the text and the image) is located near apples and not cats.

Vectors Example

The above image is a visual representation of a vector space. To perform a search, your search query is converted to a vector - similar to your data vectors. The vector database then computes the similarity between the search query and the collection of data points in the vector space.

Vector Databases are Fast

The best thing is, vector databases can query large datasets, containing tens or hundreds of millions of objects and still respond to queries in a tiny fraction of a second.

Without getting too much into details, one of the big reasons why vector databases are so fast is because they use the Approximate Nearest Neighbor (ANN) algorithm to index data based on vectors. ANN algorithms organize indexes so that the vectors that are closely related are stored next to each other.

Check out this article to learn "Why is Vector Search are so Fast" and how vector databases work.

Why are there Different Distance Metrics?

Depending on the machine learning model used, vectors can have ~100 dimensions or go into thousands of dimensions.

The time it takes to calculate the distance between two vectors grows based on the number of vector dimensions. Furthermore, some similarity measures are more compute-heavy than others. That might be a challenge for calculating distances between vectors with thousands of dimensions.

For that reason, we have different distance metrics that balance the speed and accuracy of calculating distances between vectors.

Distance Metrics

Cosine Similarity

The cosine similarity measures the angle between two vectors in a multi-dimensional space – with the idea that similar vectors point in a similar direction. Cosine similarity is commonly used in Natural Language Processing (NLP). It measures the similarity between documents regardless of the magnitude.

This is advantageous because if two documents are far apart by the euclidean distance, the angle between them could still be small. For example, if the word ‘fruit' appears 30 times in one document and 10 in the other, that is a clear difference in magnitude, but the documents can still be similar if we only consider the angle. The smaller the angle is, the more similar the documents are.

The cosine similarity and cosine distance have an inverse relationship. As the distance between two vectors increases, the similarity will decrease. Likewise, if the distance decreases, then the similarity between the two vectors increases.

The cosine similarity is calculated as:

Cosine Similarity

A·B is the product (dot) of the vectors A and B

||A|| and ||B|| is the length of the two vectors

||A|| * ||B|| is the cross product of the two vectors

The cosine distance formula is then: 1 - Cosine Similarity

Let's use an example to calculate the similarity between two fruits – strawberries (vector A) and blueberries (vector B). Since our data is already represented as a vector, we can calculate the distance.

Strawberry → [4, 0, 1]

Blueberry → [3, 0, 1]

Cosine Example

A distance of 0 indicates that the vectors are identical, whereas a distance of 2 represents opposite vectors. The similarity between the two vectors is 0.998 and the distance is 0.002. This means that strawberries and blueberries are closely related.

Dot Product

The dot product takes two or more vectors and multiplies them together. It is also known as the scalar product since the output is a single (scalar) value. The dot product shows the alignment of two vectors. The dot product is negative if the vectors are oriented in different directions and positive if the vectors are oriented in the same direction.

Dot Product direction

The dot product formula is:

Dot Product

Use the dot product to recalculate the distance between the two vectors.

Strawberry → [4, 0, 1]

Blueberry → [3, 0, 1]

Equation 1

The dot product of the two vectors is 13. To calculate the distance, find the negative of the dot product. The negative dot product, -13 in this case, reports the distance between the vectors. The negative dot product maintains the intuition that a shorter distance means the vectors are similar.

Squared Euclidean (L2-Squared)

The Squared Euclidean (L2-Squared) calculates the distance between two vectors by taking the sum of the squared vector values. The distance can be any value between zero and infinity. If the distance is zero, the vectors are identical. The larger the distance, the farther apart the vectors are.

The squared euclidean distance formula is:

L2 Formula

The squared euclidean distance of strawberries [4, 0, 1] and blueberries [3, 0, 1] is equal to 1.

Squared Euclidean  Equation

Manhattan (L1 Norm or Taxicab Distance)

Manhattan distance, also known as "L1 norm" and "Taxicab Distance", calculates the distance between a pair of vectors. The metric is calculated by summing the absolute distance between the components of the two vectors.

Manhattan

The name comes from the grid layout resembling the streets of Manhattan. The city is designed with buildings on every corner and one-way streets. If you're trying to go from point A to point B, the shortest path isn't straight through because you cannot drive through buildings. The fastest route is one with fewer twists and turns.

Manhattan example

Hamming

The Hamming distance is a metric for comparing two numeric vectors. It computes how many changes are needed to convert one vector to the other. The fewer changes are required, the more similar the vectors.

There are two ways to implement the Hamming distance:

  1. Compare two numeric vectors
  2. Compare two binary vectors

Weaviate has implemented the first method, comparing numeric vectors. In the next section, I will describe an idea to use the Hamming distance in tandem with Binary Passage Retrieval.

Let's use an example to calculate the Hamming distance. Imagine we have a dataset containing a variety of fruit and vegetables. Your first query is to see which item pairs best with your banana pancakes. To achieve that we need to compare the banana pancakes' vector with the other vectors. Like this:

Banana Pancakes[5,6,8]Hamming Distance
Blueberries [5,6,9] 1
Broccoli [8,2,9] 3

As seen above, blueberries are a better pairing than broccoli. This was done by comparing the position of the numbers in the vector representations of foods.

Hamming distance and Binary Passage Retrieval

Binary Passage Retrieval (BPR) translates vectors into a binary sequence. For example, if you have text data that has been converted into a vector ("Hi there" -> [0.2618, 0.1175, 0.38, …]), it can then be translated to a string of binary numbers (0 or 1). Although it is condensing the information in the vector, this technique can keep the semantic structure despite representing it as 0 or 1.

To compute the Hamming distance between two strings, you compare the position of each bit in the sequence. This is done with an XOR bit operation. XOR stands for "exclusive or", meaning if the bits in the sequence do not match, then the output is 1. Keep in mind that the strings need to be of equal length to perform the comparison. Below is an example of comparing two binary sequences.

Hamming and BPR

There are three positions where the numbers are different (highlighted above). Therefore, the Hamming distance is equal to 3. Norouzi et al. stated that binary sequences are storage efficient and allow storing massive datasets in memory.

Comparison of Different Distance Metrics

Cosine versus Dot Product

To calculate the cosine distance, you need to use the dot product. Likewise, the dot product uses the cosine distance to get the angle of the two vectors. You might wonder what the difference is between these two metrics. The cosine distance tells you the angle, whereas the dot product reports the angle and magnitude. If you normalize your data, the magnitude is no longer observable. Thus, if your data is normalized, the cosine and dot product metrics are exactly the same.

Manhattan versus Euclidean Distance

The Manhattan distance (L1 norm) and Euclidean distance (L2 norm) are two metrics used in machine learning models. The L1 norm is calculated by taking the sum of the absolute values of the vector. The L2 norm takes the square root of the sum of the squared vector values. The Manhattan distance is faster to calculate since the values are typically smaller than the Euclidean distance.

Generally, there is an accuracy vs. speed tradeoff when choosing between the Manhattan and Euclidean distance. It is hard to say precisely when the Manhattan distance will be more accurate than Euclidean; however, Manhattan is faster since you don't have to square the differences. You want to use the Manhattan distance as the dimension of your data increases. For more information on which distance metric to use in high-dimensional spaces, check out this paper by Aggarwal et al.

How to Choose a Distance Metric

As a rule of thumb, it is best to use the distance metric that matches the model that you're using. For example, if you're using a Siamese Neural Network (SNN) the contrastive loss function incorporates the euclidean distance. Similarly, when fine-tuning your sentence transformer you define the loss function. The CosineSimilarityLoss takes two embeddings and computes the similarity based on the cosine similarity.

To summarize, there is no ‘one size fits all' distance metric. It depends on your data, model, and application. As mentioned above, there are cases where the cosine distance and dot product are the same; however, the magnitude may or may not be important. This also goes for the accuracy/speed tradeoff between the Manhattan and euclidean distance.

Use the distance metric that matches the model that you're using.

Implementation of Distance Metrics in Weaviate

In total, Weaviate users can choose between five various distance metrics to support their dataset. In the Weaviate documentation you can find each metric in detail. Weaviate makes it easy to choose a metric depending on your application. With one edit to your schema, you can use any of the metrics implemented in Weaviate (cosinedotl2-squaredhamming, and manhattan), or you have the flexibility to create your own!

Distance Implementations and Optimizations

Even with ANN-indexes, which reduce the number of distance calculations necessary, a vector database still spends a large portion of its compute time calculating vector distances. As a result, it is very important that the engine can do this not just correctly, but also efficiently.

The distance metrics in Weaviate have been optimized to be highly efficient using "Single Instruction, Multiple Data" ("SIMD") instruction sets. Using these instructions, a CPU can do multiple calculations in a single CPU cycle. To achieve this, we had to write some of our distance metrics in pure Assembly codeHere is an overview of the current state of optimizations; including which distance metrics have SIMD optimizations for which architecture.

Open-source Contributions

Weaviate is open-source and values feedback and input from the community. A community member contributed to the Weaviate project by adding two new metrics to the 1.15 release. How cool is that! If this is something you're interested in, here is the repository with the implementation of the current metrics.

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

 

标签:Distance,Search,distance,vectors,two,Metrics,vector,between,product
From: https://www.cnblogs.com/saaspeter/p/18542076

相关文章

  • ElasticSearch 7.14 向已启用XPACK认证的集群增加新的节点
    一、环境现状描述:     目前的ElasticSearch集群仅有一个单一节点,且这个集群中已建立有索引,索引已包含业务文档数据(超过200G),该集群已经启用XPACK认证,现希望扩展这个集群,增加复制节点,且复制节点启动后,自动从主节点同步数据到新节点。     目前的ElasticSearch集群节点......
  • SpringBoot项目引入Elasticsearch时启动失败
    1、前情提要:https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/installation.html以上是Elasticsearch对接Java的官方文档(pom依赖部分)我本地Windows安装的Elasticsearch也是8.15.3版本 2、启动报错***************************APPLICATION......
  • Elasticsearch上创建的index是yellow健康状态的解决方案
    在Elasticsearch中,索引的健康状态(healthstatus)反映了索引的分片分配情况和集群的整体健康状况。这些状态可以帮助您快速了解索引和集群的运行情况。以下是Elasticsearch中索引的三种健康状态及其意义:1.green(绿色)含义:所有主分片(primaryshards)和副本分片(replicashards)都已成功......
  • macOS 下使用 Docker 安装 ElasticSearch(学习环境用)
    当前环境操作系统:macOS15.0.1Docker版本:DockerDesktop:Version4.34.3(170107)DockerEngine:27.2.0安装步骤提示:此部署只为学习使用,没有挂载本地文件1、安装ElasticSearch#安装命令#1.1创建网络somenetwork用于docker间通讯dockernetworkcreateso......
  • 记录一次docker快速启动elasticsearch单机服务
    记录一次docker快速启动elasticsearch单机服务注意事项使用df-h${dir}确定挂载目录磁盘容量避免选择较小磁盘使用lsof-i:${port}确定宿主机端口没有被占用挂载目录赋予可读可写的权限具体步骤cd/home/aicc/docker/mkdiresmkdirdatamkdircon......
  • MMdetection 问题报错 mmdet/evaluation/metrics/coco_metric.py data[‘category_id
    方案一:有人说在自己定义的conifg文件中增加 metainfo={'classes':('class1','class2','class2',),'palette':[(220,20,60),(221,11,22),(221,11,42),]}方案二:修改mmdet/evaluation/metrics文件的内......
  • Package libxml-2.0 was not found in the pkg-config search path
    1、问题Packagelibxml-2.0wasnotfoundinthepkg-configsearchpath2、检查是否有库文件,nofind/-name"libxml-2.0.pc"echo$PKG_CONFIG_PATH这个变量啥也没有3、尝试解决wgetftp://xmlsoft.org/libxml2/libxml2-2.9.2.tar.gztar-xvflibxml2-2.9.2.tar.gzcdlibxml2-......
  • 智谱BigModel研习社|搭建 AI 搜索引擎 - 使用免费的Web-Search-Pro+脑图Agent智能体
    作者:Cartman文章:多智能体AI搜索引擎点击链接,更多实践案例等你探索~ #智谱BigModel研习社是专业的大模型开发者交流平台,欢迎在评论区与我们互动! 传统搜索引擎如今的问题在于输出很多不相关结果(大量垃圾信息+SEO操纵的标题党内容),大模型也面临着幻觉问题。在网......
  • 关于我在进行靶机扫描时Dirsearch无法正常扫描的问题
    首先是在anaconda环境当中,一开始以为是anaconda不通网,在使用curl命令确认可以与目标互通之后,确认是dirsearch本身的问题。开始使用的是python3.8,有报错“dirsearch.py:23:DeprecationWarning:pkg_resourcesisdeprecatedasanAPI.Seehttps://setuptools.pypa.io/en/lates......
  • Elasticsearch+kibana+filebeat的安装及使用
    版本7.6.0自己去官网下载或者私信找我要,jdk是8版本1.ES安装网上有好多安装教程可以自己去搜索这个是我的es文件路径:{“name”:“node-1”,“cluster_name”:“elasticsearch”,“cluster_uuid”:“NIepktULRfepkje3JHw8NA”,“version”:{“number”:......