首页 > 其他分享 >Coursera自然语言处理专项课程01:Natural Language Processing with Classification and Vector Spaces笔记Week04(完结)

Coursera自然语言处理专项课程01:Natural Language Processing with Classification and Vector Spaces笔记Week04(完结)

时间:2024-03-14 18:30:24浏览次数:36  
标签:01 hash matrix Classification vector planes np Natural document

Natural Language Processing with Classification and Vector Spaces

Course Certificate

在这里插入图片描述

本文是Natural Language Processing with Classification and Vector Spaces 这门课的学习笔记,仅供个人学习使用,如有侵权,请联系删除。

在这里插入图片描述

文章目录

Week 04: Machine Translation and Document Search

Learn to transform word vectors and assign them to subsets using locality sensitive hashing, in order to perform machine translation and document search.

Learning Objectives


  • Gradient descent
  • Approximate nearest neighbors
  • Locality sensitive hashing
  • Hash functions
  • Hash tables
  • K nearest neighbors
  • Document search
  • Machine translation
  • Frobenius norm

Transforming word vectors

In the previous week, I showed you how we can plot word vectors. Now, you will see how you can take a word vector and learn a mapping that will allow you to translate words by learning a “transformation matrix”. Here is a visualization:

在这里插入图片描述

Note that the word “chat” in french means cat. You can learn that by taking the vector corresponding to “cat” in english, multiplying it by a matrix that you learn and then you can use cosine similarity between the output and all the french vectors. You should see that the closest result is the vector which corresponds to “chat”.

Here is a visualization of that showing you the aligned vectors:

在这里插入图片描述

Note that X corresponds to the matrix of english word vectors and Y corresponds to the matrix of french word vectors. R is the mapping matrix.

Steps required to learn R : R{:} R:
· Initialize R
. For loop

L o s s = ∥ X R − Y ∥ F Loss=\|XR-Y\|_F Loss=∥XR−Y∥F​
g = d d R L o s s g=\frac{d}{dR}Loss g=dRd​Loss

R = R − α ∗ g R=R-\alpha*g R=R−α∗g
Here is an example to show you how the frobenius norm works.

∥ X R − Y ∥ F A = ( 2 2 2 2 ) ∥ A F ∥ = 2 2 + 2 2 + 2 2 + 2 2 ∥ A F ∥ = 4 ∥ A ∥ F ≡ ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 \begin{aligned} &\|\mathbf{X}\mathbf{R}-\mathbf{Y}\|_F \\ &\left.\mathbf{A}=\left(\begin{array}{ll}{2}&{2}\\{2}&{2}\end{array}\right.\right) \\ &\begin{array}{l}{\|\mathbf{A}_{F}\|=\sqrt{2^{2}+2^{2}+2^{2}+2^{2}}}\\{\|\mathbf{A}_{F}\|=4}\end{array} \\ &\|\mathbf{A}\|_F\equiv\sqrt{\sum_{i=1}^m\sum_{j=1}^n\left|a_{ij}\right|^2} \end{aligned} ​∥XR−Y∥F​A=(22​22​)∥AF​∥=22+22+22+22 ​∥AF​∥=4​∥A∥F​≡i=1∑m​j=1∑n​∣aij​∣2 ​​

In summary you are making use of the following:

∙ X R ≈ Y ∙   m i n i m i z e ∥ X R − Y ∥ F 2 \begin{array}{l}\bullet\mathrm{XR}\approx\mathrm{Y}\\\bullet\mathrm{~minimize}\parallel\mathrm{XR}-\mathrm{Y}\parallel_F^2\end{array} ∙XR≈Y∙ minimize∥XR−Y∥F2​​

Lab: Rotation matrices in R2

Vector manipulation in Python

In this lab, you will have the opportunity to practice once again with the NumPy library. This time, we will explore some advanced operations with arrays and matrices.

At the end of the previous module, we used PCA to transform a set of many variables into a set of only two uncorrelated variables. This was done by means of a transformation of the data called rotation.

In this week’s assignment, you will need to find a transformation matrix from English to French vector space embeddings. Such a transformation matrix is nothing else but a matrix that rotates and scales vector spaces.

In this notebook, we will explain in detail the rotation transformation.

Transforming vectors

There are three main vector transformations:

  • Scaling
  • Translation
  • Rotation

In previous notebooks, we applied the first two kinds of transformations. Now, let us learn how to use a fundamental transformation on vectors called rotation.

The rotation operation changes the direction of a vector, leaving unaffected its dimensionality and its norm. Let us explain this with some examples.

In the following cells, we will define a NumPy matrix and a column vector as a NumPy array. Soon we will explain how this is related to matrix rotation.

import numpy as np                     # Import numpy for array manipulation
import matplotlib.pyplot as plt        # Import matplotlib for charts
from utils_nb import plot_vectors      # Function to plot vectors (arrows)
Example 1
# Create a 2 x 2 matrix
R = np.array([[-2, 0],
              [0, 2]])
              
x = np.array([[1, 1]]) # Create a row vector as a NumPy array with a single row

The dot product between a square matrix and the transpose of a row vector produces a rotation and scaling of the original vector.

Remember that our recommended way to get the dot product in Python is np.dot(a, b):

y = np.dot(R, x.T) # Apply the dot product between R and x.T
y                  # Column vector as a NumPy array with a single column

Output

array([[-2],
       [ 2]])

We are going to use Pyplot to visually inspect the effect of the rotation on 2D vectors. For that, we have created a function plot_vectors() that takes care of all the intricate parts of the visual formatting. The code for this function is inside the utils_nb.py file.

Now we can plot the vector x ⃗ = [ 1 , 1 ] \vec x = [1, 1] x =[1,1] in a cartesian plane. The cartesian plane will be centered at [0,0] and its x and y limits will be between [-4, +4]

plot_vectors([x], axes=[4, 4], fname='transform_x.svg')

Output

在这里插入图片描述

Now, let’s plot in the same system our vector x ⃗ = [ 1 , 1 ] \vec x = [1, 1] x =[1,1] and the dot product of the matrix with x . T x.T x.T.

R = [ − 2 0 0 2 ] R = \begin{bmatrix} -2 & 0 \\ 0 & 2 \end{bmatrix} R=[−20​02​]

y = R ⋅ x . T y = R \cdot x.T y=R⋅x.T

plot_vectors([x, y], axes=[4, 4], fname='transformx_and_y.svg')

Output

在这里插入图片描述

Note that the vector x (black) is transformed into vector y (blue).

Example 2

We are going to use Pyplot to visually inspect the effect of the rotation on 2D vectors. For that, we have created a function that takes care of all the intricate parts of the visual formatting. The following procedure plots an arrow within a Pyplot canvas.

Data that is composed of 2 real attributes belongs to a $ RxR $ or $ R^2 $ space. Rotation matrices in R 2 R^2 R2 rotate a given vector x ⃗ \vec x x by a counterclockwise angle θ \theta θ in a fixed coordinate system. Rotation matrices are of the form:

R o = [ c o s θ − s i n θ s i n θ c o s θ ] Ro = \begin{bmatrix} cos \theta & -sin \theta \\ sin \theta & cos \theta \end{bmatrix} Ro=[cosθsinθ​−sinθcosθ​]

(Note: This notebook uses y = R o ⋅ x . T y = Ro \cdot x.T y=Ro⋅x.T But if you use y = x ⋅ R o y = x \cdot Ro y=x⋅Ro

then the rotation matrices in R 2 R^2 R2 rotate a given vector x ⃗ \vec x x by a clockwise angle θ \theta θ in a fixed coordinate system**).**

The trigonometric functions in Numpy require the angle in radians, not in degrees. In the next cell, we define a rotation matrix that rotates vectors counterclockwise by 10 0 o 100^o 100o.

angle = 100 * (np.pi / 180) # Convert degrees to radians

Ro = np.array([[np.cos(angle), -np.sin(angle)],
              [np.sin(angle), np.cos(angle)]])

x2 = np.array([[2, 2]])    # Row vector as a NumPy array
y2 = np.dot(Ro, x2.T)

print('Rotation matrix')
print(Ro)
print('\nRotated vector')
print(y2)

print('\n x2 norm', np.linalg.norm(x2))
print('\n y2 norm', np.linalg.norm(y2))
print('\n Rotation matrix norm', np.linalg.norm(Ro))

Output

Rotation matrix
[[-0.17364818 -0.98480775]
 [ 0.98480775 -0.17364818]]

Rotated vector
[[-2.31691186]
 [ 1.62231915]]

 x2 norm 2.8284271247461903

 y2 norm 2.82842712474619

 Rotation matrix norm 1.414213562373095
plot_vectors([x2, y2], fname='transform_02.svg')

Output

在这里插入图片描述

Some points to note:

  • The norm of the input vector is the same as the norm of the output vector. Rotation matrices do not modify the norm of the vector, only its direction.
  • The norm of any R 2 R^2 R2 rotation matrix is always 2 = 1.414221 \sqrt 2 = 1.414221 2 ​=1.414221

Frobenius Norm

The Frobenius norm is the generalization to R 2 R^2 R2 of the already known norm function for vectors

∥ a ⃗ ∥ = a ⃗ ⋅ a ⃗ \| \vec a \| = \sqrt {{\vec a} \cdot {\vec a}} ∥a ∥=a ⋅a

For a given R 2 R^2 R2 matrix A, the frobenius norm is defined as:

∥ A ∥ F ≡ ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 \|\mathrm{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}} ∥A∥F​≡i=1∑m​j=1∑n​∣aij​∣2

A = np.array([[2, 2],
              [2, 2]])

np.square() is a way to square each element of a matrix. Its outcome is equivalent to that of using the * operator with numpy arrays.

A_squared = np.square(A)
A_squared

A_squared = A * A
A_squared

Output都是

array([[4, 4],
       [4, 4]])

Now you can sum over the elements of the resulting array, and then get the square root of the sum.

A_Frobenius = np.sqrt(np.sum(A_squared))
A_Frobenius

Output:4.0

That was the extended version of the np.linalg.norm() function. You can check that it yields the same result.

print('Frobenius norm of the Rotation matrix')
print(np.sqrt(np.sum(Ro * Ro)), '== ', np.linalg.norm(Ro))

Output

Frobenius norm of the Rotation matrix
1.414213562373095 ==  1.414213562373095

Congratulations!! We’ve covered a few more matrix operations in this lab. This will come in handy in this week’s programming assignment!

K-nearest neighbors

After you have computed the output of XR you get a vector. You then need to find the most similar vectors to your output. Here is a visual example:

在这里插入图片描述

In the video, we mentioned if you were in San Francisco, and you had friends all over the world, you would want to find the nearest neighbors. To do that it might be expensive to go over all the countries one at a time. So we will introduce hashing to show you how you can do a look up much faster.

Hash tables and hash functions

Imagine you had to cluster the following figures into different buckets:

在这里插入图片描述

Note that the figures blue, red, and gray ones would each be clustered with each other

在这里插入图片描述

You can think of hash function as a function that takes data of arbitrary sizes and maps it to a fixed value. The values returned are known as hash values or even hashes.

在这里插入图片描述

The diagram above shows a concrete example of a hash function which takes a vector and returns a value. Then you can mod that value by the number of buckets and put that number in its corresponding bucket. For example, 14 is in the 4th bucket, 17 & 97 are in the 7th bucket. Let’s take a look at how you can do it using some code.

在这里插入图片描述

The code snippet above creates a basic hash table which consists of hashed values inside their buckets. hash_function takes in value_l (a list of values to be hashed) and n_buckets and mods the value by the buckets. Now to create the hash_table, you first initialize a list to be of dimension n_buckets (each value will go to a bucket). For each value in your list of values, you will feed it into your hash_function, get the hash_value, and append it to the list of values in the corresponding bucket.

Now given an input, you don’t have to compare it to all the other examples, you can just compare it to all the values in the same hash_bucket that input has been hashed to.

When hashing you sometimes want similar words or similar numbers to be hashed to the same bucket. To do this, you will use “locality sensitive hashing.” Locality is another word for “location”. So locality sensitive hashing is a hashing method that cares very deeply about assigning items based on where they’re located in vector space.

Locality sensitive hashing

Locality sensitive hashing is a technique that allows you to hash similar inputs into the same buckets with high probability.

在这里插入图片描述

Instead of the typical buckets we have been using, you can think of clustering the points by deciding whether they are above or below the line. Now as we go to higher dimensions (say n-dimensional vectors), you would be using planes instead of lines. Let’s look at a concrete example:

在这里插入图片描述

Given some point located at (1,1) and three vectors V 1 = ( 1 , 2 ) , V 2 = ( − 1 , 1 ) , V 3 = ( − 2 , − 1 ) V_1=(1,2),V_2=(-1,1),V_3=(-2,-1) V1​=(1,2),V2​=(−1,1),V3​=(−2,−1) you will see what happens when we take the dot product. First note that the dashed line is our plane. The vector with point P=(1,1) is perpendicular to that line (plane). Now any vector above the dashed line that is multiplied by (1,1) would have a positive number. Any vector below the dashed line when dotted with (1,1) will have a negative number. Any vector on the dashed line multiplied by (1,1) will give you a dot product of 0.

Here is how to visualize a projection (i.e. a dot product between two vectors):

在这里插入图片描述

When you take the dot product of a vector V 1 V_1 V1​ and a P, then you take the magnitude or length of that vector, you get the black line (labelled as Projection). The sign indicates on which side of the plane the projection vector lies.

局部敏感哈希(Locality Sensitive Hashing,LSH)是一种用于解决近似最近邻搜索(Approximate Nearest Neighbor Search,ANN)问题的技术。在ANN问题中,给定一个查询点,我们希望能够在大型数据集中快速找到与该查询点最接近的数据点。

LSH通过将数据点映射到哈希空间中的桶(buckets)来实现近似最近邻搜索。相似的数据点在哈希空间中被映射到相同的桶的概率要高于不相似的数据点。通过这种方式,LSH可以将相似的数据点放在同一个桶中,从而加速近似最近邻搜索。

LSH的核心思想是设计一种哈希函数,使得相似的数据点被映射到相同的哈希桶的概率较高,而不相似的数据点被映射到不同的哈希桶的概率较低。常用的LSH方法包括:

  1. MinHash:通过对数据集进行随机排列并取最小哈希值来生成哈希签名,从而实现相似数据点的近似匹配。
  2. Locality-Sensitive Hashing for Cosine Similarity (LSH-Cosine):特别设计的哈希函数,用于处理基于余弦相似度的近似最近邻搜索问题。
  3. Locality-Sensitive Hashing for Euclidean Distance (LSH-Euclidean):特别设计的哈希函数,用于处理基于欧氏距离的近似最近邻搜索问题。

LSH在大型数据集上的搜索效率高,适用于需要快速查找近似最近邻的场景,如图像搜索、推荐系统、数据去重等。然而,由于近似搜索的性质,LSH可能会导致搜索结果的精度降低,因此在一些对搜索精度要求较高的应用中,LSH可能不适用。

Multiple Planes

You can use multiple planes to get a single hash value. Let’s take a look at the following example:

在这里插入图片描述

Given some point denoted by v, you can run it through several projections P 1 , P 2 , P 3 P_1,P_2,P_3 P1​,P2​,P3​ to get one hash value. If you compute P 1 v T P_1v^T P1​vT you get a positive number, so you set h 1 = 1 h_1=1 h1​=1. P 2 v T P_2v^T P2​vT gives you a positive number so you gel h 2 = 1. h_{2}=1. h2​=1.
P 3 v T P_{3}v^{T} P3​vT is a negatíve number so you set h 3 h_{3} h3​ to be O. You can then compute the hash value as follows.

h a s h = 2 0 × h 1 + 2 1 × h 2 + 2 2 × h 3 = 1 × 1 + 2 × 1 + 4 × 0 = 3 \begin{aligned}hash&=2^0\times h_1+2^1\times h_2+2^2\times h_3\\\\&=1\times1+2\times1+4\times0=3\end{aligned} hash​=20×h1​+21×h2​+22×h3​=1×1+2×1+4×0=3​

Another way to think of it, is at each time you are asking the plane to which side will you find the point (i.e. 1 or 0) until you find your point bounded by the surrounding planes. The hash value is then defined as:

h a s h _ v a l u e = ∑ i H 2 i × h i hash\_value=\sum_i^H2^i\times h_i hash_value=i∑H​2i×hi​

Here is how you can code it up:

在这里插入图片描述

P_l is the list of planes. You initialize the value to 0, and then you iterate over all the planes §, and you keep track of the index. You get the sign by finding the sign of the dot product between v and your plane P. If it is positive you set it equal to 1, otherwise you set it equal to 0. You then add the score for the ith plane to the hash value by computing

2 i × h i . 2^i\times h_i. 2i×hi​.

Lab: Hash tables

Hash functions and multiplanes

In this lab, we are going to practice the most important concepts related to the hash functions explained in the videos. You will be using these in this week’s assignment.

A key point for the lookup using hash functions is the calculation of the hash key or bucket id that we assign for a given entry. In this notebook, we will cover:

  • Basic hash tables
  • Multiplanes
  • Random planes

Basic Hash tables

Hash tables are data structures that allow indexing data to make lookup tasks more efficient.
In this part, you will see the implementation of the simplest hash function.

import numpy as np                # library for array and matrix manipulation
import pprint                     # utilities for console printing 
from utils_nb import plot_vectors # helper function to plot vectors
import matplotlib.pyplot as plt   # visualization library

pp = pprint.PrettyPrinter(indent=4) # Instantiate a pretty printer

In the next cell, we will define a straightforward hash function for integer numbers. The function will receive a list of integer numbers and the desired amount of buckets. The function will produce a hash table stored as a dictionary, where keys contain the hash keys, and the values will provide the hashed elements of the input list.

The hash function is just the remainder of the integer division between each element and the desired number of buckets.

def basic_hash_table(value_l, n_buckets):
    
    def hash_function(value, n_buckets):
        return int(value) % n_buckets
    
    hash_table = {i:[] for i in range(n_buckets)} # Initialize all the buckets in the hash table as empty lists

    for value in value_l:
        hash_value = hash_function(value,n_buckets) # Get the hash key for the given value
        hash_table[hash_value].append(value) # Add the element to the corresponding bucket
    
    return hash_table

Now let’s see the hash table function in action. The pretty print function (pprint()) will produce a visually appealing output.

value_l = [100, 10, 14, 17, 97] # Set of values to hash
hash_table_example = basic_hash_table(value_l, n_buckets=10)
pp.pprint(hash_table_example)

Output

{   0: [100, 10],
    1: [],
    2: [],
    3: [],
    4: [14],
    5: [],
    6: [],
    7: [17, 97],
    8: [],
    9: []}

In this case, the bucket key must be the rightmost digit of each number.

Planes

Multiplanes hash functions are other types of hash functions. Multiplanes hash functions are based on the idea of numbering every single region that is formed by the intersection of n planes. In the following code, we show the most basic forms of the multiplanes principle. First, with a single plane:

P = np.array([[1, 1]]) # Define a single plane. 
fig, ax1 = plt.subplots(figsize=(8, 8)) # Create a plot

plot_vectors([P], axes=[2, 2], ax=ax1) # Plot the plane P as a vector

# Plot  random points. 
for i in range(0, 10):
        v1 = np.array(np.random.uniform(-2, 2, 2)) # Get a pair of random numbers between -2 and 2
        side_of_plane = np.sign(np.dot(P, v1.T)) 
        
        # Color the points depending on the sign of the result of np.dot(P, point.T)
        if side_of_plane == 1:
            ax1.plot([v1[0]], [v1[1]], 'bo') # Plot blue points
        else:
            ax1.plot([v1[0]], [v1[1]], 'ro') # Plot red points

plt.show()

Output

在这里插入图片描述

The first thing to note is that the vector that defines the plane does not mark the boundary between the two sides of the plane. It marks the direction in which you find the ‘positive’ side of the plane. Not intuitive at all!

If we want to plot the separation plane, we need to plot a line that is perpendicular to our vector P. We can get such a line using a 9 0 o 90^o 90o rotation matrix.

Feel free to change the direction of the plane P.

P = np.array([[1, 2]])  # Define a single plane. You may change the direction

# Get a new plane perpendicular to P. We use a rotation matrix
PT = np.dot([[0, 1], [-1, 0]], P.T).T  

fig, ax1 = plt.subplots(figsize=(8, 8)) # Create a plot with custom size

plot_vectors([P], colors=['b'], axes=[2, 2], ax=ax1) # Plot the plane P as a vector

# Plot the plane P as a 2 vectors. 
# We scale by 2 just to get the arrows outside the current box
plot_vectors([PT * 4, PT * -4], colors=['k', 'k'], axes=[4, 4], ax=ax1)

# Plot 20 random points. 
for i in range(0, 20):
        v1 = np.array(np.random.uniform(-4, 4, 2)) # Get a pair of random numbers between -4 and 4 
        side_of_plane = np.sign(np.dot(P, v1.T)) # Get the sign of the dot product with P
        # Color the points depending on the sign of the result of np.dot(P, point.T)
        if side_of_plane == 1:
            ax1.plot([v1[0]], [v1[1]], 'bo') # Plot a blue point
        else:
            ax1.plot([v1[0]], [v1[1]], 'ro') # Plot a red point

plt.show()

Output

在这里插入图片描述

Now, let us see what is inside the code that color the points.

P = np.array([[1, 1]])      # Single plane
v1 = np.array([[1, 2]])     # Sample point 1
v2 = np.array([[-1, 1]])    # Sample point 2
v3 = np.array([[-2, -1]])   # Sample point 3
np.dot(P, v1.T) # array([[3]])
np.dot(P, v2.T) # array([[0]])
np.dot(P, v3.T) # array([[-3]])

The function below checks in which side of the plane P is located the vector v

def side_of_plane(P, v):
    dotproduct = np.dot(P, v.T) # Get the dot product P * v'
    sign_of_dot_product = np.sign(dotproduct) # The sign of the elements of the dotproduct matrix 
    sign_of_dot_product_scalar = sign_of_dot_product.item() # The value of the first item
    return sign_of_dot_product_scalar
side_of_plane(P, v1) # In which side is [1, 2] # 1
side_of_plane(P, v2) # In which side is [-1, 1] # 0

side_of_plane(P, v3) # In which side is [-2, -1] # -1

Hash Function with multiple planes

In the following section, we are going to define a hash function with a list of three custom planes in 2D.

P1 = np.array([[1, 1]])   # First plane 2D
P2 = np.array([[-1, 1]])  # Second plane 2D
P3 = np.array([[-1, -1]]) # Third plane 2D
P_l = [P1, P2, P3]  # List of arrays. It is the multi plane

# Vector to search
v = np.array([[2, 2]])

The next function creates a hash value based on a set of planes. The output value is a combination of the side of the plane where the vector is localized with respect to the collection of planes.

We can think of this list of planes as a set of basic hash functions, each of which can produce only 1 or 0 as output.

def hash_multi_plane(P_l, v):
    hash_value = 0
    for i, P in enumerate(P_l):
        sign = side_of_plane(P,v)
        hash_i = 1 if sign >=0 else 0
        hash_value += 2**i * hash_i
    return hash_value
hash_multi_plane(P_l, v) # Find the number of the plane that containes this value

Output: 3

Random Planes

In the cell below, we create a set of three random planes

np.random.seed(0)
num_dimensions = 2 # is 300 in assignment
num_planes = 3 # is 10 in assignment
random_planes_matrix = np.random.normal(
                       size=(num_planes,
                             num_dimensions))
print(random_planes_matrix)

Output

[[ 1.76405235  0.40015721]
 [ 0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788]]
v = np.array([[2, 2]])

The next function is similar to the side_of_plane() function, but it evaluates more than a plane each time. The result is an array with the side of the plane of v, for the set of planes P

# Side of the plane function. The result is a matrix
def side_of_plane_matrix(P, v):
    dotproduct = np.dot(P, v.T)
    sign_of_dot_product = np.sign(dotproduct) # Get a boolean value telling if the value in the cell is positive or negative
    return sign_of_dot_product

Get the side of the plane of the vector [2, 2] for the set of random planes.

sides_l = side_of_plane_matrix(
            random_planes_matrix, v)
sides_l

Output

array([[1.],
       [1.],
       [1.]])

Now, let us use the former function to define our multiplane hash function

def hash_multi_plane_matrix(P, v, num_planes):
    sides_matrix = side_of_plane_matrix(P, v) # Get the side of planes for P and v
    hash_value = 0
    for i in range(num_planes):
        sign = sides_matrix[i].item() # Get the value inside the matrix cell
        hash_i = 1 if sign >=0 else 0
        hash_value += 2**i * hash_i # sum 2^i * hash_i
        
    return hash_value

Print the bucket hash for the vector v = [2, 2].

hash_multi_plane_matrix(random_planes_matrix, v, num_planes)

Output: 7

Note

This showed you how to make one set of random planes. You will make multiple sets of random planes in order to make the approximate nearest neighbors more accurate.

Document vectors

Before we finish this lab, remember that you can represent a document as a vector by adding up the word vectors for the words inside the document. In this example, our embedding contains only three words, each represented by a 3D array.

word_embedding = {"I": np.array([1,0,1]),
                   "love": np.array([-1,0,1]),
                   "learning": np.array([1,0,1])
                  }
words_in_document = ['I', 'love', 'learning', 'not_a_word']
document_embedding = np.array([0,0,0])
for word in words_in_document:
    document_embedding += word_embedding.get(word,0)
    
print(document_embedding)

Output

[1 0 3]

Congratulations! You’ve now completed this lab on hash functions and multiplanes!

Approximate nearest neighbors

Approximate nearest neighbors does not give you the full nearest neighbors but gives you an approximation of the nearest neighbors. It usually trades off accuracy for efficiency. Look at the following plot:

在这里插入图片描述

You are trying to find the nearest neighbor for the red vector (point). The first time, the plane gave you green points. You then ran it a second time, but this time you got the blue points. The third time you got the orange points to be the neighbors. So you can see as you do it more times, you are likely to get all the neighbors. Here is the code for one set of random planes. Make sure you understand what is going on.

在这里插入图片描述

近似最近邻(Approximate Nearest Neighbors,ANN)是一种在大型数据集中快速查找近似最近邻的方法。在ANN问题中,给定一个查询点,我们希望能够在数据集中找到与该查询点最接近的数据点,即最近邻。然而,对于大型数据集而言,精确地计算最近邻可能非常耗时,因为需要遍历整个数据集来计算每个数据点与查询点的距离,并找到距离最近的数据点。

近似最近邻的目标是通过牺牲一定的精度来提高搜索速度,从而在较短的时间内找到与查询点近似最近的数据点。通常情况下,ANN方法可以在常数时间内找到近似最近邻,而不是遍历整个数据集。ANN方法可以分为两大类:

  1. 基于哈希的方法(Hashing-based Methods):这种方法通过将数据点映射到哈希空间中的桶来实现近似最近邻搜索。相似的数据点被映射到相同的哈希桶的概率要高于不相似的数据点,从而可以加速近似最近邻搜索。常见的基于哈希的方法包括局部敏感哈希(Locality Sensitive Hashing,LSH)和哈希平面搜索(Hashing with Hyperplanes)等。

  2. 基于索引的方法(Index-based Methods):这种方法通过构建数据结构来组织数据集,以便快速查找近似最近邻。常见的基于索引的方法包括KD树(K-Dimensional Tree)、球树(Ball Tree)、最近邻图(Nearest Neighbor Graph)等。

近似最近邻方法在大型数据集上具有较高的搜索效率,适用于需要快速查找近似最近邻的应用场景,如图像搜索、推荐系统、语音识别等。然而,由于牺牲了一定的精度,近似最近邻方法可能会导致搜索结果的准确性降低,因此在一些对搜索精度要求较高的应用中,需要权衡搜索速度和搜索精度。

Searching documents

The previous video shows you a toy example of how you can actually represent a document as a vector.

在这里插入图片描述

In this example, you just add the word vectors of a document to get the document vector. So in summary you should now be familiar with the following concepts:

在这里插入图片描述

Good luck with the programming assignment!

Practice Quiz

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Programming Assignment: Word Translation

Assignment 4 - Naive Machine Translation and LSH

You will now implement your first machine translation system and then you
will see how locality sensitive hashing works. Let’s get started by importing
the required functions!

If you are running this notebook in your local computer, don’t forget to
download the twitter samples and stopwords from nltk.

nltk.download('stopwords')
nltk.download('twitter_samples')
import pdb
import pickle
import string

import time

import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples

from utils import (cosine_similarity, get_dict,
                   process_tweet)
from os import getcwd

import w4_unittest

# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
filePath = f"{getcwd()}/tmp2/"
nltk.data.path.append(filePath)

1. The Word Embeddings Data for English and French Words

Write a program that translates English to French.

The Data

The full dataset for English embeddings is about 3.64 gigabytes, and the French
embeddings are about 629 megabytes. To prevent the Coursera workspace from
crashing, we’ve extracted a subset of the embeddings for the words that you’ll
use in this assignment.

The subset of data

To do the assignment on the Coursera workspace, we’ll use the subset of word embeddings.

en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))

Look at the data

  • en_embeddings_subset: the key is an English word, and the value is a
    300 dimensional array, which is the embedding for that word.
'the': array([ 0.08007812,  0.10498047,  0.04980469,  0.0534668 , -0.06738281, ....
  • fr_embeddings_subset: the key is a French word, and the value is a 300
    dimensional array, which is the embedding for that word.
'la': array([-6.18250e-03, -9.43867e-04, -8.82648e-03,  3.24623e-02,...

Load two dictionaries mapping the English to French words

  • A training dictionary
  • and a testing dictionary.
# loading the english to french dictionaries
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_test))

Output

The length of the English to French training dictionary is 5000
The length of the English to French test dictionary is 1500

Looking at the English French dictionary

  • en_fr_train is a dictionary where the key is the English word and the value
    is the French translation of that English word.
{'the': 'la',
 'and': 'et',
 'was': 'était',
 'for': 'pour',
  • en_fr_test is similar to en_fr_train, but is a test set. We won’t look at it
    until we get to testing.

1.1 Generate Embedding and Transform Matrices

Exercise 1 - get_matrices

Translating English dictionary to French by using embeddings.

You will now implement a function get_matrices, which takes the loaded data
and returns matrices X and Y.

Inputs:

  • en_fr : English to French dictionary
  • en_embeddings : English to embeddings dictionary
  • fr_embeddings : French to embeddings dictionary

Returns:

  • Matrix X and matrix Y, where each row in X is the word embedding for an
    english word, and the same row in Y is the word embedding for the French
    version of that English word.

在这里插入图片描述

Use the en_fr dictionary to ensure that the ith row in the X matrix
corresponds to the ith row in the Y matrix.

Instructions: Complete the function get_matrices():

  • Iterate over English words in en_fr dictionary.
  • Check if the word have both English and French embedding.
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    ### START CODE HERE ###

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    # get the english words (the keys in the dictionary) and store in a set()
    english_set = set(english_vecs.keys())

    # get the french words (keys in the dictionary) and store in a set()
    french_set = set(french_vecs.keys())

    # store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())

    # loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():

        # check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:

            # get the english embedding
            en_vec = english_vecs[en_word]

            # get the french embedding
            fr_vec = french_vecs[fr_word]

            # add the english embedding to the list
            X_l.append(en_vec)

            # add the french embedding to the list
            Y_l.append(fr_vec)

    # stack the vectors of X_l into a matrix X
    X = np.vstack(X_l)

    # stack the vectors of Y_l into a matrix Y
    Y = np.vstack(Y_l)
    ### END CODE HERE ###

    return X, Y

Now we will use function get_matrices() to obtain sets X_train and Y_train
of English and French word embeddings into the corresponding vector space models.

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# getting the training set:
X_train, Y_train = get_matrices(
    en_fr_train, fr_embeddings_subset, en_embeddings_subset)

# Test your function
w4_unittest.test_get_matrices(get_matrices)

Output

All tests passed

2 - Translations

在这里插入图片描述

Write a program that translates English words to French words using word embeddings and vector space models.

2.1 - Translation as Linear Transformation of Embeddings

Given dictionaries of English and French word embeddings you will create a transformation matrix R

  • Given an English word embedding, e \mathbf{e} e, you can multiply e R \mathbf{eR} eR to get a new word embedding f \mathbf{f} f.
    • Both e \mathbf{e} e and f \mathbf{f} f are row vectors.
  • You can then compute the nearest neighbors to f in the french embeddings and recommend the word that is most similar to the transformed word embedding.

Describing translation as the minimization problem

Find a matrix R that minimizes the following equation.

arg ⁡ min ⁡ R ∥ X R − Y ∥ F (1) \arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1} argRmin​∥XR−Y∥F​(1)

Frobenius norm

The Frobenius norm of a matrix A A A (assuming it is of dimension m , n m,n m,n) is defined as the square root of the sum of the absolute squares of its elements:

∥ A ∥ F ≡ ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 (2) \|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2} ∥A∥F​≡i=1∑m​j=1∑n​∣aij​∣2 ​(2)

Actual loss function

In the real world applications, the Frobenius norm loss:

∥ X R − Y ∥ F \| \mathbf{XR} - \mathbf{Y}\|_{F} ∥XR−Y∥F​

is often replaced by it’s squared value divided by m m m:

1 m ∥ X R − Y ∥ F 2 \frac{1}{m} \| \mathbf{X R} - \mathbf{Y} \|_{F}^{2} m1​∥XR−Y∥F2​

where m m m is the number of examples (rows in X \mathbf{X} X).

  • The same R is found when using this loss function versus the original Frobenius norm.
  • The reason for taking the square is that it’s easier to compute the gradient of the squared Frobenius.
  • The reason for dividing by m m m is that we’re more interested in the average loss per embedding than the loss for the entire training set.
    • The loss for all training set increases with more words (training examples),
      so taking the average helps us to track the average loss regardless of the size of the training set.
Implementing translation mechanism described in this section.

Exercise 2 - compute_loss

Step 1: Computing the loss

  • The loss function will be squared Frobenius norm of the difference between
    matrix and its approximation, divided by the number of training examples m m m.
  • Its formula is:
    L ( X , Y , R ) = 1 m ∑ i = 1 m ∑ j = 1 n ( a i j ) 2 L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2} L(X,Y,R)=m1​i=1∑m​j=1∑n​(aij​)2

where a i j a_{i j} aij​ is value in i i ith row and j j jth column of the matrix X R − Y \mathbf{XR}-\mathbf{Y} XR−Y.

Instructions: complete the compute_loss() function

  • Compute the approximation of Y by matrix multiplying X and R
  • Compute difference XR - Y
  • Compute the squared Frobenius norm of the difference and divide it by m m m.
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]
        
    # diff is XR - Y    
    diff = np.dot(X, R) - Y

    # diff_squared is the element-wise square of the difference    
    diff_squared = diff**2

    # sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)

    # loss i is the sum_diff_squared divided by the number of examples (m)
    loss = sum_diff_squared / m
    ### END CODE HERE ###
    return loss
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
print(f"Expected loss for an experiment with random matrices: {compute_loss(X, Y, R):.4f}" ) 

Output

Expected loss for an experiment with random matrices: 8.1866

Expected output:

Expected loss for an experiment with random matrices: 8.1866
# Test your function
w4_unittest.test_compute_loss(compute_loss)

Output

 All tests passed

Exercise 3 - compute_gradient

Step 2: Computing the gradient of loss with respect to transform matrix R

  • Calculate the gradient of the loss with respect to transform matrix R.

  • The gradient is a matrix that encodes how much a small change in R
    affect the change in the loss function.

  • The gradient gives us the direction in which we should decrease R
    to minimize the loss.

  • m m m is the number of training examples (number of rows in X X X).

  • The formula for the gradient of the loss function L ( X , Y , R ) L(X,Y,R) L(X,Y,R) is:

d d R L ( X , Y , R ) = d d R ( 1 m ∥ X R − Y ∥ F 2 ) = 2 m X T ( X R − Y ) \frac{d}{dR}L(X,Y,R)=\frac{d}{dR}\left(\frac{1}{m}\|XR-Y\|_F^2\right)=\frac{2}{m}X^T(XR-Y) dRd​L(X,Y,R)=dRd​(m1​∥XR−Y∥F2​)=m2​XT(XR−Y)
Instructions: Complete the compute_gradient function below.

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a scalar value - gradient of the loss function L for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]

    # gradient is X^T(XR - Y) * 2/m    
    gradient = 2. / m * np.dot(X.T, (np.dot(X, R) - Y))
    
    ### END CODE HERE ###
    return gradient
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
gradient = compute_gradient(X, Y, R)
print(f"First row of the gradient matrix: {gradient[0]}")

Output

First row of the gradient matrix: [1.3498175  1.11264981 0.69626762 0.98468499 1.33828969]

Expected output:

First row of the gradient matrix: [1.3498175  1.11264981 0.69626762 0.98468499 1.33828969]
# Test your function
w4_unittest.test_compute_gradient(compute_gradient)

Output

 All tests passed

Step 3: Finding the optimal R with Gradient Descent Algorithm

Gradient Descent

Gradient descent is an iterative algorithm which is used in searching for the optimum of the function.

  • Earlier, we’ve mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function.
  • Gradient descent uses that information to iteratively change matrix R until we reach a point where the loss is minimized.

Training with a fixed number of iterations

Most of the time we iterate for a fixed number of training steps rather than iterating until the loss falls below a threshold.

Pseudocode:

  1. Calculate gradient g g g of the loss with respect to the matrix R R R.
  2. Update R R R with the formula:
    R new = R old − α g R_{\text{new}}= R_{\text{old}}-\alpha g Rnew​=Rold​−αg

Where α \alpha α is the learning rate, which is a scalar.

Learning Rate

  • The learning rate or “step size” α \alpha α is a coefficient which decides how much we want to change R R R in each step.
  • If we change R R R too much, we could skip the optimum by taking too large of a step.
  • If we make only small changes to R R R, we will need many steps to reach the optimum.
  • Learning rate α \alpha α is used to control those changes.
  • Values of α \alpha α are chosen depending on the problem, and we’ll use learning_rate = 0.0003 =0.0003 =0.0003 as the default value for our algorithm.

Exercise 4 - align_embeddings

Implement align_embeddings()

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003, verbose=True, compute_loss=compute_loss, compute_gradient=compute_gradient):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    # the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(train_steps):
        if verbose and i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        ### START CODE HERE ###
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X, Y, R)

        # update R by subtracting the learning rate times gradient
        R -= learning_rate * gradient
        ### END CODE HERE ###
    return R
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)

Output

loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442

Expected Output:

loss at iteration 0 is: 3.7242
loss at iteration 25 is: 3.6283
loss at iteration 50 is: 3.5350
loss at iteration 75 is: 3.4442
# Test your function
w4_unittest.test_align_embeddings(align_embeddings)

Output

All tests passed

Calculate Transformation matrix R

Using just the training set, find the transformation matrix R \mathbf{R} R by calling the function align_embeddings().

NOTE: The code cell below will take a few minutes to fully execute (~3 mins)

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)

Output

loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735

Expected Output

loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735

2.2 - Testing the Translation

k-Nearest Neighbors Algorithm

k-Nearest neighbors algorithm

  • k-NN is a method which takes a vector as input and finds the other vectors in the dataset that are closest to it.
  • The ‘k’ is the number of “nearest neighbors” to find (e.g. k=2 finds the closest two neighbors).

Searching for the Translation Embedding

Since we’re approximating the translation function from English to French embeddings by a linear transformation matrix R \mathbf{R} R, most of the time we won’t get the exact embedding of a French word when we transform embedding e \mathbf{e} e of some particular English word into the French embedding space.

  • This is where k k k-NN becomes really useful! By using 1 1 1-NN with e R \mathbf{eR} eR as input, we can search for an embedding f \mathbf{f} f (as a row) in the matrix Y \mathbf{Y} Y which is the closest to the transformed vector e R \mathbf{eR} eR

Cosine Similarity

Cosine similarity between vectors u u u and v v v calculated as the cosine of the angle between them.
The formula is

cos ⁡ ( u , v ) = u ⋅ v ∥ u ∥ ∥ v ∥ \cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|} cos(u,v)=∥u∥∥v∥u⋅v​

  • cos ⁡ ( u , v ) \cos(u,v) cos(u,v) = 1 1 1 when u u u and v v v lie on the same line and have the same direction.
  • cos ⁡ ( u , v ) \cos(u,v) cos(u,v) is − 1 -1 −1 when they have exactly opposite directions.
  • cos ⁡ ( u , v ) \cos(u,v) cos(u,v) is 0 0 0 when the vectors are orthogonal (perpendicular) to each other.

Note: Distance and similarity are pretty much opposite things.

  • We can obtain distance metric from cosine similarity, but the cosine similarity can’t be used directly as the distance metric.
  • When the cosine similarity increases (towards 1 1 1), the “distance” between the two vectors decreases (towards 0 0 0).
  • We can define the cosine distance between u u u and v v v as
    d cos ( u , v ) = 1 − cos ⁡ ( u , v ) d_{\text{cos}}(u,v)=1-\cos(u,v) dcos​(u,v)=1−cos(u,v)

Exercise 5 - nearest_neighbor

Complete the function nearest_neighbor()

Inputs:

  • Vector v,
  • A set of possible nearest neighbors candidates
  • k nearest neighbors to find.
  • The distance metric should be based on cosine similarity.
  • cosine_similarity function is already implemented and imported for you. It’s arguments are two vectors and it returns the cosine of the angle between them.
  • Iterate over rows in candidates, and save the result of similarities between current row and vector v in a python list. Take care that similarities are in the same order as row vectors of candidates.
  • Now you can use numpy argsort to sort the indices for the rows of candidates.
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def nearest_neighbor(v, candidates, k=1, cosine_similarity=cosine_similarity):
    """
    Input:
      - v, the vector you are going find the nearest neighbor for
      - candidates: a set of vectors where we will find the neighbors
      - k: top k nearest neighbors to find
    Output:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    ### START CODE HERE ###
    similarity_l = []

    # for each candidate vector...
    for row in candidates:
        # get the cosine similarity
        cos_similarity = cosine_similarity(v, row)

        # append the similarity to the list
        similarity_l.append(cos_similarity)

    # sort the similarity list and get the indices of the sorted list    
    sorted_ids = np.argsort(similarity_l)
    
    # Reverse the order of the sorted_ids array
    sorted_ids = sorted_ids[::-1]
    
    # get the indices of the k most similar candidate vectors
    k_idx = sorted_ids[0:k]
    ### END CODE HERE ###
    return k_idx
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Test your implementation:
v = np.array([1, 0, 1])
candidates = np.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
print(candidates[nearest_neighbor(v, candidates, 3)])

Output

[[2 0 1]
 [1 0 5]
 [9 9 9]]

Expected Output:

[[2 0 1] [1 0 5] [9 9 9]]

# Test your function
w4_unittest.test_nearest_neighbor(nearest_neighbor)

Output

All tests passed

Test your Translation and Compute its Accuracy

Exercise 6 - test_vocabulary

Complete the function test_vocabulary which takes in English
embedding matrix X X X, French embedding matrix Y Y Y and the R R R
matrix and returns the accuracy of translations from X X X to Y Y Y by R R R.

  • Iterate over transformed English word embeddings and check if the
    closest French word vector belongs to French word that is the actual
    translation.
  • Obtain an index of the closest French embedding by using
    nearest_neighbor (with argument k=1), and compare it to the index
    of the English embedding you have just transformed.
  • Keep track of the number of times you get the correct translation.
  • Calculate accuracy as accuracy = # ( correct predictions ) # ( total predictions ) \text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})} accuracy=#(total predictions)#(correct predictions)​
# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_vocabulary(X, Y, R, nearest_neighbor=nearest_neighbor):
    '''
    Input:
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the transform matrix which translates word embeddings from
        English to French word vector space.
    Output:
        accuracy: for the English to French capitals
    '''

    ### START CODE HERE ###
    # The prediction is X times R
    pred = np.dot(X, R)

    # initialize the number correct to zero
    num_correct = 0

    # loop through each row in pred (each transformed embedding)
    for i in range(len(pred)):
        # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
        pred_idx = nearest_neighbor(pred[i], Y, 1)

        # if the index of the nearest neighbor equals the row of i... \
        if pred_idx == i:
            # increment the number correct by 1.
            num_correct += 1

    # accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    accuracy = num_correct / pred.shape[0]

    ### END CODE HERE ###

    return accuracy

Let’s see how is your translation mechanism working on the unseen data:

X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)

# UNQ_C11 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

acc = test_vocabulary(X_val, Y_val, R_train)  # this might take a minute or two
print(f"accuracy on test set is {acc:.3f}")

Output

accuracy on test set is 0.557

Expected Output:

0.557

You managed to translate words from one language to another language
without ever seing them with almost 56% accuracy by using some basic
linear algebra and learning a mapping of words from one language to another!

# Test your function
w4_unittest.unittest_test_vocabulary(test_vocabulary)

Output

All tests passed

3 - LSH and Document Search

In this part of the assignment, you will implement a more efficient version
of k-nearest neighbors using locality sensitive hashing.
You will then apply this to document search.

  • Process the tweets and represent each tweet as a vector (represent a
    document with a vector embedding).
  • Use locality sensitive hashing and k nearest neighbors to find tweets
    that are similar to a given tweet.
# get the positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_tweets = all_positive_tweets + all_negative_tweets

3.1 - Getting the Document Embeddings

Bag-of-words (BOW) Document Models

Text documents are sequences of words.

  • The ordering of words makes a difference. For example, sentences “Apple pie is
    better than pepperoni pizza.” and “Pepperoni pizza is better than apple pie”
    have opposite meanings due to the word ordering.
  • However, for some applications, ignoring the order of words can allow
    us to train an efficient and still effective model.
  • This approach is called Bag-of-words document model.

Document Embeddings

  • Document embedding is created by summing up the embeddings of all words
    in the document.
  • If we don’t know the embedding of some word, we can ignore that word.

Exercise 7 - get_document_embedding

Complete the get_document_embedding() function.

  • The function get_document_embedding() encodes entire document as a “document” embedding.
  • It takes in a document (as a string) and a dictionary, en_embeddings
  • It processes the document, and looks up the corresponding embedding of each word.
  • It then sums them up and returns the sum of all word vectors of that processed tweet.
# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_embedding(tweet, en_embeddings, process_tweet=process_tweet):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''
    doc_embedding = np.zeros(300)

    ### START CODE HERE ###
    # process the document into a list of words (process the tweet)
    processed_doc = process_tweet(tweet)
    for word in processed_doc:
        # add the word embedding to the running total for the document embedding
        doc_embedding += en_embeddings.get(word, 0)
    ### END CODE HERE ###
    return doc_embedding
# UNQ_C13 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# testing your function
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
tweet_embedding = get_document_embedding(custom_tweet, en_embeddings_subset)
tweet_embedding[-5:]

Output

array([-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184])

Expected output:

array([-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184])
# Test your function
w4_unittest.test_get_document_embedding(get_document_embedding)

Output

All tests passed

Exercise 8 - get_document_vecs

Store all document vectors into a dictionary

Now, let’s store all the tweet embeddings into a dictionary.
Implement get_document_vecs()

# UNQ_C14 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_vecs(all_docs, en_embeddings, get_document_embedding=get_document_embedding):
    '''
    Input:
        - all_docs: list of strings - all tweets in our dataset.
        - en_embeddings: dictionary with words as the keys and their embeddings as the values.
    Output:
        - document_vec_matrix: matrix of tweet embeddings.
        - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []

    for i, doc in enumerate(all_docs):

        ### START CODE HERE ###
        # get the document embedding of the tweet
        doc_embedding = get_document_embedding(doc, en_embeddings)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding

        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)

        ### END CODE HERE ###

    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict
document_vecs, ind2Tweet = get_document_vecs(all_tweets, en_embeddings_subset)

# UNQ_C15 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")

Output

length of dictionary 10000
shape of document_vecs (10000, 300)
Expected Output
length of dictionary 10000
shape of document_vecs (10000, 300)
# Test your function. This cell may take some seconds to run.
w4_unittest.test_get_document_vecs(get_document_vecs)

Output

 All tests passed

3.2 - Looking up the Tweets

Now you have a vector of dimension (m,d) where m is the number of tweets
(10,000) and d is the dimension of the embeddings (300). Now you
will input a tweet, and use cosine similarity to see which tweet in our
corpus is similar to your tweet.

my_tweet = 'i am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, en_embeddings_subset)


# UNQ_C16 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# this gives you a similar tweet as your input.
# this implementation is vectorized...
idx = np.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(all_tweets[idx])

Output

@hanbined sad pray for me :(((
Expected Output
@hanbined sad pray for me :(((

3.3 - Finding the most Similar Tweets with LSH

You will now implement locality sensitive hashing (LSH) to identify the most similar tweet.

  • Instead of looking at all 10,000 vectors, you can just search a subset to find
    its nearest neighbors.

Let’s say your data points are plotted like this:

在这里插入图片描述

You can divide the vector space into regions and search within one region for nearest neighbors of a given vector.

在这里插入图片描述

N_VECS = len(all_tweets)       # This many vectors.
N_DIMS = len(ind2Tweet[1])     # Vector dimensionality.
print(f"Number of vectors is {N_VECS} and each has {N_DIMS} dimensions.")

Output

Number of vectors is 10000 and each has 300 dimensions.

Choosing the number of planes

  • Each plane divides the space to 2 2 2 parts.
  • So n n n planes divide the space into 2 n 2^{n} 2n hash buckets.
  • We want to organize 10,000 document vectors into buckets so that every bucket has about   16 ~16  16 vectors.
  • For that we need 10000 16 = 625 \frac{10000}{16}=625 1610000​=625 buckets.
  • We’re interested in n n n, number of planes, so that 2 n = 625 2^{n}= 625 2n=625. Now, we can calculate n = log ⁡ 2 625 = 9.29 ≈ 10 n=\log_{2}625 = 9.29 \approx 10 n=log2​625=9.29≈10.
# The number of planes. We use log2(625) to have ~16 vectors/bucket.
N_PLANES = 10
# Number of times to repeat the hashing to improve the search.
N_UNIVERSES = 25

3.4 - Getting the Hash Number for a Vector

For each vector, we need to get a unique number associated to that vector in order to assign it to a “hash bucket”.

Hyperplanes in Vector Spaces

  • In 3 3 3-dimensional vector space, the hyperplane is a regular plane. In 2 2 2 dimensional vector space, the hyperplane is a line.
  • Generally, the hyperplane is subspace which has dimension 1 1 1 lower than the original vector space has.
  • A hyperplane is uniquely defined by its normal vector.
  • Normal vector n n n of the plane π \pi π is the vector to which all vectors in the plane π \pi π are orthogonal (perpendicular in 3 3 3 dimensional case).

Using Hyperplanes to Split the Vector Space

We can use a hyperplane to split the vector space into 2 2 2 parts.

  • All vectors whose dot product with a plane’s normal vector is positive are on one side of the plane.
  • All vectors whose dot product with the plane’s normal vector is negative are on the other side of the plane.

Encoding Hash Buckets

  • For a vector, we can take its dot product with all the planes, then encode this information to assign the vector to a single hash bucket.
  • When the vector is pointing to the opposite side of the hyperplane than normal, encode it by 0.
  • Otherwise, if the vector is on the same side as the normal vector, encode it by 1.
  • If you calculate the dot product with each plane in the same order for every vector, you’ve encoded each vector’s unique hash ID as a binary number, like [0, 1, 1, … 0].

Exercise 9 - hash_value_of_vector

We’ve initialized hash table hashes for you. It is list of N_UNIVERSES matrices, each describes its own hash table. Each matrix has N_DIMS rows and N_PLANES columns. Every column of that matrix is a N_DIMS-dimensional normal vector for each of N_PLANES hyperplanes which are used for creating buckets of the particular hash table.

Exercise: Your task is to complete the function hash_value_of_vector which places vector v in the correct hash bucket.

  • First multiply your vector v, with a corresponding plane. This will give you a vector of dimension ( 1 , N_planes ) (1,\text{N\_planes}) (1,N_planes).
  • You will then convert every element in that vector to 0 or 1.
  • You create a hash vector by doing the following: if the element is negative, it becomes a 0, otherwise you change it to a 1.
  • You then compute the unique number for the vector by iterating over N_PLANES
  • Then you multiply 2 i 2^i 2i times the corresponding bit (0 or 1).
  • You will then store that sum in the variable hash_value.

Intructions: Create a hash for the vector in the function below.
Use this formula:

h a s h = ∑ i = 0 N − 1 ( 2 i × h i ) hash = \sum_{i=0}^{N-1} \left( 2^{i} \times h_{i} \right) hash=i=0∑N−1​(2i×hi​)

Create the sets of planes

  • Create multiple (25) sets of planes (the planes that divide up the region).
  • You can think of these as 25 separate ways of dividing up the vector space with a different set of planes.
  • Each element of this list contains a matrix with 300 rows (the word vector have 300 dimensions), and 10 columns (there are 10 planes in each “universe”).
np.random.seed(0)
planes_l = [np.random.normal(size=(N_DIMS, N_PLANES))
            for _ in range(N_UNIVERSES)]
# UNQ_C17 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def hash_value_of_vector(v, planes):
    """Create a hash for a vector; hash_id says which random hash to use.
    Input:
        - v:  vector of tweet. It's dimension is (1, N_DIMS)
        - planes: matrix of dimension (N_DIMS, N_PLANES) - the set of planes that divide up the region
    Output:
        - res: a number which is used as a hash for your vector

    """
    ### START CODE HERE ###
    # for the set of planes,
    # calculate the dot product between the vector and the matrix containing the planes
    # remember that planes has shape (300, 10)
    # The dot product will have the shape (1,10)    
    dot_product = np.dot(v, planes)
        
    # get the sign of the dot product (1,10) shaped vector
    sign_of_dot_product = np.sign(dot_product)

    # set h to be false (eqivalent to 0 when used in operations) if the sign is negative,
    # and true (equivalent to 1) if the sign is positive (1,10) shaped vector
    # if the sign is 0, i.e. the vector is in the plane, consider the sign to be positive
    h = sign_of_dot_product >= 0

    # remove extra un-used dimensions (convert this from a 2D to a 1D array)
    h = np.squeeze(h)

    # initialize the hash value to 0
    hash_value = 0

    n_planes = planes.shape[1]
    for i in range(n_planes):
        # increment the hash value by 2^i * h_i        
        hash_value += 2**i * h[i]
        
    ### END CODE HERE ###

    # cast hash_value as an integer
    hash_value = int(hash_value)

    return hash_value
# UNQ_C18 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

np.random.seed(0)
idx = 0
planes = planes_l[idx]  # get one 'universe' of planes to test the function
vec = np.random.rand(1, 300)
print(f" The hash value for this vector,",
      f"and the set of planes at index {idx},",
      f"is {hash_value_of_vector(vec, planes)}")

Output

The hash value for this vector, and the set of planes at index 0, is 768

Expected Output

The hash value for this vector, and the set of planes at index 0, is 768
# Test your function
w4_unittest.test_hash_value_of_vector(hash_value_of_vector)

Output

All tests passed

3.5 - Creating a Hash Table

Exercise 10 - make_hash_table

Given that you have a unique number for each vector (or tweet), You now want to create a hash table. You need a hash table, so that given a hash_id, you can quickly look up the corresponding vectors. This allows you to reduce your search by a significant amount of time.

在这里插入图片描述

We have given you the make_hash_table function, which maps the tweet vectors to a bucket and stores the vector there. It returns the hash_table and the id_table. The id_table allows you know which vector in a certain bucket corresponds to what tweet.

# UNQ_C19 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

# This is the code used to create a hash table: 
# This function is already implemented for you. Feel free to read over it.

### YOU CANNOT EDIT THIS CELL

def make_hash_table(vecs, planes, hash_value_of_vector=hash_value_of_vector):
    """
    Input:
        - vecs: list of vectors to be hashed.
        - planes: the matrix of planes in a single "universe", with shape (embedding dimensions, number of planes).
    Output:
        - hash_table: dictionary - keys are hashes, values are lists of vectors (hash buckets)
        - id_table: dictionary - keys are hashes, values are list of vectors id's
                            (it's used to know which tweet corresponds to the hashed vector)
    """
    # number of planes is the number of columns in the planes matrix
    num_of_planes = planes.shape[1]

    # number of buckets is 2^(number of planes)
    # ALTERNATIVE SOLUTION COMMENT:
    # num_buckets = pow(2, num_of_planes)
    num_buckets = 2**num_of_planes

    # create the hash table as a dictionary.
    # Keys are integers (0,1,2.. number of buckets)
    # Values are empty lists
    hash_table = {i: [] for i in range(num_buckets)}

    # create the id table as a dictionary.
    # Keys are integers (0,1,2... number of buckets)
    # Values are empty lists
    id_table = {i: [] for i in range(num_buckets)}

    # for each vector in 'vecs'
    for i, v in enumerate(vecs):
        # calculate the hash value for the vector
        h = hash_value_of_vector(v, planes)

        # store the vector into hash_table at key h,
        # by appending the vector v to the list at key h
        hash_table[h].append(v) # @REPLACE None

        # store the vector's index 'i' (each document is given a unique integer 0,1,2...)
        # the key is the h, and the 'i' is appended to the list at key h
        id_table[h].append(i) # @REPLACE None

    return hash_table, id_table
# UNQ_C20 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
planes = planes_l[0]  # get one 'universe' of planes to test the function
tmp_hash_table, tmp_id_table = make_hash_table(document_vecs, planes)

print(f"The hash table at key 0 has {len(tmp_hash_table[0])} document vectors")
print(f"The id table at key 0 has {len(tmp_id_table[0])} document indices")
print(f"The first 5 document indices stored at key 0 of id table are {tmp_id_table[0][0:5]}")

Output

The hash table at key 0 has 3 document vectors
The id table at key 0 has 3 document indices
The first 5 document indices stored at key 0 of id table are [3276, 3281, 3282]

Expected output

The hash table at key 0 has 3 document vectors
The id table at key 0 has 3 document indices
The first 5 document indices stored at key 0 of id table are [3276, 3281, 3282]
# Test your function
w4_unittest.test_make_hash_table(make_hash_table)

Output

 All tests passed

3.6 - Creating all Hash Tables

You can now hash your vectors and store them in a hash table that
would allow you to quickly look up and search for similar vectors.
Run the cell below to create the hashes. By doing so, you end up having
several tables which have all the vectors. Given a vector, you then
identify the buckets in all the tables. You can then iterate over the
buckets and consider much fewer vectors. The more tables you use, the
more accurate your lookup will be, but also the longer it will take.

# Creating the hashtables
def create_hash_id_tables(n_universes):
    hash_tables = []
    id_tables = []
    for universe_id in range(n_universes):  # there are 25 hashes
        print('working on hash universe #:', universe_id)
        planes = planes_l[universe_id]
        hash_table, id_table = make_hash_table(document_vecs, planes)
        hash_tables.append(hash_table)
        id_tables.append(id_table)
    
    return hash_tables, id_tables

hash_tables, id_tables = create_hash_id_tables(N_UNIVERSES)

Approximate K-NN

Exercise 11 - approximate_knn

Implement approximate K nearest neighbors using locality sensitive hashing,
to search for documents that are similar to a given document at the
index doc_id.

Inputs
  • doc_id is the index into the document list all_tweets.
  • v is the document vector for the tweet in all_tweets at index doc_id.
  • planes_l is the list of planes (the global variable created earlier).
  • k is the number of nearest neighbors to search for.
  • num_universes_to_use: to save time, we can use fewer than the total
    number of available universes. By default, it’s set to N_UNIVERSES,
    which is 25 25 25 for this assignment.
  • hash_tables: list with hash tables for each universe.
  • id_tables: list with id tables for each universe.

The approximate_knn function finds a subset of candidate vectors that
are in the same “hash bucket” as the input vector ‘v’. Then it performs
the usual k-nearest neighbors search on this subset (instead of searching
through all 10,000 tweets).

# UNQ_C21 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# This is the code used to do the fast nearest neighbor search. Feel free to go over it
def approximate_knn(doc_id, v, planes_l, hash_tables, id_tables, k=1, num_universes_to_use=25, hash_value_of_vector=hash_value_of_vector):
    """Search for k-NN using hashes."""
    #assert num_universes_to_use <= N_UNIVERSES

    # Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()

    # list of document IDs
    ids_to_consider_l = list()

    # create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()

    # loop through the universes of planes
    for universe_id in range(num_universes_to_use):

        # get the set of planes from the planes_l list, for this particular universe_id
        planes = planes_l[universe_id]

        # get the hash value of the vector for this set of planes
        hash_value = hash_value_of_vector(v, planes)

        # get the hash table for this particular universe_id
        hash_table = hash_tables[universe_id]

        # get the list of document vectors for this hash table, where the key is the hash_value
        document_vectors_l = hash_table[hash_value]

        # get the id_table for this particular universe_id
        id_table = id_tables[universe_id]

        # get the subset of documents to consider as nearest neighbors from this id_table dictionary
        new_ids_to_consider = id_table[hash_value]

        ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

        # loop through the subset of document vectors to consider
        for i, new_id in enumerate(new_ids_to_consider):
            
            if doc_id == new_id:
                continue

            # if the document ID is not yet in the set ids_to_consider...
            if new_id not in ids_to_consider_set:
                # access document_vectors_l list at index i to get the embedding
                # then append it to the list of vectors to consider as possible nearest neighbors
                document_vector_at_i = document_vectors_l[i]          

                # append the new_id (the index for the document) to the list of ids to consider
                vecs_to_consider_l.append(document_vector_at_i)
                ids_to_consider_l.append(new_id)

                # also add the new_id to the set of ids to consider
                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)

        ### END CODE HERE ###

    # Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))

    # convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)

    # call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)

    # Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]

    return nearest_neighbor_ids
#document_vecs, ind2Tweet
doc_id = 0
doc_to_search = all_tweets[doc_id]
vec_to_search = document_vecs[doc_id]
# UNQ_C22 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Sample
nearest_neighbor_ids = approximate_knn(
    doc_id, vec_to_search, planes_l, hash_tables, id_tables, k=3, num_universes_to_use=5)

Output

(10,)
(10,)
(10,)
(10,)
(10,)
(10,)
(10,)
(10,)
(10,)
(10,)
Fast considering 77 vecs
print(f"Nearest neighbors for document {doc_id}")
print(f"Document contents: {doc_to_search}")
print("")

for neighbor_id in nearest_neighbor_ids:
    print(f"Nearest neighbor at document id {neighbor_id}")
    print(f"document contents: {all_tweets[neighbor_id]}")

Output

Nearest neighbors for document 0
Document contents: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Nearest neighbor at document id 51
document contents: #FollowFriday @France_Espana @reglisse_menthe @CCI_inter for being top engaged members in my community this week :)
Nearest neighbor at document id 2478
document contents: #ShareTheLove @oymgroup @musicartisthere for being top HighValue members this week :) @nataliavas http://t.co/IWSDMtcayt
Nearest neighbor at document id 105
document contents: #FollowFriday @straz_das @DCarsonCPA @GH813600 for being top engaged members in my community this week :)
# Test your function
w4_unittest.test_approximate_knn(approximate_knn, hash_tables, id_tables)

Output

 All tests passed

4 Conclusion

Congratulations - Now you can look up vectors that are similar to the
encoding of your tweet using LSH!

grades

在这里插入图片描述

后记

2024年3月14日花费1天的时间完成week3和week4的quiz和lab。从2024年3月10日enroll这门课开始,截至2024年3月14日,历经5天,其中有4天时间在学习这门课,所以说用时4天时间完成《Natural Language Processing with Classification and Vector Spaces》。希望这个Specialization后面的3门课有时间去完成。不知道何时能完成。

标签:01,hash,matrix,Classification,vector,planes,np,Natural,document
From: https://blog.csdn.net/shizheng_Li/article/details/136718293

相关文章

  • 01了解操作系统
    硬件和软件们所熟知的计算机是由:硬件和软件所组成硬件计算机系统中由电子,机械和光电元件等组成的各种物理装置的总称软件软件是用户和计算机硬件之间的接口和桥梁,用户通过软件与计算机进行交流。而操作系统,就是软件的一类。操作系统操作系统是计算机软件的一种,它主要负......
  • CTF练习日记——[HCTF 2018]WarmUp 1
    一点进去就是一个大大的笑脸,暂时没有头绪,点开页面源码试试看发现了一个source.php,从这里入手试试呢能看到在source.php中有许多的if语句,猜测适用于验证过滤啥的,但看的不是太懂,一个个分析下先这里isset()验证变量是否声明,is_string判断是否是字符串,只要有一个不符合,就会输出......
  • CTF练习日记——[极客大挑战 2019]EasySQL
    启动靶机后,首先能看到这样一个界面:这个题是和SQL注入相关,首先随意输入username和password试试看:提示用户名和密码错误,那么我们尝试输入用户名输入1',得到提示SQL语句出错,那么我们就可以从这里下手,直接在用户名那输入1'or'1'='1#注入成功,得到了我们需要的flag:flag{08f72......
  • [Ynoi2013] 大学
    非常好之\(\texttt{lxl}\)使我的代码旋转。看到这个题,第一反应显然是如果我们能够每次确切的找到要除的数,然后用树状数组进行单点修改,那么就可以达到\(\mathcal{O}(n\logV\logn)\)的复杂度。那么接下来就是考虑如何去找到能除的数。首先,我们不难想到对于每个权值\(v\)......
  • CDH - [01] 概述
      一、什么是CDH  CDH是Cloudera'sDistributionIncludingApacheHadoop的缩写,即Cloudera公司发布的Hadoop发行版。它是一个为Hadoop构建的企业级数据平台,提供了Hadoop核心组件的预编译、测试和优化的版本,以及管理这些组件的工具和附加功能。Cloudera提供了易于安装、配......
  • CTF练习日记——[极客大挑战 2019]Havefun 1
    开启靶机后,看到该界面,一只可爱的小猫,题目也没有更多信息,查看源代码试试看我们可以看到这样一段代码,我们试试cat=dog,发现结果自己出来了,得到了flag:flag{4962ffca-1564-415b-b9e0-77699a797784}......
  • html5&css&js代码 018颜色表
    html5&css&js代码018颜色表一、代码二、效果三、解释这段代码展示了一个基本的颜色表,方便参考使用,同时也应用了各种样式应用方式。一、代码<!DOCTYPEhtml><htmllang="zh-cn"><head><title>编程笔记html5&css&js颜色表</title><metacharset="utf......
  • 微服务day01
    微服务加厚风格,像把一个单独的应用程序开发为一套小程序,每个小程序运行在自己的进程中,使用轻量级机制通信,通常是httpApi,这些服务围绕业务能力构件,通过完全自动化独立部署,这些微服务使用不同的语言,以及不同的存储技术,保持最低的集中式管理。集群与分布式集群是个物理形态,分布式......
  • 计算机网络(001-1)
    计算机网络-方老师总时长24:45:00共50个视频,6个模块此文章包含1.1到1.4的内容简介1.1计算机网络的作用三网融合(三网合一)模拟信号就是连续信号数字信号是离散信号1.2互联网概述以前2兆带宽就要98现在几百兆带宽也就几百块1.3......
  • 501. 二叉搜索树中的众数c
    /***Definitionforabinarytreenode.*structTreeNode{*intval;*structTreeNode*left;*structTreeNode*right;*};*//***Note:Thereturnedarraymustbemalloced,assumecallercallsfree().*/intmax,sum,pre;void......