Similarity Functions in Python

4 min readJan 29, 2020

Similarity functions are used to measure the ‘distance’ between two vectors or numbers or pairs. Its a measure of how similar the two objects being measured are. The two objects are deemed to be similar if the distance between them is small, and vice-versa.

Measures of Similarity

Euclidean Distance

Simplest measure- just measures the distance in the simple trigonometric way

When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them. This distance between two points is given by the Pythagorean theorem.

Implementation in python

def euclidean_distance(x,y):
  return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

Manhattan Distance

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the absolute sum of the difference between the x-coordinates and y-coordinates. Suppose we have a Point A and a Point B: if we want to find the Manhattan distance between them, we just have to sum up the absolute x-axis and y-axis variation. We find the Manhattan distance between two points by measuring along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1–x2|+|y1–y2||x1–x2|+|y1–y2|

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance, L1 norm, city block distance, Minkowski’s L1 distance, taxi cab metric, or city block distance.

Implementation in Python

def manhattan_distance(x,y):
  return sum(abs(a-b) for a,b in zip(x,y))

Minkowski Distance

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance. It looks like this:

When p = 2, Minkowski distance is the same as the Euclidean distance.

When p = 1, Minkowski distance is the same as the Manhattan distance.

from math import*
from decimal import Decimal
  
def nth_root(value, n_root):
 root_value = 1/float(n_root)
 return round (Decimal(value) ** Decimal(root_value),3)
  
def minkowski_distance(x,y,p_value):
 return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)print minkowski_distance([0,3,4,5],[7,6,3,-1],3)

Cosine Similarity
Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

def square_rooted(x):
   return round(sqrt(sum([a*a for a in x])),3)
  
def cosine_similarity(x,y):
 numerator = sum(a*b for a,b in zip(x,y))
 denominator = square_rooted(x)*square_rooted(y)
 return round(numerator/float(denominator),3)
  
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Jaccard Similarity

Jaccard Similarity is used to find similarities between sets. The Jaccard similarity measures similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets.

Suppose you want to find Jaccard similarity between two sets A and B, it is the ratio of the cardinality of A ∩ B and A ∪ B.

Cardinality: Number of elements in a set

say A & B are sets, with cardinality denoted by A and B

Jaccard Similarity J(A,B) = |A∩B|/|A∪B|

Implementation in Python

from math import*
  
def jaccard_similarity(x,y):
 intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
 union_cardinality = len(set.union(*[set(x), set(y)]))
 return intersection_cardinality/float(union_cardinality)print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

References:
[1] http://dataconomy.com/2015/04/implementing-the-five-most-popular-similarity-measures-in-python/
[2] https://en.wikipedia.org/wiki/Similarity_measure
[3] http://bigdata-madesimple.com/implementing-the-five-most-popular-similarity-measures-in-python/
[4] http://techinpink.com/2017/08/04/implementing-similarity-measures-cosine-similarity-versus-jaccard-similarity/

Similarity Functions in Python

Measures of Similarity

Written by Ashutosh Kumar

No responses yet