# Similarity Functions in Python

Similarity functions are used to measure the ‘distance’ between two vectors or numbers or pairs. Its a measure of how similar the two objects being measured are. The two objects are deemed to be similar if the distance between them is small, and vice-versa.

# Measures of Similarity

**Euclidean Distance**

Simplest measure- just measures the distance in the simple trigonometric way

When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them. This distance between two points is given by the Pythagorean theorem.

Implementation in python

**def** **euclidean_distance**(x,y):

**return** sqrt(sum(pow(a**-**b,2) **for** a, b **in** zip(x, y)))

**Manhattan Distance**

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the absolute sum of the difference between the x-coordinates and y-coordinates. Suppose we have a Point A and a Point B: if we want to find the Manhattan distance between them, we just have to sum up the absolute x-axis and y-axis variation. We find the Manhattan distance between two points by measuring along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1–x2|+|y1–y2||x1–x2|+|y1–y2|

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance, L1 norm, city block distance, Minkowski’s L1 distance, taxi cab metric, or city block distance.

Implementation in Python

**def** **manhattan_distance**(x,y):

**return** sum(abs(a**-**b) **for** a,b **in** zip(x,y))

**Minkowski Distance**

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance. It looks like this:

When **p = 2**, Minkowski distance is the same as the Euclidean distance.

When **p = 1**, Minkowski distance is the same as the Manhattan distance.

frommathimport*fromdecimalimportDecimal

defnth_root(value, n_root):

root_value=1/float(n_root)

returnround (Decimal(value)**Decimal(root_value),3)

defminkowski_distance(x,y,p_value):

returnnth_root(sum(pow(abs(a-b),p_value)fora,binzip(x, y)),p_value)-1],3)

**Cosine Similarity**

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. **It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.**

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

**def** **square_rooted**(x):

**return** round(sqrt(sum([a*****a **for** a **in** x])),3)

**def** **cosine_similarity**(x,y):

numerator **=** sum(a*****b **for** a,b **in** zip(x,y))

denominator **=** square_rooted(x)*****square_rooted(y)

**return** round(numerator**/**float(denominator),3)

**print** cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

**Jaccard Similarity**

Jaccard Similarity is used to find similarities between sets. The Jaccard similarity measures similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets.

Suppose you want to find Jaccard similarity between two sets A and B, it is the ratio of the cardinality of A ∩ B and A ∪ B.

Cardinality: Number of elements in a set

say A & B are sets, with cardinality denoted by A and B

Jaccard Similarity J(A,B) = |A∩B|/|A∪B|

Implementation in Python

frommathimport*

defjaccard_similarity(x,y):

intersection_cardinality=len(set.intersection(*[set(x), set(y)]))

union_cardinality=len(set.union(*[set(x), set(y)]))

returnintersection_cardinality/float(union_cardinality)

References:

[1] http://dataconomy.com/2015/04/implementing-the-five-most-popular-similarity-measures-in-python/

[2] https://en.wikipedia.org/wiki/Similarity_measure

[3] http://bigdata-madesimple.com/implementing-the-five-most-popular-similarity-measures-in-python/

[4] http://techinpink.com/2017/08/04/implementing-similarity-measures-cosine-similarity-versus-jaccard-similarity/