Sunday, February 20, 2022

K-Nearest neighbor

Introduction

There are multiple blogs available about Machine learning models specifically for KNN(K-nearest neighbor). But, Most blogs explain theory with library code. This blog contains a KNN algorithm description with code snippets. Given code snippets can help to understand the written theory of KNN.


A question can be raised whether KNN can be used for the classification or regression?
So, the answer is, KNN can be used as a classification and regression model.

KNN for classification problem

Iris Dataset has been used to train the model as well as for testing.

There are mainly two core functionalities for implementation.

  1. Calculate the distance between two data points.
  2. Extract k data points by sorting the distance.

1. Calculate distance between two data points:

There are multiple methods to calculate distance between two points such as Minkowski DistanceManhattan DistanceEuclidean DistanceCosine DistanceJaccard Distance, and Hamming Distance. The blog is not going to cover when to use them. But, to understand in detail, This blog can be very useful.

For simplicity, euclidean distance can be used as it is a very common distance calculation algorithm.

The Euclidean distance algorithm can be used to calculate the distance between two multidimensional points.

`D\left(p_{1},p_{2}\right) = \sqrt{\sum_{i=1}^n (p_{2_{I}} - p_{1_{I}})^{2}}`

Where,
n = number of features
`p_{1}, p_{2}`= selected two data points

The python code snippet is given below:

    def calculate_euclidean_distance(data_point_1, data_point_2):
    	distance = 0.0
    	for i in range(len(data_point_1)):
        	distance += (data_point_1[i] - data_point_2[i]) ** 2
    	return np.sqrt(distance)

2. Extract k data points by sorting the distance.

This functionality is very basic. The distance between the current point and all training data points need to be calculated along with sorting them. Top k points with less distance can be extracted to check with training points labels. The label with maximum count can be considered as a prediction result.

    def get_k_nearest_points_index(user_point, neighbor_points, k):
    	neighbor_distances = []
    	for idx, neighbor_point in enumerate(neighbor_points):
        	distance = calculate_euclidean_distance(user_point, neighbor_point)
        	neighbor_distances.append((idx, distance))
    	neighbor_distances.sort(key=lambda n: n[1])
    	return neighbor_distances[:k]

After applying the above two functionalities, We have k nearest points indices of training data. Label of nearest points can be extracted using the given training data indices. The maximum count of extracted labels can be considered as a prediction class.

One question may be raised What if nearest k points are non-repetitive values then how does maximum count work? 

Sorting by points distance is already done and if all points are non-repetitive then the first point can be considered as a prediction result because it is the nearest point.

    def predict_for_classification(test_point, train_points, train_labels, k):
	k_nearest_point_indices = get_k_nearest_points_index(test_point,
								train_points, k)
  	k_labels = list(dict(k_nearest_point_indices).keys())
  	output_label = max(k_labels, key = k_labels.count)
  	return train_labels[output_label]

KNN for regression:

The KNN regression problem is simple if one has understood the classification very well. The main change is instead of extracting the maximum count of labels. The average operation is performed on the continuous label values.

Two steps are the same as per the above such as:

  1. Calculate the distance between two data points.
  2. Extract k data points by sorting the distance.
Then, the prediction for regression is as follow:
    def predict_for_regression(test_point, train_points, train_labels, k):
	k_nearest_point_indices = get_k_nearest_points_index(test_point,
    								train_points, k)
	k_labels = list(dict(k_nearest_point_indices).keys())
    
	summation = 0.0
	for label in k_labels:
		summation += train_labels[label]
	return summation/len(k_labels)

Conclusion:

The KNN model is a really basic algorithm distance calculation and predicts behavior the same as its neighbors. It can be used for classification and regression as well. The common functionalities are distance calculation and sorting distance to extract k nearest points. There are multiple distance calculation algorithms as given. The difference between classification and regression is, classification extracts the max count of nearest labels, and regression average out continuous label values.

No comments:

Post a Comment

Linear Regression

Introduction: This blog helps to understand the basic fundamentals of linear regression. The blog explains the mathematics behind linear reg...