KNN - You’ve heard about it, but what is it?

K-NearestNeighbors, often called KNN for short, is a machine learning model used to generate predictions. And those predictions can be scored, outputting a percentage of how well the model did.

I’ll be using the Breast Cancer Wisconsin (Original) Data Set from UCI, but first lets go over some basics.

Starting Small

Here is a small, fake dataset with 14 dots. Are you able to make two groups out of them?

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

dataset = {'a':[[1,2],[3,1],[0,0],[4.25,1.5],[1,3],[2,3],[3,5]], 
           'b':[[6,5],[7,7],[7,4],[7.5,6.5],[7,5.5],[8,8],[8,6]]}

[[plt.scatter(ii[0], ii[1], s=100, color='navy') 
for ii in dataset[i]] for i in dataset];

breakdown of quakes

I bet you could, but here’s how I would do it.

breakdown of quakes

Now, let’s introduce a new dot, the blue one. We’ll place it at [5,7]

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

# Data it bases it's predictions on
dataset = {'k':[[1,2],[3,1],[0,0],[4.25,1.5],[1,3],[2,3],[3,5]], 'r':[[6,5],[7,7],[7,4],[7.5,6.5],[7,5.5],[8,8],[8,6]]}

# Introducing something new
new_features = [5,7]

# One line it
[[plt.scatter(ii[0], ii[1], s=100, color=i) for ii in dataset[i]] for i in dataset];

# Add the new features
plt.scatter(new_features[0], new_features[1], s=100); # Blue

breakdown of quakes

Now for the big question…

Which group does the new blue dot belong in?

If you say the bottom left, let me know why you say that. As for me, I would classify it to the upper right group, because it’s closer to more of the red dots than black dots.

If you agree with me, congratulations, we’ve basically just done K-Nearest Neighbors.

But wait, what’s the K?

K is whatever you want it to be, but because it often plays the role of a tie breaker, it should always be an odd number. Typically, K defaults at the number 5. But that still doesn’t explain what it is. K is the number of closest dots that vote on what the new dot should be. They try to recruit it to their side.

breakdown of quakes

Let’s Build This Thing With The Cancer Dataset Mentioned Above

The code below will import the data into python for us and we’ll be able to move forward from there.

import pandas as pd

# The column names of the dataset
columns = [
           'id',
           'Clump Thickness',
           'Uniformity of Cell Size',
           'Uniformity of Cell Shape',
           'Marginal Adhesion',
           'Single Epithelial Cell Size',
           'Bare Nuclei',
           'Bland Chromatin',
           'Normal Nucleoli',
           'Mitoses',
           'Class'
           ]

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

df = pd.read_csv(url, names=columns)
df.replace('?', -99999, inplace=True)
df.drop(['id'], axis=1, inplace=True)

# A function with a big long name
def replace_spaces_with_underscore_in_column_names_and_make_lowercasee(df):
  """
  Accepts a dataframe.
  Alters column names- replacing spaces with '_' and column names lowercase.
  Returns a dataframe.
  """
  labels = list(df.columns)
  for i in range(len(df.columns)):
    labels[i] = labels[i].replace(' ', '_')
    labels[i] = labels[i].lower()
  df.columns = labels

  return df

# Invokes the function
df = replace_spaces_with_underscore_in_column_names_and_make_lowercasee(df)

# Shows a dataframe
print(df)

Splitting The Dataframe

Now we need to split dataset into what we’ll call “train/test/split” which will give us X_train, X_test, y_train, y_test.

# What the model uses to study on
features = df[['clump_thickness', 
               'uniformity_of_cell_size',
               'uniformity_of_cell_shape', 
               'marginal_adhesion',
               'single_epithelial_cell_size',
               'bare_nuclei', 
               'bland_chromatin',
               'normal_nucleoli',
               'mitoses']].astype(int).values.tolist()

# What the model tries to guess
target = df['class'].astype(int).values.tolist()

# Another import
from sklearn.model_selection import train_test_split

# Preform the split of the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)

Notice the last line, the test_size parameter is .2. That means we’re setting aside 20% of the data for testing our results (think whether the blue dot should be red or balck). That’s where the X_test and y_test come into play. Conversly, 80% of the data is to be used for training. Hence the name “train/test/split”. (I know, it’s so original!).

The features are all the columns except the target. The target is what we want to be able to predict.

Standardizing

What does standardizing mean? In short, it’s bringing all the numbers closer. “Standardization makes all variables to contribute equally to the similarity measures.” We’ll only do this to the X_stuff though.

# Another import
from sklearn.preprocessing import StandardScaler

# Standardize the X stuff
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

WAIT! Lets Cheat Real Quick

This code will give us a glimps of what we should expect when we’ve finished building our own class.

Lets run it, and then walk through it real quick.

# Even more imports
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
s_pred = model.predict(X_test)
acc_score = accuracy_score(y_test, s_pred)
print(f'The sklearn accuracy: {acc_score}')

output:

The sklearn accuracy: 0.9714285714285714

Okay, so what just happened? First of all, we had some imports to bring in. Then we instantiated our model, telling it to use the nearest 5 points by saying n_neighbors=5.

Then we had to fit it, which basically takes all the training data and crunches it down so we can have it ready for when it’s needed next.

After fitting the model (the learning phase), we’ll need to get what the computer thinks the predictions ought to be - we’ll call it s_pred which will stand for sklearns prediction.

Next we come to generating an accuracy score (followed by printing it, which we see above). How well did SKLEARN do? Comparing the actual predictions to the generated ones, sklearn was 97.14% correct at “assigning the blue dots to the correct category” or in this case, 97.14% correctly identified the 20% we held out. Predicting which people had cancer, and which didn’t.

Building Our Own Model To Try And Do The Same Thing

import numpy as np

class KNN:
    X_train = None
    y_train = None
    def __init__(self, K):
        self.K = K

    def fit(self,X_train,y_train):
        self.X_train = X_train
        self.y_train = y_train

    def euclidean_distance(self, row1, row2):
        return np.linalg.norm(row1-row2)

    def predict(self, pred):
        
        # Empty list
        predictions = []

        # Go through each row
        for p in pred:
            
            # Empty list
            distances = []

            # Every row in the training set
            for i, v in enumerate(self.X_train):

                # Get euclidean distance
                distance = self.euclidean_distance(v, p)

                # Append distance to list
                distances.append([distance, i])

            # Sort smallest to biggest
            sorted_distances = sorted(distances)

            # Slice getting K distances
            k_distances = sorted_distances[:self.K]

            # Predicted what it will be
            predict = [self.y_train[i[1]] for i in k_distances]

            # Tally the votes
            result = max(set(predict), key = predict.count)

            # Append result to the predictions list
            predictions.append(result)

        # return the prediction
        return predictions

That’s a lot of code, lets look closer at it.

It’s a class, which starts off by setting X_train and y_train to None. They’ll come into play later.

Every class needs an __init__ method, kinda like the class clown, somebody’s gotta fill the roll. The __init__ basically accepts K from the user.

Then we’ve got the fit method, this is where the chrunching of the X_train and y_train come into play.

It’s followed by the euclidean_distance method, which does some fancy math (linear algebra) to figure out how far away each point is from the other, think how far the blue dot is from every other dot… (Remember the blue dot above?)

Next, and lastly comes the predict method, which was the hardest for me to figure out. But if you’ve followed along with everything I’ve said up to this point, the predict method shouldn’t be that scary.

Lets go backwards explaining this, starting at the bottom.

Starting at the end, it returns predictions in the form of a list, which comes from the different points (think dots above) voting on who should join them. It’s limited to only the closest K (5 in this case) which come from measuring the distance from every point in the training data.

Phew! That’s a lot!

Let’s Test The Class

Because we’ve seen this kind of code above, I won’t explain every line.

model = KNN(K=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
my_acc_score = accuracy_score(y_test, y_pred)

print(f'My Class Accuracy: {my_acc_score}')

First we instantiate our model using the class we just made, then, just like whe we ran it with sklearn’s model, we fit, predict and get an accuracy score.

Conclusion

Let’s run it and see what our score is, we’ll use the same training data and the same testing data as above whe we ran it for the sklearn model.

And we get…

output

My Class Accuracy: 0.9714285714285714

The same exact score as the sklearn model! How amazing is that.

Hopefully you understand K-Nearest Neighbors a bit better now. I know I sure do!


My code can be found here
KNN Documentation
The sklearn’s KNN class can be found here
UCI Breast Cancer Wisconsin (Original) Data Set