Python Tutorial
File handling, python modules, python numpy, python pandas, python matplotlib, python scipy, machine learning, python mysql, python mongodb, python reference, module reference, python how to, python examples, machine learning - k-means.
On this page, W3schools.com collaborates with NYC Data Science Academy , to deliver digital training content to our students.
K-means is an unsupervised learning method for clustering data points. The algorithm iteratively divides data points into K clusters by minimizing the variance in each cluster.
Here, we will show you how to estimate the best value for K using the elbow method, then use K-means clustering to group the data points into clusters.
How does it work?
First, each data point is randomly assigned to one of the K clusters. Then, we compute the centroid (functionally the center) of each cluster, and reassign each data point to the cluster with the closest centroid. We repeat this process until the cluster assignments for each data point are no longer changing.
K-means clustering requires us to select K, the number of clusters we want to group the data into. The elbow method lets us graph the inertia (a distance-based metric) and visualize the point at which it starts decreasing linearly. This point is referred to as the "elbow" and is a good estimate for the best value for K based on our data.
Start by visualizing some data points:
ADVERTISEMENT
Now we utilize the elbow method to visualize the intertia for different values of K:
The elbow method shows that 2 is a good value for K, so we retrain and visualize the result:
Example Explained
Import the modules you need.
import matplotlib.pyplot as plt from sklearn.cluster import KMeans
You can learn about the Matplotlib module in our "Matplotlib Tutorial .
scikit-learn is a popular library for machine learning.
Create arrays that resemble two variables in a dataset. Note that while we only use two variables here, this method will work with any number of variables:
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12] y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
Turn the data into a set of points:
data = list(zip(x, y)) print(data)
[(4, 21), (5, 19), (10, 24), (4, 17), (3, 16), (11, 25), (14, 24), (6, 22), (10, 21), (12, 21)]
In order to find the best value for K, we need to run K-means across our data for a range of possible values. We only have 10 data points, so the maximum number of clusters is 10. So for each value K in range(1,11), we train a K-means model and plot the intertia at that number of clusters:
inertias = [] for i in range(1,11): kmeans = KMeans(n_clusters=i) kmeans.fit(data) inertias.append(kmeans.inertia_) plt.plot(range(1,11), inertias, marker='o') plt.title('Elbow method') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.show()
We can see that the "elbow" on the graph above (where the interia becomes more linear) is at K=2. We can then fit our K-means algorithm one more time and plot the different clusters assigned to the data:
kmeans = KMeans(n_clusters=2) kmeans.fit(data) plt.scatter(x, y, c=kmeans.labels_) plt.show()
COLOR PICKER
Contact Sales
If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail: [email protected]
Report Error
If you want to report an error, or if you want to make a suggestion, send us an e-mail: [email protected]
Top Tutorials
Top references, top examples, get certified.
Part 1: The Algorithm
- Calculating Euclidean distances
- Assigning data points to closest centroids
- Update assignment of points to centroids
- Update centroids
- Clustering 2D points
Download the starter code for this assignment. Then, extract (unzip) the contents anywhere on your computer. In VSCode, navigate to File | Open Folder and select the homework4 folder. Inside the homework4 folder you’ll find a number of folders and files:
The folders:
- data : a directory containing the data your program needs
- results : a directory where your plots will be automatically saved
Python files:
- analysis.py : where you’ll do your analysis for part 2
- answers.txt : where you’ll write your answers for questions in part 2
- kmeans.py : where you’ll write your k-means clustering algorithm for parts 1 and 2
- test_kmeans.py : program to test kmeans.py
- test_analysis.py : code to test analysis.py
- utils.py : helper code for the assignment
If the starter code you downloaded doesn’t have the right structure, pictured below, try downloading it again by accessing the same site via Incognito mode.
Do not modify anything in utils.py . The file is provided to help you complete your assignment.
Part 1: Implementing K-means clustering
In Part 1, you will design and implement a k-means clustering algorithm. You only need to modify kmeans.py for Part 1. Below is an overview of the data structures data and centroids , which you will need to work with in order to implement the algorithm:
data: A list of data points, where each data point is a list of floats.
Example: data = [[0.34, -0.2, 1.13, 4.3], [5.1, -12.6, -7.0, 1.9], [-15.7, 0.06, -7.1, 11.2]] Here, data contains three data points, where each data point is four-dimensional. In this example, point 1 is [0.34, -0.2, 1.13, 4.3] , point 2 is [5.1, -12.6, -7.0, 1.9] and point 3 is [-15.7, 0.06, -7.1, 11.2] .
centroids: A dictionary of centroids, where the keys are strings (centroid names) and the values are lists (centroid locations).
Example: centroids = {"centroid1": [1.1, 0.2, -3.1, -0.4], "centroid2": [9.3, 6.1, -4.7, 0.18]} . Here, centroids contain two key-value pairs. You should NOT change the names of the centroids when you later update their locations.
When you first run the tests (before you implement anything) you will see an error about missing implementations. As you work your way through the assignment, you’ll start to see less errors as the tests pass. Seeing assertion errors for parts that you haven’t implemented yet is normal and expected. You can run your tests at any time by opening the terminal ( Terminal | New Terminal ) and typing python test_kmeans.py or by just running the test_kmeans.py file in VSCode.
We have provided several test functions that you should use to check your implementation as you go through the assignment. To run the tests, open the terminal ( Terminal | New Terminal or control-` ) and paste in python test_kmeans.py then press enter/return.
You will only be working in kmeans.py for Part 1. Here is a brief overview of the steps, which are described below:
- Implement euclidean_distance() , which will calculate the distance between two points.
- Implement get_closest_centroid() , which allows you to find the closest centroid for a point.
- Implement assign_points_to_centroids() , which sets the centroid assignment for a point.
- Implement mean_of_points() , which calculates the mean of a list of points. Also, implement update_centroids() , which will update all of the centroids to the new mean of the points assigned to them.
- Run the entire algorithm!
- Check code quality
- Submit your work!
Calculating Euclidean distances ¶
Implement euclidean_distance() using the euclidean distance formula to calculate the distance between two points:
def euclidean_distance ( point1 , point2 ): """ Calculate the Euclidean distance between two data points. Arguments: point1: a non-empty list of floats representing a data point point2: a non-empty list of floats representing a data point Returns: the Euclidean distance between two data points Example: Code: point1 = [1.1, 1, 1, 0.5] point2 = [4, 3.14, 2, 1] print(euclidean_distance(point1, point2)) Output: 3.7735394525564456 """ After writing this function, you should write the code:
python test_kmeans.py
You should see the following output once you have correctly implemented euclidean_distance() :
If you receive an error that looks like
You should check your implementation of euclidean distance to make sure it actually returns a valid value. The exception message will change to reflect the next un-implemented function as you progress through the assignment.
Assigning data points to closest centroids ¶
Implement get_closest_centroid() to find the closest centroid for a data point:
You should see the following output once you have correctly implemented get_closest_centroid() :
Again, you will also see an error message for the next test, as you have not implemented those functions yet.
Update assignment of points to centroids ¶
Implement assign_points_to_centroids() to assign data points to their closest centroid:
If there are no data points assigned to a centroid, then you should not include that centroid in the returned dictionary.
You should see the following output from running test_kmeans.py once you have correctly implemented assign_points_to_centroids() :
Update centroids ¶
Do not modify list_of_points or hard-code the dimensionality of the data. Your code should be able to run on data points with any dimension .
Implement mean_of_points() to find the average of a cluster:
Then, implement the update_centroids() to update the centroid to be the average of the clusters:
You should see the following output from test_kmeans.py once you have correctly implemented mean_of_points() and update_centroids() :
At this point, if you completed all parts to the k-means clustering algorithm you should see following output when you run the tests:
Clustering 2D points ¶
Now that you have completed all parts needed to run your k-means clustering algorithm, we can now run it on actual data. Below you should see four lines of code that will help to do exactly that. These four will load the 2D data points and the initial centroids we provided in the .csv files. After that, main_2d() will execute the k-means algorithm and write_centroids_tofile() will save the final centroids to a file to save them permanently.
Now run your program again by typing python kmeans.py in the terminal window. If you are successful, you should see:
In addition, there should be 7 plots generated, in the results/2D/ folder. If you examine these 7 plots, they should be identical to the steps shown in the the gif shown in the background section of this spec.
Quality ¶
Make sure to provide descriptive comments for each function in docstring format
If you discuss this assignment with one or more classmates, you must specify with whom you collaborated in the header comment in your submission. Otherwise, you may be guilty of academic misconduct and face consequences. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration.
Your assignment should pass two checks: flake8 and our code quality guidelines. The code quality guidelines are very thorough.
Submission ¶
Submit kmeans.py to Gradescope for Part 1 .
IMAGES
VIDEO