🤔 Into the Unknown, clustering¶

Suppose someone gave you a file of data that looked like the following:

Data file: plants_1.csv

Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
93,5.8,2.6,4.0,1.2
49,5.3,3.7,1.5,0.2
98,6.2,2.9,4.3,1.3
31,4.8,3.1,1.6,0.2
125,6.7,3.3,5.7,2.1
14,4.3,3.0,1.1,0.1
15,5.8,4.0,1.2,0.2
72,6.1,2.8,4.0,1.3
27,5.0,3.4,1.6,0.4
134,6.3,2.8,5.1,1.5
6,5.4,3.9,1.7,0.4
126,7.2,3.2,6.0,1.8
139,6.0,3.0,4.8,1.8
107,4.9,2.5,4.5,1.7
65,5.6,2.9,3.6,1.3
13,4.8,3.0,1.4,0.1
100,5.7,2.8,4.1,1.3
118,7.7,3.8,6.7,2.2
8,5.0,3.4,1.5,0.2
121,6.9,3.2,5.7,2.3
23,4.6,3.6,1.0,0.2
144,6.8,3.2,5.9,2.3
88,6.3,2.3,4.4,1.3
82,5.5,2.4,3.7,1.0
59,6.6,2.9,4.6,1.3
22,5.1,3.7,1.5,0.4
96,5.7,3.0,4.2,1.2
47,5.1,3.8,1.6,0.2
30,4.7,3.2,1.6,0.2
90,5.5,2.5,4.0,1.3
95,5.6,2.7,4.2,1.3
79,6.0,2.9,4.5,1.5
111,6.5,3.2,5.1,2.0
25,4.8,3.4,1.9,0.2
17,5.4,3.9,1.3,0.4
5,5.0,3.6,1.4,0.2
48,4.6,3.2,1.4,0.2
141,6.7,3.1,5.6,2.4
26,5.0,3.0,1.6,0.2
43,4.4,3.2,1.3,0.2
70,5.6,2.5,3.9,1.1
109,6.7,2.5,5.8,1.8
55,6.5,2.8,4.6,1.5
56,5.7,2.8,4.5,1.3
89,5.6,3.0,4.1,1.3
150,5.9,3.0,5.1,1.8
110,7.2,3.6,6.1,2.5
99,5.1,2.5,3.0,1.1
20,5.1,3.8,1.5,0.3
2,4.9,3.0,1.4,0.2
145,6.7,3.3,5.7,2.5
12,4.8,3.4,1.6,0.2
52,6.4,3.2,4.5,1.5
142,6.9,3.1,5.1,2.3
87,6.7,3.1,4.7,1.5
138,6.4,3.1,5.5,1.8
148,6.5,3.0,5.2,2.0
11,5.4,3.7,1.5,0.2
9,4.4,2.9,1.4,0.2
21,5.4,3.4,1.7,0.2
123,7.7,2.8,6.7,2.0
61,5.0,2.0,3.5,1.0
76,6.6,3.0,4.4,1.4
38,4.9,3.1,1.5,0.1
135,6.1,2.6,5.6,1.4
40,5.1,3.4,1.5,0.2
66,6.7,3.1,4.4,1.4
92,6.1,3.0,4.6,1.4
45,5.1,3.8,1.9,0.4
10,4.9,3.1,1.5,0.1
18,5.1,3.5,1.4,0.3
137,6.3,3.4,5.6,2.4
35,4.9,3.1,1.5,0.1
75,6.4,2.9,4.3,1.3
116,6.4,3.2,5.3,2.3
119,7.7,2.6,6.9,2.3
127,6.2,2.8,4.8,1.8
112,6.4,2.7,5.3,1.9
129,6.4,2.8,5.6,2.1
37,5.5,3.5,1.3,0.2
39,4.4,3.0,1.3,0.2
50,5.0,3.3,1.4,0.2
94,5.0,2.3,3.3,1.0
140,6.9,3.1,5.4,2.1
97,5.7,2.9,4.2,1.3
28,5.2,3.5,1.5,0.2
86,6.0,3.4,4.5,1.6
83,5.8,2.7,3.9,1.2
44,5.0,3.5,1.6,0.6
80,5.7,2.6,3.5,1.0
51,7.0,3.2,4.7,1.4
149,6.2,3.4,5.4,2.3
77,6.8,2.8,4.8,1.4
146,6.7,3.0,5.2,2.3
3,4.7,3.2,1.3,0.2
53,6.9,3.1,4.9,1.5
64,6.1,2.9,4.7,1.4
81,5.5,2.4,3.8,1.1
29,5.2,3.4,1.4,0.2
114,5.7,2.5,5.0,2.0
36,5.0,3.2,1.2,0.2
101,6.3,3.3,6.0,2.5
54,5.5,2.3,4.0,1.3
67,5.6,3.0,4.5,1.5
71,5.9,3.2,4.8,1.8
42,4.5,2.3,1.3,0.3
63,6.0,2.2,4.0,1.0
34,5.5,4.2,1.4,0.2
133,6.4,2.8,5.6,2.2
24,5.1,3.3,1.7,0.5
104,6.3,2.9,5.6,1.8
57,6.3,3.3,4.7,1.6
113,6.8,3.0,5.5,2.1
124,6.3,2.7,4.9,1.8
78,6.7,3.0,5.0,1.7
128,6.1,3.0,4.9,1.8
69,6.2,2.2,4.5,1.5
105,6.5,3.0,5.8,2.2
46,4.8,3.0,1.4,0.3
32,5.4,3.4,1.5,0.4

You are told by your lead researcher that this dataset contains measurements of several species of plants. what you have to work with is the following:

The identifier of the plant
The width of the sepal
The length of the sepal
The width of the petal
The length of the petal

How many different species of plants are there in the data? Your research team has no idea but would like you to use your data science skills to answer the question.

Your first instinct (which is a good one) is to visualize the data using altair to see if you can see any patterns in the data that might help you figure this out.

No matter what your answer to the previous question, the majority of this project is to develop an algorithm called k-means clustering.

The idea behind k-means clustering is easy enough to describe, but will give you practice in all of your Python skills to implement.

Determine the number of clusters you want to find call it N. We will represent each cluster by a point in space called the centroid. The centroid will be the point in the center of the cluster. We will call each cluster \(N_i\)
Now randomly place N centroids for your data.
Repeat the following until your clusters stop changing, or you have reached some predetermined limit. 1. For each point in your data find the centroid that it is closest to. Add that point to the cluster \(N_i\) You can use a list of Id values to keep track of which plants belong to which cluster. 2. Now using the points you have put in each cluster compute new coordinates for the centroid by taking the average of each feature. For example the average of PetalWidth and SepalWidth. would give us a new point in 2-d space for each plant.

Visualizing K-Means Clustering

Mean square point-centroid distance: not yet calculated

Algorithm

Repeat until convergence:

Find closest centroid

Find the closest centroid to each point, and group points that share the same closest centroid.

Update centroid

Update each centroid to be the mean of the points in its group.

Find closest centroid

Data

Clustered points

Random

Number of clusters

Number of centroids

New points New centroids

K-means is different than regression in that we are not using “the answers” to help us learn. This algorithm is part of a class of machine learning algorithms known as unsupervised leanring That is, the K-means algorithm just does its best to see if it can make sense of the data.

Implementing K-Means¶

Now your task is to implement the algorithm we have described and you have experiemented with in the visualization. At the end you should write a csv file that contains the Id number, The two measures you have chosen to cluster around – chosen from your visualization, and the cluster number that each plant belongs to. You will use this file in the next step.

Now that you have clustered the points its time to graph them again, and color code them! You’ll need to add a new column in your Data object call it species, the value for each row will be the number of the cluster that it belongs to. This will be fairly easy to do if you wrote a csv file in the previous step.

In a typical application using k-means clusttering you would now begin a more thorough investigation of your clusters. What can you figure out about them. If you are doing an investigation for a web site, you may discover that you have 5 different kinds of customers. This new knowledge that there are different kinds of customers can let you further develop campaigns to reach each customer in a way better suited to their category. In the case of our plant data we would disover that in fact there are three species of plants in the data corresponding to a different kind of flower.

We have been visualizing the data in two dimensions, but we can calculate our distances in any number of dimensions. The pythagorean theorem, which allows us to calculate the distance works for as many dimensions as we want. math:dist = sqrt{x^ + y^2 + z^2 + …} Update your code so that it can take advantage of all four of the features you are given. Does this move any plants into a different cluster?

You have attempted of activities on this page

Before you keep reading...