#4 B565-Data Mining Homework #4 Due on Thursday, March 7, 2023, 11:59 pm February 21, 2023 1 Student Name B565-Data Mining HW 4 k-means Algorithm in Theory This part is provided to help you implement k-means clustering algorithm. 1: ALGORITHM k-means 2: INPUT (data ∆, distance d : ∆2 → R≥0, centoid number k, threshold τ) 3: OUTPUT(Setofcentoids{c1,c2,…,ck}) 4: 5: *** Dom(∆) denotes domain of data. 6: 7: *** Assume centroid is structure c = (v ∈ DOM(∆),B ⊆ ∆) 8: *** cv is the centroid value and cB is the set of nearest points. 9: *** ci means centroid at it iteration. 10: 11: i=0 12: *** Initialize Centroids 13: forj=1,kdo 14: cij .v ← random(Dom(∆)) 15: cij.B ← ∅ 16: end for 17: 18: repeat 19: i←i+1 20: *** Assign data point to nearest centroid 21: for δ ∈ ∆ do 22: cij .B ← cB ∪ {δ}, where mincij {d(δ, cij .v)} 23: end for 24: for j = 1, k do 25: *** Get size of centroid 26: n ← |cij.B| 27: *** Update centroid with average 28: cij.v ← (1/n)Pδ∈cij.B δ 29: *** Remove data from centroid 30: cij.B ← ∅ 31: end for 32: *** Calculate scalar product (abuse notation and structure slightly) 33: *** See notes 34: until ((1/k) Pk ||ci−1 − ci ||) < τ j=1 jj 35: return({ci1,ci2,…,cik}) k-means on a tiny data set. Here are the inputs: ∆ = d((x1, y1), (x2, y2)) = {(2, 5), (1, 5), (22, 55), (42, 12), (15, 16)} (1) [(x1 − x2)2 + (y1 − y2)2)]1/2 k=2 (3) τ = 10 (4) Observe that Dom(∆) = R2. We now work through k-means. We ignore the uninformative assignments. We remind the reader that T means transpose. 1: i←0 (2) Page 2 of 9 Student Name B565-Data Mining HW 4 2: *** Randomly assign value to first centroid. 3: c01.v←random(Dom(∆))=(16,19) 4: *** Randomly assign value to second centroid. 5: c02.v←random(Dom(∆))=(2,5) 6: i←i+1 7: *** Associate each datum with nearest centroid 8: c1.B={(22,55),(42,12),(15,16)} 9: c12 .B = {(2, 5), (1, 5)} 10: *** Update centroids 11: c1.v←(26.3,27.7)=(1/3)((22,55)+(42,12)+(15,16)) 12: c12.v←(1.5,5)=(1/2)((2,5)+(1,5)) 13: *** The convergence condition is split over the next few lines to explicitly show the calculations 16: Since the threshold is met (6.9 < 10), k-means stops, returning {(26.3, 27.7), (1.5, 5)} 14: (1/k)Pk ||ci−1−ci||=(1/2)(||c0−c1||+||c0−c1||)=(1/2)(||? 2?−?1.5?||+||?16?−?26.3?||) j=1jj 1122 551927.7 15: = (1/2)[(?.5?T ?.5?)(1/2) + (?−9.7?T ?−9.7?)(1/2) )] = (1/2) (√.5 + √169.7) ∼ (1/2)(13.7) = 6.9 0 0 −8.7 −8.7 Page 3 of 9 Student Name B565-Data Mining HW 4 Problem 1 For this problem we are going to use a diabetes data set collected from many US hospitals on the purpose of analyzing the factors causing readmission of diabetic patients. You can download the data from here [link]. The web-page comes with a downloadable link along with the description of the data. Answer the following questions [20 points]: 1. Briefly describe this data set–what is its purpose? How should it be used? What are the kinds of data it’s using? Discussion of data Answer here. . . 2. Using R/Python, show code that answers the following questions: (a) How many entries are in the data set? R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting (b) How many unknown or missing data are in the data set? R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting (c) Create histograms for attributes {age, num lab procedures, num medications}. Label the plots properly. Discuss the distribution of values eg, are uniform, skewed, normal. Place images of these histograms into the document. Show the Python or R code that you used below and discuss below that. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Problem 1 continued on next page. . . Page 4 of 9 Student Name B565-Data Mining HW 4 Plots Place images here with suitable captions. 3. Make a scatter plots of [time in hospital, num medications] and [num medications, num lab procedures] variables and color the data points with the class variable [readmitted]. Discuss the plots, ie, do you observe any relationships between variables? Show the Python or R code that you used below and discussion below that. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. Problem 2 The pseudo-code for k-means and a running example of k-means on a small data set are provided above. Answer the following questions [10 points]: 2.1 Does k-means always converge? Given your answer, a bound on the iterate must be included. How is its value determined? 2.2 What is the run-time of this algorithm? Problem 3 Implement Lloyd’s algorithm for k-means (see algorithm k-means above) in R or Python and call this program Ck. As you present your code explain your protocol for [20 points] 1. initializing centroids 2. maintaining k centroids 3. deciding ties 4. stopping criteria Problem 3 continued on next page. . . Page 5 of 9 Student Name B565-Data Mining HW 4 R/Python Code subsectionR/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Initialization of Centroids Answer here. . . Discussion of Maintaining k Centroids Answer here. . . Discussion of Deciding Ties Answer here. . . Discussion of Stopping Criteria Answer here. . . Problem 4 In this question, you are asked to run your program, Ck, against the Diabetes data set from Problem 1 and New York Times Comments data set [link](use the file nyt-comments-2020.csv as your data set). Upon stopping, you will calculate the quality of the centroids and of the partition. For each centroid ci, form two counts: yi ← X [δ.C = “y”], readmitted/selected δ∈ci.B ni ← X [δ.C == “n”], not readmitted/selected δ∈ci.B where[x=y] returns 1ifTrue,0 otherwise. For example,[2=3]+[0=0]+[34=34]=2 The centroid ci is classified as readmitted/selected if yi > ni and not readmitted/selected otherwise. We can now calculate a simple error rate. Assume ci is good. Then the error is: ni ni + yi k = X error(ci) i=1 error(ci) = We can find the total error rate easily: Error({c1, c2, . . . , ck}) Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, ie, box plot. Discuss your results. Note: The error calculation method mentioned above is generalized for both data sets where the first data set asks if a patient was readmitted or not and the second data set asks if a comment was selected by editor Problem 4 continued on next page. . . Page 6 of 9 Student Name B565-Data Mining HW 4 or not. Data preparation: Like any other data mining problem, before feeding your data to the clustering algo- rithms, you will have to perform data cleaning, feature engineering, and feature selection on both data sets. For the second data set (NY Times Comments data set), you will be using commentBody column as your input feature (you have to build a corpus and perform an appropriate vector-space representation technique) [20 points]. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. Problem 5 The k-means algorithm provided above stops when centroids become stable (Line 34). In theory, k-means converges once SSE is minimized k SSE=XX||x−cj.v||2 jx∈cj .B In this question, you are asked to use SSE as stopping criterion. Run your program, CkSSE , against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare k-means and kmeans++’s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly , ie, box plot. Discuss your results [20 points]. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. Page 7 of 9 Student Name B565-Data Mining HW 4 Problem 6 Traditional k-means initialization is based on choosing values from a uniform distribution. In this question, you are asked to improve k-means through initialization. k-means ++ is an extended k-means clustering algorithm and induces non-uniform distributions over the data that serve as the initial centroids. Read the paper and discuss the idea in a paragraph. Implement this idea to improve your k-means program. Run your program, Ck++, against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare Ck, CkSSE and Ck++’s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, ie, box plot. Discuss your results [20 points]. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. Problem 7 In this question, you are asked to make use of the R/Python libraries for k-means. The elbow technique is used to determine optimal cluster number. Find the optimal cluster number for the Diabetes and New York Times Comments data sets using elbow method (for 2 ≤ k ≤ 15). Provide plots that show the total SSE for each k. Discuss your results [20 points]. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. Page 8 of 9 Student Name B565-Data Mining HW 4 Extra credit This part is optional. 1. Ball-k-means calculates the distance of a data point from the centroid to find the annular region in which the data point resides. The annular region helps determine which neighbor centroids should be included in distance computations. This improves the run-time over earlier approaches by avoid- ing expensive computations. Run your program, Ckball , against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2, . . . 5 for 20 runs each for both data sets. Moreover, compare Ck, CkSSE , Ck++ and Ckball ‘s run time for k = 2, . . . 5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly , ie, box plot. Discuss your results [30 points]. R/Python script # Sample R Script With Highlighting # Sample Python Script With Highlighting Discussion of Findings Answer here. . . Plots Place images here with suitable captions. 2. The student who has the fastest implementation of all four clustering algorithms will receive extra 20 points [20 points]. Submission You must use LATEX to turn in your assignments. Please submit the following two files via Canvas: 1. A .pdf with the name yourname-hw4-everything.pdf which you will get after compiling your .tex file. 2. A .zip file with the name yourname-hw4.zip which should contain your .tex, .pdf, codes(.py, .ipynb, .R, or .Rmd), and a README file. The README file should contain information about dependencies and how to run your codes