Notes: ------ R project provides you with a command line interface/environment and with graphical output. The "Introduction to R" manual will contain most of the information that you will need to experiment with R. The full reference manual (2000+ pages) contains all the information that can be found with the help() command. Although the following instructions have been tested on linux machines, the functionality should be the same in Windows machines. Care may have to be taken for the default directory where files are placed. You may need to understand the differences between a matrix and a data frame by reading the manual. The dissimilarity matrix has to have a specific format and is better constructed by a metric of R (like the correlation) or by transforming an already existing dissimilarity data matrix with the "as.dist()" function, than by hand. You can view the contents of an object by just entering the name of the data structure on the command line. REMEMBER: In R, the assignment operator is "<-" and not "="! Commands you will/may need for clustering: ------------------------------------------ -- help() Provides help on a function. Using "help.search()" you can search for a specific function name. "help.start()" gives a particularly nice html interface to manuals and references. -- history() The history() command allows you to view the most recent commands. You can also use the up/down arrows to access all previously entered commands -- library() Load an R package. To load the clustering package, enter "library(cluster)". -- read.table() It loads a table from a file and stores it in a data frame. Each row in the file is going to be stored as a column in the data frame. The elements on the rows are seperated by spaces by default. -- ls() Shows the objects that are currently stored in R. -- rm() Delete an object from memory. -- plot() Graphically plots an object. It has various parameters that can be explored with the help command. Some objects give special instructions to plot() in order to draw them, providing also additional arguments for control of the output (like the agnes object). -- agnes() The agglomerative clustering function that accepts as input a similarity matrix or data frame, or dissimilarity matrix. It has parameters to specify whether you are giving as input a similarity/dissimilarity matrix, the distance metric for calculating dissimilarities in case you provide a similarity matrix, the clustering method and others. All parameters can be explored with the help() command. -- diana() A divisive hierarchical clustering function. It's use is similar to agnes(). -- as.dist() Transform a matrix or data frame to a dissimilarity matrix (make sure the result is what you need). -- quit() You can exit the R environment with this command. You will be given an option to save your work (all your objects and commands you entered during your session). If you save it, a couple of files are written in your directory, which are automatically loaded next time you start R on the same directory by default. Example of the use of R for clustering: --------------------------------------- This example assumes that you have a directory with the file "R_dist-3" and the file "names" placed in. R should be started in that directory. The file "R_dist-3" contains the 3-mer distributions of 104 bacteria and the file "names" contains the names of these bacteria. Start R > mydata3 <- read.table("R_dist-3") > mydata3 # Just to view the data > mycor3 <- cor(mydata3) # Calculate the dissimilarity matrix # based on correlation > mycor3 # View the dissimilarity matrix > names <- read.table("names", sep="\n") > mycor3 # View the column annotated matrix > rownames(mycor3) <- names$V1 > myagnes3 <- agnes(mycor3) > plot.agnes(myagnes3, which.plots=2, main="Correlation Dendrogram of 3-mer Distributions (average method)") Et voila! Notes by (R beginner) Dimitris Papamichail (dimitris@cs.sunysb.edu) ---------- Forwarded message ---------- To change different clustering algos should not be difficult, once you have the data. R seems to have six methods to cluster, namely agnes, clara, diana, fanny, mona and pam, which also seem to accept pretty much the same style of data (dissimilarity/similarity matrix). Ok, mona is for binary variables, so probably not suitable, but the others seem ok. You can find information on these if you do help(). In order to apply different cost functions may be more tricky. If these are not available directly from R (like the correlation cor(), or the eucliean and manhattan that are embedded in the cluster functions), it would probably be easier to create the similarity/dissimilarity matrix outside of R (with your favorite programming language). Once you learn how to use the operators and loop constructions of R it would probably be easier with R, but you need to study the manual a little more for that. An advantage of R in dealing with creating matrices is that you can apply operations directly on arrays and data frames (avoiding loops in some cases). But it may take some time to learn how and even I have limited experience on that. For more information I would open the reference guide of R (the 2000+ page reference), which has a section on clustering and all the functions available "clustered" together. Hope that helps a bit. Good luck. Dimitris > Hi Dimitris, > > I have read your instruction of R throughly, which is very clear and easy to > understand. But I think it may not be enough for my homework. I need to > change different clustering algorithms and different cost functions to > analysis the data. Would you please give me some hints about doing it? > > Thanks a lot,