Clustering Microarray Data

                                           Heather Turner

                                          Department of Statistics
                                         University of Warwick, UK




Heather Turner (University of Warwick)                               1/9
Overview of Microarray Experiment




                                         −→                                −→




    Array of p genes                          Scanned image                     n × p matrix
         (×n)                                     (×n)


Heather Turner (University of Warwick)        Clustering Microarray Data                       2/9
Example: Serum Stimulation of
                                         Human Fibroblasts
                                         (Eisen, Spellman, Brown & Botstein, PNAS,
                                         1998)
                                              9,800 spots representing 8,600 genes
                                              12 samples taken over 24 hour period
                                              Highlighted clusters can be roughly
                                              categorised as genes involved in
                                              A cholesterol biosynthesis
                                              B the cell cycle
                                              C the immediate–early response
                                              D signaling and angiogenesis
                                              E wound healing and tissue remodelling

Heather Turner (University of Warwick)        Clustering Microarray Data               3/9
Why the need for specialised techniques?

          Application
                  Dimensions of the data are nonstandard (large n, small p)
          Structure
                  Both genes and sample clusters may be of interest
                  Co-expression may be restricted to a subset of the attributes
                  Genes/samples may belong to more than one group
                  Many “uninteresting” genes
          Nature
                  Clusters of interest may not be characterised by similar
                  expression profile
                  Samples may be taken over time


Heather Turner (University of Warwick)   Clustering Microarray Data               4/9
One-way Clustering Techniques

          Increased structural flexibility
     Overlapping non-exhaustive clusters                              Context-specific clusters




            Gene shaving: Hastie et al,                         Clustering On Subsets of
            Genome Biol., 2000                                  Attributes (COSA): Friedman
                                                                and Meulman, JRSS B, 2004


Heather Turner (University of Warwick)   Clustering Microarray Data                              5/9
Two-way Clustering Techniques
          Use conventional one-way methods iteratively
        Sample clusters within gene clusters                   Clusters within two-way clusters




                Inter-related two-way                                 Coupled Two-Way Clustering
                clustering: Tang et al, BIBE 01                       (CTWC): Getz et al, PNAS,
                                                                      2003
                EMMIX-GENE: McLachlan et
                al, Bioinformatics, 2002
Heather Turner (University of Warwick)   Clustering Microarray Data                           6/9
Co-clustering Techniques
          Simultaneously cluster both genes and samples
                   Two-way partition                                  Conjugate clusters




            Spectral bi-clustering: Kluger,                     Double Conjugated Clustering
            Genome Res., 2003                                   (DCC): Busygin et al, SIAM
                                                                ICDM 02
            Co-clustering: Cho, SIAM
            ICDM 04
Heather Turner (University of Warwick)   Clustering Microarray Data                        7/9
Biclustering Techniques
          Retrieve isolated two-way clusters: biclusters
         Clusters based on latent model                                 Biclusters




            Rich probabilistic models: Segal                    SAMBA: Tanay et al,
            et al, Bioinformatics, 2001                         Bioinformatics, 2002

                                                                Plaid models: Lazzeroni and
                                                                Owen, Statist. Sinica, 2002
Heather Turner (University of Warwick)   Clustering Microarray Data                           8/9
Current Situation

          Many novel methods, few used in practice
                  Molecular biologists often have limited (access to) statistical
                  expertise
                  Limited number of methods in publically available software
          Little work on performance evaluation
          Development of methods continues
                  Improved algorithms
                  Time series
                  Three-way data
                  Integretation of other sources of data



Heather Turner (University of Warwick)      Clustering Microarray Data              9/9

Clustering Microarray Data

  • 1.
    Clustering Microarray Data Heather Turner Department of Statistics University of Warwick, UK Heather Turner (University of Warwick) 1/9
  • 2.
    Overview of MicroarrayExperiment −→ −→ Array of p genes Scanned image n × p matrix (×n) (×n) Heather Turner (University of Warwick) Clustering Microarray Data 2/9
  • 3.
    Example: Serum Stimulationof Human Fibroblasts (Eisen, Spellman, Brown & Botstein, PNAS, 1998) 9,800 spots representing 8,600 genes 12 samples taken over 24 hour period Highlighted clusters can be roughly categorised as genes involved in A cholesterol biosynthesis B the cell cycle C the immediate–early response D signaling and angiogenesis E wound healing and tissue remodelling Heather Turner (University of Warwick) Clustering Microarray Data 3/9
  • 4.
    Why the needfor specialised techniques? Application Dimensions of the data are nonstandard (large n, small p) Structure Both genes and sample clusters may be of interest Co-expression may be restricted to a subset of the attributes Genes/samples may belong to more than one group Many “uninteresting” genes Nature Clusters of interest may not be characterised by similar expression profile Samples may be taken over time Heather Turner (University of Warwick) Clustering Microarray Data 4/9
  • 5.
    One-way Clustering Techniques Increased structural flexibility Overlapping non-exhaustive clusters Context-specific clusters Gene shaving: Hastie et al, Clustering On Subsets of Genome Biol., 2000 Attributes (COSA): Friedman and Meulman, JRSS B, 2004 Heather Turner (University of Warwick) Clustering Microarray Data 5/9
  • 6.
    Two-way Clustering Techniques Use conventional one-way methods iteratively Sample clusters within gene clusters Clusters within two-way clusters Inter-related two-way Coupled Two-Way Clustering clustering: Tang et al, BIBE 01 (CTWC): Getz et al, PNAS, 2003 EMMIX-GENE: McLachlan et al, Bioinformatics, 2002 Heather Turner (University of Warwick) Clustering Microarray Data 6/9
  • 7.
    Co-clustering Techniques Simultaneously cluster both genes and samples Two-way partition Conjugate clusters Spectral bi-clustering: Kluger, Double Conjugated Clustering Genome Res., 2003 (DCC): Busygin et al, SIAM ICDM 02 Co-clustering: Cho, SIAM ICDM 04 Heather Turner (University of Warwick) Clustering Microarray Data 7/9
  • 8.
    Biclustering Techniques Retrieve isolated two-way clusters: biclusters Clusters based on latent model Biclusters Rich probabilistic models: Segal SAMBA: Tanay et al, et al, Bioinformatics, 2001 Bioinformatics, 2002 Plaid models: Lazzeroni and Owen, Statist. Sinica, 2002 Heather Turner (University of Warwick) Clustering Microarray Data 8/9
  • 9.
    Current Situation Many novel methods, few used in practice Molecular biologists often have limited (access to) statistical expertise Limited number of methods in publically available software Little work on performance evaluation Development of methods continues Improved algorithms Time series Three-way data Integretation of other sources of data Heather Turner (University of Warwick) Clustering Microarray Data 9/9