Skip to contents

SAMBAR, or Subtyping Agglomerated Mutations By Annotation Relations, is a method to identify subtypes based on somatic mutation data. SAMBAR was used to identify mutational subtypes in 23 cancer types from The Cancer Genome Atlas (Kuijjer ML, Paulson JN, Salzman P, Ding W, Quackenbush J, British Journal of Cancer (May 16, 2018), doi: 10.1038/s41416-018-0109-7,, BioRxiv, doi:

SAMBAR’s input is a matrix that includes the number of non-synonymous mutations in a sample \(i\) and gene \(j\) . SAMBAR first subsets these data to a set of 2,219 cancer-associated genes (optional) from the Catalogue Of Somatic Mutations In Cancer (COSMIC) and Östlund et al. (Network-based identification of novel cancer genes, 2010, Mol Cell Prot), or from a user-defined list. It then divides the number of non-synonymous mutations by the gene’s length \(L_j\), defined as the number of non-overlapping exonic base pairs of a gene. For each sample, SAMBAR then calculates the overall cancer-associated mutation rate by summing mutation scores in all cancer-associated genes \(j'\) . It removes samples for which the mutation rate is zero and divides the mutation scores the remaining samples by the sample’s mutation rate, resulting in a matrix of mutation rate-adjusted scores \(G\):


The next step in SAMBAR is de-sparsification of these gene mutation scores (agglomerated mutations) into pathway mutation (annotation relation) scores. SAMBAR converts a (user-defined) gene signature (.gmt format) into a binary matrix \(M\) , with information of whether a gene \(j\) belongs to a pathway \(q\) . It then calculates pathway mutation scores \(P\) by correcting the sum of mutation scores of all genes in a pathway for the number of pathways \(q'\) a gene belongs to, and for the number of cancer-associated genes present in that pathway:

\(P_{iq}=\frac{\displaystyle\sum_{j \in q} G_{ij}/{\displaystyle\sum_{q'} M_{jq'}}}{\displaystyle\sum_{j} M_{jq}}\)

Finally, SAMBAR uses binomial distance to cluster the pathway mutation scores. The cluster dendrogram is then divided into \(k\) groups (or a range of \(k\) groups), and the cluster assignments are returned in a list.

Example: Subtyping Uterine Corpus Endometrial Carcinoma

As an example, we’ve added mutation data of Uterine Corpus Endometrial Carcinoma (UCEC) primary tumor samples, obtained from The Cancer Genome Atlas, to this package in data(mut.ucec). We’ve also added gene lengths (the number of non-overlapping exonic base pairs) for 23,459 genes (hg19) in data(exon.size). We’ve added a list of cancer-associated genes in data(gene). Finally, the package includes a .gmt file that contains Hallmarks pathways from MSigDb.

To run SAMBAR with default settings on these data: subtypes <- sambar()

Download the signatureset from our public AWS S3 bucket by usingsystem("curl -O") and call SAMBAR by using the path where the signatureset is stored, subtypes <- sambar(mutdata=mut.ucec, esize=exon.size, signatureset="./h.all.v6.1.symbols.gmt"), cangenes=genes, kmin=2, kmax=4)

This will run SAMBAR on the UCEC mutation data, using desparsification on cancer-associated genes based on the MSigDb “Hallmark” gene sets. It will return a list of samples belonging to \(k=2-4\) subtypes.

Instead of using the default signatureset, you can download any .gmt file from MSigDb and add the path to this file in signatureset=PATH_TO_FILE.


Please note that, in order to be able to cluster the data using the binomial distance, SAMBAR removes samples that have mutation scores of 0 across all pathways, and pathways that have mutation scores of 0 across all samples. Using a small number of genes of interest (cangenes) or a pathways that include small numbers of genes could result in errors. We have not yet added any user-friendly error checks to the package but plan to do this in the future.