diff --git a/DESCRIPTION b/DESCRIPTION index 5e6c8dc..d0d232f 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -16,6 +16,47 @@ Authors@R: c( role = "ctb", comment = "'esApply' and 'BiobaseDevelopment' vignette translation from Sweave to Rmarkdown / HTML" ), + Package: Biobase +Title: Biobase: Base functions for Bioconductor +Description: Functions that are needed by many other packages or which + replace R functions. +biocViews: Infrastructure +URL: https://bioconductor.org/packages/Biobase +BugReports: https://github.com/Bioconductor/Biobase/issues +Version: 2.59.0 +License: Artistic-2.0 +Authors@R: c( + person("R.", "Gentleman", role="aut"), + person("V.", "Carey", role = "aut"), + person("M.", "Morgan", role="aut"), + person("S.", "Falcon", role="aut"), + person("Haleema", "Khan", + role = "ctb", + comment = "'esApply' and 'BiobaseDevelopment' vignette translation from Sweave to Rmarkdown / HTML" + ), + person("Paul", "Villafuerte", + role = "ctb", + comment = "'ExpressionSetIntroduction' vignette translation from Sweave to Rmarkdown / HTML" + ) + person("Bioconductor Package Maintainer", + role = "cre", + email = "maintainer@bioconductor.org" + )) +Suggests: tools, tkWidgets, ALL, RUnit, golubEsets, BiocStyle, knitr +Depends: R (>= 2.10), BiocGenerics (>= 0.27.1), utils +Imports: methods +VignetteBuilder: knitr +LazyLoad: yes +Collate: tools.R strings.R environment.R vignettes.R packages.R + AllGenerics.R VersionsClass.R VersionedClasses.R + methods-VersionsNull.R methods-VersionedClass.R DataClasses.R + methods-aggregator.R methods-container.R methods-MIAxE.R + methods-MIAME.R methods-AssayData.R + methods-AnnotatedDataFrame.R methods-eSet.R + methods-ExpressionSet.R methods-MultiSet.R methods-SnpSet.R + methods-NChannelSet.R anyMissing.R rowOp-methods.R + updateObjectTo.R methods-ScalarObject.R zzz.R + person("Bioconductor Package Maintainer", role = "cre", email = "maintainer@bioconductor.org" diff --git a/vignettes/ExpressionSetIntroduction.Rmd b/vignettes/ExpressionSetIntroduction.Rmd new file mode 100644 index 0000000..89740e4 --- /dev/null +++ b/vignettes/ExpressionSetIntroduction.Rmd @@ -0,0 +1,430 @@ +--- +title: An Introduction to Bioconductor's *ExpressionSet* Class +author: +- name: "Seth Falcon" +- name: "Martin Morgan" +- name: "Robert Gentleman" +- name: "Paul Villafuerte" + affiliation: "Vignette translation from Sweave to Rmarkdown / HTML" +date: "6 October, 2006; revised `r format(Sys.time(), '%B %d, %Y')`" +vignette: > + %\VignetteIndexEntry{An introduction to Biobase and ExpressionSets} + %\VignetteKeywords{tutorial, environment, graphics, ExpressionSet} + %\VignetteDepends{Biobase} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +output: + BiocStyle::html_document: + number_sections: true + toc: true +editor_options: + markdown: + wrap: 72 +--- + +# Introduction + +`r Biocpkg('Biobase')` is part of the Bioconductor project, and is used +by many other packages. `r Biocpkg('Biobase')` contains standardized +data structures to represent genomic data. The *ExpressionSet* class is +designed to combine several different sources of information into a +single convenient structure. An *ExpressionSet* can be manipulated +(e.g., subsetted, copied) conveniently, and is the input or output from +many Bioconductor functions. + +The data in an *ExpressionSet* is complicated, consisting of expression +data from microarray experiments (`assayData`; `assayData` is used to +hint at the methods used to access different data components, as we will +see below), 'meta-data' describing samples in the experiment +(`phenoData`), annotations and meta-data about the features on the chip +or technology used for the experiment (`featureData`, `annotation`), +information related to the protocol used for processing each sample (and +usually extracted from manufacturer files, `protocolData`), and a +flexible structure to describe the experiment (`experimentData`). The +*ExpressionSet* class coordinates all of this data, so that you do not +usually have to worry about the details. However, an *ExpressionSet* +needs to be created in the first place, and creation can be complicated. + +In this introduction we learn how to create and manipulate +*ExpressionSet* objects, and practice some basic *R* +skills. + +# Preliminaries + +## Installing Packages + +If you are reading this document and have not yet installed any software +on your computer, visit [http://bioconductor.org](http://bioconductor.org) and follow the +instructions for installing **R** and Bioconductor. Once you +have installed **R** and Bioconductor, you are ready to go +with this document. In the future, you might find that you need to +install one or more additional packages. The best way to do this is to +start an **R** session and evaluate commands like + +```{r install-pkg, eval=FALSE} +if (!require("BiocManager")) + install.packages("BiocManager") +BiocManager::install("Biobase") +``` + +## Loading Packages + +The definition of the *ExpressionSet* class along with many methods for manipulating *ExpressionSet* objects are defined in the +`r Biocpkg('Biobase')` package. In general, you need to load class and +method definitions before you use them. When using Bioconductor, this +means loading **R** packages using `library` or `require`. + +```{r loadlib, message=FALSE} +library("Biobase") +``` + +**Exercise 1** + +*What happens when you try to load a package that is not installed?* + +| When using `library`, you get an error message. With `require`, the +| return value is `FALSE` and a warning is printed. + +# Building an ExpressionSet From .CEL and other files + +Many users have access to .CEL or other files produced by microarray +chip manufacturer hardware. Usually the strategy is to use a +Bioconductor package such as `r Biocpkg('affyPLM')`, `r Biocpkg('affy')`, +`r Biocpkg('oligo')`, or `r Biocpkg('limma')` to +read these files. These Bioconductor packages have functions (e.g., +`ReadAffy`, `expresso`, or `justRMA` in `r Biocpkg('affy')`) to read CEL +files and perform preliminary preprocessing, and to represent the +resulting data as an *ExpressionSet* or other type of object. Suppose +the result from reading and preprocessing CEL or other files is named +`object`, and `object` is different from *ExpressionSet*; a good bet is +to try, e.g., + +```{r convert,eval=FALSE} +library(convert) +as(object, "ExpressionSet") +``` + +It might be the case that no converter is +available. The path then is to extract relevant data from `object` and +use this to create an *ExpressionSet* using the instructions below. + +# Building an ExpressionSet From Scratch + +As mentioned in the introduction, the data from many high-throughput +genomic experiments, such as microarray experiments, usually consist of several conceptually distinct parts: assay data, phenotypic meta-data, feature annotations and meta-data, and a description of the experiment. We'll construct each of these components, and then assemble them into an *ExpressionSet*. + +## Assay data + +One important part of the experiment is a matrix of 'expression' values. The values are usually derived from microarrays of one sort or another, perhaps after initial processing by manufacturer software or +Bioconductor packages. The matrix has $F$ rows and $S$ columns, where +$F$ is the number of features on the chip and $S$ is the number of +samples. + +A likely scenario is that your assay data is in a 'tab-delimited' text +file (as exported from a spreadsheet, for instance) with rows +corresponding to features and columns to samples. The strategy is to +read this file into [R]{.sans-serif} using the `read.table` command, +converting the result to a *matrix*. A typical command to read a +tab-delimited file that includes column 'headers' is + +```{r read-table-geneData} +dataDirectory <- system.file("extdata", package="Biobase") +exprsFile <- file.path(dataDirectory, "exprsData.txt") +exprs <- as.matrix(read.table(exprsFile, header=TRUE, sep="\t", + row.names=1, + as.is=TRUE)) +``` + + The first two lines +create a file path pointing to where the assay data is stored; replace +these with a character string pointing to your own file, e.g, + +```{r exprsFile,eval=FALSE} +exprsFile <- "c:/path/to/exprsData.txt" +``` + +(Windows users: note the use of `/` rather than `\`; this is because +*R* treats the `\` character as an 'escape' sequence to +change the meaning of the subsequent character). See the help pages for `read.table` for more detail. A common variant is that the character separating columns is a comma ("comma-separated values", or "csv" files), in which case the `sep` argument might be `sep=","`. + +It is always important to verify that the data you have read matches +your expectations. At a minimum, check the class and dimensions of +`geneData` and take a peak at the first several rows + +```{r geneData-peak} +class(exprs) +dim(exprs) +colnames(exprs) +head(exprs[,1:5]) +``` + +At this point, we can create a minimal *ExpressionSet* object using the `ExpressionSet` constructor: + +```{r ExpressionSet-basic} +minimalSet <- ExpressionSet(assayData=exprs) +``` + +We'll get more benefit from expression sets by creating a richer object that coordinates phenotypic and other data with our expression data, as outlined in the following sections. + +## Phenotypic data + +Phenotypic data summarizes information about the samples (e.g., sex, +age, and treatment status; referred to as 'covariates'). The information +describing the samples can be represented as a table with $S$ rows and +$V$ columns, where $V$ is the number of covariates. An example of +phenotypic data can be input with + +```{r pData} +pDataFile <- file.path(dataDirectory, "pData.txt") +pData <- read.table(pDataFile, + row.names=1, header=TRUE, sep="\t") +dim(pData) +rownames(pData) +summary(pData) +``` + +There are three columns of data, and 26 rows. Note that +the number of rows of phenotypic data match the number of columns of +expression data, and indeed that the row and column names are +identically ordered: + +```{r geneCovariate-geneData-name-matc} +all(rownames(pData)==colnames(exprs)) +``` + +This is an essential feature of the relationship between the assay and phenotype data; *ExpressionSet* will complain if these names do not match. + +Phenotypic data can take on a number of different forms. For instance, +some covariates might reasonably be represented as numeric values. Other covariates (e.g., gender, tissue type, or cancer status) might better be represented as `factor` objects (see the help page for `factor` for more information). It is especially important that the phenotypic data are encoded correctly; the `colClasses` argument to `read.table` can be helpful in correctly inputing (and ignoring, if desired) columns from the file. + +**Exercise 2** + +*What class does read.table return?* + +**Exercise 3** + +*Determine the column names of `pData`. Hint: `apropos("name")`.* + +```{r colnames} + names(pData) +``` + +**Exercise 4**. *Use `sapply` to determine the classes of each column of `pData`. Hint: read the help page for `sapply`.* + +```{r sapplyClasses} + sapply(pData, class) +``` + +**Exercise 5**. *What is the sex and Case/Control status of the 15th and 20th samples? And for the sample(s) with score greater than 0.8.* + +```{r simpleSubsetting} + pData[c(15, 20), c("gender", "type")] + pData[pData$score>0.8,] +``` + +Investigators often find that the meaning of simple column names does +not provide enough information about the covariate -- What is the +cryptic name supposed to represent? What units are the covariates +measured in? We can create a data frame containing such meta-data (or +read the information from a file using `read.table`) with + +```{r metadata-create} +metadata <- data.frame(labelDescription= + c("Patient gender", + "Case/control status", + "Tumor progress on XYZ scale"), + row.names=c("gender", "type", "score")) +``` + +This creates a *data.frame* object with a single column called `labelDescription`, and with row names identical to the column names of the *data.frame* containing the phenotypic data. The column `labelDescription` *must* be present; other columns are optional. + +Bioconductor's *`r Biocpkg('Biobase')`* package provides a class called +*AnnotatedDataFrame* that conveniently stores and manipulates the +phenotypic data and its metadata in a coordinated fashion. Create and +view an *AnnotatedDataFrame* instance with: + +```{r AnnotatedDataFrame} +phenoData <- new("AnnotatedDataFrame", + data=pData, varMetadata=metadata) +phenoData +``` + +Some useful operations on an *AnnotatedDataFrame* include *sampleNames*, *pData* (to extract the original `pData` *data.frame*), and *varMetadata*. In addition, *AnnotatedDataFrame* objects can be subset much like a *data.frame*: + +```{r AnnotatedDataFrame-subset} +head(pData(phenoData)) +phenoData[c("A","Z"),"gender"] +pData(phenoData[phenoData$score>0.8,]) +``` + +## Annotations and feature data + +Meta-data on features is as important as meta-data on samples, and can +be very large and diverse. A single chip design (i.e., collection of +features) is likely to be used in many different experiments, and it +would be inefficient to repeatedly collect and coordinate the same +meta-data for each *ExpressionSet* instance. Instead, the ideas +is to construct specialized meta-data packages for each type of chip +or instrument. Many of these packages are available from the +Bioconductor web site. These packages contain information such as the +gene name, symbol and chromosomal location. There are other meta-data +packages that contain the information that is provided by other +initiatives such as GO and KEGG. The `r Biocpkg('annotate')` and +`r Biocpkg('AnnotationDbi')` packages provides basic data manipulation +tools for the meta-data packages. + +The appropriate way to create annotation data for features is very +straight-forward: we provide a character string identifying the type of chip used in the experiment. For instance, the data we are using is from the Affymetrix hgu95av2 chip: + +```{r} +annotation <- "hgu95av2" +``` + +It is also possible to record information about features that are +unique to the experiment (e.g., flagging particularly relevant +features). This is done by creating or modifying an +`AnnotatedDataFrame` like that for `phenoData` but +with row names of the *AnnotatedDataFrame* matching rows of the +assay data. + +## Experiment description + +Basic description about the experiment (e.g., the investigator or lab +where the experiment was done, an overall title, and other notes) can +be recorded by creating a *MIAME* object. One way to create a +*MIAME* object is to use the `new` function: + +```{r} +experimentData <- new("MIAME", + name="Pierre Fermat", + lab="Francis Galton Lab", + contact="pfermat@lab.not.exist", + title="Smoking-Cancer Experiment", + abstract="An example ExpressionSet", + url="www.lab.not.exist", + other=list( + notes="Created from text files" + )) +``` + +Usually, `new` takes as arguments the class name and +pairs of names and values corresponding to different slots in the +class; consult the help page for *MIAME* for details of +available slots. + +## Assembling an *ExpressionSet* + +An *ExpressionSet* object is created by assembling its component +parts and callng the `ExpressionSet` constructor: + +```{r ExpressionSetFinally} +exampleSet <- ExpressionSet(assayData=exprs, + phenoData=phenoData, + experimentData=experimentData, + annotation="hgu95av2") +``` + +Note that the names on the right of each equal sign can refer to any +object of appropriate class for the argument. See the help page for +*ExpressionSet* for more information. + +We created a rich data object to coordinate diverse sources of +information. Less rich objects can be created by providing less +information. As mentioned earlier, a minimal expression set can be +created with + +```{r ExpressionSet-minimal} +minimalSet <- ExpressionSet(assayData=exprs) +``` + +Of course this object has no information about phenotypic or feature +data, or about the chip used for the assay. + +# *ExpressionSet* Basics + +Now that you have an *ExpressionSet* instance, let's explore +some of the basic operations. You can get an overview of the +structure and available methods for *ExpressionSet* objects by +reading the help page: + +```{r helpExpressionSet, eval=FALSE} +help("ExpressionSet-class") +``` + +When you print an *ExpressionSet* object, a brief summary of +the contents of the object is displayed (displaying the entire object +would fill your screen with numbers): + +```{r showExpressionSet} +exampleSet +``` + + +## Accessing Data Elements + +A number of accessor functions are available to extract data from an +*ExpressionSet* instance. You can access the columns of the phenotype data (an *AnnotatedDataFrame* instance) using $: + +```{r usingDollar} +exampleSet$gender[1:5] +exampleSet$gender[1:5] == "Female" +``` + +You can retrieve the names of +the features using `featureNames`. For many microarray datasets, the +feature names are the probe set identifiers. + +```{r featureNames} +featureNames(exampleSet)[1:5] +``` + +The unique identifiers of the samples +in the data set are available via the `sampleNames` method. The +`varLabels` method lists the column names of the phenotype data: + +```{r sampleNames} +sampleNames(exampleSet)[1:5] +varLabels(exampleSet) +``` + +Extract the expression matrix of sample information using `exprs`: + +```{r exprs} +mat <- exprs(exampleSet) +dim(mat) +``` + +### Subsetting + +Probably the most useful operation to perform on *ExpressionSet* objects +is subsetting. Subsetting an *ExpressionSet* is very similar to +subsetting the expression matrix that is contained within the +*ExpressionSet*, the first argument subsets the features and the second +argument subsets the samples. Here are some examples: Create a new +*ExpressionSet* consisting of the 5 features and the first 3 samples: + +```{r first10} +vv <- exampleSet[1:5, 1:3] +dim(vv) +featureNames(vv) +sampleNames(vv) +``` + +Create a subset consisting of only the male samples: + +```{r males} +males <- exampleSet[ , exampleSet$gender == "Male"] +males +``` + + +# What was used to create this document + +The version number of *R* and the packages and their versions that were +used to generate this document are listed below. + +```{r echo=FALSE} +sessionInfo() +``` + +