0 Load some libraries

library(vegan, warn.conflicts=F, quietly=T)

## This is vegan 2.5-6

library(GUniFrac, warn.conflicts=F, quietly=T)
library(phyloseq, warn.conflicts=F, quietly=T)
library(ggplot2, warn.conflicts=F, quietly=T)
library(dplyr, warn.conflicts=F, quietly=T)

1 A multivariate microbiome dataset

The dataset we will analyze is about the effect of smoking on the upper respiratory tract.(Charlson et al., PLOS ONE,5(12):0015216). The dataset used here is from the throat microbiome of the left body side. It contains 60 subjects (32 non-smokers and 28 smokers). This example dataset is also used in the book: “Statistical Analysis of Microbiome Data with R”, by Xia, Sun and Chen. The results from our analysis should be the same as presented in the book. First load the data and explore some of its properties.

data(throat.otu.tab)
data(throat.meta)
throat_meta <- select(throat.meta, SmokingStatus, Age, Sex, PackYears)

1.1 First analysis

The throat.otu.tab dataset contains the counts for 856 different OTUs for the 60 subjects. Most OTUs only have counts for very few subjects, but some OTUs (e.g. 3227 in column 33) have many counts for most subjects. The metadata (throat_meta) provides information on Smokerstatus, Age, Sex and PackYears.

To plot the Richness of the samples and the abundances, the data will be converted to the phyloseq format.

OTU = otu_table(as.matrix(throat.otu.tab), taxa_are_rows = FALSE)
SAM = sample_data(throat_meta)
physeq<-merge_phyloseq(phyloseq(OTU),SAM)
physeq

1.1.1 Plot Richness

Explore whether there is a difference between Smokers and Non-Smokers in terms of Chao1 and Shannon measures of Richness.

plot_richness(physeq, measures = c("Chao1", "Shannon"),x = "SmokingStatus", color = "SmokingStatus")

Three smokers have a much smaller Shannon value. ### ? What does it mean to have a small Shannon diversity?

1.1.2 Plot Abundance Bar

Explore whether there is a difference between Smokers and Non-Smokers or between males and females in terms of Abundance.

plot_bar(physeq, fill="SmokingStatus")

plot_bar(physeq, fill="Sex")

? What is your conclusion with respect to Richness and Abundance?

2 Multivariate analysis, Unconstrained Ordination

Now we will focus on differences between the samples. Based on 856 OTUs can one see whether the smokers are different from the non-smokers? We will start with Principal Component Analysis.

2.1 Principal Component Analysis (PCA)

PCA aims to visualize the data using a score plot in which both the subjects and also the OTUs are represented in a 2 dimensional plot. PCA aims to make such a map using principal components that maximize the explained variation in the data. However, since PCA suffers from the many zeros in the dataset, a Hellinger transformation is used to ‘partly’ remove the zero problem.

abund_hell <- decostand (throat.otu.tab, 'hell') # Hellinger transformation
pca_hell<- rda(abund_hell)
summary (pca_hell)

The PCA is calculated using the RDA function. It provides information on how much variance is calculated per principal component, and it also provides the loading values for each OTU and the scores for each subject.

PC1 and 2 explain 18% and 15% of variation, and after that the 3rd only 8%. We will focus on first 2 PCs.

Scoreplot and loading plot

plot(pca_hell, type="n")
points(pca_hell, col=throat_meta$SmokingStatus)

ordiplot(pca_hell, display = "species", type = "text", main = "Loading plot")

In the score plot, it can be seen that the smokers (red) and non-smokers partly overlap, but on the right hand site we mainly see smokers. Looking at the loading plot, some OTUs stick out of the bunch. These are the ones with large count values (e.g. see OTU 3227 at the lower left corner). OTU 3418 will have high values in the samples that can be found at the same spot of the score plot (e.g. ESC_1.63_OPL).

2.2: Principal Coordinates Analysis (PCoA)

Because of the “strange” distribution of the abundance data, PCA might not be the most appropriate method for ordination. Another approach is to PCoA to visualize the subjects in a 2D plot. PCoA is applied to a subject by subject distance matrix. Here we will apply PCoA on the smokers data using the Bray Curtis distance matrix.

#Calculate Bray Curtis Distance
bcdist_throat <-vegdist(throat.otu.tab, "bray")
PCoA <- cmdscale (bcdist_throat, eig = TRUE,k = 2)

PCoA$eig

explainedvar <- round(PCoA$eig[c(1,2)] / sum(PCoA$eig), 2) * 100
explainedvar

# Plot Eigenvalues
plot(PCoA$eig, xlab = "PCoA", ylab = "Eigenvalue",las = 1, cex.lab = 1.5, pch = 16)

# Score Plot
plot(PCoA$points[ ,1], PCoA$points[,2], col = throat_meta$SmokingStatus, main = "PCoA scores plot")

First the eigenvalues are shown, with 2.79 and 2.21 for the first two components. It can be seen that these explain 20 and 16% of the variation respectively.The eigenvalue plot also shows that the first two are much larger than the rest and a 2D plot seems sufficient. The PCoA score plot shows that the smokers (red) are clustered somewhat at the left side of the plot. The PCA and PCoA scoreplots give generally similar information. The PCoA does not have a loading plot as the information on the OTUs have been lost when the distance is calculated.

3: Constrained ordination

In the constrained ordination analysis, the goal is to explore how the variation in OTUs between the subject can be explained using the cofactors present in the throat_meta data. Thus how much variation can be explained by SmokerStatus, Age, Sex and PackYears. We will look at Redundancy analysis (RDA), which is an extension of PCA, and also at Distance Based RDA which is an extension of PCoA.

3.1 Redundancy analysis.

For the RDA analysis we again use the Hellinger transformation.

abund_hell <- decostand (throat.otu.tab,"hell")
# Apply rda on abundance data, using SmokingStatus, Age, Sex, PackYears as restrictions
rda_hell<- rda(abund_hell, throat_meta)

# Show results
summary (rda_hell)
coef(rda_hell)
RsquareAdj(rda_hell)

# Triplot with species (sp), explanatory variables (CN) and scores as linear combinations of the explanatory variables (LC)

plot(rda_hell, display=c("sp", "lc", "cn"),main="Triplot RDA - scaling 2")

There are a lot of results shown: First the partitioning of the variance. Only 10% of the variation in the Hellinger transformed data can be described by the 4 cofactors, and thus 90% of the variation does not relate to Smoking, Age, Sex or Packyears. Then the Eigenvalues per component are shown. There are 4 RDA components, and 55 PCs.

? Why are there 4 RDA components?

Then we can see that from the explained part by the cofactors, 77% can be explained by the first two RDA components. Finally the RDA scores (and PC loadings) are provided (which are called the species scores) as well as their scores (here they are colled site scores). The latter are represented in two ways, as Site scores (weighted sums of species scores), and as Site constraints (linear combinations of constraining variables).

The figure shows a Triplot with information on samples (green) represented as scores calculated as linear combinations of the explanatory variables, the species (red) and the explanatory variables themselves. ### ? What is the interpretation of the green triangles on a straight line in the left side of the plot?

2 types of scaling are distinguished in RDA. Note that in PCA the loadings were normalized to length 1 and the lengths of the scores were representing the amount of variance explained by the PC. This is called Scaling type 1. The opposite where the scores are normalized to length one and the length of the loadings represent the size of the component is called Scaling type 2.

Now we will look whether the explained variation by the explanatory variables is significantly larger than can be expected by chance.

rda_hell<- rda(abund_hell ~ SmokingStatus + Age + Sex + PackYears, throat_meta)
# significance testing Whole model
set.seed (1)
anova(rda_hell, step=1000)
# significance testing Whole per RDA component
anova(rda_hell, by="axis", step=1000)

anova(rda_hell, by="margin", step=1000)

First we can see that the RDA model is significant (F = 1.6, p = 0.07). ### ? What does this mean?

Then we can conclude there is at least a single RDA component significant. Finally, to see which of the explanatory variables has influence on the OTUs a marginal test is used. The marginal test shows how much the new explanatory variable contributes when the others are already in the model. It seems like smoking status and Sex have an influence, although interpretation is complex as the explanatory variables are rather correlated (i.e. Age and PackYears, PackYears and SmokingStatus) It is possible to test different models (e.g. without SmokingStatus and Age, as this is somehow also also represented by PackYears).

rda_hell2<- rda(abund_hell ~ Sex + PackYears, throat_meta)
# significance testing Whole model
set.seed (1)
anova(rda_hell2, step=1000)
# significance testing Whole per RDA component
anova(rda_hell2, by="axis", step=1000)

anova(rda_hell2, by="margin", step=1000)

? What is your conclusion?

3.2. Distance Based RDA

From the PCA and PCoA analyses we already realized that a distance based approach could be more appropriate for OTU data with so many zeros. There is also a distance based RDA methods that starts with a PCoA on a distance matrix.

# apply distance based RDA method
DBRDA_throat  <-dbrda(bcdist_throat ~ ., throat_meta)
DBRDA_throat

#Make plot
plot(DBRDA_throat, display = c("lc","cn"), main="Biplot DBRDA")

#significance testing
set.seed (123)
anova(DBRDA_throat, step=1000)
anova(DBRDA_throat, by="axis", step=1000)
anova(DBRDA_throat, by="margin", step=1000)

? What is your interpretation of the DBRDA results and figure?

4: DO IT YOURSELF

The throat dataset has 856 OTU features however most of them are only found in one or two subjects. Furthermore the number of reads is rather different per subject. As a preprocessing we could take only OTU features into account that are present in at least 20 individuals. Then you could compare A Hellinger transform and a Centered Log Ratio for PCA on the smaller dataset. Also you could see how the new preprocessing performs on PCoA, RDA and DBRDA.

# SELECT ONLY SPECIES PRESENT IN 20 OR MORE INDIVIDUALS
nonzeros = colSums(throat.otu.tab!=0)
a = nonzeros>20
throat.otu.select = throat.otu.tab[,a]

## CLR
throat.otu.select1 = throat.otu.select + 1
throat.otu.select.clr = t(apply(throat.otu.select1,1,function(x){log(x) - mean(log(x))}))

#PCA on Hellinger transformed data
throat.otu.select.hell <- decostand (throat.otu.select, 'hell') # Hellinger transformation
pca_sel_hell<- rda(throat.otu.select.hell)
summary (pca_sel_hell)
plot(pca_sel_hell, type="n", main = "Scores Hellinger")
points(pca_sel_hell, col=throat_meta$SmokingStatus)

ordiplot(pca_sel_hell, display = "species", type = "text", main = "Loading Hellinger")

# PCA on CLR transformed data
pca_sel_clr<- rda(throat.otu.select.clr)
summary (pca_sel_clr)
plot(pca_sel_clr, type="n", main = "Scores CLR")
points(pca_sel_clr, col=throat_meta$SmokingStatus)

ordiplot(pca_sel_clr, display = "species", type = "text", main = "Loading CLR")

# PCoA
bcdist_throat_sel <-vegdist(throat.otu.select, "bray")
PCoA_sel <- cmdscale (bcdist_throat_sel, eig = TRUE,k = 2)
explainedvar_sel <- round(PCoA_sel$eig[c(1,2)] / sum(PCoA_sel$eig), 2) * 100
explainedvar_sel

# Score Plot
plot(PCoA_sel$points[ ,1], PCoA_sel$points[,2], col = throat_meta$SmokingStatus, main = "PCoA scores plot")

# RDA
rda_hell_sel<- rda(throat.otu.select.hell, throat_meta)
coef(rda_hell_sel)
RsquareAdj(rda_hell_sel)
plot(rda_hell_sel, display=c("sp", "lc", "cn"),main="Triplot RDA - scaling 2")

rda_hell_sel <- rda(throat.otu.select.hell ~ SmokingStatus + Age + Sex + PackYears, throat_meta)
# significance testing Whole model
set.seed (1)
anova(rda_hell_sel, step=1000)

# significance testing Whole per RDA component
anova(rda_hell_sel, by="axis", step=1000)
anova(rda_hell_sel, by="margin", step=1000)

# DBRDA
DBRDA_throat_sel  <-dbrda(bcdist_throat_sel ~., throat_meta)
DBRDA_throat_sel
plot(DBRDA_throat_sel, display = c("lc","cn"), main="Biplot DBRDA")

#significance testing
set.seed (123)
anova(DBRDA_throat_sel, step=1000)
anova(DBRDA_throat_sel, by="axis", step=1000)
anova(DBRDA_throat_sel, by="margin", step=1000)

Multivariate Data Analysis of microbiome data

Johan Westerhuis

11/30/2020

0 Load some libraries

1 A multivariate microbiome dataset

1.1 First analysis

1.1.1 Plot Richness

1.1.2 Plot Abundance Bar

? What is your conclusion with respect to Richness and Abundance?

2 Multivariate analysis, Unconstrained Ordination

2.1 Principal Component Analysis (PCA)

Scoreplot and loading plot

2.2: Principal Coordinates Analysis (PCoA)

3: Constrained ordination

3.1 Redundancy analysis.

? Why are there 4 RDA components?

? What is your conclusion?

3.2. Distance Based RDA

? What is your interpretation of the DBRDA results and figure?

4: DO IT YOURSELF

Multivariate Data Analysis of microbiome data

Johan Westerhuis

11/30/2020

0 Load some libraries

1 A multivariate microbiome dataset

1.1 First analysis

1.1.1 Plot Richness

1.1.2 Plot Abundance Bar

? What is your conclusion with respect to Richness and Abundance?

2 Multivariate analysis, Unconstrained Ordination

2.1 Principal Component Analysis (PCA)

Scoreplot and loading plot

? Combining score and loading plot, can you find a species related that is more abundant in smokers?

2.2: Principal Coordinates Analysis (PCoA)

3: Constrained ordination

3.1 Redundancy analysis.

? Why are there 4 RDA components?

? What is your conclusion?

3.2. Distance Based RDA

? What is your interpretation of the DBRDA results and figure?

4: DO IT YOURSELF