AMADA

arxiv ascl

Welcome to the AMADA - Analysis of Muldimensional Astronomical DAtasets

AMADA allows an iterative exploration and information retrieval of high-dimensional data sets. This is done by performing a hierarchical clustering analysis for different choices of correlation matrices and by doing a principal components analysis in the original data. Additionally, AMADA provides a set of modern visualization data-mining diagnostics. The user can switch between them using the different tabs.

Install R and Rstudio from

http://www.r-project.org http://www.rstudio.com

Install Required libraries

install.packages('ape',dependencies=TRUE)
install.packages('circlize',dependencies=TRUE)
install.packages('corrplot',dependencies=TRUE)
install.packages('devtools',dependencies=TRUE)
install.packages('fpc',dependencies=TRUE)
install.packages('ggplot2',dependencies=TRUE)
install.packages('ggthemes',dependencies=TRUE)
install.packages('MASS',dependencies=TRUE)
install.packages('markdown',dependencies=TRUE)
install.packages('mclust',dependencies=TRUE)
install.packages('minerva',dependencies=TRUE)
install.packages('mvtnorm',dependencies=TRUE)
install.packages('pcaPP',dependencies=TRUE)
install.packages('pheatmap',dependencies=TRUE)
install.packages('phytools',dependencies=TRUE)
install.packages('qgraph',dependencies=TRUE)
install.packages('RColorBrewer',dependencies=TRUE)
install.packages('RCurl',dependencies=TRUE)
install.packages('squash',dependencies=TRUE)
install.packages('stats',dependencies=TRUE)
install.packages('shiny',dependencies=TRUE)

Install AMADA R package from github

require(devtools)

install_github("RafaelSdeSouza/AMADA")

Run Shiny App


require(shiny)
runUrl('https://github.com/RafaelSdeSouza/AMADA_shiny/archive/master.zip')
### If the above does not work, try this
### options("download.file.extra" = "--no-check-certificate") 

Data Input

AMADA allows the users to either use available datasets or upload their own. Check the bottom of the 'Import Dataset' panel to see if the data have been properly imported. The data can be seen on the screen by clicking in the tab "Dataset" on the main page.

Available datasets

The available datasets follow the same nomenclature of their respective source articles. I recommend the user to check the original articles or catalogs for a better understanding of their meaning.

Import dataset

Data must be imported as a CSV/TXT format, columns are named and separated by spaces. It may contain an arbitrary number of columns and rows. If missing data is present, it should be marked as NA. An example of how a dataset should be formatted can be found by clicking the tab "Dataset" on the main page.

Control Options

On the left panel, the user can choose among different methods of analysis and visualization. Once the combination is chosen, click on the button "Make it so" to update the plots. The following options are available:

Fraction of data to display: choose the percentage of data displayed on the screen.

Correlation method: choose among Pearson, Spearman or Maximum Information Coefficient (MIC).

Display numbers: choose if correlation coefficients should be displayed in the heatmap.

Dendrogram type: choose among Phylogram, Cladogram or Fan.

Graph layout: choose among Spring or Circular.

Chord diagram colour: choose among different colour schemes.

Number of PCs: choose the number or principal components to display as Nightingale charts.

PCA method: choose among Standard PCA or Robust PCA.

Employed Analysis

The current version of AMADA allows the user to choose among different types of correlation methods and PCA analysis.

Principal Components Analysis

PCA: A orthogonal transformation that linearly converts a dataset into a set of uncorrelated variables called principal components (PCs). The PCs are computed by diagonalization of the data correlation matrix, with the resulting eigenvectors corresponding to PCs and the resulting eigenvalues to the variance explained by the PCs. The eigenvector corresponding to the largest eigenvalue gives the direction of greatest variance (PC1), the second largest eigenvalue gives the direction of the next highest variance (PC2), and so on (e.g., Jolliffe 2002).

Robust PCA: Robust principal component analysis using the Projection–Pursuit principle. The data is projected on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized (Croux, Filzmoser and Oliveira, 2007).

Hierarchical Clustering

An unsupervised learning technique whose aim is to find hidden structures in the dataset. Instead of find a single partitioning of the data, the goal of hierarchical clustering is to build a hierarchy of partitions which may reveal interesting structure in the dataset at multiple levels of association. A clear advantage is the needless of a prior specification of the number of clusters to be searched. Nonetheless, the method implicitly assumes a measure of similarity between pairs of objects. Which in our case is given by the correlation distance d(x,y)= 1-|corr(x,y)|. The outcome is a hierarchical representations in which the clusters at each level of the hierarchy are created by merging clusters at the next lower level. We employ an agglomerative approach, which starts with a single cluster assigned for each object and then progressively merge the two closest clusters until a single cluster remains.

Number of Clusters: To guide the user, we display an optimal number of clusters via Calinski and Harabasz, 1972 index. The groups are color-coded in the dendrogram and graph visualizations.

Correlation Method

Pearson: a measure of the linear correlation between two variables X and Y (Pearson 1895).

Spearman: a measure of the monotonic correlation between two variables X and Y (Spearman 1904).

Maximum Information Coefficient: a measure of linear or non-linear correlation between two variables X and Y (Reshef et al. 2011). The current version of MIC does not support NA.

Visualization

AMADA offers many different plots to represent the results of the correlation analysis and unsupervised learning of the datasets.

The user can choose any of the following plots:

Heatmap: Plots a correlation matrix color-coded by the correlation level between each pair of variables (e.g., Raivo Kolde, 2013). For visualization purposes, the arrangement of the rows and columns are made following a hierarchical clustering with a dendrogram drawn at the edges of the matrix.

Distogram: Plots a distance matrix containing the distances, taken pairwise, of all sets of variables (e.g., Aron Eklund, 2012). The distance being used is the correlation distance, given by d(x,y)= 1-|corr(x,y)|.

Dendrogram: Plots the dendrogram of the hierarchical clustering applied to the catalog variables. Options are: Phylogram, Cladogram or Fan. This type of visualization is adapted from tools for Phylogenetic studies (e.g., Paradis et al. 2003).

Graph: Plots a clustered graph built in such way that each vertice represent a different parameter and the thickness of the edges are weighted by the degree of correlation between each pair of variables (Epskamp et al. 2012). The configuration is such that highly correlated parameters appear closer in the graph.

Chord diagram: Plots a matrix using a circular layout. The columns and rows are represented by segments around the circle. Individual cells are shown as ribbons, which connect the corresponding row and column segments (Gu, Z. (2014)). The thickness of the ribbons are weighted by the degree of correlation between each pair of variables. For a given choice of colour pallete, the colour intensity ranges from fully anti-correlated to correlated.

Nightingale chart: Plots a polar barplot. The length of the strips represents the relative contribution of each variable to the i-th Principal Component. This plot is inspired by the original chart from Nightingale 1858.
Probably one of the most influential visualizations of all time used by Florence Nightingale to convince Queen Victoria about improving hygiene at military hospitals, therefore saving lives of thousands of soldiers.

References

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Package dependencies

ape, phytools, squash, fpc, minerva, MASS, corrplot, qgraph, ggplot2, ggthemes, reshape, pcaPP, mvtnorm, circlize