JMP, a statistical software package from SAS, is designed for dynamic data visualization. It allows study teams to obtain descriptive statistics and perform simple data analysis.
All modules in this series are presented by Ross A. Dierkhising, M.S., a master's-level biostatistician who also consults through CCaTS' BERD Resource.
These modules are intended for researchers who want to learn about the technologies available in the Medical Genome Facility (formerly the Advanced Genomics Technology Center) at Mayo Clinic and receive training on commercial and public bioinformatics software, public bioinformatics databases, and genome browsers.
Effective use of bioinformatics software enables researchers to study — on a genome-wide scale — gene expression, exon composition of transcripts, protein-binding sites, genotypes, gene copy number variations, DNA methylation and other molecular events.
During a laboratory experiment, researchers may obtain RNA or DNA samples, which are then processed and analyzed. This module provides an overview of the technology that translates RNA or DNA information into a digital form — that is, computer files. It focuses on methods and software that are used to analyze these files and obtain biological interpretation of the results.
'Essentials of Microarray Technology'
High-throughput microarray technologies enable researchers to study gene expression, exon composition of transcripts, protein-binding sites, SNPs, gene copy number variations and other molecular events on the genome-wide scale. There are numerous platforms, but each follows a similar basic concept that must be understood to allow researchers to design quality experiments.
'Essentials of Microarray Technology: Affymetrix and Illumina Platforms'
High-throughput microarray technologies enable researchers to study gene expression, exon composition of transcripts, protein-binding sites, SNPs, gene copy number variations and other molecular events on the genome-wide scale. Although there are multiple platforms implementing this technology, there are a number of key principles that are critical for understanding its potential as well as its limits.
Affymetrix has additional arrays that can be utilized to ask specific questions of a sample. Illumina microarray technology has proved to be one of the best platforms for gene expression profiling, microRNA and DNA methylation profiling, and SNP detection.
This module explains key principles and features of Affymetrix microarray technology, the background between exon arrays and tiling arrays using Affymetrix technology, and key principles and features of Illumina microarray technology. It also provides background information on learning how to analyze Illumina data.
'Obtaining Data and Gene Expression Profiles From Gene Expression Omnibus Microarray Database and File Decompression'
Gene Expression Omnibus (GEO) is a free database maintained by the National Center for Biotechnology Information (NCBI) that holds thousands of data sets from published expression data. Public microarray databases are online repositories of microarray data of different types (gene expression, exon arrays and SNPs). They are often supplied with some data analysis tools, visualization tools or both.
These databases contain data generated with different microarray platforms, including spotted arrays, Affymetrix, Illumina and Agilent. Researchers can use GEO to gather preliminary data on a gene or genes of interest. This module explains how to obtain experimental data from public databases, search for genes of interest and download them.
'Using Partek Genomics Suite for Microarray Data Analysis: The Basics'
Partek GS is an excellent software application for the analysis of gene expression, exon composition of transcripts, copy number variation, gene annotation and more. This is an introductory-level module that covers basic functionalities of the software.
'Using Ingenuity Pathway Analysis Software for Gene Pathways Analysis' (502E00CMF120035)
Ingenuity Pathway Analysis is one of the main software applications for the analysis of molecular pathways, biological networks of genes and proteins, data mining of biological annotations, data visualization, and reporting tools. This is an introductory-level module that explains how to use the basic functionalities of the software.
'Introduction to Cytoscape Software' (502E00CMF130043)
Cytoscape is a free software application for analysis and integration of gene network and gene interaction data. It is powerful software for integration and visualization of complex biological data of various types, such as complex gene networks, gene expression data (microarray, PCR or next-generation sequencing), methylation data, gene copy number variations and more.
Cytoscape accepts various formats of gene-protein-metabolite interaction files, directly uploads data from a large number of databases, and has a large set of tools for functional analysis and annotation of genes, proteins and metabolites, including tools for Gene Ontology analysis.
Upon completion of this module, participants will be able to install Cytoscape on their computers, load data into the software, learn main controls and tools, perform a simple analysis, and visually represent the results.
'An Introduction to the Sequence Read Archive and Conversion of SRA Format to FASTQ Format' (502E00CMF140075)
Massively parallel sequencing technologies (next-generation sequencing, or NGS) are more and more often used to quantitate genes, gradually replacing microarray technologies. Data generated by NGS platforms demands development of storage devices, data transfer methods and hardware that can efficiently handle very big volumes of data. This also extends to the data analysis software.
This module shows how NGS data obtained in gene expression experiments (RNA-seq) are archived in the National Institutes of Health database. It also explains how this data can be retrieved from the archive and used for data analysis.
'Introduction to Integrative Genomics Viewer (IGV)' (502E00CMF140076)
Genome browsers are software for biological interpretation and visualization of data. With the advent of next-generation sequencing (NGS) technologies, genome browsers became one of the key components of analytical workflows. However, they can be used to mine genomics data and visualize results obtained by various types of technologies.
One of the leading genome browsers is Integrative Genomics Viewer (IGV), which was developed at the Broad Institute. This module shows how to use IGV for visualization and analysis of NGS data.
Galaxy Software (502E00GLXY0001)
This curriculum includes the following content:
'Introduction to Galaxy Software'
Massively parallel sequencing, also called next-generation sequencing, generates massive amounts of data. Storage and analysis of this data requires specialized software and hardware. Galaxy is the major free system (software and hardware) that meets those requirements.
Galaxy has software tools for the analysis of ChIP-seq, RNA-seq and DNA-seq data (including methylation analysis), transcription factor binding analysis, genotyping analysis, copy number variation, gene expression, and gene/DNA variant detection, as well as the EMBOSS package of tools.
In this module, participants learn how to set up a free account with Galaxy, learn main controls and learn how to import data from the UCSC Genome Browser, which is well-integrated with Galaxy. They also learn how to do a simple analysis using Galaxy.
'Loading Data into Galaxy Software'
Data files generated in experiments using next-generation sequencing technology are very big — up to hundreds of gigabytes (Hi-seq). Smaller files (less than 2 GB) can be uploaded into the software directly from a computer, but to upload bigger files (up to 50 GB), FTP client software is necessary.
This module shows how to upload small and big files into Galaxy.
'ChIP-seq Analysis with Galaxy Software (Part 1)'
This module shows how to analyze changes in methylation status of DNA using data generated with ChIP-seq method — more specifically, methylated DNA immunoprecipitation (MeDIP) and sequenced with Illumina Genome Analyzer IIx. The module uses FASTQ files as an input data, so most techniques used in this analysis are applicable to other types of ChIP-seq data.
The whole analysis involves many steps, so to make it easier to understand and learn, the module is divided into three parts, each of which explains a particular analytical process. This first part shows how to identify DNA regions that have a different level of methylation in different samples.
'Analysis of Genome-Wide Methylation Pattern Using Galaxy Software (Part 2)'
Various experimental treatments or biological states (such as embryonic development) may affect genome-wide methylation pattern: number of hypermethylated sites in introns, exons, in promoter regions and CpG islands.
Once differentially methylated sites (hyper- and hypo-methylated) are identified, the next step in the analysis is to find differences in and characterize methylation pattern — that is, distribution of frequencies of differentially methylated sites in specified genomic regions. This module explains how to do this type of analysis using Galaxy software.
'Methylation Analysis of Promoter Regions Using Galaxy Software (Part 3)'
It is widely accepted in the literature that methylation of promoter regions causes suppression of gene expression. Specific locations of methylated sites may affect binding of transcription factors to the promoter. This is the reason why detailed analysis of methylation of gene promoters is especially important in the context of the whole study.
This module explains how to analyze differential methylation of promoter regions on a genome-wide scale using Galaxy software.
'Preparation of FASTQ Files for RNA-seq Analysis Using Galaxy Software (Part 1)'
Investigators commonly receive their data in BAM or FASTQ format files. It is preferable to start analysis with FASTQ files to be able to check the quality of sequence reads and remove low-quality fragments. FASTQ files come in different flavors depending on the specifics of the sequencing platform that was used to generate them (for example, single read versus paired-end read).
There are software tools developed specifically to convert BAM files into high-quality FASTQ files. This module demonstrates how to install these software tools on a computer and process BAM files.
'RNA-seq Analysis Using Galaxy Software (Part 2)'
Typically, the goal of the RNA-seq analysis is to identify genes that are differentially expressed in groups of samples being compared. This helps find alternatively spliced transcripts in groups of samples and alternative transcription start sites.
Starting with FASTQ files, it takes numerous analytical steps to obtain the final result. There are multiple parameters for each algorithm used at each step of the analysis, and these parameters need to be set correctly. This module walks participants through the major steps of the analytical workflow and explains how to correctly set the parameters.
'Interpretation of the Results of RNA-seq Analysis Using Galaxy Software (Part 3)'
The output of the RNA-seq analysis is genomic coordinates of the regions with requested properties (differential gene expression, alternative splicing, alternative TSS and more). To understand the biological result of the experiment, these coordinates need to be annotated with gene or transcript symbols, names or IDs, positions of TSS, and more.
This module demonstrates how to obtain these annotations and map them to the results of the RNA-seq analysis.
'Using UCSC Genome Browser for Data Visualization and Analysis' (502E00CATS15090)
Knowledge about DNA elements of the genomes of different species is growing exponentially. The volume and complexity of genomic information poses many challenges for data analysis, one of which is visual representation of genomic data.
The genome browser developed by the University of California, Santa Cruz (UCSC) — known as the UCSC Genome Browser — is one of the main bioinformatics resources that handles this challenge. Not only does this software enable researchers to visually represent complex genomic data, but it also empowers scientists to explore vast amounts of genomic information gathered by the research community worldwide and recorded in the UCSC databases.
This module explains how to visually represent genomic data using the UCSC Genome Browser and also how to search and retrieve specific information from UCSC databases.
'Gene Pathways Analysis With MetaCore Software' (502E00CATS15094)
MetaCore is one of the most advanced software programs for the analysis of molecular pathways, biological networks of genes and proteins, data mining of biological annotations, and data visualization and reporting tools. This software can perform analysis of data obtained by various technological platforms — microarray, next-generation sequencing, PCR, ELISA, protein mass spectrometry and more — because the input data is a list of gene or protein IDs.
This is an introductory-level module that explains how to import data into the software and use basic functionalities for simple pathway and network analysis.