Bioinformatics Software Packages

This software is free. You can redistribute and/or modify it under the terms of the GNU General Public License, as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed with helpful intent, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Please give credit where credit is due. If you use functions from Mayo Clinic, please acknowledge the original contributor of the material.



Genotype imputation has become a standard tool in genetics, but performing this analysis correctly requires considerable expertise and is time and labor intensive. We developed an impute2-based genotype imputation workflow that greatly simplifies the process of imputation and achieves a significant speedup of imputation using multiple CPU’s on a computer cluster. The user simply provides a genotype dataset and the workflow implements detailed steps of matching the strand of the input genotypes and reference, smart segmentation of the genome and generation of QC metrics.

Availability and implementation: The workflow works on the two most popular cluster management systems, Sun Grid Engine (SGE) and Portable Batch System (PBS). It is available under the GNU public license.

Authors: Hugues Sicotte, Naresh Prodduturi

Hugues Sicotte, Ph.D.

Genome Smasher

GenomeSmasher is a set of tools used to create diploid FASTA files with containing snps, indels, duplications, deletions and translocations. These FASTA files can then be used in conjunction with next-generation sequencing simulators to artificially create sequencing experiments. The utility of these tools are to assess the performance and reliability of data analysis in next-generation sequencing pipelines.

Authors: Steven N. Hart, Naresh Prodduturi

Steven N. Hart, Ph.D.

MACE (model-based Analysis of ChiP-exo)

Precisely mapping protein-binding sites on the genome is fundamentally important to understand many biological processes. A recently developed technology, ChIP-exo, is able to define protein-binding sites with unprecedented higher resolution and sensitivity than its precursors, such as ChIP-chip and ChIP-seq. While higher spatial resolution allows for better understanding of protein-DNA interactions, there are no dedicated methods available to analyze ChIP-exo data. The lack of analytical methods significantly restricts the usefulness of ChIP-exo. Here researchers have developed a novel analysis framework named MACE (model-based analysis of ChIP-exo) to detect two binding borders of protein-DNA interactions from ChIP-exo data. When MACE was applied to yeast Reb1 and human CTCF ChIP-exo data and evaluated, the identified binding sites used multiple-layer independent evidences, including cognate motif, sequence conservation, nucleosome positioning and open chromatin states. Researchers demonstrated that this analysis framework is able to define protein-binding sites with sensitivity and specificity and is tailored to analyze ChIP-exo data.

Liguo Wang, Ph.D.


SoftSearch was developed as a sensitive structural variant (SV) detection tool for Illumina paired-end next-generation sequencing data. SoftSearch simultaneously utilizes soft-clipping and read-pair strategies for detecting SVs to increase sensitivity. Soft clips are proxies for split-reads that indicate part of the read maps to the reference genome, but the other part is not localized at the same place (for example, breakpoint spanning reads). Discordant read-pairs refer to a read and its mate, where the insert size is greater (or less than) the expected distribution of the dataset — or where the mapping orientation of the reads is unexpected (for example, both on the same strand). SoftSearch looks for areas with soft-clipping in the genome that have discordant read pairs supporting the anomaly. Once areas with both these conditions are identified, the read and mate information is extracted directly from the BAM file containing the discordant reads, obviating the need for time-consuming and error-prone complex alignment strategies. Only a small number of soft-masked bases discordant read-pairs are necessary to identify an SV, which on their own would not be sufficient to make an SV call, thus highlighting Soft Search’s improved sensitivity (see Performance).

  • SoftSearch source code
    • Authors: Steven N. Hart, Jaysheel Bhavsar, Saurabh Baheti, Vivekananda (Vivek) Sarangi, Jean-Pierre A. Kocher.

      Steven N. Hart, Ph.D.


      A post-processor to optimize the selection of tag SNPs from common bin-tagging programs. SNPPicker uses a multi-step search strategy in combination with a statistical model to produce optimal genotyping panels. Authors: Hugues Sicotte, David N. Rider, Gregory A. Poland, Neelam Dhiman, Jean-Pierre A. Kocher. [03/2011]

      Publication: SNPPicker: high quality tag SNP selection across multiple populations

      Dave N. Rider


      A Targeted RE-sequencing Annotation Tool that offers a comprehensive, open framework, end-to-end solution for analyzing and interpreting targeted re-sequencing data. TREAT encompasses sequence alignment, variant calling, variant annotation, variant filtering, and visualization in one comprehensive analytic workflow. The rich set of annotations provided by TREAT enables the filtering of detected variants based on their functional characteristics, and visualizations at the variant positions allow the investigators to closely examine the identified variants of interest. An Amazon Cloud Image of TREAT is provided for researchers with no access to local bioinformatics infrastructure with instructions given in the tutorial below. The source code for local installation is available via the link below.

      Authors: Yan W. Asmann, Sumit Middha, Asif Hossain, Saurabh Baheti, Ying Li, High-Seng Chai, Zhifu Sun, Patrick H. Duffy, Ahmed A. Hadad, Asha Nair, Xiaoyu Liu, Yuji Zhang, Eric W. Klee, Jean-Pierre A. Kocher. [06/2011]

      Publication: TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data

      TREAT files available upon request.

      Patrick H. Duffy


      A bioinformatics tool to identify fusion transcripts from paired-end transcriptome sequencing data. The tool employs multiple steps of false positive filtering and nominates the fusion candidates with high confidence (approaching 100% true positive rate). The unique features of SnowShoes-FTD include: (i) the ability to discover multiple fusion isoforms in which the two gene partners give rise to transcripts with different junctions; (ii) prediction of potential fusion mechanisms including inversion, translocation, and/or interstitial deletions; (iii) identification of whether the junction point in a fusion transcript occurs at the boundaries of known exons which implies the fusion events might have happened inside an intron in DNA and transcribed to the fusion transcript.

      Furthermore, the SnowShoes-FTD greatly simplifies the validation process of the fusion candidates by giving a 5’ to 3’ oriented template region spanning fusion junction point which is long enough for designing primers for PCR validation of the fusion candidates. The SnowShoes-FTD also predicts the protein sequences of the fusion genes using known transcript sequences of fusion partners and identifies in-frame vs. out-of-frame fusion products. In addition, the mutations including non-synonymous single amino acid changes and insertions at the fusion junction points for the in-frame fusion proteins are identified. The source codes of SnowShoes-FTD are provided in two formats: one configured to run on the Sun Grid Engine for parallelization with shorter run time, and the other formatted to run on a single LINUX node.

      Note: The download package of the SnowShoes-FTD contains the tool itself, the reference files necessary to run the tool, and the test data. Because of its large size, we will set up a FTP transfer site for each request. We apologize for the inconvenience and we are looking for alternative sites to host the download.

      Authors: Yan W. Asmann, Asif Hossain, Brian M. Necela, Sumit Middha, Krishna R. Kalari, Zhifu Sun, H.S. Chai, D.W. Williamson, Derek C. Radisky, G.P. Schroth, Jean-Pierre A. Kocher, Edith A. Perez, E. Aubrey Thompson

      Publication: A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines

      Please contact the author to gain access to the software:
      Yan W. Asmann, Ph.D.


      Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking.

      To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting, and visualization. With this package, bioinformaticians or investigators can start from sequencing reads and get a fully annotated CpG methylation report quickly allowing more time for biological interpretation. The SAAP-RRBS program:

      • Conducts read quality check, adapter trimming, alignment, methylation extraction, annotation and visualization for sequence reads in a fastq format (single or pair end RRBS).
      • Conducts further downstream analyses for aligned BAM files from other RRBS aligners (single end RRBS).
      • Conducts comprehensive annotations for a CpG list with chromosome and location.
      • Is highly automatic and fast. To run the whole pipeline for a RRBS sample with 50 million of reads takes 4-6 hours.
      • Offers two modes of run. For users without a cluster environment, it can be run in a single Linux machine, one sample at time (non-sge mode).
      • Allows users with a cluster environment to submit jobs to the cluster to run multiple samples simultaneously for fast processing (sge mode).
      • Can handle both single end and pair end sequencing.
      • Provides summary reports for all samples in a run so users can quickly grasp their data.
      • Adapts for a different aligner and is extensible to the whole genome sequencing data.

      Authors: Zhifu Sun, Saurabh Baheti, Sumit Middha, Rahul Kanwar, Y. Zhang, X. Li, Andreas S. Beutler, Eric W. Klee, Yan W. Asmann, E. Aubrey Thompson, Jean-Pierre A. Kocher

      Publication: SAAP-RRBS: streamlined analysis and annotation pipeline for reduced representation bisulfite sequencing

      Saurabh Baheti

      Zhifu Sun, M.D.