Features

Download

Installation

Use

Example

Notes

PIQA

 

PIQA: Pipeline for Illumina G1 Genome Analyzer Data Quality Assessment

A. Martinez*, E. Ballesteros*, C. Feng*, M. Rojas*, H. Koshinsky**, V.Y. Fofanov**, P. Havlak* and Y. Fofanov*

*Bioinformatics Laboratory, University of Houston, Houston TX, USA

**Eureka Genomics Corp., Houston TX, USA

 Abstract

PIQA is a quality analysis pipeline designed to examine genomic reads produced by Next Generation Sequencing technology (Illumina G1 Genome Analyzer). A short statistical summary, as well as tile-by-tile and cycle-by-cycle graphical representation of clusters density, quality scores, and nucleotide frequencies allow easy identification of various technical problems including defective tiles, mistakes in sample/library preparations, and abnormalities in the frequencies of appearance of sequenced genomic reads. PIQA is written in the R statistical programming language and is compatible with bustard, fastq, and scarf Illumina G1 Genome Analyzer data formats.

Bioinformatics 2009


Features:
  1. PIQA takes as input unfiltered (files in "bustard"1 format) or filtered data (files in either "scarf" or "fastq" formats).

  2. PIQA outputs in a user chosen directory a comma separated text file (follow this link to get an example summary file) and three linked HTML files.

  3. The comma separated file is a summary of the quality information by lane, tile and cycle and it is used to produce the plots in the HTML pages. The file is available to the user for further analyses.

  4. The main HTML page is "PIQA_report.html" which displays five plots:

    • Density of reads per tile
    • Proportion of base-calls by tile
    • Proportion of base-calls by cycle
    • Average base-call quality per tile
    • Average base-call quality per cycle
  5. The main HTML has links to other two html pages that show, one, the proportion of base-calls per tile for each cycle, the other, the average quality of base-calls per tile for each cycle.

  6. Every plot allows downloading the data used to generate the graph as a comma separated text file.

  7. Follow the links ahead to access the documentation of PIQA online or as a pdf file.

 

Download:

PIQA is a compiled R package and it is available for different operating systems:

  • Linux (Ubuntu) 32 bit processor

  • Linux (Red Hat) 64 bit processor

  • Linux (Debian) 32 bit processor

  • Windows 32 bit processor (2)

  • MacOSX 32 (Processor Intel Core Duo)

  • MacOSX 64 (Processor Intel Core 2 Duo)

 

Installation3:

Linux/MacOSX: Install as a regular R package using the root account. Use the command shell to change directories to that in which the package was downloaded, then type at the operating system prompt ("terminal" window), e.g.:

 > R CMD INSTALL PIQA_1.0_R_i486-pc-linux-gnu.tar.gz

Windows 32: Start R and from the "Packages" menu select "Install package(s) from local zip files", then choose the downloaded file (PIQA_1.0.zip).

 

 

Use:

Once installed (see the "Installation" section above), call the program by typing at R's prompt:

 > library("PIQA")

Although the R implementation allows the user direct interaction with all the functions defined in the package, our intention was to provide the user with a "push-button" application. In this way all the user has to do is to type at R's prompt:

 > PIQA_rpt(data_format, in_d, out_d, lane)

The parameters are:

  • "data_format" is one of "bustard", "fastq" or "scarf".

  • "in_d" is either an input directory (if "data_format" is "bustard") or a file name (if "data_format" is "fastq").

  • "out_d" is the directory (path included) in which the results will be output.

  • "lane" is the number of the lane (1<lane<8) of the flow cell which will be processed.

For optional parameters, see the documentation.

 

 

Examples4:

Download and install the PIQA package as detailed above, then follow the next steps:

  1. Download an example dataset (WARNING: The files are big!):

    • Zip file in "bustard" format(~ 1072 Mb)

    • Zip file in "fastq" format(~ 110 Mb)

    • Zip file in "scarf" format (~ 85 Mb)

     

  2. Extract the files into a known location (directory) of your computer. In the following examples we assume the files have been downloaded into c:\ on a windows system, or R's working directory on a Linux system (you can get R's working directory by typing at R's prompt "getwd()").

     

  3. For "bustard" data format5, type at R's prompt:

      > library("PIQA")

    • On a windows system you could type: > type_of_data="bustard"; in_d="c:/bustard"; out_d="c:/";lane=1;

    • On a Linux system you could type: > type_of_data="bustard"; in_d="bustard"; out_d=getwd();lane=1;

    • Finally type (in either system): > PIQA_rpt(type_of_data,in_d,out_d,lane)

     

  4. For "fastq" data format5, type at R's prompt:

      >library("PIQA")

    • On a windows system you could type: >type_of_data="fastq"; in_d="c:/s_1_sequence.txt"; out_d="c:/";lane=1;

    • On a Linux system you could type: >type_of_data="fastq"; in_d="s_1_sequence.txt"; out_d=getwd() ;lane=1;

    • Finally type (in either system): >PIQA_rpt(type_of_data,in_d,out_d,lane)

     

  5. For "scarf" data format5, type at R's prompt:

      >library("PIQA")

    • On a windows system you could type: >type_of_data="scarf"; in_d="c:/s_7_sequence.txt"; out_d="c:/"; lane=7;

    • On a Linux system you could type: >type_of_data="scarf"; in_d="s_7_sequence.txt"; out_d=getwd(); lane=7;

    • Finally type (in either system): >PIQA_rpt(type_of_data,in_d,out_d,lane)

Click here to see an example of the resulting report.

 

 

Notes:
  1. When dealing with data in the "bustard" format, PIQA assumes the structure of the directory and file naming conventions of the Illumina Pipeline.

  2. Under Windows Vista R must be run as administrator (to open R in this modality, left click on R's icon and select the option "Run as administrator").

  3. PIQA requires the R2HTML library for generating its output. Make sure R2HTML is installed in your R system to be able to run PIQA. Install R2HTML in a computer with Internet access by typing at R's prompt:

      > install.packages("R2HTML")

  4. The processing of unfiltered data ("bustard" format) takes considerably more time (~ 20 x) than the processing of filtered data (in "fastq" or "scarf" formats) which take about 20 seconds on a commercial desktop machine.

  5. The "in_d" parameter in the examples are text files for the cases of "fastq" and/or "scarf" and it is a directory in the case of "bustard" data.