IGB supports multiple file formats in both compressed and uncompressed formats. See the table below for details and links (when available) to resources describing each format.
IGB also supports some file formats using third-party plug-ins.
See below for details.
Table of supported file formats
A mostly-obsolete XML format used internally at Affymetrix.
A binary indexed version of the SAM format used for displaying alignment data. See SAMtools for more details.
Plain text version of BAM format. Supported in IGB 6.6 and higher. We recommend using this only for smaller files.
A tabular format developed for use with the UCSC genome browser. IGB supports four, twelve, and fourteen column BED format. In IGB, the thirteenth and fourteenth columns of fourteen-column BED format (also called BED detail format) are interpreted as gene name and description, respectively.
Same as the wiggle format. See below for details.
The bigBed format stores annotation items that can either be simple, or a linked collection of exons, much as bed files do. BigBed files are created initially from bed type files, using the program
Like the bigBED format, this is an indexed form of a WIG file leading to the ability to partially load data, and much faster load times. See http://genome.ucsc.edu/goldenPath/help/bigWig.html.
Binary graph format developed by Affymetrix.
.bps, .bgn, .brs, .bsnp, .brpt, .bp1
Binary annotation formats developed specifically for IGB by Affymetrix. These are generally not documented anywhere and are being retired. If you are using any of these formats, we recommend using tabix-indexed, bgzip-compressed BED detail (fourteen column BED) files instead.
A binary format for sequence data originally developed for IGB by Affymetrix to speed up loading sequence data over the network. This format is no longer used. Starting with IGB 7.0, we will no longer support this file format. We recommend using 2bit instead of BNIB for sequence files.
Binary files generated by Affymetrix software. Signal values and presence calls for probesets on Affymetrix microarrays. There are multiple sub-formats, identifiable from the file contents. IGB can read most, but not all of these formats
Output from Affymetrix Genotyping Console representing a summary of all chromosomal region exhibiting copy number change.
Copy Number Analysis
The Affymetrix CNT file format is a tabular format representing output from the Affymetrix CNAT program.
A text format for representing chromosome band (ideogram) data. Examples are available from the IGBQuickLoad.org site under human genome directories.
DAS XML files
.das, .dasxml, .das2xml
.egr, .egr.txt, .sin
EGR is a tabular format for associating arbitrary numbers of scores with scored genomic intervals. The extension ".egr" is preferred. The other extensions are kept for backwards compatibility. Files generated from Affymetrix GeneChip Operating Software (GCOS) or ExACT (Exon Array Computational Tool) software.
.fa, .fasta, .fna, .fsa, .mpfa, .fas
Sequence data in a simple ASCII format. See here . Recommended only for short sequences. Otherwise, use 2Bit (see below) or follow these instructions to convert to the ".bnib" format. Note: IGB does not support the use of the Control-A character in the header lines.
NCBI's file format. Experimentally supported in IGB 6.4.
GFF (General Feature Format)
.gff, gtf, .gff3
General Feature Format. There are several types of GFF file that use incompatible syntax. The original GFF format is GFF1. A variant called GTF is also used. GFF3 has been proposed to extend on GFF and to constrain the specification more tightly to avoid mutually-incompatible versions of GFF. If IGB has difficulty reading your GFF file, make sure the header includes the GFF version, as indicated in the GFF specification documents.
Tab-delimited graph format. A simple text format containing two columns of numbers separated by a single space or tab. The first number is the base position; the second number is the score. Because this format does not include chromosome names, we recommend you use .sgr or .wig formats instead.
.psl, .psl3, .link.psl
PSLX is an extension to the PSL format that shows sequence data in each alignment. Supported in IGB 6.4.
.sin, .egr, .egr.txt
An outdated format, replaced now by EGR files.
Tab-delimited graph format. Sequence graph files that show base coordinate scores. These files are generated by CNAT (the Affymetrix Chromosome Copy Number Analysis Tool software). The format of .sgr text files is: chromosome identification, then two columns of numbers separated by a single space or a tab. The first number is the base position; the second number is the score
See Expression Graphs format.
Tally files are created by the bam_tally program (using options -P -B 0), and contain mismatch pileup information. The display is identical to the MisMatch Pileup view mode. The Tally files contain the sequence reference. The plugin will use a tabix index if available.
USeq is a binary indexed format used to display graph and annotation data. Supported in IGB 6.2. For more information about it, see: http://useq.sourceforge.net.
|VCF||.vcf||Variant Call Format (VCF) is a flexible and extendable format for variation data such as single nucleotide variants, insertions/deletions, copy number variants and structural variants. More information on the VCF file format can be found here: https://github.com/samtools/hts-specs|
This is a text format for graphical data designed for the UCSC genome browser. IGB supports all 3 subtypes: BED, variableStep, fixedStep. For more information, see the UCSC Web page describing the format: http://genome.ucsc.edu/goldenPath/help/wiggle.html. Files in wiggle format can use UCSC track lines to specify colors and other properties.
2Bit is a compact format for DNA sequences developed by UCSC. Supported in IGB 6.3. See http://genome.ucsc.edu/FAQ/FAQformat.html#format7 for more information about it.
Files Types Supported Through Plugins
Some file formats can be read after installing the appropriate plug-in. See Plug-ins.
About GFF and its variants
GFF stands for 'general feature format' or 'gene finding format'; it is a tab-delimited file with 9 columns. There are several types of GFF files that use incompatible syntax. The original GFF format is GFF1. A variant called GTF is also used. GFF3 has been proposed to extend on GFF and to constrain the specification more tightly to avoid mutually-incompatible versions of GFF. Some GFF files created by Affymetrix make use of extensions to GFF that are specific to IGB. These are indicated in the file headers by lines beginning with "##IGB-".
IGB can handle most versions of GFF/GTF, but may have difficulty with some rarely-used advanced features. IGB does not read any FASTA data that is included in some GFF3 files. If IGB has difficulty reading your GFF file, make sure that there is a line in the header similar to
##gff-version 2 that identifies the correct format number 1, 2 or 3.
The GFF format is described at http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
The GTF format is described here http://genes.cs.wustl.edu/GTF2.html
The GFF3 format is described here http://song.sourceforge.net/gff3-jan04.shtml
About the .wig format (also called bedGraph)
A "wig" file is a data file that associates numerical values (e.g., read coverage) with a region of the genome. For example, here is an excerpt:
Note that the top line of the file contains information (meta-data) about the data set, including its name. When you open the file in IGB, you'll see this "name" attribute again in the track label.
Partial data loading using tabix indexed files
IGB supports partial data loading of several file types using tabix indexing. Supported file types include SAM, BED, BEDGRAPH, PSL, and PSLX.
Indexed files allow for faster searching and loading. The indexed file and its index (.tbi file) must reside in the same folder either on your local computer or on a server. More about tabix can be found here.
Sequence File Formats
IGB supports fasta and 2bit formats. Older versions of IGB also support an IGB-specific format called bnib. Newer versions of IGB will probably still open bnib files, but as of IGB 7.0, we are no longer including the bnib format in our testing process.
FASTA files contain sequence data in a simple ASCII format. For details, Google search fasta.
BNIB is an older format developed for IGB that makes it possible to represent sequence data in a very compact format. 2bit, developed at UCSC is also a compact, binary format for representing sequence data, but a number of open source tools are available for working with this format and so for this reason, IGB now uses 2bit instead of bnib.
.egr and .sin Formats
EGR (also known as Scored Interval, .sin, format.files) are TAB-delimited files with a header. They can contain one or more scores associated with named annotations or with named or unnamed genomic regions. They have an optional header section which is a list of tag-value pairs, one per line, in the form: # tag = value Currently the only tags used by the parser are of the form score$i (score name tags are optional). If score name tags are present, then score number $i will be named according to the value of the score$i tag. If any score name tags are missing, default names will be created.
It is recommended that a tag value pair with the genome version, such as #genome_version = H_sapiens_May_2004 , be included to indicate which genome assembly the sequence coordinates are based on. This will ensure that the file is being compared to other annotations from the same assembly.
Data in the file
There are three versions of this format. They can all be described this way, where the parentheses indicate optional elements:(annot_id) ((seqid) min_coord max_coord strand) [score]*
seqidis word string [a-zA-Z_0-9]+
strandcan be '+', '-', or '.' for "unknown"
annot_idis word string [a-zA-Z_0-9]+
All lines must have same number of columns. Format 1 has tab-delimited lines with 4 required columns, any additional columns are scores:seqid min_coord max_coord strand [score]* Format 2 has tab-delimited lines with 5 required columns, any additional columns are scores:annot_id seqid min_coord max_coord strand [score]* Format 3 has tab-delimited lines with 1 required column, any additional columns are scores:annot_id [score]* The IGB parser should be able to distinguish between these, based on combination of number of fields, and presence and position of the strand field. For use in IGB, EGR version 3 is dependent on prior loading of annotations with matching ids.
- Format 1:# genome_version = H_sapiens_Apr_2003
- score0 = A375
- score1 = FHS
gene1 chr22 14433291 14433388 + 140.642 175.816
gene2 chr22 14433586 14433682 + 52.3838 58.1253
gene3 chr22 14434054 14434140 + 36.2883 40.7145
- Format 2:# genome_version = H_sapiens_Apr_2003
- score0 = A375
- score1 = FHS
chr22 14433291 14433388 + 140.642 175.816
chr22 14433586 14433682 + 52.3838 58.1253
chr22 14434054 14434140 + 36.2883 40.7145
- Format 3:(assumes annotations with the names gene1, gene2, and gene3 are already loaded.)# genome_version = H_sapiens_Apr_2003
- score0 = A375
- score1 = FHS
gene1 140.642 175.816
gene2 52.3838 58.1253
gene3 36.2883 40.7145