Skip to end of metadata
Go to start of metadata

Introduction

IGB supports multiple file formats in both compressed and uncompresed formats. See the links below for information about each of these. A brief overview of the most common formats are provided below. For greater detail on working with some of these formats, see the linked pages. Note that for IGB to recognize a file, the file name must end with one of the supported file format extensions, listed below.

File Types Summary

Type Extension Description
Affymetrix XML .axml A mostly-obsolete XML format used internally at Affymetrix.
BAM .bam A binary indexed version of the SAM format used for displaying alignment data. See SAMtools for more details.
Note: Be sure you've indexed your BAM file (you should have a .bai file as well). Supported in IGB 6.3.
SAM
.sam
The original SAM format. Much larger file size than BAM, and slower to work with; we strongly recommend the indexed BAM format.
Supported in IGB 6.6
BAR .bar Binary graph format developed by Affymetrix. Generated from tiling arrays by TAS (Tiling Analysis Software.)
BED .bed A tabular format developed for use with the UCSC genome browser. The specifications is available here:
      http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED
BGR .bgr Binary graph format.
Binary Files .bps, .bgn, .brs, .bsnp, .brpt, .bp1 Binary annotation formats developed specifically for IGB by Affymetrix. These are generally not documented.
BNIB
.bnib A binary format for sequence data developed for IGB by Affymetrix to speed up loading sequence data over the network.
CHP .chp Binary files generated by Affymetrix software. Signal values and presence calls for probesets on Affymetrix microarrays. There are multiple sub-formats, identifiable from the file contents.  IGB can read most, but not all of these formats
CN_SEGMENTS .cn_segments Output from Affymetrix Genotyping Console representing a summary of all chromosomal region exhibiting copy number change.
Copy Number Analysis .cnt The Affymetrix CNT file format is a tabular format representing output from the Affymetrix CNAT program.
Cytoband .cyt A text format for cytoband data to represent cytoBand.txt files from the UCSC genome browser.
DAS XML files .das, .dasxml, .das2xml XML formats returned from DAS servers. See http://www.biodas.org. See DAS/1 specification and DAS/2 specification
Expression Graphs .egr, .egr.txt, .sin EGR is a tabular format for associating arbitrary numbers of scores with scored genomic intervals. The extension ".egr" is preferred. The other extensions are kept for backwards compatibility. Files generated from Affymetrix GeneChip Operating Software (GCOS) or ExACT (Exon Array Computational Tool) software.
FASTA .fa, .fasta Sequence data in a simple ASCII format. See here . Recommended only for short sequences. Otherwise, use 2Bit (see below) or follow these instructions to convert to the ".bnib" format. Note: IGB does not support the use of the Control-A character in the header lines.
GenBank .gb, .gen NCBI's file format. Experimentally supported in IGB 6.4.
GFF (General Feature Format) .gff, gtf, .gff3 General Feature Format. There are several types of GFF file that use incompatible syntax. The original GFF format is GFF1. A variant called GTF is also used. GFF3 has been proposed to extend on GFF and to constrain the specification more tightly to avoid mutually-incompatible versions of GFF. If IGB has difficulty reading your GFF file, make sure the header includes the GFF version, as indicated in the GFF specification documents.
GR .gr Tab-delimited graph format. A simple text format containing two columns of numbers separated by a single space or tab.  The first number is the base position; the second number is the score. Because this format does not include chromosome names, we recommend you use .sgr or .wig formats instead.
PSL .psl, .psl3, .link.psl PSL is a tabular format used for representing alignments in UCSC's BLAT tool.
PSLX .pslx PSLX is an extension to the PSL format that shows sequence data in each alignment.   Supported in IGB 6.4.
SAM .sam The SAM format is not directly supported. Please convert to the binary BAM format via SAMtools. Be sure to index your file as well.
Scored Intervals .sin, .egr, .egr.txt See EGR .
Scored Map .map An outdated format, replaced now by EGR files.
SGR .sgr Tab-delimited graph format. Sequence graph files that show base coordinate scores. These files are generated by CNAT (the Affymetrix Chromosome Copy Number Analysis Tool software). The format of .sgr text files is: chromosome identification, then two columns of numbers separated by a single space or a tab. The first number is the base position; the second number is the score
SIN .sin See Expression Graphs format.
USeq .useq USeq is a binary indexed format used to display graph and annotation data. Supported in IGB 6.2. For more information about it, see: http://useq.sourceforge.net.
Wiggle .wig This is a text format for graphical data designed for the UCSC genome browser. IGB supports all 3 subtypes: BED, variableStep, fixedStep. For more information, see the UCSC Web page describing the format: http://genome.ucsc.edu/goldenPath/help/wiggle.html. Files in wiggle format can use UCSC track lines to specify colors and other properties.
2Bit .2bit 2Bit is a compact format for DNA sequences developed by UCSC. Supported in IGB 6.3. See http://genome.ucsc.edu/FAQ/FAQformat.html#format7 for more information about it.

Additional Files Types Supported Through Plugins

In IGB 6.6, we have added the ability to handle VCF (1000 genomes project preferred format), BigWig and BigBed (USCS formats), through the use of Plugins. Go to the Plugin tab and activate the appropriate Plugins. More about VCF here; both bigWig and bigBed details can be found here.

.bed Format

.bed is a tabular format for genomic annotations that was developed for use with the UCSC genome browser. The full specifications are available at the UCSC Web site: http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED/

Here is an example of a simple BED file with all 12 columns filled. (Spaces should be replaced with tab characters. IGB will accept spaces instead of tabs, but that is not recommended for compatibility with other programs.)

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Note: IGB will also tolerate, and safely ignore, an extra column placed before the sequence name (first) column.

.gff and Related Formats

GFF stands for 'general feature format' or 'gene finding format'; it is a tab-delimited file with 9 columns. There are several types of GFF files that use incompatible syntax. The original GFF format is GFF1. A variant called GTF is also used. GFF3 has been proposed to extend on GFF and to constrain the specification more tightly to avoid mutually-incompatible versions of GFF. Some GFF files created by Affymetrix make use of extensions to GFF that are specific to IGB. These are indicated in the file headers by lines beginning with "##IGB-".

IGB can handle most versions of GFF/GTF, but may have difficulty with some rarely-used advanced features. IGB does not read any FASTA data that is included in some GFF3 files. If IGB has difficulty reading your GFF file, make sure that there is a line in the header similar to ##gff-version 2 that identifies the correct format number 1, 2 or 3.

The GFF format is described at http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

The GTF format is described here http://genes.cs.wustl.edu/GTF2.html

The GFF3 format is described here http://song.sourceforge.net/gff3-jan04.shtml

FASTA Format

FASTA files contain sequence data in a simple ASCII format.

For more information, see: the wikipedia entry on FASTA format: http://en.wikipedia.org/wiki/FASTA_format.

Note: We recommend using FASTA for short sequences only. For loading data into IGB, or populating a DAS or Quickload server, we recommend you use the 2Bit sequence format developed at UC Santa Cruz for their genome browser or IGB's own BNIB format.

BNIB Format

BNIB is a format developed for IGB that makes it possible to represent sequence data in a very compact format, thus reducing the amount of network traffic required to send sequence data to the IGB viewer when you press the Load Sequence In View or Load All Sequence buttons. To create BNIB files, see Converting FASTA to BNIB.

.egr and .sin Formats

EGR (also known as Scored Interval, .sin, format.files) are TAB-delimited files with a header. They can contain one or more scores associated with named annotations or with named or unnamed genomic regions. They have an optional header section which is a list of tag-value pairs, one per line, in the form: # tag = value Currently the only tags used by the parser are of the form score$i (score name tags are optional). If score name tags are present, then score number $i will be named according to the value of the score$i tag. If any score name tags are missing, default names will be created.

It is recommended that a tag value pair with the genome version, such as #genome_version = H_sapiens_May_2004 , be included to indicate which genome assembly the sequence coordinates are based on. This will ensure that the file is being compared to other annotations from the same assembly.

- Data in the file

There are three versions of this format. They can all be described this way, where the parentheses indicate optional elements:(annot_id) ((seqid) min_coord max_coord strand) [score]*

  1. seqid is word string [a-zA-Z_0-9]+
  2. min_coord is int
  3. max_coord is int
  4. strand can be '+', '-', or '.' for "unknown"
  5. score is float
  6. annot_id is word string [a-zA-Z_0-9]+

All lines must have same number of columns. Format 1 has tab-delimited lines with 4 required columns, any additional columns are scores:seqid min_coord max_coord strand [score]* Format 2 has tab-delimited lines with 5 required columns, any additional columns are scores:annot_id seqid min_coord max_coord strand [score]* Format 3 has tab-delimited lines with 1 required column, any additional columns are scores:annot_id [score]* The IGB parser should be able to distinguish between these, based on combination of number of fields, and presence and position of the strand field. For use in IGB, EGR version 3 is dependent on prior loading of annotations with matching ids.

Examples

  • Format 1:# genome_version = H_sapiens_Apr_2003
  1. score0 = A375
  2. score1 = FHS
    gene1 chr22 14433291 14433388 + 140.642 175.816
    gene2 chr22 14433586 14433682 + 52.3838 58.1253
    gene3 chr22 14434054 14434140 + 36.2883 40.7145
  • Format 2:# genome_version = H_sapiens_Apr_2003
  1. score0 = A375
  2. score1 = FHS
    chr22 14433291 14433388 + 140.642 175.816
    chr22 14433586 14433682 + 52.3838 58.1253
    chr22 14434054 14434140 + 36.2883 40.7145
  • Format 3:(assumes annotations with the names gene1, gene2, and gene3 are already loaded.)# genome_version = H_sapiens_Apr_2003
  1. score0 = A375
  2. score1 = FHS
    gene1 140.642 175.816
    gene2 52.3838 58.1253
    gene3 36.2883 40.7145

.wig Format

A "wig" file is a data file that associates numerical values (e.g., read coverage) with a region of the genome. For example, here is an excerpt from a sample "wig" file created by a program called wiggle, which is part of the TopHat distribution:

track type=bedGraph name="HotK1T2 multi-mapping coverage"
chr1    0       3707    0
chr1    3707    3713    1
chr1    3713    3730    2
chr1    3730    3737    3
chr1    3737    3745    4
and etc.
Note that the top line of the file contains information (meta-data) about the data set, including its name. When you open the file in IGB, you'll see this "name" attribute again in the track label.

tabix Indexed Files (IGB 6.6)

Files types (SAM, BED, WIG, PSL) that have been indexed using the tabix format are now supported. Indexed files allow for faster searching and loading. More about tabix can be found here.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.