Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

BEDTools

 

BEDTools is a collection of utilities for comparing, summarizing, and intersecting genomic features in the UCSC Genome Browser BED format.

BEDTools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.

Installed on blacklight, biou

Other resources that may be helpful include:

Programs in the BEDTools package

closestBed returns the closest feature to each entry in a BED file
complementBed returns all intervals _not_ spanned by the features in a BED file
coverageBed creates a BED graph file based on the the overlaps of two BED files
fastaFromBed creates FASTA sequences based on intervals in a BED file
genomeCoverageBed creates either a histogram or a "per base" report of genome coverage
intersectBed returns overlaps between two BED files
linksBed creates an HTML file of links to the UCSC or a custom browser
mergeBed merges overlapping features into a single feature
peIntersectBed returns overlaps between a paired-end BED file and a regular BED file
sortBed sorts a BED file by chrom, then start position
subtractBed removes the portion of an interval that is overlapped by another feature
windowBed returns overlaps between two BED files based on a user-defined window

Running BEDTools

On blacklight

Make the BEDTools commands available for use through the module command.  To load the BEDTools module enter:

module load bedtools

On biou

The BEDTools programs are available through the Galaxy instance on biou.

To make the BEDTools programs available through the command line, csh users should enter the following command:

source /packages/bin/SETUP_BIO_SOFTWARE

To make the BEDTools programs available through the command line, bash users should enter the following command:

source /packages/bin/SETUP_BIO_SOFTWARE

 

Usage Examples

closestBed

Very similar to intersectBed, but finds the closest (not necessarily overlapping) feature in B to each feature in A. If multiple features in B overlap a feature in A, the one with the highest overlap with respect to A is reported.

complementBed

Returns those intervals in a given genome that do not overlap the entries in a BED file.

  • For example, report all intervals in the human genome that are not covered by repetitive elements.
    $ complementBed -i repeatMasker.bed -g hg18.genome
    That is:
    RepeatMasker: ============ ==== =========== ========
    Output: **** ******** ************ ************* *******
    

    NOTE: Genome files must contain a single line for each chromosome in the following tab-delimited format. If you wish to create a genome file for a different species, you can easily do this by using the UCSC Genome Browser's MySQL interface or use the Table Browser (table chromInfo). Here's an example for how to collect the genome info for hg18.

    mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" > hg18.genome

    NOTE: Be sure to remove the header line (chrom size) in the resulting file prior to running genomeCoverageBed

    For example, here is hg18.genome.

    chr1 247249719
    chr1_random 1663265
    chr10 135374737
    chr10_random 113275
    chr11 134452384
    chr11_random 215294
    chr12 132349534
    chr13 114142980
    chr13_random 186858
    chr14 106368585
    chr15 100338915
    chr15_random 784346
    chr16 88827254
    chr16_random 105485
    chr17 78774742
    chr17_random 2617613
    chr18 76117153
    chr18_random 4262
    chr19 63811651
    chr19_random 301858
    chr2 242951149
    chr2_random 185571
    chr20 62435964
    chr21 46944323
    chr21_random 1679693
    chr22 49691432
    chr22_random 257318
    chr22_h2_hap1 63661
    chr3 199501827
    chr3_random 749256
    chr4 191273063
    chr4_random 842648
    chr5 180857866
    chr5_random 143687
    chr5_h2_hap1 1794870
    chr6 170899992
    chr6_random 1875562
    chr6_cox_hap1 4731698
    chr6_qbl_hap2 4565931
    chr7 158821424
    chr7_random 549659
    chr8 146274826
    chr8_random 943810
    chr9 140273252
    chr9_random 1146434
    chrM 16571
    chrX 154913754
    chrX_random 1719168
    chrY 57772954
    

coverageBed

Reports the number of entries in A that overlap with each feature in B. For example, one could use BED B as say 10Kb "windows" across a genome. If BED A contained the coordinates of sequence alignments, one could use coverageBed to count the number of reads that map to each window. This would be a preliminary step copy-number analysis. The output is in BEDGRAPH format and will allow you to plot the result on the UCSC Genome Browser.

  • For example, column four is the number of reads in "A" that overlapped each "window" in "B" :
    $ coverageBed -a illuminaReads.bed -b windows.bed | head
    chr1 0 10000 0
    chr1 10001 20000 33
    chr1 20001 30000 42
    chr1 30001 40000 71
    

fastaFromBed

Extracts sequences in FASTA format based for each of the intervals defined in a BED file. The headers in the input FASTA file must exactly match the chrom column (column 1) in the BED file. If a strand is provided, the sequence will be reported accordingly (i.e. "-" returns a reverse complement). Otherwise, everything is reported on the "+" strand.

  • The example below will report those intervals in the human genome that do not contain an annotated repeat.
    $ fastaFromBed -db hg18.fa -bed segDups.hg18.bed -fo segDups.hg18.fasta

genomeCoverageBed

Computes a histogram of sequence coverage (aligned sequences in BED format) for a given genome. Optionally, (-d) it will also report the depth of coverage at each base on each chromosome in the genome file (-g) supplied. The (-max) option will "lump" all positions with a depth >= max into a single bin. I thank Royden Clark for his help in developing the speedy algorithm to compute coverage. By default, the output format is chrom, depth, number of bases at that depth, size of chrom, fraction of bases at depth.

  • For example, here's a look at "Joe's genome" coverage on chromosome 1, binning everything with a depth >= 10 together:
    $ genomeCoverageBed -g hg18.genome -i joe.bed -max 10 | grep -P "chr1\t"
    chr1 0 88757187 247249719 0.358978
    chr1 1 26410800 247249719 0.106818
    chr1 2 21892230 247249719 0.088543
    chr1 3 18653191 247249719 0.0754427
    chr1 4 16000409 247249719 0.0647136
    chr1 5 13564576 247249719 0.0548618
    chr1 6 11582171 247249719 0.046844
    chr1 7 9722031 247249719 0.0393207
    chr1 8 8168742 247249719 0.0330384
    chr1 9 6765343 247249719 0.0273624
    chr1 10 25733039 247249719 0.104077
    

intersectBed

Reports overlaps between features in two BED files.

NOTE: When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. E.g. chr7 10000001 10000002 rs123464

  • Report the base-pair overlap between the features in two BED files.
    $ intersectBed -a reads.bed -b genes.bed
  • Report whether each entry in A overlaps ANY entry in B.
    $ intersectBed -a reads.bed -b genes.bed -u
  • Report those entries in A that overlap NO entries in B. Like "grep -v"
    $ intersectBed -a reads.bed -b genes.bed -v
  • Report the count of entries in B that A overlaps.
    $ intersectBed -a reads.bed -b genes.bed -c
  • Report the entire, original A entry for each overlap with B.
    $ intersectBed -a reads.bed -b genes.bed -wa
  • Report the entire, original B entry for each overlap.
    $ intersectBed -a reads.bed -b genes.bed -wb
  • Read BED A from stdin. Useful for stringing together commands.
    NOTE: -a must be "stdin".

    For example, find genes that overlap LINEs but not SINEs.

    $ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed -v

linksBed

  • Creates HTML links from a BED file.
    $ linksBed myRegions.bed > myRegions.html

mergeBed

Merges overlapping or "book-ended" entries into a single entry.

  • Merge overlapping repetitive elements into a single entry
    $ mergeBed -i repeatMasker.bed
  • Merge overlapping repetitive elements into a single entry, returning the number of entries merged
    $ mergeBed -i repeatMasker.bed -n
  • Merge _nearby_ repetitive elements into a single entry, so long as they are within 1000 bp of one another
    $ mergeBed -i repeatMasker.bed -d 1000

peIntersectBed

Reports overlaps between features in a "paired-end" BEDPE file and a regular BED feature file.

OVERLAPS WITH PAIRED-END "ENDS"

  • Which paired-end reads overlap with a repeat on EITHER end?
    $ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type either
  • >Which structural variations overlap with a repeat on BOTH ends?
    $ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type both
  • Which paired-end reads DO NOT overlap with a repeat on either end?
    $ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type NEITHER
  • Which structural variations overlap with a repeat on ONE and ONLY ONE end?
    $ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type xor

OVERLAPS WITH THE "SPAN" OF THE PAIRED-END READ OR EVENT

  • Which pair-end reads overlap span a gap based on the "inner-span"?
        E.g. Pair =====....................=====
        Gap 0000000000000000
     
        $ peIntersectBed -a reads.bedpe -b gaps.bed -type inspan
    
  • Which pair-end reads overlap span a gene based on the "inner-span"?
        E.g. Pair =====....................=====
        Gene 00000000000000000000000
        $ peIntersectBed -a reads.bedpe -b genes.bed -type outspan
    

sortBed

  • Sort a BED file in ascending order, first by chrom, then by start position.
    $ sortBed scrambled.bed
    Accepts input from stdin:
    $ cat scrambled.bed | sortBed -i stdin > sorted.bed

subtractBed

For each feature in A, it returns the fraction of that feature that is not overlapped by B. If a feature in A is entirely "spanned" by any feature in B, it will not be reported.

windowBed

Very similar to intersectBed, but allows one to look for overlaps within a window flanking each A entry.

  • Report all genes that are within 10000 bp UPSTREAM and DOWNSTREAM of CNVs.
    $ windowBed -a CNVs.bed -b genes.bed -w 10000

    Allows input from stdin in the same manner as intersectBed.

 

Common Problems

  1. Why am I not finding overlap with my SNPs?

    When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. For example:

    chr7 10000001 10000002 rs123464