BEDTools is a collection of utilities for comparing, summarizing, and intersecting genomic features in the UCSC Genome Browser BED format.
BEDTools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.
Other resources that may be helpful include:
- AR Quinlan and IM Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26 (6): 841–842 (2010).
- Website: http://code.google.com/p/bedtools/
- Manual: http://bedtools.readthedocs.org/en/latest/
Programs in the BEDTools package
|closestBed||returns the closest feature to each entry in a BED file|
|complementBed||returns all intervals _not_ spanned by the features in a BED file|
|coverageBed||creates a BED graph file based on the the overlaps of two BED files|
|fastaFromBed||creates FASTA sequences based on intervals in a BED file|
|genomeCoverageBed||creates either a histogram or a "per base" report of genome coverage|
|intersectBed||returns overlaps between two BED files|
|linksBed||creates an HTML file of links to the UCSC or a custom browser|
|mergeBed||merges overlapping features into a single feature|
|peIntersectBed||returns overlaps between a paired-end BED file and a regular BED file|
|sortBed||sorts a BED file by chrom, then start position|
|subtractBed||removes the portion of an interval that is overlapped by another feature|
|windowBed||returns overlaps between two BED files based on a user-defined window|
Make the BEDTools commands available for use through the module command. To load the BEDTools module enter:
module load bedtools
The BEDTools programs are available through the Galaxy instance on biou.
To make the BEDTools programs available through the command line, csh users should enter the following command:
To make the BEDTools programs available through the command line, bash users should enter the following command:
Very similar to intersectBed, but finds the closest (not necessarily overlapping) feature in B to each feature in A. If multiple features in B overlap a feature in A, the one with the highest overlap with respect to A is reported.
- Find the closest ALU to each gene.
$ closestBed -a genes.bed -b ALUs.bed
Returns those intervals in a given genome that do not overlap the entries in a BED file.
- For example, report all intervals in the human genome that are not covered by repetitive elements.
$ complementBed -i repeatMasker.bed -g hg18.genome That is: RepeatMasker: ============ ==== =========== ======== Output: **** ******** ************ ************* *******
NOTE: Genome files must contain a single line for each chromosome in the following tab-delimited format. If you wish to create a genome file for a different species, you can easily do this by using the UCSC Genome Browser's MySQL interface or use the Table Browser (table chromInfo). Here's an example for how to collect the genome info for hg18.
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" > hg18.genome
NOTE: Be sure to remove the header line (chrom size) in the resulting file prior to running genomeCoverageBed
For example, here is hg18.genome.
chr1 247249719 chr1_random 1663265 chr10 135374737 chr10_random 113275 chr11 134452384 chr11_random 215294 chr12 132349534 chr13 114142980 chr13_random 186858 chr14 106368585 chr15 100338915 chr15_random 784346 chr16 88827254 chr16_random 105485 chr17 78774742 chr17_random 2617613 chr18 76117153 chr18_random 4262 chr19 63811651 chr19_random 301858 chr2 242951149 chr2_random 185571 chr20 62435964 chr21 46944323 chr21_random 1679693 chr22 49691432 chr22_random 257318 chr22_h2_hap1 63661 chr3 199501827 chr3_random 749256 chr4 191273063 chr4_random 842648 chr5 180857866 chr5_random 143687 chr5_h2_hap1 1794870 chr6 170899992 chr6_random 1875562 chr6_cox_hap1 4731698 chr6_qbl_hap2 4565931 chr7 158821424 chr7_random 549659 chr8 146274826 chr8_random 943810 chr9 140273252 chr9_random 1146434 chrM 16571 chrX 154913754 chrX_random 1719168 chrY 57772954
Reports the number of entries in A that overlap with each feature in B. For example, one could use BED B as say 10Kb "windows" across a genome. If BED A contained the coordinates of sequence alignments, one could use coverageBed to count the number of reads that map to each window. This would be a preliminary step copy-number analysis. The output is in BEDGRAPH format and will allow you to plot the result on the UCSC Genome Browser.
- For example, column four is the number of reads in "A" that overlapped each "window" in "B" :
$ coverageBed -a illuminaReads.bed -b windows.bed | head chr1 0 10000 0 chr1 10001 20000 33 chr1 20001 30000 42 chr1 30001 40000 71
Extracts sequences in FASTA format based for each of the intervals defined in a BED file. The headers in the input FASTA file must exactly match the chrom column (column 1) in the BED file. If a strand is provided, the sequence will be reported accordingly (i.e. "-" returns a reverse complement). Otherwise, everything is reported on the "+" strand.
- The example below will report those intervals in the human genome that do not contain an annotated repeat.
$ fastaFromBed -db hg18.fa -bed segDups.hg18.bed -fo segDups.hg18.fasta
Computes a histogram of sequence coverage (aligned sequences in BED format) for a given genome. Optionally, (-d) it will also report the depth of coverage at each base on each chromosome in the genome file (-g) supplied. The (-max) option will "lump" all positions with a depth >= max into a single bin. I thank Royden Clark for his help in developing the speedy algorithm to compute coverage. By default, the output format is chrom, depth, number of bases at that depth, size of chrom, fraction of bases at depth.
- For example, here's a look at "Joe's genome" coverage on chromosome 1, binning everything with a depth >= 10 together:
$ genomeCoverageBed -g hg18.genome -i joe.bed -max 10 | grep -P "chr1\t" chr1 0 88757187 247249719 0.358978 chr1 1 26410800 247249719 0.106818 chr1 2 21892230 247249719 0.088543 chr1 3 18653191 247249719 0.0754427 chr1 4 16000409 247249719 0.0647136 chr1 5 13564576 247249719 0.0548618 chr1 6 11582171 247249719 0.046844 chr1 7 9722031 247249719 0.0393207 chr1 8 8168742 247249719 0.0330384 chr1 9 6765343 247249719 0.0273624 chr1 10 25733039 247249719 0.104077
Reports overlaps between features in two BED files.
NOTE: When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. E.g. chr7 10000001 10000002 rs123464
- Report the base-pair overlap between the features in two BED files.
$ intersectBed -a reads.bed -b genes.bed
- Report whether each entry in A overlaps ANY entry in B.
$ intersectBed -a reads.bed -b genes.bed -u
- Report those entries in A that overlap NO entries in B. Like "grep -v"
$ intersectBed -a reads.bed -b genes.bed -v
- Report the count of entries in B that A overlaps.
$ intersectBed -a reads.bed -b genes.bed -c
- Report the entire, original A entry for each overlap with B.
$ intersectBed -a reads.bed -b genes.bed -wa
- Report the entire, original B entry for each overlap.
$ intersectBed -a reads.bed -b genes.bed -wb
- Read BED A from stdin. Useful for stringing together commands.
NOTE: -a must be "stdin".
For example, find genes that overlap LINEs but not SINEs.
$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed -v
- Creates HTML links from a BED file.
$ linksBed myRegions.bed > myRegions.html
Merges overlapping or "book-ended" entries into a single entry.
- Merge overlapping repetitive elements into a single entry
$ mergeBed -i repeatMasker.bed
- Merge overlapping repetitive elements into a single entry, returning the number of entries merged
$ mergeBed -i repeatMasker.bed -n
- Merge _nearby_ repetitive elements into a single entry, so long as they are within 1000 bp of one another
$ mergeBed -i repeatMasker.bed -d 1000
Reports overlaps between features in a "paired-end" BEDPE file and a regular BED feature file.
OVERLAPS WITH PAIRED-END "ENDS"
- Which paired-end reads overlap with a repeat on EITHER end?
$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type either
- >Which structural variations overlap with a repeat on BOTH ends?
$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type both
- Which paired-end reads DO NOT overlap with a repeat on either end?
$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type NEITHER
- Which structural variations overlap with a repeat on ONE and ONLY ONE end?
$ peIntersectBed -a reads.bedpe -b repeatMasked.bed -type xor
OVERLAPS WITH THE "SPAN" OF THE PAIRED-END READ OR EVENT
- Which pair-end reads overlap span a gap based on the "inner-span"?
E.g. Pair =====....................===== Gap 0000000000000000 $ peIntersectBed -a reads.bedpe -b gaps.bed -type inspan
- Which pair-end reads overlap span a gene based on the "inner-span"?
E.g. Pair =====....................===== Gene 00000000000000000000000 $ peIntersectBed -a reads.bedpe -b genes.bed -type outspan
- Sort a BED file in ascending order, first by chrom, then by start position.
$ sortBed scrambled.bedAccepts input from stdin:
$ cat scrambled.bed | sortBed -i stdin > sorted.bed
For each feature in A, it returns the fraction of that feature that is not overlapped by B. If a feature in A is entirely "spanned" by any feature in B, it will not be reported.
- Remove introns from genes.
$ subtractBed -a genes.bed -b introns.bed
Very similar to intersectBed, but allows one to look for overlaps within a window flanking each A entry.
- Report all genes that are within 10000 bp UPSTREAM and DOWNSTREAM of CNVs.
$ windowBed -a CNVs.bed -b genes.bed -w 10000
- Why am I not finding overlap with my SNPs?
When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. For example:
chr7 10000001 10000002 rs123464