Random Projection-based IBD Detection (RaPID)

Ardalan Naseri, Xiaoming Liu, Shaojie Zhang, Degui Zhi

1. Usage information

The binary (trial) version of the program for Linux (64-bit) can be downloaded from here. RaPID is free for academic use. For commercial use, please contact authors.

    chomod +x RaPID_v.1.2

    Usage: ./RaPID_v.1.2 <option(s)>

    Options:

    -h,--help					Show this help message
    -i,--input <INPUT FILE>			Input file path in MACS format
    -o,--output <OUTPUT FOLDER> 		Output folder. The final result will be results.max in the output folder
    -a,--no-allele-frequency                    Ignore MAFs. By default the sites are selected at random weighted by their MAFs.
    -w,--window <WINDOW SIZE> 			Window size for subsampling
    -r,--run <NUMBER OF RUNS>			Number of runs
    -s,--support <NUMBER OF SUCCESSES>		Minimum number of successes to consider a hit
    -m,--minimum <MINIMUM IBD LENGTH>		Minimum IBD length in terms of SNPs
    -l,--length <MINIMUM IBD LENGTH>		Minimum IBD length in cM
    -t,--total <TOTAL LENGTH>			Length of chromosome in cM or Mbps
    -c,--count-sites                            Count the number of sites line by line. Useful if the MaCS file is not well formatted.
    -p,--percentage-sites                       Percentage of the variant sites to cover a given minimum length (m = p*N*l/t). Not applicable if the minimum length is given in terms of the number of sites(default 0.9)


Required:

    <INPUT FILE> <OUTPUT FOLDER> <WINDOW SIZE> <NUMBER OF RUNS> <NUMBER OF SUCCESSES> <MINIMUM IBD LENGTH>
    <TOTAL LENGTH> required if minimum IBD length is given in cM or Mbps

The input file should be given in MaCS format. You can use this script to convert VCF files into macs format.

You can find an example of a panel with 4000 haplotypes with the chromosome length of 10Mbps in macs format here. Unzip the file using

    gunzip 4k_1e7_e0.001.tar.gz
    tar -xvf 4k_1e7_e0.001.tar

To run the program for detecting the IBDs with a minimum length of 1.5 Mbps type:

    RaPID_v.1.2 -i 4k_1e7_e0.001 -o output -r 10 -w 35 -s 2 -l 1.5 -t 10

The program outputs all detected IBD segments in the following format:

    MATCH <hap1_index> <hap2_index> <starting_position> <ending_position> <length>
<starting_position> denotes the start of the IBD segments in terms of variant sites. <ending_position> denotes the end of the IBD segments, respectively.

For example:

    MATCH 0 1369 29280 43739 14460

0 is the index of the first haplotype and 1369 is the index of the second haplotype. 29280 is the starting site, 43739 is the ending site and 14460 is the length of the IBD in terms of variant sites.

2. Parameters selection

The java interface RaPID_Params.jar can be used to estimate the parameters:

    java -jar RaPID_Params.jar



Java Interface Java_Params.jar to estimate the parameters of RaPID

3. Benchmarking

Simulation Data: 4000 haplotypes of the length 10Mbps were generated using MaCS simulator:

    macs 4000 10000000 -seeds 1 -T -t 1.167 -r 0.8453 -h 1000 -G 21133448 -eG 1.183e-07 5072027 -eG 7.098e-07 1183473 -eN 3.549e-06 0.0001420 -eN 5.915e-05
    0.001136 2>trees.txt 1> sites.txt

MaCS simulator can be downloaded from https://github.com/gchen98/macs.

IBD segments can be extracted from the generated files using Dendropy python library (http://www.dendropy.org/). The Python script extract_IBDs.py uses Dendropy package and extracts the IBD segments.

For benchmarking (sequencing data) the following commands with corresponding parameters were used:

    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 80 -s 2 -l 1.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 110 -s 2 -l 2 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 120 -s 2 -l 2.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 180 -s 2 -l 3.0 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 200 -s 2 -l 3.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 250 -s 2 -l 4.0 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 280 -s 2 -l 4.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 300 -s 2 -l 5.0 -t 10

For benchmarking (SNP array data) the following commands with corresponding parameters were used:

    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 2 -s 2 -l 1.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 3 -s 2 -l 2 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 4 -s 2 -l 2.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 5 -s 2 -l 3.0 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 5 -s 2 -l 3.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 5 -s 2 -l 4.0 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 6 -s 2 -l 4.5 -t 10
    RaPID_v.1.2 -i <input_file> -o <output_folder> -r 10 -w 6 -s 2 -l 5.0 -t 10

4. Applying RaPID on 1000G Project Data

All autosomal chromosomes from 1000G project data can be downloaded from: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502

To convert the VCF files into macs format and extract the bi-alleleic sites use the Python script convert_vcf_to_macs.py

The following commands with corresponding parameters were used:
 
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 60 -s 2 -l 1.0 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 65 -s 2 -l 1.11 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 70 -s 2 -l 1.25 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 75 -s 2 -l 1.43 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 90 -s 2 -l 1.67 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 110 -s 2 -l 2.0 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 120 -s 2 -l 2.5 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 190 -s 2 -l 3.33 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 300 -s 2 -l 5 -t <chr_length>
    RaPID_v.1.2 -i <input_file> -o <chr_name> -r 10 -w 450 -s 2 -l 10 -t <chr_length>
The Python script compute_kinship.py computes the kinship among all populations and plots the kinship matrix.

5. 1000G Project Results

Detected IBD segments in 1000G project

Visualization of shared IBD segments among populations in 1000G project


6. RaPID version v.1.0 for RECOMB 2017