Random Projection-based IBD Detection (RaPID)

Ardalan Naseri, Xiaoming Liu, Degui Zhi, Shaojie Zhang

1. Usage information

The binary version of the program for Linux (64-bit) can be downloaded from here.

    chomod +x RaPID_v.1.0

    Usage: ./RaPID_v.1.0 <option(s)>

    Options:

    -h,--help					Show this help message
    -i,--input <INPUT FILE>			Input file path in MACS format
    -o,--output <OUTPUT FOLDER> 		Output folder. The final result will be results.max in the output folder
    -w,--window <WINDOW SIZE> 			Window size for subsampling
    -r,--run <NUMBER OF RUNS>			Number of runs
    -s,--support <NUMBER OF SUCCESSES>		Minimum number of successes to consider a hit
    -m,--minimum <MINIMUM IBD LENGTH>		Minimum IBD length in terms of SNPs
    -l,--length <MINIMUM IBD LENGTH>		Minimum IBD length in cM
    -t,--total <TOTAL LENGTH>			Length of chromosome in cM or Mbps

Required:

    <INPUT FILE> <OUTPUT FOLDER> <WINDOW SIZE> <NUMBER OF RUNS> <NUMBER OF SUCCESSES> <MINIMUM IBD LENGTH>
    <TOTAL LENGTH> required if minimum IBD length is given in cM or Mbps

The input file should be given in MaCS format. You can use this script to convert VCF files into macs format.

You can find an example of a panel with 4000 haplotypes with the chromosome length of 10Mbps in macs format here. Unzip the file using

    gunzip 4k_1e7_e0.001.tar.gz
    tar -xvf 4k_1e7_e0.001.tar

To run the program for detecting the IBDs with a minimum length of 1.5 Mbps type:

    RaPID_v.1.0 -i 4k_1e7_e0.001 -o output -r 40 -w 35 -s 20 -l 1.5 -t 10

The program outputs all detected IBD segments in the following format:

    MATCH <hap1_index> <hap2_index> <starting_position> <ending_position> <length>
<starting_position> denotes the start of the IBD segments in terms of variant sites. <ending_position> denotes the end of the IBD segments, respectively.

For example:

    MATCH 0 1369 29280 43739 14460

0 is the index of the first haplotype and 1369 is the index of the second haplotype. 29280 is the starting site, 43739 is the ending site and 14460 is the length of the IBD in terms of variant sites.

2. Parameters selection

The java interface RaPID_Params.jar can be used to estimate the parameters:

    java -jar RaPID_Params.jar



Java Interface Java_Params.jar to estimate the parameters of RaPID

3. Benchmarking

Simulation Data: 4000 haplotypes of the length 10Mbps were generated using MaCS simulator:

    macs 4000 10000000 -seeds 1 -T -t 1.167 -r 0.8453 -h 1000 -G 21133448 -eG 1.183e-07 5072027 -eG 7.098e-07 1183473 -eN 3.549e-06 0.0001420 -eN 5.915e-05
    0.001136 2>trees.txt 1> sites.txt

MaCS simulator can be downloaded from https://github.com/gchen98/macs.

IBD segments can be extracted from the generated files using Dendropy python library (http://www.dendropy.org/). The Python script extract_IBDs.py uses Dendropy package and extracts the IBD segments.

For benchmarking the following commands with corresponding parameters were used:

    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 75 -s 40 -l 1.5 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 105 -s 40 -l 2 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 125 -s 40 -l 2.5 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 160 -s 40 -l 3.0 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 185 -s 40 -l 3.5 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 215 -s 40 -l 4.0 -t 10
    RaPID_v1.0 -i <input_file> -o <output_folder> -r 80 -w 235 -s 40 -l 4.5 -t 10
    RaPID_v.1.0 -i <input_file> -o <output_folder> -r 80 -w 265 -s 40 -l 5.0 -t 10

4. Applying RaPID on 1000G Project Data

All autosomal chromosomes from 1000G project data can be downloaded from: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502

To convert the VCF files into macs format and extract the bi-alleleic sites use the Python script convert_vcf_to_macs.py

For 1000G project data, we assumed an error rate of 0.001 which corresponds to genotyping quality of 30 and estimated the parameters accordingly. The following commands with corresponding parameters were used:

    RaPID_v.1.0 -i <input_file> -o <chr_name> -r 40 -w 35 -s 20 -l 1.5 -t <chr_length>
    RaPID_v.1.0 -i <input_file> -o <chr_name> -r 40 -w 180 -s 20 -l 5.0 -t <chr_length>
The Python script compute_kinship.py computes the kinship among all populations and plots the kinship matrix.