fileGenotype

Firstly, downloading fillgenotype program and sample example data.

1. program downloading

Because our program is written using C language, so if the platform compiling this program is different, the executable program(compiled program) will be different. if you use linux operating system with x86_64 CPU, the compiled program's downloading address is as follows:

fillGenotype_86_64.gz

2. sample data downloading

This sample data is from chromosome04 of rice. the number of SNP sites is 55819 and 950 is the number of inbreeding lines(950lines).

download address is exam.input.gz

After download the two files, please use gunzip command to exact them. Because these files are compressed by gzip program.

Secondly, setting parameters and running program

1. After you have exacting the gz file of program, using chmod command to change the permission. the command is as follows:

chmod +x fillGenotype

the meaning of this step is that let the program be executable in linux system.

then you use the command below to look the parameters of this program.

./fillGenotype --help

you will see the helping information below

usage: ./fillGenotype -w <window-size> -k <K-value> -p <noequal-punish> -r <percent-ratio> -i <input-fle> -o <output-file>
-w, --windowSize
define window size
-k, --K-value
define the number of neighbor
-p, --noequalPunish
define punish when two genotypes are not same. noequalPunish must be integer. Normally gapScore is 5, equalScore is 10
-r, --percentRatio
define the ratio which the most genotype account for in all neighbor genotype
-i, --inputFile
the name of input file
-o, --outputFile
the name of output file

The setting values of -w -k -r -p parameters is for imputing missed genotype more accurate.

About imputing our example file, we used the parameter as follows:

-w 80

-k 5

-p -7

-r 0.7

The normal running command is:

fillGenotype -w 80 -k 5 -p -7 -r 0.7 -i example.input -o example.output

The order of the parameters must be changeless, first parameter is -w [PARAMETER_VALUE_SPECIFIED], second parameter is -k [PARAMETER_VALUE_SPECIFIED], third parameter is -p [PARAMETER_VALUE_SPECIFIED], fourth parameter is -r [PARAMETER_VALUE_SPECIFIED], fifth parameter is -i [INPUT_FILE] sixth parameter is -o [OUTPUT_FILE].

After program is run finished, the result file is as follows. exam.output.

On our test Linux server, we cost 142 minutes for finishing imputing missed genotype.

if you think that the time of imputing is very long, you can use head command to select the first 2000 lines of input file.

head -2000 exam.input > input.file

Thirdly, in order to watch the output file clearly, we write a perl script for changeing output file into image.

download address: showMatrix.pl.gz

Whatever your operate system( linux or windows), the command of this script is :

perl showMatrix.pl example.input example.output 5 output_image.jpg

You can see the sample image following:

The color of base pair 'A' is red, 'T' is green, 'C' is pink, 'G' is blue, this missed base pair(no be sequenced) is white.

Of course, you can modify this script to assign your like colors.