Two basic types of linkage analysis are available, Twopoint and Multipoint. For Twopoint linkage analysis, only the IBDs at specific markers or candidate loci are required. For Multipoint linkage analysis, multipoint IBDs (MIBDs) are required, which in turn requires that marker locations have been mapped. IBD stands for Identity By Descent, and MIBD stands for Multipoint Identity By Descent.
IBDs and MIBDs are matrices that contain one value for each pair of individuals in a pedigree. To save space, unrelated pairs of individuals are left out and their value is assumed to be zero. The matrices are stored in files which are compressed automatically by SOLAR using the GNU gzip program. Even with compression, the files can take up quite a bit of disk space. The first line in matrix files created by SOLAR looks like strange data but is actually a a checksum to insure the matrix is used with the correct pedigree only. This checksum line is created by the matcrc command. It is not necessary for user-created matrix files to have this checksum line, but it is recommended and easy to do with the matcrc command.
SOLAR uses an approximate method for computing multipoint IBDs. This method is discussed in detail in:
Almasy L, Blangero J (1998) Multipoint quantitative trait linkage analysis in general pedigrees. Am J Hum Genet 62:1198-1211.
It is also possible to use the multipoint IBDs computed by another genetic analysis program. This is discussed in Section 5.4.
Preparation of marker-specific IBD files requires the following commands (in this order):
load pedigree pedigree-filename load freq freq-filename ;# Optional load marker marker-filename freq mle ;# Required if no load freq ibddir ibd-dirname ibd
(Depending on the situation, some of these commands are not required. A simplified presentation of the following discussion was presented in the tutorial Section 3.7.)
Use the load freq
command if you have
prior knowledge of the allele frequencies. Otherwise, a
simple counting method, identical to that used by the PEDSYS
program genefreq, is used by the load
marker
command to determine the allele frequencies.
If you use the simple-count allele frequencies provided by
load marker
, it is recommended that you
compute maximum likelihood estimates of the frequencies with
the freq mle
command (though that process
might take a lot of computer time and memory). If you load
allele frequency information which has already been maximized
with load freq
, you may choose not to use
the
freq mle
command.
The order of the commands is important. The pedigree data
must be loaded first so that SOLAR can determine the family
structure and generate its internal indexing. The marker data
and frequency data must be loaded next. If marker allele
frequencies have been determined previously, these can be
loaded either before or after loading the marker data. If
allele frequencies are not known ahead of time, simply omit
the load freq
step.
Only one pedigree file can be loaded at a time. Issuing the
load pedigree
command causes any
previously loaded pedigree data to be unloaded. Pedigree data
need only be loaded once. Thereafter, each time SOLAR is run
from the same working directory, the same pedigree data will
still be loaded. Pedigree data must be loaded before marker
data can be loaded, and when pedigree data is unloaded, any
currently loaded marker data will also be unloaded.
Only one marker file can be loaded at a time. The marker data
will remain loaded until either a new marker file is loaded or
the marker unload
command is given. Once
loaded, the marker file does not have to be reloaded in
subsequent SOLAR runs. Similarly, only one freq file can be
loaded at a time. The allele frequency data will remain
loaded until either a new freq file is loaded or the
freq unload
command is given. Once
loaded, the freq file does not have to be reloaded in
subsequent SOLAR runs.
By default, the markers in the marker file are assumed to be
autosomal. If they are X-linked, the XLinked option must be
set with the ibdoption
command before the
marker file is loaded. Alternatively, the
-xlinked
option can be specified in the
load marker
command. Since there is no
mechanism for declaring the X-linked status of individual
markers, the markers in the marker file must either all be
autosomal or all be X-linked.
NOTE: At present, marker-specific IBDs for X-linked loci can be computed only with the Curtis and Sham method. This means that twopoint linkage analysis of X-linked loci is restricted to non-inbred pedigrees with limited looping. Also, it is not currently possible to conduct a multipoint linkage analysis for X-linked markers.
Computing the maximum likelihood estimates for the allele
frequencies (using the freq mle
command)
will improve the accuracy with which marker genotypes are
imputed for those individuals who are not typed. By default,
IBDs will not be computed for markers with simple-count allele
frequencies generated by the load marker
command. That is, MLE allele frequencies are required when
prior frequency data is not available. This behavior can be
overridden either by setting the NoMLE
option with the
ibdoption
command, or by giving the
-nomle
argument to the
ibd
command.
SOLAR uses an external program, allfreq, to compute MLE allele frequencies. Allfreq is an extension of the program MENDEL. The marker genotypes are expected to follow the rules of Mendelian inheritance. If a marker discrepancy is encountered, the following error message will be displayed:
Mendelian inconsistency found near individual ID = ID
If the pedigree file contains a family ID
(FAMID
) field, the family ID will be
included in the error message. This information may be
helpful in diagnosing the discrepancy, but in general you will
need to use other software to ensure that your marker data is
clean.
During the maximization procedure, the freq
mle
command displays the number of iterations and
the improvement in the log likelihood obtained thus far. If a
SOLAR script which includes the freq mle
command is run as a background job, this display may cause the
job to hang or abort. To turn off the display, use the
verbosity min
command.
Once computed, MLE allele frequencies will be in effect while
the marker data is loaded. When new marker data is loaded,
the allele frequencies for previously loaded markers will not
be retained (unless the frequencies were loaded from a file by
the load freq
command.) In order to keep
a record of the allele frequencies that were used to compute
IBDs, MLE allele frequencies should be saved to a file with
the freq save
command. The load
freq
command can be used to restore the MLE allele
frequencies from this file at a later time if desired. New
marker data will not be loaded until either the MLE allele
frequencies are saved or the previous marker data is unloaded
with the marker unload -nosave
command.
Likewise, the freq load
command will not
replace unsaved MLE allele frequencies with new frequency data
unless the -nosave
option is specified.
The directory where the IBD files are to be created must be
specified with the ibddir
command before
the IBD computation can be started. While it is perfectly
legal to store the IBD files in the current working directory
(specified with ibddir .
) you may find it
more convenient to create a subdirectory to hold the files.
SOLAR uses one of two methods to compute marker-specific IBDs. Which method is used depends on the family structure. The first method is the Curtis and Sham algorithm, in which the LINKAGE/FASTLINK package is used to compute the required likelihoods. This method is not applicable in the case of inbreeding. Furthermore, although LINKAGE/FASTLINK is capable of handling multiple loops, we have chosen to sidestep the issue of choosing multiple loopbreakers. Therefore, if inbreeding is present or if more than one loopbreaker is required in the case of non-inbreeding loops, SOLAR will not use the Curtis and Sham method.
There is an important caveat regarding IBD computation using
the LINKAGE-based method. Because the method requires many
invocations of the LINKAGE programs unknown and mlink, there
will be a high volume of I/O as input and output files are
read and written. As a result, the speed of file I/O will be
the limiting performance factor. If possible, avoid the use
of remote file systems, e.g. NFS, for which the file I/O must
be done over a network. On systems that support an in-memory
file system, e.g. Solaris' /tmp
, consider
running the IBD calculations there. We have observed a
two-fold or better performance increase by running in
/tmp
.
The second method, which is applicable in all cases, is a
Monte Carlo algorithm. First, all missing genotypes are
imputed in a random fashion and the likelihood of the imputed
genotype vector is calculated. A recursive algorithm due to
Davis and Weeks is then used to compute IBDs given the imputed
genotype vector. This process is performed repetitively and a
weighted average of the IBDs is accumulated for each pair of
individuals, where the weight is the likelihood of observing
the imputed genotype vector. The number of iterations
performed defaults to 200, but may be changed with the
ibdoption
command. In the case that all
individuals are typed, as in a simulated data set, there are
no missing genotypes to impute. There is no need to iterate,
so the number of iterations may be set to one. Since the
Davis and Weeks algorithm is quite fast, for completely-typed
data the Monte Carlo method will give the best performance.
For this reason, the Monte Carlo method is chosen
automatically to compute IBDs for completely-typed markers.
The ibdoption
command lets you display or
modify the options in effect related to IBD and MIBD
calculation. These options are:
XLinked select this option for X-linked marker data NoMLE if this option is chosen, MLE allele frequencies are not required for IBD calculation MCarlo if this option is chosen, the Monte Carlo method will be used to calculate IBDs MibdWin size (in cM) of the multipoint IBD window - the MIBDs at a given chromosome location depend only on markers inside or on the boundary of the window centered at that location
Use the command help
ibdoption
for more information.
During the IBD computation procedure, the ibd
command displays the number of pedigrees processed thus far. If
a SOLAR script which includes the ibd
command is run as a background job, this display may cause the job
to hang or abort. To turn off the display, use the
verbosity min
command.
Preparation of multipoint IBD (MIBD) files requires the following commands:
load pedigree pedigree-filename load map map-filename ibddir ibd-dirname mibddir mibd-dirname mibd relate ;# Not required mibd merge ;# Not required mibd means ;# Not required mibd [from to] incr
As in the case of marker-specific IBD preparation, the order
of the commands is important. First the pedigree data is
loaded and then the map data. The marker-specific IBDs must
have already been computed and reside in the directory
specified by the ibddir
command.
Since multipoint IBDs are computed on a per-chromosome basis,
the map file includes only the markers on a particular
chromosome. To compute MIBDs for more than one chromosome at
a time, you can create a Tcl script which performs the
steps listed above once for each chromosome. Of course, you
won't need to reload the pedigree data for each chromosome,
nor will it be necessary to run the
mibd relate
command more than once, since
the relative-class file depends only on pedigree information,
not marker or map data.
When computing MIBDs, SOLAR converts the distances between
pairs of markers to recombination fractions. By default,
SOLAR assumes the mapping function to be Kosambi. It is also
possible to use the Haldane mapping function. Use the command
help
map
for more information.
The directory where the MIBD files are to be created must be
specified with the mibddir
command before
the multipoint IBD computation can be started. While it is
perfectly legal to store the MIBD files in the current working
directory (you would use the command mibddir
.
), you may find it more convenient to create a
subdirectory to hold the files.
The relative-class file is created with the mibd
relate
command. Next, the marker-specific IBDs are
merged into a single file with the mibd
merge
command. The mibd means
command computes the mean IBD by relative-class. The last
mibd
command listed above computes multipoint
IBDs at incr
cM intervals from location
from
through location
to
. If only the
incr
argument is given, multipoint
IBDs are computed at incr
cM
intervals from location 0 through the location of the last
marker in the map. For convenience, only the last of the
mibd
commands need be given. The first
three mibd
commands
(mibd relate
, mibd merge
,
mibd means
) will be issued automatically.
Since the identification of relative classes requires knowledge
of the family structure but not marker or map information, the
mibd relate
command can be issued as soon
as the pedigree data has been loaded. The pedigree
classes
command can then be used to display a tally
of the relative classes present in the data set. It is
possible that the mibd relate
command will
fail because an unknown relationship is encountered.
SOLAR cannot handle arbitrary relationships, relying instead
on information that has been worked out in advance for an
extensive set of relative classes. While the set of known
relationships has been expanding with new SOLAR releases,
occasionally a data set will contain relative classes not yet
handled by SOLAR. If your data set contains such a class, you
must contact the SOLAR developers for assistance.
NOTE: At present it is not possible to compute multipoint IBDs for X-linked markers.
During the MIBD computation procedure, the mibd
command displays the marker location currently being processed. If
a SOLAR script which includes the mibd
command
is run as a background job, this display may cause the job to hang
or abort. To turn off the display, use the
verbosity min
command.
You are not limited to using the multipoint IBDs computed by SOLAR. A number of genetic analysis programs are available which may be used to compute multipoint IBDs. If your pedigree data contains relative classes not supported by SOLAR, or if you prefer to use exact multipoint rather than SOLAR's approximation, you can compute MIBDs with one of these programs and then use those MIBDs in your SOLAR multipoint analyses.
There are two important steps required in order to use another program to compute MIBDs suitable for use in SOLAR. First, the input files required by that program must be created. For a number of genetic analysis programs, this step can be performed using Mega2, which is available at:
http://watson.hgen.pitt.edu/mega2.html
The second step is to convert the output of the other program into SOLAR-ready MIBD files. (See Section 8.6 for a description of MIBD files.) Some genetic analysis programs may provide support for this step; for example, Loki has the capability to generate SOLAR-ready MIBD files directly. In general, however, some post-processing of the program output will be necessary. SOLAR provides a mechanism, described in Section 5.5, by which multipoint IBDs can be read from a comma-delimited file and used to generate SOLAR-ready MIBD files. This process is referred to as importing MIBDs. Thus, if a genetic analysis program outputs a file in the appropriate comma-delimited format, or if the program output is translated into this format, then the creation of MIBD files can be performed by SOLAR.
SOLAR currently provides direct support for computing
multipoint IBDs using SimWalk2, Loki,
GeneHunter, and Merlin.
The input files needed to compute MIBDs are created by the
mibd prep
command, while the resulting
output files are processed with the mibd
import
command.
For SimWalk2, the necessary commands are:
load pedigree pedigree-filename load freq freq-filename ;# Optional load marker marker-filename freq mle ;# Required if no load freq load map map-filename mibd prep simwalk
The following SimWalk2 input files will be created:
BATCH2.DAT - control file swmibd.map - map file swmibd.loc - locus file swmibd.ped - pedigree/genotype data mibdchr<chr>.loc - map file for SOLAR plots
The file mibdchr<chr>.loc
(where <chr>
is the chromosome
number from the map file) should be moved to the directory where
the SimWalk2-computed multipoint IBDs will be stored.
SimWalk2 (version 2.91 or higher) can then be run to compute
the MIBDs. A number of output files will be generated by SimWalk2.
The files which contain the MIBD results needed by SOLAR are
named IBD-01.nnn
, where
nnn
is the pedigree number,
e.g. 001
. These files must be combined into a
single output file, which can be gzipped to save space if desired.
This file can then be processed by SOLAR using the commands:
mibddir mibd-dirname mibd import simwalk chr -f output-filename
The chromosome number chr must be given so that SOLAR knows how to name the MIBD files it creates (see Section 8.6).
For Loki, the necessary commands are:
load pedigree pedigree-filename load freq freq-filename ;# Optional load marker marker-filename freq mle ;# Required if no load freq load map map-filename mibd prep loki
The following Loki input files will be created:
lkmibd.data - pedigree/genotype data lkmibd.prep - prep parameter file lkmibd.loki - loki parameter file mibdchr<chr>.loc - map file for SOLAR plots
The file mibdchr<chr>.loc
should be moved to the directory where the Loki-computed
multipoint IBDs will be stored. Loki can then be run
to compute the MIBDs. The MIBDs will be stored in a gzipped
output file named loki.ibd.gz
. This file
can be processed by SOLAR using the commands:
mibddir mibd-dirname mibd import loki
It is not necessary to give the chromosome number, as is the case with SimWalk2, because it is stored in one of the Loki input files.
For GeneHunter, the necessary commands are:
load pedigree pedigree-filename load freq freq-filename ;# Optional load marker marker-filename freq mle ;# Required if no load freq load map map-filename mibd prep genehunter
The following GeneHunter input files will be created:
ghmibd.cmd - control file ghmibd.loc - locus file ghmibd.ped - pedigree/genotype data mibdchr<chr>.loc - map file for SOLAR plots
The file mibdchr<chr>.loc
should be moved to the directory where the GeneHunter-computed
multipoint IBDs will be stored. GeneHunter can then be
run to compute the MIBDs. To use the control file
ghmibd.cmd
, start GeneHunter and enter the
command run ghmibd.cmd
. The MIBDs will be
stored in an output file named ghmibd.ibd
.
This file can be processed by SOLAR using the commands:
mibddir mibd-dirname mibd import genehunter chr
The chromosome number chr must be given so that SOLAR knows how to name the MIBD files it creates (see Section 8.6).
For Merlin, the necessary commands are:
load pedigree pedigree-filename load freq freq-filename ;# Optional load marker marker-filename freq mle ;# Required if no load freq load map map-filename mibd prep merlin
The following Merlin input files will be created:
mlmibd.cmd - Merlin IBD command mlmibd.dat - data description file mlmibd.ped - pedigree/genotype data mlmibd.frq - allele frequency file mlmibd.map - map file mibdchr<chr>.loc - map file for SOLAR plots
The file mibdchr<chr>.loc
should be moved to the directory where the Merlin-computed
multipoint IBDs will be stored. Merlin can then be
run to compute the multipoint IBDs by entering the Unix
command contained in the file mlmibd.cmd
.
The MIBDs will be stored in an output file named
merlin.ibd
. This file can be processed
by SOLAR using the commands:
mibddir mibd-dirname mibd import merlin chr
The chromosome number chr must be given so that
SOLAR knows how to name the MIBD files it creates (see
Section 8.6).
We have noted that it is OK to use externally-computed
multipoint IBDs, i.e. MIBDs computed by another genetic
analysis program, in a SOLAR multipoint
analysis. It is also OK to use externally-computed
marker-specific IBDs in a twopoint analysis. However, the use
of IBDs/MIBDs computed by a program other than SOLAR is
complicated by two issues. The first issue, the format of the
IBD/MIBD files, is trivial. The more difficult problem is
that SOLAR's IBD/MIBD files are keyed by the indexed IDs
(
To facilitate the use of externally-computed IBDs and MIBDs,
SOLAR provides an import feature for the
5.5 Importing and Exporting IBDs and MIBDs
IBDIDs
) assigned by SOLAR rather than the
IDs from the pedigree file.
ibd
and mibd
commands.
The ibd import
command reads
marker-specific IBDs, keyed by real ID
s,
from a comma-delimited file, and creates a SOLAR-formatted IBD
file keyed by
IBDIDs
. See Section 4.1 for a description of the
format of comma-delimited files. The comma-delimited input
file must contain at least the following fields:
ID1
,
ID2
, and IBD
. If the
pedigree file contains a family ID (FAMID
)
field, then a FAMID
field must appear in
the input file as well. The input file can contain IBDs for
more than one marker provided a MARKER
field
is present which contains the marker names. An input file which
contains IBDs for more than one marker must be sorted on the
MARKER
field. If the input file contains a
D7
field, the d7 values from that field will
be included in the IBD file(s) that are created. See Section 8.5 for a description
of d7 coefficients. Following are the first few lines from an
example input file:
MARKER,FAMID,ID1,ID2,IBD,D7
D1S53,Smith,John,John,1,1
D1S53,Smith,John,Karen,0.500135,0.07112
The mibd import
command reads multipoint
IBDs, keyed by real IDs, from a comma-delimited file, and
creates SOLAR-formatted MIBD files keyed by
IBDIDs
. The comma-delimited input file
must contain at least the following fields:
CHROMO
,
LOCATION
, ID1
,
ID2
, and IBD
. If the
pedigree file contains a family ID (FAMID
)
field, then a FAMID
field must appear in
the input file as well. The CHROMO
field
contains the chromosome numbers, while the
LOCATION
field contains the chromosomal
locations in cM. An input file which contains MIBDs for more
than one chromosomal location must be sorted on the
CHROMO
field first and then on the
LOCATION
field. If the input file contains
a D7
field, the d7 values from that field
will be included in the MIBD file(s) that are created.
Following are the first few lines from an example input file:
CHROMO,LOCATION,ID1,ID2,IBD 6,93,A0457,A1082,0.369 6,93,A0457,A0119,0.14576
When importing IBDs (and MIBDs), SOLAR assumes that the input file contains an entry for every pair of individuals whose IBD value is non-zero. Any pair of individuals who do not have an entry in the input file will be assumed to have IBD = 0. However, in the case that there are no entries in the input file for the pairs of individuals in a particular pedigree, an IBD value of -1 is assigned to the main diagonal entries in the IBD matrix for that pedigree. In a linkage analysis, all pairs of individuals in that pedigree will be treated as having ibd = phi2, the expected IBD allele-sharing at a locus chosen at random.
SOLAR also provides an export feature for the
ibd
and mibd
commands.
The ibd export
command writes the IBDs and
d7 coefficients from one or more marker-specific IBD files to
a comma-delimited file. A MARKER
field is
included which contains the marker name(s). The
IBDIDs
in the SOLAR IBD file are
translated to real IDs, with a FAMID
field
added if family IDs are present in the pedigree file. The
mibd export
command writes the IBDs and d7
coefficients at every chromosomal location (for which a SOLAR
MIBD file exists) on one or more chromosomes to a
comma-delimited file. A CHROMO
field is
included to identify the chromosome(s), and a
LOCATION
field is included which gives the
chromosomal locations in cM.
Since the exported IBD (MIBD) files can easily become quite large,
it may be a good idea to limit the number of markers (chromosomes)
exported to any one file. The mibd export
command has a -byloc
option, which specifes
that a separate export file is to be created for each chromosomal
location. This feature can be handy in the event you are planning
to merge files containing MIBDs exported from different pedigrees,
as, for example, in the situations described in the following
paragraphs. Before MIBDs can be imported from the merged files,
the files must be sorted on chromosomal location. Sorting is not
necessary, however, if each merged file contains MIBDs for a single
location.
The ability to export and import IBDs/MIBDs is useful whenever the
SOLAR indexing must be modified. An obvious example is
when one or more pedigrees in a data set are altered after
IBDs have been computed. The indexed IDs
(IBDIDs
) that are assigned to individuals
in the altered pedigrees, as well as those of individuals in
following pedigrees, will likely be changed. This makes the
existing SOLAR IBD files, which are keyed by
IBDIDs
, unusable. In this case, you can
export the existing IBDs to a comma-delimited file. The IBDs
for pairs of individuals from the pedigrees to be altered
should be deleted from this file because those IBDs may
change. Next, load the altered pedigrees into SOLAR, compute
new IBDs for those pedigrees, and export the new IBDs, appending
them to the comma-delimited file. Make sure the comma-delimited
file is properly sorted -- by marker in the case of IBDs or by
chromosomal location in the case of MIBDs. Finally, load the
entire pedigree file, including the altered pedigrees, into
SOLAR to generate the new indexing for the data set, and
import the IBDs from the comma-delimited file.
Another situation in which it is necessary to export and import IBDs (or MIBDs) is when there are two or more subgroups within your data set for which the marker allele frequencies are different. In order to use the correct allele frequencies when computing IBDs, you must load each subgroup into SOLAR separately, compute the IBDs using the frequency data for that subgroup, and export the IBDs to a comma-delimited file. Merge the exported IBD files, making sure the merged file is sorted appropriately. Finally, load the entire data set into SOLAR, and import the IBDs from the merged file.