SOLAR has four major high level genetic modeling commands:
Command Modeling Performed ------- -------- --------- polygenic Quantitative Genetic (and/or covariate screening) twopoint Twopoint multipoint Multipoint bayesavg Bayesian Model Averaging (covariates or linkage elements)
These commands (which are actually SOLAR/Tcl scripts)
create, maximize, and compare many models automatically.
Modeling can also be done more manually using the elementary
SOLAR commands. The basic command underlying all the large
analysis scripts is
maximize
which adjusts model parameter
values to achieve the
maximum likelihood.
In rare cases, SOLAR may be unable to maximize models for which the data are barely adequate, or the curvature of the likelihood surface exceeds the numerical limits. The loglikelihood and/or the parameter values may go to illegal numerical values. This is called Convergence Failure. Techniques for overcoming convergence failure are discussed below in Section 6.8.
Quantitative Genetic or Polygenic analysis can
be done using the
polygenic
command. First, you must load the pedigree and phenotypes
file (if you have not done so already), specify your trait and
specify your covariates. Then you can run
polygenic
. (If you have created any other
models before, you should give the model
new
command first.)
solar> example solar> load pedigree gaw10.ped solar> load phenotypes gaw10.phen solar> model new solar> trait q4 solar> covariate sex age age*sex age^2 age^2*sex solar> polygenic
Covariates may include sex (found in the pedigree file)
and any of the variables included in the phenotypes file.
Variables may be raised to any power using the
^
operator (for example,
age^2
is age squared).
Interactions are specified using the *
operator (so age*sex
is the interaction of
age and sex, usually called age by sex). You may list
all the covariates in one command (as above) and/or use
several commands; the list of covariates is cumulative. If
you simply want to include ALL the variables (other than the
trait) in the phenotypes file as covariates, you may use the
automodel
or allcovar
commands (which will also include all the age and sex
interactions and squares as shown above). In the covariate command,
the pound sign (
#
) is a shortcut for both variables
and their interactions, and commas can be used to separate
multiple exponents. So the above list of 5 covariates could
be abbreviated like this:
covariate age^1,2#sex
The polygenic command will calculate H2r (the polygenic
heritability), the significance of H2r, and the
proportion of variance caused by the covariates. The
coefficients of the covariates are the values of the
parameters in the model itself. You can display the entire
current model with the command model
,
or just the parameters with the command
parameters
. If we wanted to just see the
value of one parameter, bage
(the beta coefficient or
parameter for age), we could use the command:
solar> parameter bage = 0.006262359675
To see the calculated standard error for this
parameter, we use the selector se
:
solar> parameter bage se 0.002376826451
A popular option, covariate screening, can be added to this with
the -screen
argument:
solar> polygenic -screen
( -screen
may be abbreviated as -s
)
Covariate screening will determine the statistical
significance of each covariate, and remove it from the
final model if it falls below a specified level. By default,
the significance criterion level is 0.1 (chosen to retain
covariates unless they are unlikely to be important). The
significance criterion can be changed by using the
-prob
argument:
solar> polygenic -screen -prob 0.05
If you would like to determine the significance of each
covariate, but would not like to remove any of them
from the final model, you can force all the covariates into
the final model with the -all
argument:
solar> model new solar> trait q4 solar> covar age^1,2#sex solar> polygenic -screen -all
You can also selectively force covariates into the final model
using the one or more -f
arguments. Each
one is followed by the name of one argument:
solar> polygenic -s -fix sex -fix age
When completed, polygenic will provide a summary of the results, which will look something like this:
Pedigree: gaw10.ped Phenotypes: gaw10.phen Trait: q4 Individuals: 1000 H2r is 0.5502247 p = 2.8143773e-29 (Significant) H2r Std. Error: 0.0573958 age p = 0.0085273 (Significant) sex p = 0.2913649 (Not Sig., but fixed) age*sex p = 0.2032678 (Not Significant) age^2 p = 0.2947688 (Not Significant) age^2*sex p = 0.5929531 (Not Significant) The following covariates were removed from final models: age*sex age^2 age^2*sex Proportion of Variance Due to All Final Covariates Is 0.0069282 Output files and models are in directory q4/ Summary results are in q4/polygenic.out Loglikelihoods and chi's are in q4/polygenic.logs.out Best model is named poly and null0 (currently loaded) Final models are named poly, spor, nocovar Constrained covariate models are named noResidual Kurtosis is -0.0538, within normal range
New with SOLAR version 2.0.5 and above, a Residual
Kurtosis test is performed. The residuals of the trait
(after the covariate influences have been removed) are
written to a file named polygenic.residuals
and then the
kurtosis is computed. If there are no covariates, only
the kurtosis of the trait itself is tested. If the Residual
Kurtosis exceeds 0.8, you are given a warning. In that case,
you are advised to see Note 5 in the help message for polygenic
.
High kurtosis indicates a trait distribution not well suited
for SOLAR linkage analysis because unrealistically high LOD scores
might be produced. However, you can mitigate the harmfulness
of high kurtosis using tdist or lodadj.
Output files will be written to a subdirectory named by the
trait
or outdir
commands.
They will include a file named
polygenic.out
, which is like the summary
shown above. There will also be a file named
polygenic.logs.out
which will show all the
loglikelihoods used in the calculations. There will also be
many models, including the final model which will be named
poly.mod
. For convenience, a copy of that
model named null0.mod
will be made for use
in twopoint
and
multipoint
linkage scans. The final
polygenic model will be left in memory at the conclusion. If
screening was done, it will include only the covariates which
met the significance criterion, or fixed by the user. The
information summary shown above mentions the important files
and models written.
A twopoint scan is done by the twopoint
command (itself a Tcl/Solar script), which will scan
all the markers currently in the ibddir
.
The
LOD score and H2q1
(proportion
of variance associated with a linkage element) will be
reported for each marker. The report is displayed on your
terminal and also written to the file
twopoint.out
. If you later get additional
markers, you may append them to the previous
twopoint.out
file using the
-append
option.
Twopoint scanning requires a model named
null0
in the output directory. The usual
way to create this is with the polygenic
command. If you are appending new markers to an existing
twopoint.out
, you must be sure you are
using the same null model. If not, you should use the
-overwrite
option instead of the
-append
option, and scan the markers all
over again.
Twopoint scanning can be done with custom parameterizations. See Section 9.5 for a discussion. One special case of this is the analysis of dominance, which is described in Section 9.4.
The multipoint
command will scan the selected chromosomes (or the whole
genome) for the highest scoring quantitative trait locus
(QTL). If a criterion LOD score is specified,
and that criterion is met by the highest LOD score,
multipoint
will lock-in the highest QTL
and perform another scan to find another QTL. An example was
given in Chapter
3. If no criterion is specified,
multipoint
will simply make one pass to
find the best single QTL and stop.
Multipoint requires a model named null0
in the output
directory and the usual way to create this is with the
polygenic
command. It also requires MIBD
files (whose preparation is described in Section 5.3.
Before giving the multipoint command, the following other commands are also required:
mibddir select multipoint IBD directory chromosome specify chromosome(s) interval select scan interval and range
The following commands are OPTIONAL:
finemap establish LOD threshold for finemapping (OPTIONAL...it defaults to 0.588)
Here is another example multipoint scan:
mibddir gaw10mibd chromosome 6-10 12 14-23 interval 5 multipoint 3 1.9
In this example, multipoint will perform one pass, and then if the LOD for the best QTL is greater than 3, it will make another pass. At the end of the second pass, it will continue making passes until the LOD for the best QTL does not reach 1.9 or higher.
As it is proceeding, multipoint
will show
the LOD score and H2q1
for each locus
tested. When it is done, it will show the highest scoring
locus for the last pass, and the QTL's included in the final model.
Summary output is written to a file named
multipoint.out
. This simply summarizes
the loci that were scanned in each pass, and the QTL's
included in the final model. The detailed results for each
pass are written to a file named
multipoint<pass>.out
, for example,
multipoint1.out
for the first pass,
multipoint2.out
for the second pass, and
so on. The best model containing all loci that met the
specified criteria is saved in link.mod
.
The best model for each pass is also saved in a null
model for the next pass named
null<pass>.out
, for example,
null1
is model with the best QTL from pass
1, null2
is the model with the best QTL
from pass 2 along with the best QTL from pass 1, and so on.
The Bayesian Model Averaging method identifies
important elements (either linkage elements or covariates) by
computing a statistic called the Bayesian Inference
Criterion (BIC) for a model containing each
possible
combination (or
set) of those elements, and then performing additional
statistical tests on the models having the highest BICs and
the elements they contain. The
SOLAR command which uses this method is bayesavg
.
bayesavg
has been modified recently.
Rather than computing the effective log(n)
by maximizing a fully saturated model with all the
candidate elements, an estimated
log(n)
derived from the number of elements is
used. Later, after the window of important models has been
identified, an exact log(n)
is computed
from the model with the best BIC. Not only is this approach
more accurate, it bypasses the step having to maximize a fully
saturated model. It
does, however, require rewriting the main output file to
correct the BIC values. During scanning, the main output file
is bayesavg.est
(or
bayesavg_cov.est
for covariate analysis).
Later this file is copied to a file named
bayesavg.nose
(or
bayesavg_cov.nose
) which uses the final
log(n)
but does not have standard error
estimates. Finally the file bayesavg.out
(or bayesavg_cov.out
) is written which has
standard error estimates for all the models in the window
(unless the -nose
option has been
specified, in which case the .nose
file is
simply renamed to .out
). A file named
bayesavg.avg
(or
bayesavg_cov.avg
) has the final bayesian
model averages.
One way of starting bayesavg
assumes that
you already have created a model with all the elements in it.
This would be most likely in the case of linkage elements.
After having run a multipoint scan finding several
important linkage elements, you could immediately procede to
bayesavg
. The default mode is linkage
model averaging:
solar> mibddir gaw10mibd solar> model new solar> trait q4 solar> covar age solar> polygenic solar> multipoint -ov 1 solar> bayesavg
It's also possible to use bayesavg
with a
starting model containing a slew of covariates. However,
there are also alternatives where you start with a base model
containing only fixed features, to which you want to
add a set of variable features. Perhaps the most
useful option for this is the -list
option, where you specify a file containing all the covariates
(or linkage element) you want tested. The file contains the
name of each item on a separate line. When you specify a
list, the covariates already in the model become fixed
elements.
model new trait q4 covar age polygenic bayesavg -cov -list testcovars.lst
SOLAR QTN analysis is based on bayesian
covariate model averaging. Each candidate single nucleotide
polymorphism (snp) is coded as a covariate in a
phenotypes file. The
-qtn
option has all the features suitable
this application. It automatically invokes the
-cov
and -stop
options. The -stop
option stops the
analysis when no models of current size (that is, the
degrees of freedom) have been included in the window.
When this happens, it is impossible for any models of larger
size to be included in the model either. (This option will
probably become the default in a future release.) This is
very important for QTN analysis because otherwise the set of
all possible models could become astronomically huge. Also
the
-qtn
option includes all covariates with
the prefix snp_
as variable elements, and
all the others as fixed elements. Thus, you can add all the
regular covariates and the snps to the starting model (which
need not be maximized). Also a special
windowfile is produced named
bayesavg_cov.win
which contains a matrix
of the snps included in each window model.
Before using bayesavg
on a large set of
snps, screening should be done using qtnm
and
plotqtn
to
eliminate redundant snps. Once again, this is done to help
prevent the number of models that need to be tested from
becoming astronomically huge.
When there are going to be more than 20 snps tested (which
would correspond to 1,048,575 models with 1 or more elements
in them) it is adviseable to use the -max
option, because the amount of storage necessary to keep all
the possible combinations in computer memory (RAM) will start
becoming a significant problem. SOLAR might crash when
starting the analysis if your computer does not have
sufficient memory. Generally, it will not be necessary to run
through all the possible degrees of freedom anyway. In most cases, a
-max 10
or less is sufficient. (Use a
progressively smaller number as the number of snps increases.)
When using
-max
, however, be sure that the BIC
started to decline for the last degree of freedom actually
tested, otherwise a larger
-max
may be required.
If you are using the standard SOLAR parameterization,
but the polygenic
script does not perform
exactly the quantitative genetic analysis you require, you can
probably do it at a lower level by using the
maximize
command to maximize models you
have set up exactly as you want using the
parameter
, constraint
,
omega
, and mu
commands.
This is less necessary now than it has been in the past,
because now polygenic
can handle constrained covariate beta parameters, which
previously was one of its more glaring limitations. (See
Note 4 in the documantation for
polygenic
for further details.)
The direct use of the maximize
command was
introduced in Section
3.5 of the Tutorial (Chapter 3). If you need to use a
custom parameterization that is discussed in Section 9.5.
Similar ideas can be employed to do a custom linkage analysis.
Underlying all modeling commands in SOLAR is the maximize
command, which
will maximize the loglikehood of the model by adjusting the values
of the parameters such as the variance components and covariate
betas. In addition, the standard errors for each parameter will be
computed, though this is a time consuming option which may be
switched off with the standerr
option:
option standerr 0
If there are covariates, it is usually best to maximize a
sporadic model (having no genetic elements) first
because the determination of covariate betas can then be done
analytically. Then you can maximize the corresponding
polygenic model. If you are using the standard
parameters, you can use the
spormod
to set up the variance parameters
constraints, and omega
as required to make a model sporadic, and
polymod
to set up the parameters,
constraints, and omega as required to make a model polygenic.
You can call these commands after one another, in either
order, and as many times as you like (at least that was the
design intent). There is also a linkmod
command which sets up the parameters, constraints, and omega
as required for a linkage model. It is similar to the
other commands, but it requires that a polygenic model has
been set up first. After linkmod
, you
could go back to a polygenic model with the
polymod
command.
If you have been working with other models in the same
session, it is safest to wipe the slate clean with the
model new
command.
solar> model new solar> trait q4 solar> covariate age solar> spormod solar> maximize solar> polymod solar> maximize poly.out solar> save model q4/null0
Illustrated above is what the polygenic command does in its
most minimal form, without running any extra models to compute
statistics, etc. The most important thing is that a
null0
hass been created to use as the
null in linkage analysis.
To make it easy to compare different models, SOLAR has
commands such as loglike
(return the
natural log likelihood of the recently maximized model),
chi
(compute the probability of rejecting
the null hypothesis given a chi-square value) and
clod
(compute the lod score given two log
likelihoods). The following example will compute the p for
the covariate age in a polygenic model (similar to what is
done by the polygenic -screen
command).
Notice that chi here returns a string such as "p = 0.0153740".
The Tcl
set
command is used to save intermediate
results in variables and the Tcl expr
command is used to evaluate mathematical expressions.
solar> model new solar> trait q4 solar> covariate age solar> spormod solar> maximize solar> polymod solar> maximize solar> set with_age [loglike] solar> parameter bage = 0 solar> constraint bage = 0 solar> maximize solar> set without_age [loglike] solar> set chisq [expr 2 * ($with_age - $without_age)] solar> set deg_freedom 1 solar> chi $chisq $deg_freedom p = 0.0153740
In the next example, the p of the H2r in a polygenic model
will be computed similar to the way it is done by the
polygenic
command. In this case, we will
use the chi
command to return a number so
that it can be divided by two (which is done because H2r is
fixed to a boundary).
solar> spormod solar> maximize solar> set spor_ll [loglike] solar> polymod solar> maximize solar> set poly_ll [loglike] solar> set chisq [expr 2 * ($poly_ll - $spor_ll)] solar> set p2 [chi -number $chisq 1] solar> set p [expr $p2 / 2] solar> puts "p = $p2" p = 3.6566122e-29
After a model has been maximized, all the parameter values in
memory are updated. The current value of a particular
parameter can be displayed with the
The optimization algorithms used by SOLAR are very flexible
and robust, and do not necessarily need to have trait values
following a Multivariate Normal Distribution (though
that is optimal), and likewise for covariate values, (and
one would expect SOLAR to operate best when all data is
normally distributed and there is plenty of it). SOLAR
includes a number of
heuristics based on many years of experience with data
that has not always been entirely adequate. However, despite
all this, there is no guarantee of success. SOLAR
maximization will sometimes be unable to find the optimal
parameter values. This is called
Convergence Failure.
Convergence Failure happens most frequently with linkage
models. In the scan output files produced by the
When possible, (2) one of the most effective way to solve
convergence failures is with better data.
Better data can be better in many ways. It can simply
be a larger data set. It can also be a data set in which
outliers have been removed. SOLAR does not
automatically remove outliers. It can also be a better
cleaned data set, in which Mendelian errors have been
removed. The cleaning of genetic data should be considered a
part of every project based on genetic data. Statistical
genetics programs, SOLAR included, are very sensitive to and
can be thrown off by even a small number of errors. Here at
SFBR, data cleaning is a major process involving many
people and some staff are dedicated to it. SOLAR already
checks for mendelian errors, but it does not necessarily
detect all errors. Another program which is very useful for
finding data errors is Simwalk.
Often it is possible to avoid convergence failure by
transforming the data in some way. The ideal data for
SOLAR would have a multivariate normal distribution,
with a mean of zero and a standard deviation of 1.0. Usually
larger values for the standard deviation are OK but frequently
smaller values cause problems. In particular, a trait
standard deviation smaller than 0.5 has often been found to
cause problems so SOLAR now has a warning for that. One
simple but often useful change in some cases is simply to
multiply the trait values by some constant to overcome a small
standard deviation. Other traditional transformations, such
as taking the natural log or square root of each value, may
also be useful. Typically one
looks at a
histogram of the data and/or gets some statistics such
as standard deviation and kurtosis. (SOLAR has a
command
NOTE: The data transformations described below are intended
for use with quantitative traits only. The things you can
do with discrete traits are limited and not technically
rigorous. However, sometimes you can do exploratory research
by pretending your discrete trait is a poorly distributed
quantitative trait. That can be accomplished by giving the
command parameter
command, or the entire model can
be displayed with the model
command.
Models can also be saved with the save
model
command, and loaded with the load
model
command.
6.8 Techniques for Resolving Convergence Failure
multipoint
command, such as
multipoint1.out
, there is a final column
(with no header) which gives an error abbreviation, if
applicable. This is where you may see the abbreviation
ConvrgErr.
Convergence errors, like lousy
weather, should not stop SOLAR from scanning the genome, at
least once. If convergence errors have occurred during one
scan, however, additional scans conditioned on that scan will
not be performed, and a warning message will be displayed.
When you plot the results, points having convergence errors
will be shown having an asterisk symbol. (The symbol is
placed at an interpolated position on the LOD curve, however
you must remember that the LOD of any points where convergence
failure occurred is actually unknown. Unfortunately,
convergence failure is probably more likely to occur at
interesting points such as peaks having the highest LOD
score. However, sometimes convergence failure occurs right
next to zero LOD scores on both sides. Such convergence failures
can probably be ignored since even if the convergence failures
are hiding single point peaks, those peaks can probably be
ignored since they probably reflect random noise rather than
an underlying biological process. So, the first thing one can
do about SOME convergence errors is: (1) When convergence
errors occur at unimportant QTL's (surrounded by zero LOD's on
both sides), they can probably be ignored.
6.8.1 Get Better Data
6.8.2 Transform Data
stats
to get
the statistics, including kurtosis, for any variable.)
Then one decides upon a suitable transformation. Kurtosis
less than 0.8 (where zero is defined as the value for a
standard normal distribution) is highly desireable. Users of
Pedsys can use the
tally
program. Other useable programs
include Gauss and
Excel. The transformation of data prior to statistical
analysis is discussed in many textbooks, including Genetics
and Analysis of Quantitative Traits by Lynch and Walsh
(1998, Sinauer Associates, Inc.).
option EnableDiscrete 0
before
running polygenic.
6.8.2.1 zscore
SOLAR has a command zscore
to
perform a simple transformation of the trait so that at least the
Mean and
SD are near optimal values. The formula is:
zscore = (value - Mean) / SD
To use the zscore command, give the command
zscore
after specifying your trait. Then
run polygenic and multipoint as usual. Be sure to read the
documentation for zscore
.
Inverse normalization is an important new feature added to SOLAR version 4. When a variable (trait or covariate) is inverse normalized, it resembles a normally distributed variable, yet retains much of the original information as far as a linkage analysis is concerned. As with zscore, the variable will have a mean near 0.0 and a standard deviation near 1.0, but it will also have a distribution that approximates normal.
SOLAR provides two ways to inverse normalize. The most
convenient is to inverse normalize on-the-fly with the
define
command, for example:
define iapoe = inormal_apoe trait iapoe
The prefix "inormal_" (and the underscore is mandatory) causes the prefixed variable to inverse normalized during maximization. Note that if the phenotype name includes special characters, the entire term must be included in angle brackets, for example:
define iapoe =
The define command can also be used with all other simple mathematical operators (+ - * /) and also the math functions (such as log, log10, sqrt, and exp) from C++ which are listed in the help documentation of the define command. Multiple phenotypes may be included in a single define.
The other way to inverse normalize is using the inormal
command,
which allows you to create a new file with the inormalized
variable for use with SOLAR or other programs.
To do this, you start the analysis by doing a polygenic
analysis of your trait and covariates. After the polygenic
command, run the residual
command.
This will produce a file named
residual.out
with a residualized trait now
called residual
. Now load the file
residual.out as a phenotypes file, and select the variable
residual
as your trait. Now you do not
need to select any covariate, as all covariate effects are
already incorporated into the residual value, at least as
they appeared in the polygenic model.
trait apoe covar age sex age*sex height weight BMI polygenic residual load phenotypes residual.out model new trait residual polygenic ... multipoint
Models using the standard parameterization set up by SOLAR automatically have the parameter boundaries managed by SOLAR. This is done differently for the variance component parameters and the covariate beta parameters.
The boundaries of covariate beta parameters are almost always
handled quite well by SOLAR. In most cases they can be left
alone. If uninitialized, the parameter boundaries are estimated
and set in one of the earliest phases of the
maximize
command. The initial boundaries
are based on the maximum and minimum values of the covariate
variable and the trait. The covariate boundaries are initially
set so that the covariate variable could explain twice of the
variation of the trait variable. (This default ratio is
controlled by the option AutoCovarBound
.)
If the covariate boundaries are too narrow, they are
automatically increased by covboundincr
each time for a maximum of covboundretries
.
You can adjust option AutoCovarBound
using
the option
command, and adjust the other
factors using the covboundincr
and
covboundretries
commands. But generally
you will not need to mess with these (I have never known
anyone to try), unless you are having trouble with your
sporadic or polygenic models.
SOLAR already has fairly complicated automatic management of
variance component parameter boundaries based on our
experience. However, it is not necessarily perfect, and may
require some adjustment to handle the most difficult
convergence situations. The basic command to use is boundary
,
and you should read the documentation for the
boundary
command as well as the
documentation under boundary-notes
.
You can click the links above in this paragraph, browse the
documentation later, or use the help
command to bring up the documentation.
The standard variance component parameters of SOLAR (such as
e2
and h2r
have
natural boundaries of 0 and 1. But allowing the parameters
this entire range frequently causes convergence failure, so
artificial boundaries for each parameter are positioned
and moved within that range. At the "end" of the Fortran
maximization algorithm, if any of the variance component
parameters has hit its boundary, the hit boundary is moved and
then the maximization algorithm is called again. On the other
hand, if there is a convergence failure, boundaries around all the
variance components are pulled-in (crunched), and
maximization is tried again. If boundaries have already been
crunched and there is still a convergence failure, a second level
of more intense crunching is done. During
multipoint
scanning, there are additional
types of retries which may be tried. Ultimately the
effect of all retries will either be a successful convergence,
which is what usually happens, or a convergence failure.
Prior to running multipoint
, you can give
boundary
commands to alter the way the
automatic boundary management is done. For example, the
boundary wide start
can be given to use
the wide natural boundaries whenever possible (unless
an actual convergence failure occurs anyway). For more
examples and discussion, see the documentation for the
boundary command.
The Fortran maximization algorithm Search.f also has a
number of convergence settings which can be controlled through
the
option
command. Generally most users
should not be touching these settings. Improper adjustment of
the convergence settings could lead to invalid results.
There are four basic settings available named after the
Fortran variables they control: MaxStep, Conv,
NConv,
and Tol
. It has proven
useful (starting with version 2.0.4) to use a larger value for
Conv for discrete models, so there is now a separate option
for Conv(Discrete)
. There are also two
new options to deal with discrete models:
BCliff
and MaxCliffs
.
Since options are model dependent, you should specify
or change an option AFTER giving the trait
command, but before giving a maximization command such as
polygenic
. If you load a model, all
option values are taken from the model or use the default
otherwise. If you give the model new
command, all options are set to the default values.
MaxStep
(which is called MXSTEP in
Search.f) determines the number of step decrements which are
allowed in each iteration. At the beginning of each
iteration, Search.f tries to predict the optimal point by solving
the quadratic equation. However, often there is more
curvature in the likelihood function than expected, and the
predicted point goes over a cliff where likelihood gets worse
rather than better. When this happens, Search.f backs up, only
going a fraction of the distance in the direction of the
predicted optimal point. Once again, that can fail. The
number of times SOLAR is allowed to "back up" is limited by
MaxStep
. Sometimes it helps to increase
this value. You can see the number of steps that are taken in
each iteration in the maximization output files or in the
verbose maximization output.
Conv
and NConv
work
together to determine when maximization has completed. When
the loglikelihood has changed by less than
Conv
for NConv
iterations, maximization is deemed complete. It sometimes
helps to make Conv larger to allow maximization to complete when
the likelihood surface is extremely curved. This is similar
to increasing the tolerances for parameters, which is
addressed by the Tol
setting.
Maximization involves computing the loglikehood of one set of
parameter values and then trying the find a set of parameter
values that has an even higher loglikelihood. Unfortunately,
sometimes a test point of parameter values will lead to
an incomputable loglikelihood. Such an incomputable value,
like division by zero, is called Not A Number or
NaN
. The code directing discrete trait
maximizations,
Optima.f
had a tendency to ignore these
problems under some circumstances. So an addtional mechanism
has been introduced which handles this problem in the same way
that Search.f
backs up from unexpectedly
large (but not incomputable) drops in loglikelihood value.
The bad "point" is considered to be "over the cliff" so we
back up from it. Since the situation has apparently gotten
very bad, we back may need to back up more than usual, so
there is a new MaxCliffs
setting which
takes the place of the smaller MaxStep
setting. There is also a factor, BCliff
which determines how "much" to back up. The default value,
0.1
means that each time we back up
from a cliff, we back up by a factor of 10. This is the same
(not adjustable) value that is used under "ordinary"
circumstances where the likelihood has fallen by a very large
but still computable amount.
Sometimes a multipoint run will terminate with a "Constraints
not Satisfied" error. This seems very strange at first,
because your model started with satisfied constraints, so why
should it not end with satisfied constraints? Unfortunately,
during maximization there can be numerical problems (such as
when very small numbers are added to very large numbers) such
that, at the end of a maximization step the constraints are
no longer satisfied. This is best understood as a kind of
convergence error resulting from numerical limits. There is
an EXPERIMENTAL option intended to correct this problem. You
can try the command option EnforceConstraints
1
before running polygenic
.
Unfortunately, in our experience, this simply leads to other
kinds of convergence errors later on. Residualization as
described above generally works better, if you can use it.
SOLAR has two mechanisms to deal with trait
distributions that are significantly different from
multivariate normality. These are also applicable
when the Residual Kurtosis (calculated by the
polygenic
command) is greater than 0.8.
These mechanisms are embodied in the commands
tdist
and lodadj
.
Using either of these commands will result in more accurate
LOD scores when the trait distribution is not ideal.
tdist
command
The tdist
command sets up an extra parameter,
t_param
, and enables it by setting the
option tdist
to 1. All you need to do is
give the tdist
command prior to running
polygenic
or another maximization command:
solar> model new solar> trait apoe1 solar> covar age sex age*sex solar> tdist solar> polygenic
Although tdist
is quick and easy to use,
the extra parameter will slow down maximization somewhat,
which will slow down genome scans. To turn if off, give the
tdist -off
command.
lodadj
)
Empirical LOD Adjustment, as realized by the
lodadj
command, is one of the most important features of SOLAR. It
corrects for the innacuracy caused by non-normal trait
distribution, but unlike
tdist
it does not make maximization slower
or less robust.
To use LOD adjustment, you must first calculate the adjustment
required by simulation. This is done by the lodadj
-calc
command, and it may require a considerable
amount of time to complete. Ultimately you end up with a
factor less than 1.0 (such as 0.9) by which all your LOD
scores are (automatically) multiplied. After the LOD
adjustment is computed, you simply need to turn it on when
desired with the
lodadj
command and off with
the lodadj -off
command.
There is a full discussion of lodadj
in
Section 10.2.
Advanced modeling topics including discrete traits, bivariate (two traits), dominance, household groups, and arbitrary parameterization are now discussed in Chapter 9.