Go to Main Index
Go to Table of Contents

Chapter 8

Intermediate Files Created By SOLAR

SOLAR generates many files which are re-used by SOLAR itself. Normally there is no need to mess with these files, and it might cause problems if you do. But in some cases, users want or need to know about these files.

8.1 FORTRAN temporary files

FORTRAN temporary files with names such as:

        tmp.FAAA3Baa8z

may be left behind if a SOLAR command is terminated unexpectedly. Temporary files are normally deleted automatically end of some operation or at the end of a SOLAR session. Files with names such as this (prefixed by tmp.) may be deleted when SOLAR is not running if you wish to recover diskspace (otherwise, they are harmless). You should exit from SOLAR first to be safe. Do not summarily delete any other files created by SOLAR.

8.2 Pedigree Data: pedindex.out

8.2.1 When and why pedindex.out is created

The load pedigree command creates a file named pedindex.out (among other files; see section 8.2.3 for more) in the current working directory. Other SOLAR commands look for this file when they need to access pedigree information. (One should not work with more than one pedigree file from within the same working directory as this can cause serious problems.) This simplifies most SOLAR commands because they can read pedigree information in a simplified, verified, and canonical form.

If you change the pedigree file, you will need to give the load pedigree command again to create a new pedindex.out file. If you have created IBD and/or MIBD files, you will also need to create them all over again, even if no genotypic information has changed, since they depend upon the specific indexing in the pedindex.out file, which might change even if you do not change any of your ID's.

Once a pedigree file has been loaded, it is not necessary to load it again within the same working directory, even when starting SOLAR again at a later date. SOLAR detects the presence of an already existing pedindex.out file. If you reload the same pedigree file, nothing will be damaged because the same pedindex.out will be re-created each time you load the exact same source pedigree file. But unless you have changed the pedigree data file, reloading it is just a waste of time.

On the other hand, you must NOT try to load pedindex.out itself as a pedigree file using load pedigree. Not only is this not necessary, it will not work as you may think. The pedigree loading procedure is not idempotent. In other words, the individuals in pedindex.out will be ordered differently than in the your original pedigree file, and loading pedindex.out itself would result in yet another ordering.

8.2.2 What pedindex.out contains

The pedindex.out file associates each ID in your pedigree file with a sequential ID used by all SOLAR commands. The field name for these sequential ID's is IBDID. IBDID's are unique in the entire file. Fathers and mothers are identified by fields named FIBDID and MIBDID which provide the respective parent IBDID. (Note: the field names are found inside the companion code file pedindex.cde, described in the next section.)

An all new sequential pedigree identification number named PEDNO is also created which is independent of any PEDNO you might already have in your pedigree file. The new PEDNO assignments are based entirely on father and mother information available in your file to ensure consistency. These new PEDNO assignments are the ones to which SOLAR commands such as pedlod refer. If you already have a PEDNO field in your pedigree file, it is ignored, and the PEDNO assignments in pedindex.out are likely to be different.

If your ID's are sequential within a numbered families, and not unique in all your data, you must use a field named FAMID to identify the families to which individuals belong for identification purposes. If present in your pedigree file, the FAMID field will be carried over unchanged into the pedindex.out file. But SOLAR may decide to divide your families down further, if, for example, it detects individuals such as marry-ins who have no genetic relation to other individuals in a family. Marry-ins will become singleton pedigrees with their own unique PEDNO in the pedindex.out file. But this will not affect the FAMID, which is used simply for individual identification purposes.

8.2.3 Other files created by the command load pedigree

A PEDSYS code file named pedindex.cde is created which identifies the fields in the fixed width pedindex.out file. Because of this file, the pedindex.out can be read with such PEDSYS programs as browse. Within SOLAR, you can write scripts that read pedindex.out using the tablefile command, which can read files either in PEDSYS or comma delimited format.

A state file named pedigree.info is created which points to the loaded pedigree file and contains basic information about the pedigree. This basic information is displayed by the pedigree show command.

The load pedigree command also builds one or two matrix files, the Kinship Matrix phi2.gz and, optionally, the Household Matrix house.gz. These matrix files are described in the following sections.

8.3 Kinship Matrix: phi2.gz

The load pedigree command also creates a file named phi2.gz which contains a two times the kinship coefficient matrix. Currently this file is not used in the normal course of operations (for quantitative traits) because kinship coefficients are generated on-the-fly by SOLAR in a way that reduces memory requirements. However it is used during the analysis of discrete traits because the discrete trait modeling code was simplified by removing the on-the-fly kinship computation.

In some cases, very sophisticated SOLAR users can substitute a modified phi2.gz file to perform a special kind of analysis. In order to force SOLAR to use the external matrix file for a quantitative trait, you need to give a suitable load matrix command or use the loadkin command which does this using the standard matrix identifiers phi2 and delta7 (described below). This should be done before running polygenic. During model maximization, if a matrix has been loaded with the identifier name phi2, it supercedes any on-the-fly calculated values for phi2.

It is instructive to understand the phi2.gz file as it is a template for all matrix files, such as the IBD and MIBD files described below.

All matrix files are compressed with GNU gzip to save a considerable amount of space (a high compression factor is achieved.) Even in their compressed form a full set of MIBD files for the whole genome can take a lot of space (possibly 100's of megabytes.)

When uncompressed, every SOLAR matrix file is found to have a very simple format. There are three or four columns of numbers which must be space delimited. The first two columns are sequential identifiers, based on the IBDID sequencing in pedindex.out (and not necessarily any user sequencing, as described above.) There must be one (and only one) line for each pair of individuals. The last one or two columns are the coefficients related to that pair of individuals. Each type of matrix file has different types of coefficients. The coefficients should begin in the fourteenth character column, or higher, counting the first character column as number one.

Starting with SOLAR Version 4, a "checksum" is added as the first line in every matrix file as a fake 1,1 element. It is immediately followed by and overwritten by the real 1,1 element (or a fake 1,1 guard element with value 0). This "checksum" is actually a polynomial Cyclic Redundancy Check (CRC) computed with the Unix cksum command on the pedindex.out file. This enables the load matrix command (used often in model files) to check that the matrix file was created with the exact same pedigree. Even the smallest change to a pedigree file will change the IBDID's computed during the load pedigree command, which will necessitate that all matrices be regenerated with the new IBDID's. Note that it is not the name of the pedigree file or the timestamp of the pedindex.out file, but the data they contain that counts, so this checking won't be affected if you copy or reload the same pedigree file. The possibility of the CRC checking not detecting a pedigree/matrix mismatch is astronomically small.

If you are creating your own matrix files, you can generate and prepend the required CRC using the matcrc command, which should be run on the matrix file after it has been gzipped by you (matcrc will gunzip the file, prepend the CRC, and then re-gzip the file). This must be run in a directory where the matching pedigree file is loaded. Version 4 of SOLAR does not require that all matrices have CRC's, though this requirement may be added in some future SOLAR version, and it is desireable anyway to have the pedigree/matrix checking. Without this checking, it is all too easy to use an obsolete matrix file with a mismatching modified pedigree.

The phi2.gz file has phi2 and delta7 coefficients. phi2 is the kinship coefficient phi times 2, a term which occurs frequently in genetic covariance equations. (We might have liked to name this 2phi, but many computer programs don't like names which begin with numbers.)

For pedigrees without inbreeding, the coefficient we call delta7 is the same as delta7 from the Jacquard condensed coefficients of identity, delta1 - delta9. When inbreeding is not present, Jacquard's delta7 is the probability that a pair of individuals share exactly two alleles identical by descent (IBD) at a randomly chosen locus. For pedigrees with inbreeding however, our delta7 may differ from Jacquard's delta7, and should not be used.

For non-inbred pedigrees, we can also express IBD-allele sharing using the Cotterman coefficients, K0 - K2. K0 is the probability, at a random locus, of sharing no alleles IBD, K1 is the probability of sharing exactly one allele IBD, and K2 is the probability of sharing two alleles IBD. Hence, in the case of no inbreeding, our delta7 is equivalent to K2.

phi2 and delta7 are named terms which may be used in the SOLAR omega (covariance) equation. Normally the omega is automatically set up for you by other commands (such as polygenic or multipoint) but it is available for examination or modification for custom analysis (see Section 9.5 for an introduction to custom analyses). Only phi2 is currently set up automatically; analysis invoving dominance would involve using delta7 as well, and we have not automated that yet (but it is described in Section 9.4).

8.4 Household Matrix matrix.gz

When a load pedigree command is executed, the matrix file house.gz will be created automatically if the pedigree file has an HHID field (or a field mapped to HHID with the field command.) The household matrix file contains only one meaningful coefficient at this time. It is simply a 1 if the two individuals are members of the same household, and 0 if they are not. We name this coefficient house.

The household matrix need not necessarily refer to household but could refer to any other shared environmental grouping, such as neighborhood, city, tribe, favorite musical genre, etc.

To include household effects in an analysis, use the house command prior to giving the polygenic command.

8.5 IBD Data: ibd.<marker>.gz

The ibd command creates files named ibd.<marker>.gz, where <marker> is the marker name. These files contain the ibd and d7 coefficients for each pair of related individuals for which ibd > 0.    ibd is computed as .5 * p(1) + p(2), where p(i) is the probability of sharing exactly i alleles identical by descent at this marker locus. In non-inbred relationships, ibd gives the expected proportion of alleles shared identical by descent at this marker locus. d7 is simply p(2). These coefficients are the marker-specific analogues of the phi2 and delta7 coefficients described in the previous section. If no member of a pedigree has been genotyped for the marker, an IBD value of -1 is assigned to the main diagonal entries in the IBD matrix for that pedigree. In a linkage analysis, all pairs of individuals in that pedigree will be treated as having ibd = phi2, the expected IBD allele-sharing at a locus chosen at random.

IBD files will be created in the directory specified by the ibddir command. For example, the command ibddir ibd will result in the IBD files being created in a subdirectory named ibd. You can refer to the current directory with the name . (dot).

The IBD files are used by the twopoint command when doing twopoint analyses. twopoint uses the command linkmod -2p to set up the omega equation and required paramters as required for a twopoint linkage model. linkmod -2p uses the load matrix command to load the required matrices.

d7 is only available if the IBDs were computed by the Curtis and Sham method (or if IBDs were imported from a package which computed d7). Pedigrees which contain inbreeding or multiple marriage loops, or for which all individuals have been typed, result in the Monte Carlo method being used by SOLAR, and in those cases d7 is not available. In those cases, it is suggested that you use another genetics package to compute IBDs, which might give more exact numbers anyway, and that is especially advantageous for analyses of dominance. (Analysis of dominance is discussed in Chapter 9).

A number of working files are required for the IBD calculation process. These files exist in subdirectories created by the load marker command, one directory for each marker. For more information, use the command help marker.

The first line in IBD matrix files created by SOLAR looks like strange data but is actually a a checksum to insure the matrix is used with the correct pedigree only. This checksum line is created by the matcrc command. It is not necessary for user-created matrix files to have this checksum line, but it is recommended and easy to do with the matcrc command.

8.6 MIBD Data: mibd.<chr>.<loc>.gz

The mibd command creates files named mibd.<chr>.<loc>.gz, where <chr> is the chromosome number and <loc> is the chromosomal location in cM. These files contain the mibd coefficients for each pair of related individuals for which mibd > 0.    They will be created in the directory specified by the mibddir command. For example, mibddir mibd will result in the MIBD files being created in a subdirectory named mibd. You can refer to the current directory with the name . (dot).

For historical reasons, the second column of MIBD files created by SOLAR contains a copy of phi2 from the phi2 matrix. This is no longer needed now, and it is an obsolescent feature, because there is now a separate phi2.gz file. Imported MIBD files may contain d7, the MIBD analogue of delta7, in their second column, if the genetic package computes it. It makes more sense for the second column to contain d7 than anything else now, (though it is unlikely that SOLAR will compute d7 for MIBDs). See section 5.5 for information on how to use MIBDs computed by another program.

The first line in MIBD matrix files created by SOLAR looks like strange data but is actually a a checksum to insure the matrix is used with the correct pedigree only. This checksum line is created by the matcrc command. It is not necessary for user-created matrix files to have this checksum line, but it is recommended and easy to do with the matcrc command.

mibd is the multipoint analogue of the ibd coefficient in marker-specific IBD files, which are described in the previous section. If no member of a pedigree has been genotyped for any of the markers on this chromosome, an MIBD value of -1 is assigned to the main diagonal entries in the MIBD matrix for that pedigree. In a linkage analysis, all pairs of individuals in that pedigree will be treated as having mibd = phi2, the expected IBD allele-sharing at a locus chosen at random.

The MIBD files are used by the multipoint command when doing multipoint analysis. multipoint uses the linkmod command to set up the parameters in each multipoint model. linkmod uses the load matrix command to load a particular MIBD file.

In addition to the files above, four other files are created and used to compute MIBDs:   mibdrel.ped which stores the relative class (e.g. full sib, parent-offspring) for each pair of individuals in a pedigree;   mibdchr<chr>.mrg which stores the marker-specific IBDs for all the markers on chromosome <chr>;   mibdchr<chr>.mean which stores, for each relative-class, the mean IBD value at each marker on chromosome <chr>;   and mibdchr<chr>.loc which stores the location in cM of each marker on chromosome <chr>.

8.7 Model Files: *.mod

Models are constructed automatically by commands such as polygenic, twopoint, and multipoint. In the course of these commands, important models are automatically saved, and may be reloaded later with the load model command. Models may also be saved at any time with the save model command. To identify models, SOLAR automatically tacks on a .mod extension to all model filenames (unless they arleady have a .mod extension).

The model file itself is a text file consisting of a sequence of SOLAR commands such as solarmodel, parameter, constraint, trait, covariate, omega, mu, matrix, and option. The first command must be a solarmodel command (which identifies the SOLAR version under which the model was created so that any upgrade issues can be resolved). Model files do not contain load pedigree or load phenotypes commands. Those commands must be given prior to using a model at least once in any working directory.

There is usually some current model in effect in SOLAR. Usually this is the best model created by the previous command. When starting solar, the empty model is in effect. (To clear any previous model and start over with a new empty model, use the model new command.)

If the current model has been maximized (so that its parameters have been set to their maximum likelihood estimates) the model will include a loglike command giving the (natural) log likelihood associated with the maximized model. The basic maximization command is maximize, but commands such as polygenic, multipoint, and twopoint automatically do maximization for you so you may not need to use the actual maximize command except in special cases such as those described in Chapter 9.

When any maximization starts, the starting model is saved as last.mod, which might permit you to go back to the previous model if something goes wrong. In many cases, however, you cannot depend on the last.mod command to take you back before your last command. Sometimes commands create and maximize several intermediate models, and in that case last.mod will represent the last of these intermediate models. Even the maximize command itself sometimes creates intermediate models in the process of trying to resolve convergence difficulties.

It is more useful to load specially named models that may result from particular commands such as polygenic and multipoint. For example, polygenic always creates a model named null0.mod, and multipoint creates a model named null1.mod (and null1.mod, null2.mod, etc. if oligogenic scanning is performed). For a detailed description of the models created by any particular command, see the command documentation for that command.

The default directory for loading and saving models is the current working directory. However, commands such as multipoint and twopoint save models into the maximization output directory named by the trait or outdir command. Therefore, if you are going to load any of the models created by a previous command, you need to either specify that directory name explicitly:

        solar> load model q4/null0

Or, if you have previously specified the trait or outdir or loaded another model (which specifies the trait) you can use the full_filename command:

        solar> load model [full_filename null0]
(The full_filename command is most useful in scripts where you don't know what the trait is going to be, or whether the user has used the outdir command.)

There is also a command read_model which lets you read the likelihood and/or parameter values from previously stored models without loading them.

8.8 SOLAR maximization output files

If you maximize a model with the maximize command, or if you use the verbosity max command before running a command which maximizes many models such as polygenic or multipoint, you will see a lot of information about the model, the data, and the history of maximization displayed on your terminal. This information may also be written to a file for you to read later, (even if you haven't used the verbosity max command). The default name for this maximization output file if you use the maximize command is solar.out, and it is written to the maximization output directory described in the previous section. Other commands which do maximization name these files after the models they create, except with a .out suffix.

polygenic: poly.out spor.out nocovar.out
multipoint: null1.out null2.out

Maximization output files show the final values for all parameters, the standard errors (unless standard error calculations were turned off), the Loglikelihood, the Normalized Quadratic value, Descriptive Statistics for the Quantitative Variables, and the Iteration History. (Note: if there are retries because of maximization convergence difficulties, the Iteration History will only correspond to the last maximization retry.) Some of these values are saved within the corresponding model file itself, but a few are not. For that reason, there is a command to read certain information from the maximization output files which is not available in model files: read_output.