Most data input files to SOLAR must be either PEDSYS
data files or Comma Delimited data files. (The
map and freq files, which are relatively short, are
exceptions. They are space delimited, and have fields
prescribed by position.) SOLAR can read both PEDSYS data
files and Comma Delimited data files interchangeably. (Some
database systems call Comma Delimited files "Comma Separated"
or "CSV." Most database systems can export suitable files
whatever they happen to call them.) SOLAR automatically
determines the type of files you have and reads them
appropriately. PEDSYS data files require matching
code files (which have the same name up to the last dot
but with the extension .cde
or
.CDE
) which describe the data fields (see
section 4.1.4 below). Comma delimited files have a header
line (described below in section 4.1.3) to identify the data
fields.
In most cases, missing data is specified by blanking individual data fields. For PEDSYS files, this means the data field is filled with space characters. For Comma Delimited files, this is usually specified by having no characters between the delimiting commas or the first or last comma and the beginning or ending of the line, but it could also be specified by any number of space characters. SOLAR ignores leading and trailing space characters for any data field.
The names you use for data fields should be (and in many cases
must be) unambiguous in their case insensitive form.
For example, you must not have one field named
AGE
and another named
age
. SOLAR will not be able to
tell these names apart. As much as possible within a
Tcl-based program running on Unix or linux,
SOLAR has been designed to be case insensitive (because we
like it that way). The SOLAR command names (such as
trait
), however, must all be entered in
lower case. Generally we use lower case for everything or
nearly so, but there is no requirement that you do this.
Field names in pedigree and phenotype files can be any length but must be unique in 18 characters or less. (In some messages, SOLAR might truncate your names to even less than 18 characters, but this will not affect operation.)
It is recommended that you only use characters A-Z, a-z, 0-9,
and underscore (_) in your field names. Some other special
characters may work, but are not recommended. The following
characters will particularly cause problems: / ^ # :
,
and space. This is only true for the
field names; in the actual data itself special characters are
allowed for some data types, though most data for SOLAR
consists of numbers.
Comma Delimited files must have a first record (called the header) which is a comma delimited list of the field names. Most database systems produce a suitable header record, but if yours does not you can edit the file to add one with an ASCII text file editor such as vi or emacs. Note that SOLAR does not allow field names to have spaces in them, and does not like null field names (names with no characters in them). Beware of database systems which may leave null field names at the end of the header when the file was created by certain SQL operations. (Most trouble that people have with SOLAR relates to data file formatting or data errors. If you are having trouble, please look at your data files carefully.) Here is a typical example of a header line:
ID,AGE,EF,Q1,Q2,Q3,Q4,Q5
It is permitted to suffix each field name with a type identification word after a colon (:). This is fairly common for comma delimited files exported from some database systems. SOLAR, however, does not use these type identifications. The type of each data field is determined by its name, and, in most cases, numbers only are expected. SOLAR ignores the contents of fields which it isn't asked to read, so they can contain anything (except commas or newlines). A header line with type identifications might look like this:
id:char,age:int,ef:int,q1:float,q2:float,q3:float,d1:int
This section applies to the use of PEDSYS data files only.
When using PEDSYS files, the names of data fields are defined in the Code file (.cde). We have decided to use the concatenation of the first 3 (out of 4) Mnemonics with restrictions described below. SOLAR ignores the "Description" field of code files because it is wordy and usually contains spaces (and such strings would not make good variable names).
Successive mnemonics are concatenated to produce field names of up to 18 characters. Concatenation of mnemonics is terminated by the first blank mnemonic or the first occurrance of a PEDSYS "standard" mnemonic (GENO, PHENO, G'TYPE, P'TYPE, INFERD). (This is a change from the behavior of SOLAR versions prior to 1.5.0.)
For example, consider this (edited by hand) code file:
REFORMATTED REFORMATTED MERGE merge.out & zno_ma.out 6 EGO PERMANENT ID ID C 6 AGE AT CLINIC VISIT AGE CLINIC R 2 Diabetes Medication Diabet Meds R 2 Diabetes Status DiabetStat I 2 Weight at clinic Weight PHENO R 2 Antilipid Drugs Lip Meds R
The SOLAR field names would be ID, AGECLINIC, DiabetMeds, DiabetStat, Weight, and LipMeds.
SOLAR is forgiving in allowing the name "DiabetStat" even though the "S" occurs in the 7th position (where real PEDSYS programs would expect a space). This is for the convenience of people editing these files by hand. But beware that PEDSYS programs will clobber a name like DiabetStat with mandatory spaces in the 7th, 14th, and 21st columns, so it is not a good idea to do this if you intend to run your data files through PEDSYS programs later.
Very early versions of SOLAR, prior to 1.0.1, required a special
"Fisher" or "Fisher SFBR" format created by the PEDSYS TRANSLAT
program. Those data formats are no longer either required or
supported efficiently; contact us if you need help with files in
those formats.
The pedigree file consists of one record for each individual
in the data set. Each record must include the following fields:
4.2 Pedigree File
ego ID, father ID, mother ID, sex
In addition, a family ID is required when ego IDs are not unique across the entire data set! If the data set contains genetically identical individuals, an MZ-twin ID must be present (as described below). If an analysis of household effects is planned, a household ID can be included (also described below). The default field names are:
id
,fa
,mo
,sex
,famid
,mztwin
,hhid
ego
,
sire
, and
dam
are also accepted, by default, in
place of
id
,
fa
, and
mo
respectively.
You can
set up SOLAR to use different field names by using the field
command (discussed below). You do not necessarily need to change
your names to match ours (though it might be easiest in the
long run if you do).
A blank parental ID or a parental ID of 0 (zero) signifies a missing parent. (This is an exception to the general rule that solar requires a blank field for missing data. This is an accomodation to the many people who have been using data formatted for the linkage program.)
SOLAR requires that either both parents are unknown (i.e. the individual is a founder) or that both parents are known. (In this case, we are only referring to knowledge concerning identity, not knowledge about genotypes or phenotypes, and the only aspect of identity we are really concerned with here is the existance of each individual in the pedigree and their genetic relationship(s) to other individuals.)
If only one parent is known, you can either chose to blank the known parent, or create new dummy parent(s) to fill in for ones that are unknown. If a dummy is used, it should only be used in such a way that does not create invalid information. For example, if there are two children, the creation of a dummy parent might imply that they are full siblings, when they might be only half-siblings. If the children are half-siblings sharing a single known parent, a separate dummy parent should be created for each child.
On the other hand, in cases where the known parent does not contribute any information anyway (phenotypic information, genotypic information, or relationship information, including a fact such as that children are siblings or half-siblings) it may be preferable to blank the known parent. A statistical geneticist ought to be able to determine the correct strategy for any particular pedigree in any particular kind of analysis.
Sex may be encoded as M/F, m/f, or 1/2. The missing value for sex is 0, U, u, or blank.
The MZ-twin ID is used to designate genetically identical individuals, e.g. monozygotic twins or triplets. Each member of a group of identical individuals should be assigned the same MZ-twin ID. Twin IDs must be unique across the entire data set. If there are no genetically identical individuals in the data set, this field need not be present in the pedigree file. Any number or alphanumeric name may be used as MZ-twin ID. 0 (zero) or blank signifies that the individual is not a MZ-twin.
The Household ID, if present, will be used to generate a matrix file (house.gz) that can be used later to include a variance component for household or any other shared environmental effects. Household IDs must be unique across the entire data set. Any number or alphanumeric name may be used as a Household ID. 0 (zero) or blank signifies that the individual is a singleton household (i.e. not in the same household as anyone else). Singleton households are allowed but warnings might be displayed to prevent this from being specified by accident.
The family ID field (indicated by the
famid
name) is required only when ego IDs
are not unique across the entire data set. If, for example,
IDs are sequential integers unique only within pedigrees, then
the
family ID must be included. Or if, for example, if a
data set consists of nuclear families, and the same ego ID may
appear in more than one family, then the family ID must
be included. You must use the famid
mnemonic for this disambiguating family ID, or a field
mapped to it with the field
command. The
name pedno
is reserved for another use (as
described next).
At the time the pedigree file is loaded, SOLAR indexes the
data set. This indexing is internal and should not be
confused with any external indexing the user may have imposed
upon the data set. This indexing information is stored in a
file named pedindex.out
in the directory
where SOLAR is running when the pedigree file is loaded. (Be
careful about deleting files unless you are sure they are not
needed by SOLAR!) SOLAR ignores supplied pedigree delineations
(in famid
or pedno
fields) and builds up pedigrees from scratch using only the
parental identifications supplied. SOLAR must be very strict
about this because the requirements of linkage analysis are
very strict. The field pedno
in the
pedindex.out
file represents the
delineation of pedigrees that SOLAR has created. A field named
pedno
in your pedigree file is ignored.
Once a pedigree file has been loaded, it is not necessary to
load it again in later SOLAR sessions in the same working
directory. When starting, SOLAR looks for
pedindex.out
and other files containing saved
state information from previous SOLAR sessions in the
current directory. The pedindex.out
file
is discussed further in Section 8.2.
The concise reference specification for the pedigree file format is shown by the command file-pedigree or the help message for it.
The phenotypes file consists of one record for each individual. Each record must include an ego ID and one or more phenotypic values (which may be blank to signify missing data) and may also include quantitative covariate factors such as age:
ego ID, phen 1, phen 2, ...
Just as in the pedigree file, the ego ID may be
specified with a field named id
, a field
named ego
, or a field mapped to
id
with the field
command. Also, a family ID is required when your
ego ID's are not unique across the entire data set, and,
if needed, it must be specified with a
famid
field or one mapped to it. The
header for the example gaw10.phen
phenotypes file (which does not require
famid
) looks like this:
ID,AGE,EF,Q1,Q2,Q3,Q4,Q5
Showing the ID, Age, one quantitative Environmental Factor, and 5 quantitative phenotypes.
The phenotypes file may also contain other data, such as pedigree data. You could use one file as both your phenotype and your pedigree file, though that is not necessarily recommended.
If your data has probands and you wish to employ ascertainment
correction, the phenotypes file must have a proband
field, for which the default name is
probnd
. In this field, blank ( ) or zero (0) signifies
non-proband, and anything else signifies proband. A decimal
point is _not_ permitted after the zero. The presence of a
proband field automatically turns on ascertainment correction
during maximization. Ascertainment correction can be turned
off with the command:
field probnd -none
The phenotype field names may be anything within certain rules. (no spaces, tabs, or slashes; also certain special characters such as *#,^/-+ can cause problems in the names of phenotypes used as covariates). If you stick with alphabetic characters, numeric characters, and underscores you will be safe.
The phenotype data fields must always contain numbers, either with or without decimal points. Zero (0) is always considered a permissible value. Missing data must always be specified with a blank.
Floating or fixed point numbers must always include a decimal point; numbers without a decimal point might be assumed to be integers. Binary, discrete or categorical values should be indicated with consecutive integers (e.g. 0/1 or 1/2 or 2/3). SOLAR checks all phenotype fields to see if they contain only two consecutive integers and judges them discrete if they do, and quantitative otherwise. The presence or absence of a decimal point, or any type information in the header line or code file is ignored.
Discrete variables with more than 2 classes cannot be used as traits in SOLAR. For use as covariates, they should be decomposed into multiple binary variables as discribed in Section 9.1.1.1.
The load phenotypes
command creates a file
named phenotypes.info
in the current
directory. Once a phenotypes file has been loaded, it need
not be loaded again in the same directory, unless you
change the file.
SOLAR automatically removes individuals missing any
required phenotypic data and pedigrees in which
all the non-probands are missing required phenotypic data
from the maximization sample. (SOLAR cannot use a pedigree
containing only probands, unless ascertainment correction is
turned off with the command field proband
-none
.) You need not remove these individuals or
pedigrees yourself. You will get counts of the numbers of
pedigrees and individuals included and excluded in the
maximization output files (described below) or by giving the
verbosity max
command prior to other
commands. Even if individuals are excluded from a
maximization sample, they may still have some importance in
having contributed to relationship information and/or
genetic inferences.
An exception to the above paragraph relates to traits in a bivariate analysis. By default, an individual need only have one of the two specified traits to be included in the maximization sample.
The concise reference specification for the phenotypes file format is shown by the command file-phenotypes or the help message for it.
The marker file contains genotype data for one or more marker loci. The file consists of one record for each individual who has been typed for one or more of these markers. Each record must contain the following fields:
ego ID, genotype1, genotype2, ...
In addition, a family ID field must be included when ego IDs are not unique across the entire data set. If, however, each ego ID is unique to an individual and an individual may appear multiple times in the data set, then the family ID should not be included. The same genotypic data is then associated with every occurrence of an individual.
The default field names are id
and
famid
. ego
is also
accepted by default in place of id
. You
can set up SOLAR to use different field names by using the
field
command.
The scheme used to encode genotypes may vary from field to field. SOLAR recognizes many standard coding schemes, but the safest way to code genotypes is with the forward slash to separate the alleles. For example:
AB E1 E3 123/456
A blank genotype field denotes missing data, as do the genotypes 0/0 and -/-. SOLAR requires that either both alleles are typed or both alleles are missing, except for male genotypes at X-linked marker loci. In that case, either a single allele is specified (the other allele is blank, 0, or -), or the genotype is coded as a homozygote.
237/243 valid female X-linked marker genotype /251 valid male X-linked marker genotype 251/0 valid male X-linked marker genotype -/251 valid male X-linked marker genotype 251/251 valid male X-linked marker genotype
The marker loci in the marker file must all be autosomal or
all be X-linked. By default, SOLAR assumes that the markers
are autosomal. If the markers are X-linked, then either the
XLinked
option must be set with the
ibdoption
command prior to loading the
marker file, or the -xlinked
option must
be given in the load marker
command.
Once a marker file has been loaded, it is not necessary to load it again in subsequent SOLAR runs from the same working directory.
The concise reference specification for the marker file format is shown by the command file-marker or the help message for it.
The map file contains chromosomal locations in cM for a set of marker loci on a single chromosome. The first line of the file contains the chromosome number, and (optionally) the name of the mapping function to be used to convert inter-marker distances to recombination fractions. Currently, the Kosambi and Haldane mapping functions are allowed. The default mapping function is Kosambi. The chromosome number can be any character string not containing a blank or a forward slash (/), although the use of integers is recommended. For example, the strings '01' and '15q' are allowed. Each line after the first line consists of the following space-delimited fields:
marker name, marker location (in cM)
The markers must be in ascending order of location.
Example:
20 Haldane D20S101 0.0 D20S202 34.2 D20S303 57.5
The concise reference specification for the map file format is shown by the command file-map or the help message for it.
The freq file contains allele frequency data for a set of marker loci, one line per marker. Each line consists of the following space-delimited fields:
marker name, all_1 name, all_1 freq, all_2 name, all_2 freq, ...
The allele frequencies for a marker must sum to 1 (a small roundoff error is tolerated.)
Allele frequency information is used when IBDs are computed for a marker that is not completely typed, i.e. there are individuals for whom genotype data is not available.
Example:
D20S101 123 0.2457 127 0.1648 133 0.5895 IGF1 A 0.4 B 0.3 C 0.1 F 0.2 ApoE E1 .125 E2 .25 E3 .625
Once a freq file has been loaded, it is not necessary to load it again in subsequent SOLAR runs from the same working directory.
The concise reference specification for the freq file format is shown by the command file-freq or the help message for it.
If the field names in your data files do not match those names
expected by SOLAR, the easiest thing to do might be to
edit the names in your data files. But if this is impossible
or impractical, you can also use the field command to map
the SOLAR built-in names to your names. Your
field
command settings are saved in a file
named
field.info
so that they do not need to be
re-entered each time you start SOLAR.
(field
commands you have written to a file
named
.solar
in the current or home directory
will have even higher precedence.)
With no arguments, the field
command shows
all the fields whose mapping can be changed, and what their
current mapping is.
solar> field Default Current Meaning ------- ------- ------- ID ID(Individual Permanent ID) [mandatory] FA FA (Father's Permanent ID) [mandatory] MO MO (Mother's Permanent ID) [mandatory] SEX SEX (M/F, m/f, or 1/2) [mandatory] PROBND PROBND [optional] MZTWIN MZTWIN [optional] FAMID FAMID [optional] HHID HHID [optional]
With two arguments, the command maps the second name to the first built-in one. For example, the command
field id personwould map the built-in name
id
to a
user-provided name person
.
help field gives
more details about the field command.