Go to Main Index
Go to Table of Contents

Chapter 4

Input Data File Formats

4.1 Basic Input Data File Formats Supported

Most data input files to SOLAR must be either PEDSYS data files or Comma Delimited data files. (The map and freq files, which are relatively short, are exceptions. They are space delimited, and have fields prescribed by position.) SOLAR can read both PEDSYS data files and Comma Delimited data files interchangeably. (Some database systems call Comma Delimited files "Comma Separated" or "CSV." Most database systems can export suitable files whatever they happen to call them.) SOLAR automatically determines the type of files you have and reads them appropriately. PEDSYS data files require matching code files (which have the same name up to the last dot but with the extension .cde or .CDE) which describe the data fields (see section 4.1.4 below). Comma delimited files have a header line (described below in section 4.1.3) to identify the data fields.

4.1.1 Use Blank to specify missing data

In most cases, missing data is specified by blanking individual data fields. For PEDSYS files, this means the data field is filled with space characters. For Comma Delimited files, this is usually specified by having no characters between the delimiting commas or the first or last comma and the beginning or ending of the line, but it could also be specified by any number of space characters. SOLAR ignores leading and trailing space characters for any data field.

4.1.2 Field Name Lengths and Characters

The names you use for data fields should be (and in many cases must be) unambiguous in their case insensitive form. For example, you must not have one field named AGE and another named age. SOLAR will not be able to tell these names apart. As much as possible within a Tcl-based program running on Unix or linux, SOLAR has been designed to be case insensitive (because we like it that way). The SOLAR command names (such as trait), however, must all be entered in lower case. Generally we use lower case for everything or nearly so, but there is no requirement that you do this.

Field names in pedigree and phenotype files can be any length but must be unique in 18 characters or less. (In some messages, SOLAR might truncate your names to even less than 18 characters, but this will not affect operation.)

It is recommended that you only use characters A-Z, a-z, 0-9, and underscore (_) in your field names. Some other special characters may work, but are not recommended. The following characters will particularly cause problems: / ^ # : , and space. This is only true for the field names; in the actual data itself special characters are allowed for some data types, though most data for SOLAR consists of numbers.

4.1.3 Comma Delimited Field Names

Comma Delimited files must have a first record (called the header) which is a comma delimited list of the field names. Most database systems produce a suitable header record, but if yours does not you can edit the file to add one with an ASCII text file editor such as vi or emacs. Note that SOLAR does not allow field names to have spaces in them, and does not like null field names (names with no characters in them). Beware of database systems which may leave null field names at the end of the header when the file was created by certain SQL operations. (Most trouble that people have with SOLAR relates to data file formatting or data errors. If you are having trouble, please look at your data files carefully.) Here is a typical example of a header line:

        ID,AGE,EF,Q1,Q2,Q3,Q4,Q5

It is permitted to suffix each field name with a type identification word after a colon (:). This is fairly common for comma delimited files exported from some database systems. SOLAR, however, does not use these type identifications. The type of each data field is determined by its name, and, in most cases, numbers only are expected. SOLAR ignores the contents of fields which it isn't asked to read, so they can contain anything (except commas or newlines). A header line with type identifications might look like this:

        id:char,age:int,ef:int,q1:float,q2:float,q3:float,d1:int

4.1.4 PEDSYS Field Names

This section applies to the use of PEDSYS data files only.

When using PEDSYS files, the names of data fields are defined in the Code file (.cde). We have decided to use the concatenation of the first 3 (out of 4) Mnemonics with restrictions described below. SOLAR ignores the "Description" field of code files because it is wordy and usually contains spaces (and such strings would not make good variable names).

Successive mnemonics are concatenated to produce field names of up to 18 characters. Concatenation of mnemonics is terminated by the first blank mnemonic or the first occurrance of a PEDSYS "standard" mnemonic (GENO, PHENO, G'TYPE, P'TYPE, INFERD). (This is a change from the behavior of SOLAR versions prior to 1.5.0.)

For example, consider this (edited by hand) code file:

	    REFORMATTED REFORMATTED MERGE merge.out & zno_ma.out  
	    6 EGO PERMANENT ID      ID                          C
	    6 AGE AT CLINIC VISIT   AGE    CLINIC               R
	    2 Diabetes Medication   Diabet Meds                 R
	    2 Diabetes Status       DiabetStat                  I
	    2 Weight at clinic	    Weight PHENO                R
	    2 Antilipid Drugs       Lip    Meds                 R

The SOLAR field names would be ID, AGECLINIC, DiabetMeds, DiabetStat, Weight, and LipMeds.

SOLAR is forgiving in allowing the name "DiabetStat" even though the "S" occurs in the 7th position (where real PEDSYS programs would expect a space). This is for the convenience of people editing these files by hand. But beware that PEDSYS programs will clobber a name like DiabetStat with mandatory spaces in the 7th, 14th, and 21st columns, so it is not a good idea to do this if you intend to run your data files through PEDSYS programs later.

4.1.5 Obsolete Formats

Very early versions of SOLAR, prior to 1.0.1, required a special "Fisher" or "Fisher SFBR" format created by the PEDSYS TRANSLAT program. Those data formats are no longer either required or supported efficiently; contact us if you need help with files in those formats.

4.2 Pedigree File

The pedigree file consists of one record for each individual in the data set. Each record must include the following fields:

	ego ID, father ID, mother ID, sex

In addition, a family ID is required when ego IDs are not unique across the entire data set! If the data set contains genetically identical individuals, an MZ-twin ID must be present (as described below). If an analysis of household effects is planned, a household ID can be included (also described below). The default field names are:

	id, fa, mo, sex, famid, mztwin, hhid

ego, sire, and dam are also accepted, by default, in place of id, fa, and mo respectively. You can set up SOLAR to use different field names by using the field command (discussed below). You do not necessarily need to change your names to match ours (though it might be easiest in the long run if you do).

A blank parental ID or a parental ID of 0 (zero) signifies a missing parent. (This is an exception to the general rule that solar requires a blank field for missing data. This is an accomodation to the many people who have been using data formatted for the linkage program.)

SOLAR requires that either both parents are unknown (i.e. the individual is a founder) or that both parents are known. (In this case, we are only referring to knowledge concerning identity, not knowledge about genotypes or phenotypes, and the only aspect of identity we are really concerned with here is the existance of each individual in the pedigree and their genetic relationship(s) to other individuals.)

If only one parent is known, you can either chose to blank the known parent, or create new dummy parent(s) to fill in for ones that are unknown. If a dummy is used, it should only be used in such a way that does not create invalid information. For example, if there are two children, the creation of a dummy parent might imply that they are full siblings, when they might be only half-siblings. If the children are half-siblings sharing a single known parent, a separate dummy parent should be created for each child.

On the other hand, in cases where the known parent does not contribute any information anyway (phenotypic information, genotypic information, or relationship information, including a fact such as that children are siblings or half-siblings) it may be preferable to blank the known parent. A statistical geneticist ought to be able to determine the correct strategy for any particular pedigree in any particular kind of analysis.

Sex may be encoded as M/F, m/f, or 1/2. The missing value for sex is 0, U, u, or blank.

The MZ-twin ID is used to designate genetically identical individuals, e.g. monozygotic twins or triplets. Each member of a group of identical individuals should be assigned the same MZ-twin ID. Twin IDs must be unique across the entire data set. If there are no genetically identical individuals in the data set, this field need not be present in the pedigree file. Any number or alphanumeric name may be used as MZ-twin ID. 0 (zero) or blank signifies that the individual is not a MZ-twin.

The Household ID, if present, will be used to generate a matrix file (house.gz) that can be used later to include a variance component for household or any other shared environmental effects. Household IDs must be unique across the entire data set. Any number or alphanumeric name may be used as a Household ID. 0 (zero) or blank signifies that the individual is a singleton household (i.e. not in the same household as anyone else). Singleton households are allowed but warnings might be displayed to prevent this from being specified by accident.

The family ID field (indicated by the famid name) is required only when ego IDs are not unique across the entire data set. If, for example, IDs are sequential integers unique only within pedigrees, then the family ID must be included. Or if, for example, if a data set consists of nuclear families, and the same ego ID may appear in more than one family, then the family ID must be included. You must use the famid mnemonic for this disambiguating family ID, or a field mapped to it with the field command. The name pedno is reserved for another use (as described next).

At the time the pedigree file is loaded, SOLAR indexes the data set. This indexing is internal and should not be confused with any external indexing the user may have imposed upon the data set. This indexing information is stored in a file named pedindex.out in the directory where SOLAR is running when the pedigree file is loaded. (Be careful about deleting files unless you are sure they are not needed by SOLAR!) SOLAR ignores supplied pedigree delineations (in famid or pedno fields) and builds up pedigrees from scratch using only the parental identifications supplied. SOLAR must be very strict about this because the requirements of linkage analysis are very strict. The field pedno in the pedindex.out file represents the delineation of pedigrees that SOLAR has created. A field named pedno in your pedigree file is ignored.

Once a pedigree file has been loaded, it is not necessary to load it again in later SOLAR sessions in the same working directory. When starting, SOLAR looks for pedindex.out and other files containing saved state information from previous SOLAR sessions in the current directory. The pedindex.out file is discussed further in Section 8.2.

The concise reference specification for the pedigree file format is shown by the command file-pedigree or the help message for it.

4.3 Phenotypes File

The phenotypes file consists of one record for each individual. Each record must include an ego ID and one or more phenotypic values (which may be blank to signify missing data) and may also include quantitative covariate factors such as age:

	ego ID, phen 1, phen 2, ...

Just as in the pedigree file, the ego ID may be specified with a field named id, a field named ego, or a field mapped to id with the field command. Also, a family ID is required when your ego ID's are not unique across the entire data set, and, if needed, it must be specified with a famid field or one mapped to it. The header for the example gaw10.phen phenotypes file (which does not require famid) looks like this:

        ID,AGE,EF,Q1,Q2,Q3,Q4,Q5

Showing the ID, Age, one quantitative Environmental Factor, and 5 quantitative phenotypes.

The phenotypes file may also contain other data, such as pedigree data. You could use one file as both your phenotype and your pedigree file, though that is not necessarily recommended.

If your data has probands and you wish to employ ascertainment correction, the phenotypes file must have a proband field, for which the default name is probnd. In this field, blank ( ) or zero (0) signifies non-proband, and anything else signifies proband. A decimal point is _not_ permitted after the zero. The presence of a proband field automatically turns on ascertainment correction during maximization. Ascertainment correction can be turned off with the command:

        field probnd -none 

The phenotype field names may be anything within certain rules. (no spaces, tabs, or slashes; also certain special characters such as *#,^/-+ can cause problems in the names of phenotypes used as covariates). If you stick with alphabetic characters, numeric characters, and underscores you will be safe.

The phenotype data fields must always contain numbers, either with or without decimal points. Zero (0) is always considered a permissible value. Missing data must always be specified with a blank.

Floating or fixed point numbers must always include a decimal point; numbers without a decimal point might be assumed to be integers. Binary, discrete or categorical values should be indicated with consecutive integers (e.g. 0/1 or 1/2 or 2/3). SOLAR checks all phenotype fields to see if they contain only two consecutive integers and judges them discrete if they do, and quantitative otherwise. The presence or absence of a decimal point, or any type information in the header line or code file is ignored.

Discrete variables with more than 2 classes cannot be used as traits in SOLAR. For use as covariates, they should be decomposed into multiple binary variables as discribed in Section 9.1.1.1.

The load phenotypes command creates a file named phenotypes.info in the current directory. Once a phenotypes file has been loaded, it need not be loaded again in the same directory, unless you change the file.

SOLAR automatically removes individuals missing any required phenotypic data and pedigrees in which all the non-probands are missing required phenotypic data from the maximization sample. (SOLAR cannot use a pedigree containing only probands, unless ascertainment correction is turned off with the command field proband -none.) You need not remove these individuals or pedigrees yourself. You will get counts of the numbers of pedigrees and individuals included and excluded in the maximization output files (described below) or by giving the verbosity max command prior to other commands. Even if individuals are excluded from a maximization sample, they may still have some importance in having contributed to relationship information and/or genetic inferences.

An exception to the above paragraph relates to traits in a bivariate analysis. By default, an individual need only have one of the two specified traits to be included in the maximization sample.

The concise reference specification for the phenotypes file format is shown by the command file-phenotypes or the help message for it.

4.4 Marker File

The marker file contains genotype data for one or more marker loci. The file consists of one record for each individual who has been typed for one or more of these markers. Each record must contain the following fields:

	ego ID, genotype1, genotype2, ...

In addition, a family ID field must be included when ego IDs are not unique across the entire data set. If, however, each ego ID is unique to an individual and an individual may appear multiple times in the data set, then the family ID should not be included. The same genotypic data is then associated with every occurrence of an individual.

The default field names are id and famid. ego is also accepted by default in place of id. You can set up SOLAR to use different field names by using the field command.

The scheme used to encode genotypes may vary from field to field. SOLAR recognizes many standard coding schemes, but the safest way to code genotypes is with the forward slash to separate the alleles. For example:

	AB
	E1 E3
	123/456

A blank genotype field denotes missing data, as do the genotypes 0/0 and -/-. SOLAR requires that either both alleles are typed or both alleles are missing, except for male genotypes at X-linked marker loci. In that case, either a single allele is specified (the other allele is blank, 0, or -), or the genotype is coded as a homozygote.

	237/243   valid female X-linked marker genotype
	   /251   valid male X-linked marker genotype
	  251/0   valid male X-linked marker genotype
	  -/251   valid male X-linked marker genotype
	251/251   valid male X-linked marker genotype

The marker loci in the marker file must all be autosomal or all be X-linked. By default, SOLAR assumes that the markers are autosomal. If the markers are X-linked, then either the XLinked option must be set with the ibdoption command prior to loading the marker file, or the -xlinked option must be given in the load marker command.

Once a marker file has been loaded, it is not necessary to load it again in subsequent SOLAR runs from the same working directory.

The concise reference specification for the marker file format is shown by the command file-marker or the help message for it.

4.5 Map File

The map file contains chromosomal locations in cM for a set of marker loci on a single chromosome. The first line of the file contains the chromosome number, and (optionally) the name of the mapping function to be used to convert inter-marker distances to recombination fractions. Currently, the Kosambi and Haldane mapping functions are allowed. The default mapping function is Kosambi. The chromosome number can be any character string not containing a blank or a forward slash (/), although the use of integers is recommended. For example, the strings '01' and '15q' are allowed. Each line after the first line consists of the following space-delimited fields:

	marker name, marker location (in cM)

The markers must be in ascending order of location.

Example:

	20 Haldane
	D20S101         0.0
	D20S202        34.2
	D20S303        57.5

The concise reference specification for the map file format is shown by the command file-map or the help message for it.

4.6 Freq File

The freq file contains allele frequency data for a set of marker loci, one line per marker. Each line consists of the following space-delimited fields:

	marker name, all_1 name, all_1 freq, all_2 name, all_2 freq, ...

The allele frequencies for a marker must sum to 1 (a small roundoff error is tolerated.)

Allele frequency information is used when IBDs are computed for a marker that is not completely typed, i.e. there are individuals for whom genotype data is not available.

Example:

	D20S101 123 0.2457 127 0.1648 133 0.5895
	IGF1 A 0.4 B 0.3 C 0.1 F 0.2
	ApoE E1 .125 E2 .25 E3 .625

Once a freq file has been loaded, it is not necessary to load it again in subsequent SOLAR runs from the same working directory.

The concise reference specification for the freq file format is shown by the command file-freq or the help message for it.

4.7 The Field Command

If the field names in your data files do not match those names expected by SOLAR, the easiest thing to do might be to edit the names in your data files. But if this is impossible or impractical, you can also use the field command to map the SOLAR built-in names to your names. Your field command settings are saved in a file named field.info so that they do not need to be re-entered each time you start SOLAR. (field commands you have written to a file named .solar in the current or home directory will have even higher precedence.)

With no arguments, the field command shows all the fields whose mapping can be changed, and what their current mapping is.

        solar> field

        Default   Current      Meaning
        -------   -------      -------
        ID        ID            (Individual Permanent ID) [mandatory]
        FA        FA            (Father's Permanent ID) [mandatory]
        MO        MO            (Mother's Permanent ID) [mandatory]
        SEX       SEX           (M/F, m/f, or 1/2) [mandatory]
        PROBND    PROBND        [optional]
        MZTWIN    MZTWIN        [optional]
        FAMID     FAMID         [optional]
        HHID      HHID          [optional]

With two arguments, the command maps the second name to the first built-in one. For example, the command

	field id person
would map the built-in name id to a user-provided name person. help field gives more details about the field command.