Galgor: Commands which deal with Genetic Algorithm and Monte Carlo.
# Michal Vieth,H. Daigler, C.L. Brooks III -Dec-15-1997 Initial release.
     The commands described in this node are associated with genetic
algorithm module for conformational searches and docking of small ligands to 
rigid proteins. The full description of the GA features is presented
in the paper "Rational approach to docking. Optimizing the search algorithm"
* Menu:
* Implementation::      A brief description of the anatomy of GA
* Syntax::              Syntax of the replication commands
* Description::         Description of key words and commands usage
* Restrictions::        Restrictions on usage
* Examples::            Supplementary examples of the use of GA
    Genetic Algorithm and Monte Carlo:  Description and Discussion
Name               Keyword               Module
GA setup           GALGOR SETUP          genetic.src
Genetic algorithm  GALGOR EVOLVE         genetic.src, genetic2.src
Monte Carlo        GALGOR EVOLVE MCARLO  genetic.src, genetic2.src  
This code was created by Michal Vieth, Heidi Daigler and  Charles Brooks III
at The Scripps Research Institute during the summer/fall of 1997
based on the code provided by Charles Brooks and Heidi Daigler, Department
of Chemistry, Carnegie Mellon University developed during the summer of 1994.
Its purpose is to enable monte carlo and genetic algorithm based conformational
searches to be performed on peptides/proteins, small organic molecules and 
docking of (small) ligands to their receptors.
It builds upon the replica
ideas of Leo Caves to make multiple copies of the system, i.e., the
chromosomes.  These chromosomes make up a population of molecular
conformations which are crossed and mutated according to the specific
genetic algorithm used.  It is recomended to use a generational update
version of genetic algorithm with elitism 1 or 2, and and island/nicheing
ideas to evolve subpopulations independently.
Each chromosome is represented by the set of
internal coordinates (and rigid body degrees of freedom if desired) deemed to
"evolve", i.e., the genes.  The genes are represented as real numbers. The
chromosome are crossed at a randomly chosen gene or random genes are
mutated with a given rate of mutation.  In addition migration of individuals
between subpopulations is used if the island model is employed.
From each population, the
members suitable for crossing, the "parents", are chosen based upon
their ranking, as measured by a priori chosen preference to select
higher ranked individuals. Evolutionary presure of 1.2 based of scaled fitness
following Jones, JMB 245, 43-53 is used for parents selection.
The potential energy function used can be all of CHARMM's energy
terms, including constraints, or any subset of them.  The user energy
term can be utilized to include any special functional form which is
particularly amenable to the GA search.  The current implementation
uses the standard CHARMM energy functions.  
   For docking the essential part of the energy function is the use
of soft core VDW potential that permits ligands to diffuse into the
protein.  The protein in docking is rigid, and at this time it is
not possible to include partial flexibility.
   Monte Carlo scheme uses the same engine, the only difference is that
the members of the entire population become completly separate entities
which do not exchange any information. Their evolution by simulated
annealing is similar to MCSS molecular dynamics.
   Entire approach to docking and comparison of performance of GA/MD/MC
is described in detail in two back to back papers submitted to J.Comp.Chem.
"Rational approach to docking I and II."
In the initial set-up stage, the GALGorithm keyword is parsed and the
main subroutine for the genetic algorithm is called from CHARMM
(GENETIC).  The keyword SETUp causes a number of things to occur.
First the number of chromosomes (replicas) and the atoms whose
internal coordinates are to be sampled and replicated are parsed.
Following this, the specific IC variables which are to be "evolved"
are selected by the VARIable_ICs sub-parser.  This is carried out in a
separate routine which is modeled after the routine used for
processing the IC edit command.  Finally, the replica subroutine from
Leo Caves is called to replicate the system and the original subsystem
atoms are deleted by a call equivalent to delete atoms select segid
orig end.  Each new segment generated by the replica command is given
the prefix C (for chromosome) followed by the number of that replica
(chromosome).  The apropriate pointer structures for
inclusion/exclusion of dihedral, angle and bond ICs are then created
and control is passed back to the CHARMM level, where specific
manipulations can be performed on the individual chromosomes, e.g., to
vary initial conformations around the initial progenerator of all
chromosomes.
Evolution of the population of chromosomes is controlled by calling
GALGorithm with the EVOLve keyword from CHARMM.  This uses a parser
which sets-up the control variables for the genetic algorithm
evolution and then runs the genetic evolution.  The specific functions
performed by this portion of the code are:
   1.  Parse parameters controlling the GA evolution.
GA follows the scheme:
   2.  Evolve the population of chromosomes.  This involves the following
       steps:
       a) Evaluate the energy of the population at step 0.
       b) Rank the population members in accordance with the viability.
       c) Choose parents to develop the next generation based on roulette
          wheel selection according to scaled fitness.
       d) Mate parents by crossover or  random mutations in genes, 
       e) Reconstruct (via IC build-like procedure) the cartesian positions of
          the new generation from modified ICs and if applicable from
          the modified values of the rigid degrees of freedom. 
       f) Evaluate energies of the new generation 
       g) if elitist
          model is being used transfer fittest parent into the new generation.
          If steady state/generational update is used replace least fit
          parents by children. if elitism is used retain some most fit parents.
          IF evolutionary strategy is pick the best members to form a new
          population
       h) Begin again at step b), cycle through to convergence or 
          MAXGenerations or NEVALuations
 
Monte Carlo proceeds via the following scheme:
   2.  Evolve the population of chromosomes.  This involves the following
       steps:
       a) Evaluate the energies of each replica 
       b) Generate new replica by mutations
       c) Reconstruct (via IC build-like procedure) the cartesian positions of
          the new generation from modified ICs and if applicable from
          the modified values of the rigid degrees of freedom. 
       d) Evaluate energies of the new replicas
       e) Accept new replicas with Boltzmann probability
       h) Begin again at step b), cycle through
          MAXGenerations or NEVALuations
                    Syntax for the Genetic ALGORithm command
           GALGOR  
setup:     {[SETUP] [CHROmosomes int] [atom selection] -
           {SEED 3x(<resnum> <atom>)}
           {[VARIable_IC] -
           {[DIHEdral] [IMPRO] [INCLude] [4x<atom selection]} {[DEPE] -
           [4x<atom selection] [OFFSet int]} -
           {{[BOND] [3x atom selection]} [ALL] [NONE]} -
           {{[ANGLE] [2x atom selection]} [ALL] [NONE]} -
           {TRAN} {ROTA} -
           [END]}
evolution: {[EVOLVE] -
GA parameters:
           [PARENTS int] [CHILDREN int]  -
           {[STEADY] [ELITE int]} [EPRES real] -
           [CROSsover_rate real] [MUTAtion_rate real] [GSIZE int] -
           [NICHes int] [INTEraction_frequency int] -
MC parameters :
           {[MCARlo] [TBEG real] [TEND real] [TFRQ int]
           [TJWAlking real] [IJWAlking real]} -
Evolution parameters:
           [MAXGenerations int] [NEVAluations int] [IPRINT int] -
           { [TOLE real] [TOLC real]} [PRIN_frequency int] -
Initialization parameters:
           {[RANDomization] [ISEED int] [RDIHE real] [RROTA real] -
           {[RTRA real] [RXTR real] [RYTR real] [RZTR real] [OFTR real]}} -
Step sizes:
           [TRANslation_step real] -
           [ROTAtion_step real] [BOSTep real] [ANSTep real] -
           [IMPRoper_step real] [PINTernal real] [PALL real] [PCALL real] 
           [PTALL real] -
           {[IBIG_step_frq int] [BTRAN real] [BROTA real] [BDIST real] -
Nonbonded parameters:
           {[RMIN real]} [QNOE] [NBFRQ real] - 
Exit parameters:
           [DELETE] [CLEAR] [LEAVE int]}}
     There are two basic parts of running genetic algorithm or Monte Carlo
in CHARMM:
Description of the basic key words of the genetic algorithm :
 
The following is the description of the setup commands for setting up the
system
Keyword/Syntax    Default   Purpose
SETUP                       setting up the data structure
CHRO              50        Number of chromosomes in a population
atom sele                   The atoms to be used as chromosomes
SEED 3x(<resnum> <atom>)    Specify seed atoms for ic builds
                            Default to use first three atoms in PSF
VARI                        Definition of the active variables - genes. 
                            This keyword acts in similar way to IC EDIT. It has
                            to be followed by definition of internal 
                            coordinates and end with END statement.
DIHE INCL <sele> ALL        Within VARI specifies the selection of dihedral 
                            angles to be used as active variables. May be
                            followed by DIHE DEPE.
DIHE DEPE <sele> OFFS value Within VARI specifies the dihedral that can be
                            computed from the value of the dihedral defined
                            immediately before by adding the OFFSet value,
                            This specifies dihedral dependency if two or more
                            quartets of atoms  describe rotation about the same
                            bond                      
DIHE IMPR <sele> ALL        Improper dihedrals to be used as active variables.
BOND <sele>      ALL        Bonds to be used as active variable
ANGLE <sele>     ALL        Angles to be used as active variables
Example:
   GALGOR SETUP -
   CHRO 50  SELE segid maa END -
   SEED 1 c 2 n 2 ca -
   VARIable_ICs
   DIHE  INCL  1 C     2 N     2 CA    2 C   end
   DIHE  DEPE  2 CA    2 C     3 N     3 CA  end OFFSET -2.0
   DIHE  IMPR  2 N     1 CL    1 C     1 O   end
   BOND ALL
   ANGLE ALL
   TRAN ROTA
   END
The following is the description of the setup commands for evolution of
the system
Keyword/Syntax    Default   Purpose
EVOLVE                      The beginning of the evolution setup
STEA              off       The type of evolutionary algorithm to be used.
                            IF STEAdy key word is absent evolutionary strategy
                            is used, in which the new population consists of
                            CHRO most fit individuals that are chosen from
                            2*CHRO parents and children. If STEA key word is
                            present generational (or steady state) update 
                            is used in which children replace least fit 
                            parents in the population. STEA is highly 
                            recomended for problems with two or more variables
                            to avoid premature convergence of the population. 
 
PARENTS           CHRO      The number of parents to be used in mating,
                            typically all chromosomes are permitted to be
                            parents
ELITE             2         So called elitism of the population - 
                            The number of parents to be retained in the next
                            generation. If STEA key word is used with elitism
                            equal to CHRO-2 a typical steady state update
                            is carried, otherwise the update is called 
                            generational. With no STEA key word in evolutionary
                            strategy elitism is not changing the 
                            evolution process.
CHILDREN          CHRO-ELIT The number of children created in each mating
EPRES             1.2       Evolutionary presure, this is defined as the 
                            relative probability to select the most fit 
                            individual with respect to average individual.
                            The scaled fitness is linear and based on Jones, 
                            JMB 245, 43-5
NICHes            1         The number of subpopulations (niches) to be evolved
                            independently, This is used as yet another mean
                            to avoid premature convergence.
INTE              100       Migration frequency between niches.  This is
                            indicates how frequently individuals migrate from
                            one subpopulation to another. THe essence of
                            migration is top replace NICHE-1 worst members
                            in each subpopulation by the best members of
                            other subpopulations
CROSsover         0.6       The probability of mating by crossover [0,1], 
                            higher values tend to cause faster convergence
MUTA              0.4       The probability of mating by mutation, higher
                            values prevent fast convergence. In general the
                            sum of MUTA and CROS would be 1.0, unless we
                            want to create children as clones of some parents.
                            The clones will make sense in generational update,
                            less sense in evolutionary strategy and no sense
                            in the steady state update.
GSIZE             1         The number of variables coding for a gene. Since
                            we use floating point numbers to represent a gene
                            1 is strongly recomended. Higher values would
                            affect the meaning of crossover changing it to
                            mutation if it occurred within a gene.
MCARLO            off       Turns on Monte Carlo procedure, instead of GA,
                            CHROM act now as separate replicas not interacting
                            with each other and each evolving by means of
                            mutations independently
TBEG              300.0     Initial temperature of the system in K
TEND              300.0     Final temperature in K
TFRQ              10        Temperature change frequency, the temperature 
                            change is calculated in the following way :
                            tchange=(TEND-TBEG)/(MAXG/TFRQ)
IJWALK            MAXG+1    Frequency of sampling at higher temperature, that
                            allows sporadically for higher acceptance ratio.
                            The use of this jwalker is not recommended unless
                            the system has critically low acceptance rate.
                            The acceptance rate is printed with energy in place
                            of GRMS printout
TJWALK            400.0     Temperature allowing for better acceptance ratio
                            turned on every IJWALK generations. This is turned
                            on only for replicas whose acceptance ratio is
                            less than 1%
                            
MAXGeneration     10        The total number of generations (or MC steps) for
                            which the system is to evolve
NEVAL             150000    The maximum number of energy evaluations performed
                            on the system
IPRINT            5         The number of chromosomes for which the energies
                            are to be printed
PRINT_freq        100       Printing frequency of chromosome energies. The 
                            chromosome are ranked based on energy, and
                            depending on the print level various terms are 
                            included. By default all energy terms are reported.
TOLCO             0.0       The value of the RMS deviation between the average
                            chromosomes and all other chromosomes (measured
                            by values of all genes) for which the evolution
                            is terminated due to convergence in variable space.
                            IF the value is zero the convergence is not checked
TOLER             0.0       The value of the RMS deviation between the energy 
                            of the average chromosome and all other chromosomes
                            for which the evolution due to convergence in 
                            energy. If the value is zero the convergence is
                            not checked.
RAND              off       IF this keyword is used the variables in all
                            chromosomes are randomized around their original
                            values with the spread equal to their step sizes
ISEED             31415     Seed number for the ranom generator
RDIHE             0.0       The spread for generating the initial value of 
                            dihedrals if RAND key word is used. With a default
                            value dihedrals will be randomized about their
                            initial values (with RAND present)
RROTA             0.0       The spread in initial values of euler angles 
                            generated randomly if RAND key word is used. The
                            values are in radians. IF the value is zero
                            euler angles will be randomized about their
                            initial values
RTRA              0.0       The maximum distance of the center of mass 
                            of molecules from the point specified by RXTR
                            RYTR RZTR for the random initial placement of 
                            molecules.
RXTR              0.0       The location of the center for generation
RYTR              0.0       of initial centers of masses 
RZTR              0.0   
OFTRA             0.2       The minimum distance of the center of mass 
                            of molecules from the point specified by RXTR
                            RYTR RZTR for the random initial placement of 
                            molecules.
TRAN              0.6       The magnitude of the maximum value for the 
                            step size of translations, the actual value is
                            calculated : (Random number-0.5)*TRAN
                            
ROTA              0.5       The magnitude of the maximum value for the 
                            step size of rotations, the actual value is
                            calculated : (Random number-0.5)*ROTA. The
                            values are in radians.
BOSTep            0.002     The magnitude of the maximum value for the 
                            step size of bond distance change
ANSTEP            2.0       The magnitude of the maximum value for the 
                            step size of bond angle change in degrees
IMPROPER          2.0       The magnitude of the maximum value for the 
                            step size of improper dihedral angle change 
                            in degrees
PINT              0.5       Probability to mutate internal degrees of 
                            freedom, the probability to mutate rigid                                       body degrees of freedom is 1-pint. This applies 
                            only for the system with active translations or
                            rotations, otherwise the value of PINT is 1.
PALL             0.0        Relative probability to mutate all 3 
                            translational or rotational degrees of
                            freedom, default 0 with respect to other
                            rigid body mutations
PCALl            0.0        Relative probability to mutate all 6 rigid
                            body degrees of freedom
PTALl            0.0        Probability to mutate all internal degrees
                            of freedom simultaneously
IBIGstep         MAXG+10    The frequency of mutations with big steps
BTRAN            0.6        The step for translations every IBIG steps
BROTA            0.5        The step for rotations every IBIG steps
BDIS             30.0       The step for dihedrals change every IBIG steps
QNOE             off        The flag turning on the NOE potential
NBFRQ            1          Nonbonded interaction list update frequency
RMIN             0.0        The soft core switching distance, the force
                            for all nonbonded interactions at RMIN*SIGMA
                            is the same as for the distances lower that 
                            RMIN*SIGMA. The two above flags turn on the soft
                            core VDW potential. The strongly recomended
                            value for RMIN in the initial stages of docking 
                            is 0.885.
LEAVE            CHROM      The number of chromosomes to be left at the end
                            of evolution
DELETE           off        Delete all chromosomes except the first LEAVE
CLEAR            off        CLearing all GA routines
Example for evolution of bonds, angles, improper dihedrals and dihedrals.
GALGorithm EVOLve -
STEA elite 2 EPRE 1.2 -
MAXG 1000 -
NICHE 1  50 GSIZE 1 MUTA 0.4 CROS 0.6 -
RAND ISEED 1423 -
PINT 1.0 TOLC 0.01  -
PRIN 10 DIST 20.0 IPRO 3.0 ANST 1.5 BOST 0.002 -
NBFR 10 DELETE CLEAR
1) Since the setup of GA module is based on the replica module (with deletion
   of the parent replica) the requirements for running GA code are identical 
   to the requirements for replicas. 
2) Energy has to be called before the evolution step. 
3) For any interaction of chromosomes with environment PSF (i.e. protein
   to which chromosomes are docked) it is assumed that chromosomes are
   in the PSF before the environment
4) Due to the existing bag in the nonobonded list generation after deletion
   of the parent replica PSF in order to avoid any errors in generation of
   nonbonded list with environment it is necessary to define PSF for the
   environment twice and then delete the first environment PSF:
replica nrep 10 sele segid ligand end
dele atom sele segid ligand end
or 
GALGOR SETUP chrom 10 sele segid ligand end
open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1
generate 3pti setup nodihe
open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1
generate 3ptb setup nodihe
dele atom sele segid 3pti end
   5) In the process of generation Cartesian coordinates from internal
      coordinates it is ASSUMED THAT a dihedral angle IS DEFINED in
      the IC table for the FIRST THREE atom in the PSF. If the first
      three atoms from PSF cannot be used as seeds RTF has to be modified 
      so the first three atoms define a dihedral angle in IC table . In a 
      later stage the GA_place routine should be modified to allow for 
      different atom seeding.
   6) In order to run consecutive evolutions of GA with different parameters
      or consecutive runs of MC no DELETE nor CLEAR key words can be used.
      Otherwise segmentation fault will be reported.
   7) IC FILL command has to be used when cartesian coordinated of chromosomes
      are read from external file if one wants to use initial values of 
      internal coordinates corresponding to the cartesian coordinates.
      Otherwise internal coordinates will be taken from the parent chromosome.
      GA_
   8) The supported energy terms are: USER, ANGLE, BOND, UREYB, DIHE, 
      IMDIHE, VDW, ELEC, NOE, CHARM, CDIHE.
   
Supplementary examples.
_________________________________________________________
Alanine dipeptide rigid bond/angle/improper setup:
_________________________________________________________
Read sequ card
* maa
*
3
AMN ALA CBX
Generate maa setup 
ic parameters  
ic seed 1 cl 1 c 1 o  !provide the three atoms to "seed" building
ic build  !build the structure based upon the internal coordinates (ics)
set npar 20
GAlgorithm SETUp -
   CHROmosomes 20 select segid maa end -
   VARIable_ICs
    DIHEdral INCLude  1 C     2 N     2 CA    2 C   end
    DIHEdral INCLude  2 N     2 CA    2 C     3 N   end
   END
Nbonds cdie eps 1.0 cutnb 99.0 ctofnb 90.0 wrnmxd 99.0 swit vswit
energy
ic fill
_________________________________________________________
Alanine dipeptide global minimum by generational update GA:
_________________________________________________________
GALGorithm EVOLve -
RAND iseed @num rdihe 180.0 -
  steady epres 1.2 -
  MAXGenerations 1000 -
  niches 2 inte 10 -
  gsize 1 muta 0.2  pinte 1.0 cone -
  print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear
Alanine dipeptide global minimum by evolutionary strategy:
_________________________________________________________
GALGorithm EVOLve -
RAND iseed @num rdihe 180.0 -
  MAXGenerations 1000 -
  niches 2 inte 10 -
  gsize 1 muta 0.2  pinte 1.0 cone -
  print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear
__________________________________________________________
Alanine dipeptide global minimum by simulated annealing MC:
_________________________________________________________
GALGorithm EVOLve -
mCarlo   Tbeg 500.0 Tend 250.0  tfrq 10 -
RAND iseed 3213  rdihe 360.0 -
  MAXGenerations 5000  -
  niches 2 inte 10 -
  gsize 1 muta 0.2  pinte 1.0 cone  -
  ibig 50 bdis 100.0 -
  print 10 Dist 20.0 anst 1.0 toler 0.01 delete clear
_________________________________________________________
GA docking of g3p to tim:
_________________________________________________________
setup:
set nchrom 150
set nliga @nchrom
set npar @nchrom
GAlgorithm SETUp -
   CHROmosomes @nchrom select segid g3p end -
   VARIable_IC
    dihe incl  1 O1    1 C2    1 C4    1 O7    end
    dihe depe  1 O1    1 C2    1 C4    1 C8    end offset -120.0
    dihe incl  1 C2    1 C4    1 C8    1 O10   end
    dihe depe  1 O7    1 C4    1 C8    1 O10   end offset -120.0
    dihe incl  1 C4    1 C8    1 O10   1 P14   end
    dihe incl  1 H1    1 O1    1 C2    1 C4    end
    dihe incl  1 C2    1 C4    1 O7    1 H2    end
    dihe depe  1 C8    1 C4    1 O7    1 H2    end offset 120.0
    dihe incl   1 C8    1 O10   1 P14   1 O15  end
    dihe depe   1 C8    1 O10   1 P14   1 O16  end offset 120.0
    dihe depe   1 C8    1 O10   1 P14   1 O17  end offset -120.0
rota tran
   END
mult nliga by 2
mult nchrom by 2
open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1
generate 3pti setup nodihe
open unit 1 read form name "6Atim.pdb"
read sequ pdb unit 1
close unit 1
generate 3ptb setup nodihe
dele atom sele segid 3pti end
open unit 1 read form name 6tim_m.crd
read coor ignore unit 1 append
close unit 1
cons fix sele segid 3pt* end
fast on
wrnlev -1
energy rdie eps 3.0 cutnb 13.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5 -
qrmin rmin 0.885
GALGorithm EVOLve -
RAND iseed @num  rtran 11. rxtr -0.7 rytr -4.2 rztr -9. oftra 0.3 -
rrota 6.28 rdihe 360.0  -
steady  -
 elite 2 epres 1.2  -
 MAXGenerations 400  -
 niches 5 inte 100 nbfrq 40 pcall .3 pall 0.3  -
 ibigstep 40 btran 1.8  brota 1.8  bdist 40.0  -
 qrmin rmin 0.885 qnoe -
 gsize 1 muta 0.7 cross .3 tran 1. rota 1.  pint 0.3 - 
  print 50 Dist 30.0
energy rdie eps 3.0 cutnb 12.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5
!!!!!!!!!!!!!!!!!
!!!!!!! EVOLUTION
!!!!!!!!!!!!!!!!!
GALGorithm EVOLve -
steady  -
nevalu 5000000 -
 elite 2 epres 1.1   -
 MAXGenerations 70  -
 niches 5 inte 25 nbfrq 50 pcall .2 pall 0.2  -
 ibigstep 50 btran 1.  brota 1. bdist 60.0 qrmin rmin 0.65 qnoe -
 gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 - 
 print 50 Dist 30.0 
GALGorithm EVOLve -
steady  -
nevalu 5000000 -
 elite 2 epres 1.1   -
 MAXGenerations 30  -
 niches 5 inte 5 nbfrq 30 pcall .2 pall 0.2  -
 ibigstep 30 btran 1.  brota 1. bdist 40.0  qnoe qrmin rmin 0.5 -
 gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 -
 print 50 Dist 30.0 delete clear leave 10
CHARMM Documentation / Rick_Venable@nih.gov