Quick start

From TurboGAP
Revision as of 07:36, 24 July 2021 by Miguel Caro (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is a quick start guide to fitting and using GAP potentials using an atomic database and some combination of descriptors. You can do this with QUIP/GAP alone, or add TurboGAP support for faster and more accurate many-body descriptors of the SOAP type. You can run some of these potentials with TurboGAP, provided that you only use distance_2b, angle_3b or soap_turbo descriptors. In addition, you can also use your GAP to run molecular dynamics simulations on LAMMPS with the QUIP-LAMMPS interface, although some advanced functionality like van der Waals corrections is not currently supported on LAMMPS.



  • You need either your own or a preexisting atomic database. If you are generating your own, you need (usually) a DFT package that can predict energy and forces for given atomic configurations. Examples of such packages are VASP, GPAW, Quantum Espresso, ABINIT, FHI AIMS, CASTEP, CP2K, etc.
  • You need a working QUIP and GAP installation. QUIP is for free, and GAP is for free for non-commercial research. Below are some quick instructions on how to get going with this.


Optional packages are not required but they improve the ease/scope of making and using GAP potentials. Don't be lazy and install them (we will use them at discretion throughout the guide below).

  • The Atomic Simulation Environment (ASE) is not required but it is always a useful addition to an atomistic toolset, and particularly useful to generate the XYZ database files used by QUIP/GAP.
  • A TurboGAP installation if you want to use soap_turbo many-body descriptors, which are faster and more accurate than the default SOAP descriptors. In addition, TurboGAP supports SOAP compression (making it even faster). TurboGAP cannot (currently) be used on its own to training a GAP: it is used as a library in combination with QUIP/GAP. TurboGAP is free for non-commercial research. TurboGAP can also be used to run parallel MD calculations, but the number of available implementations of integrators, thermostats, barostats, etc., is way below what LAMMPS can offer (see next point), as is the efficiency and scalability of the parallel implementation. That said, for specific problems with the right balance between computation and communication, TurboGAP can run GAP-based MD significantly faster than LAMMPS (about an order of magnitude faster).
  • A LAMMPS installation, if you want to use your GAP potentials to run large-scaling molecular dynamics simulations. A custom installation of LAMMPS with QUIP support is required (more details below). At the moment some advanced functionality like van der Waals corrections is not available through the LAMMPS interface.

Building a database

The most essential requirement to train a ML potential, be it GAP or something else, is training data. For QUIP, a database of atomic configurations is an extended XYZ file with atomic structures (i.e., atomic coordinates), accompanied by energies, forces and/or virials. A single XYZ file, e.g., database.xyz, contains a concatenation of individual atomic structures. What kinds of configurations are contained here and how well they span the regions of configuration space that are relevant to our potential will be absolutely critical in determining the accuracy of our trained GAP.

Good databases for general-purpose interatomic potentials contain a combination of crystalline and distorted structures, dimer, trimers, surfaces, liquids, etc. Specific-purpose interatomic potentials can, on the other hand, focus on particular structures. Our "training set" may contain a lot of structures, of the order of thousands of simulation boxes and hundreds of thousands of unique atomic environments. Not all of these will be used for the kernel-ridge regression during production, only the sparse set will be used for that; however, the full training set will be used to find the optimal regression coefficients.

As an example, let us use GPAW, in combination with ASE, to compute energies and forces for a bunch of elemental phosphorus structures. Note that this is just to get to grips with the workflow; you can generate the training data with another DFT code (although ASE is always handy for parsing the output of the calculation). E.g., if you use VASP, you can import the OUTCAR with ASE (atoms = read("OUTCAR")) and the energy, forces, etc., will be imported to the ASE Atoms() object.

But let us get back to generating a (rather crappy) database of P structures with GPAW. The code below will compute the energy for an isolated atom, assign it label "isolated_atom" and write it to file; then, for 20 configurations corresponding to the P2 dissociation curve; then, 50 very stupidly constructed bulk configurations in periodic boundary conditions. Note that we are using an LCAO basis set since we want to do this fast, and we are not leaving a lot of vacuum around the non-periodic structure. If you hope to publish your database, you should fix this and other problems with this example.

 1 import numpy as np
 2 from ase import Atoms
 3 from ase.io import read,write
 4 from gpaw import GPAW
 7 # Compute an isolated P atom
 8 if 1:
 9     print("Doing isolated atom...")
10     atoms = Atoms("P", positions=[[0,0,0]], pbc=False)
11     atoms.center(vacuum=4.)
12     calc = GPAW(xc="PBE", txt="gpaw.out", spinpol=True,
13                 mode="lcao", basis="dzp")
14     atoms.set_calculator(calc)
15     e = atoms.get_potential_energy(force_consistent=True)
16     f = atoms.get_forces()
17     atoms.info["config_type"] = "isolated_atom"
18     write("database/isolated_atom.xyz", atoms)
21 # Do a dimer curve
22 if 1:
23     list = np.arange(1., 5., 0.2)
24     for i in range(0, len(list)):
25         print("Doing %i/%i dimer configurations..." % (i, len(list)))
26         d = list[i]
27         atoms = Atoms("P2", positions = [[0,0,0], [d,0,0]], pbc=False)
28         atoms.center(vacuum = 4.)
29         calc = GPAW(xc="PBE", txt="gpaw.out", mode="lcao", basis="dzp")
30         atoms.set_calculator(calc)
31         e = atoms.get_potential_energy(force_consistent=True)
32         f = atoms.get_forces()
33         atoms.info["config_type"] = "dimer"
34         write("database/dimer_%i.xyz" % i, atoms)
37 # Do some highly distorted bulk structures in cubic boxes
38 # From high to low density, 10 structures at each density
39 # with 8 atoms in each of them
40 if 1:
41     n = 0
42     nstruc = 10
43     natoms = 8
44     list = np.arange(3.5, 6., 0.5)
45     ncalc = nstruc*len(list)
46     for L in list:
47         cell = [L, L, L]
48         atoms = Atoms("P", positions=[[0,0,0]], cell = cell, pbc=True)
49 #       We add nstruc structures
50         for i in range(0,nstruc):
51 #           natoms atoms to each structure
52             while len(atoms) < natoms:
53                 j = len(atoms)
54                 atom_too_close = True
55                 while atom_too_close:
56                     pos = L * np.random.sample(3)
57                     atoms_temp = atoms + Atoms("P", positions=[pos])
58                     atom_too_close = False
59                     for k in range(0,j):
60                         d = atoms_temp.get_distance(k, j, mic=True)
61                         if d < 1.5:
62                             atom_too_close = True
63                             break
64                 atoms = atoms_temp.copy()
65 #           Do the calculation
66             calc = GPAW(xc="PBE", txt="gpaw.out", mode="lcao", basis="dzp")
67             atoms.set_calculator(calc)
68             print("Doing %i/%i random configurations..." % (n, ncalc))
69             e = atoms.get_potential_energy(force_consistent=True)
70             f = atoms.get_forces()
71             atoms.info["config_type"] = "random"
72             write("database/random_%i.xyz" % n, atoms)
73             n += 1

We have given a label ("config_type") to structures generated in different ways because this will allow us to provide per-configuration regularization parameters, a nice feature, and very useful, of QUIP/GAP. Importantly, you should put all of the individual XYZ files into a single "database" XYZ file, which is easily achieved by concatenating all of them. For example, given the files generated above, you could do (adjusting the ranges as needed):

1 cat database/isolated_atom.xyz > database.xyz
2 for i in $(seq 0 1 19); do cat database/dimer_${i}.xyz >> database.xyz; done
3 for i in $(seq 0 1 49); do cat database/random_${i}.xyz >> database.xyz ; done

If you want to avoid running the codes above for whatever reason, you can download the crappy database here: database.xyz.

Installing TurboGAP (optional step)

If you want to be able to use TurboGAP functionality, e.g., soap_turbo descriptors (which are faster and more accurate than regular soap) you need to build the TurboGAP library to be used in combination with QUIP/GAP. Note that you need to have Blas/Lapack libraries locally installed.

To proceed with the installation, clone the TurboGAP repo:

 1 # Create and cd into a directory
 2 mkdir /myapps/turbogap
 3 cd /myapps/turbogap
 5 # Clone the turbogap development repo
 6 git clone --recursive https://github.com/mcaroba/turbogap.git turbogap_dev
 8 # cd and make the code (you may want to edit the Makefile, load compiler modules, etc., depending on the specifics of your system)
 9 cd turbogap_dev
10 make

This should have built the TurboGAP binaries and libraries. The git --recursive flag will take care of cloning also the GAP_turbo source code.

Installing QUIP/GAP

To install QUIP and GAP (they are two separate packages with two separate licenses) you need to get QUIP from libAtoms or clone the git repo (it is probably best/safest to get the development version to make sure all functionality is available):

1 # Create and cd into a directory
2 mkdir /myapps/quip
3 cd /myapps/quip
5 # Clone the QUIP development repo
6 git clone --recursive https://github.com/libAtoms/QUIP.git quip_dev
8 # cd into the quip directory
9 cd quip_dev

The git --recursive flag will take care of cloning also the GAP source code.

Now you are ready to build QUIP/GAP. First, you need to select your architecture. For a list of possible architectures, do ls arch. If you want to use QUIP/GAP together with TurboGAP, make sure that your chosen compiler is compatible (e.g., you use gfortran to build both codes). The architecture is selected by exporting the QUIP_ARCH environment variable. After that, you make config. You'll be asked whether or not to install a bunch of functionality. Don't get crazy with your ENTER key, you need to select "y" (yes) for GAP (the default is "n") and, if you're planning to use TurboGAP functionality, you need to provide QUIP with the TurboGAP root directory, /myapps/turbogap/turbogap_dev in the example above. Optionally, if you want to build LAMMPS with QUIP functionality, you need to make the QUIP library. Example:

 1 # cd up from the src directory
 2 cd ..
 4 # Export your architecture
 5 export QUIP_ARCH="linux_x86_64_gfortran"
 7 # Configure the installation (make sure to 1) select "yes" to GAP and 2) provide the TurboGAP root directory if you want TurboGAP support)
 8 make config
10 # Build the code
11 make
13 # Optional step if you want to use QUIP with LAMMPS
14 make libquip

The code will be built into a build/your_architecture directory (e.g., build/linux_x86_64_gfortran), where you will find all the relevant binaries, e.g., quip, gap_fit, etc. Make sure to add that directory to your PATH.

Fitting a GAP potential

Now, equipped with our crappy P database (or your own fancy database, of course) we can get hands on with the task of fitting a GAP potential. GAP fitting is done with the gap_fit command. These are the needed ingredients:

  • Descriptors. For each descriptor, a "sub-GAP" is fitted. That is, a GAP is actually (generally speaking) a sum of GAPs. You can combine N-body descriptors (e.g., two-body, three-body, etc.) with many-body (SOAP) descriptors, use more than one SOAP, use only N-body, ... the possible combinations are endless. We can start out with a combination of one two-body and one many-body (SOAP) descriptors, for our crappy P potential.
  • A kernel definition. How do you use the descriptors to construct a kernel? Typical choices are Gaussian kernels (squared exponentials) for N-body descriptors and dot-product kernels for SOAP descriptors.
  • Sparse set. The number of "reference" configurations that are used to construct the kernel. During production, you make predictions by comparing your current descriptor to all of those in the sparse set. During training, you compare all of your training descriptors (including those in the sparse set) to the sparse set descriptors. Each sub-GAP has its own sparse set. The cost of evaluating each sub-GAP grows (approximately) linearly with this number.
  • Sparsification technique. How do you choose the sparse set configurations? Common ways are CUR decomposition and plain-and-simple random.
  • Regularization.

[Section under construction]

This is an example gap_fit command using a soap many-body descriptor:

 1 # force should be "force" or "DUMMY"
 3 forces=0
 5 if [ $forces -eq 1 ]; then
 6 virial_name="DUMMY"
 7 force_name="force"
 8 energy_name="free_energy"
 9 else
10 virial_name="DUMMY"
11 force_name="DUMMY"
12 energy_name="free_energy"
13 fi
15 gap_fit atoms_filename=database.xyz \
16         gap={ distance_Nb order=2 compact_clusters=T cutoff=5.0 n_sparse=20 covariance_type=ard_se \
17                   delta=0.5 theta_uniform=1.0 sparse_method=uniform : \
18               soap l_max=8 n_max=8 atom_sigma=0.5 zeta=4 cutoff=5.0 \
19                   central_weight=1.0 config_type_n_sparse={random:100} \
20                   delta=0.1 f0=0.0 covariance_type=dot_product sparse_method=cur_points } \
21         default_sigma={0.002 0.2 0.2 0.2} \
22         config_type_sigma={dimer:0.0002:0.02:0.02:0.02:random:0.002:0.2:0.2:0.2} \
23         energy_parameter_name=${energy_name} force_parameter_name=${force_name} virial_parameter_name=${virial_name} \
24         sparse_jitter=1.0e-8 do_copy_at_file=F sparse_separate_file=T gp_file=crappy_P.xml

This command will run for some time. In the example above, I have disabled using forces, because the can use a lot of memory, and not fit into your crappy workstation. For publication-grade GAPs, we usually need to use a computer cluster with "fat memory" nodes. But if your database is small and/or you leave forces out, you may be able to train a GAP on a small machine. In the future it is expected that GAP fit support will be extended to use distributed memory (across compute nodes). Currently, only access to the same physical memory is supported (i.e., one can only run gap_fit on a single compute node.

Whether you trained with or without forces, gap_fit will produce a (series of) XML file (starting by "crappy_P.xml" in the example above) with the information necessary to evaluate your GAP and run GAP MD.

Testing your potential with QUIP

Installing LAMMPS (optional step)

Running molecular dynamics with LAMMPS (optional step)