Skip to content
1 change: 1 addition & 0 deletions docs/source/howto/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ How-To Guides
installation
plugins_develop
cookbook
tricks_real_world_runs
faq
296 changes: 296 additions & 0 deletions docs/source/howto/tricks_real_world_runs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
.. _how-to:real-world-tricks:

========================================
Real-world calculations: tips and tricks
========================================

This how-to page collects useful tips and tricks that may be useful in real-world AiiDA simulations.

How to provide specific configuration to a calculation job
==========================================================

When submitting a job, you can provide specific instruction which should be put in the job submission script (e.g. for a slurm scheduler).
All these additional instructions can be provided via the ``metadata.options`` dictionary of the specific calculation job.

In particular, you can provide:

- custom scheduler directives, like additional ``#SBATCH`` lines in case of slurm schedulers
- prepend text to the job script (e.g. if you need to load specific modules or set specific environment variables which are not already specified in the computer/code setup)
- additional mpirun parameters (e.g. if you need to bind processes to cores, etc.)


Basic pattern
-------------

These options are set on the process builder under ``builder.metadata.options`` and are available for all ``CalcJob`` plugins (e.g. ``aiida-quantumespresso.PwCalculation``):

.. code-block:: python

builder = PwCalculation.get_builder()

# Required scheduler resources (example: 2 nodes x 16 MPI per node)
builder.metadata.options.resources = {
'num_machines': 2,
'num_mpiprocs_per_machine': 16,
}

# Optional scheduler settings
builder.metadata.options.max_wallclock_seconds = 2 * 60 * 60 # 2 hours
builder.metadata.options.queue_name = 'debug' # scheduler queue/partition
builder.metadata.options.account = 'proj123' # accounting/project (if required)

In case you are submitting a ``PwBaseWorkChain``, these options should be set under the ``pw`` input namespace:

.. code-block:: python

builder = PwBaseWorkChain.get_builder()

# Required scheduler resources (example: 2 nodes x 16 MPI per node)
builder.pw.metadata.options.resources = {
'num_machines': 2,
'num_mpiprocs_per_machine': 16,
}

# Optional scheduler settings
builder.pw.metadata.options.max_wallclock_seconds = 2 * 60 * 60 # 2 hours
builder.pw.metadata.options.queue_name = 'debug' # scheduler queue/partition
builder.pw.metadata.options.account = 'proj123' # accounting/project (if required)

Custom scheduler directives (e.g. extra ``#SBATCH``)
----------------------------------------------------

Use ``custom_scheduler_commands`` to inject raw scheduler lines near the top of the submit script (before any non-scheduler command):

.. code-block:: python

builder.metadata.options.custom_scheduler_commands = """
#SBATCH --constraint=mc
#SBATCH --exclusive
#SBATCH --hint=nomultithread
""".strip()

Notes:

- Keep the lines valid for your scheduler (Slurm here; adapt to PBS/LSF/etc.).
- Use this when a directive is not covered by a dedicated option.


Prepend/append shell text to the job script
-------------------------------------------

Use ``prepend_text`` to add shell commands immediately before launching the code, and ``append_text`` for commands executed right after the code finishes:

.. code-block:: python

builder.metadata.options.prepend_text = """
echo "Run started on $(hostname) at $(date)"
""".strip()

builder.metadata.options.append_text = """
echo "Run finished on $(hostname) at $(date)"
""".strip()

Tip: for simple environment variables you can also use ``environment_variables`` (AiiDA will export them for you):

.. code-block:: python

builder.metadata.options.environment_variables = {
'OMP_NUM_THREADS': '1',
}



Extra parameters to mpirun (or equivalent)
------------------------------------------

Set ``mpirun_extra_params`` to pass flags to the MPI launcher in addition to the computer's configured ``mpirun_command``:

.. code-block:: python

# Example for OpenMPI process binding
builder.metadata.options.mpirun_extra_params = [
'--bind-to', 'core', '--map-by', 'socket:PE=2',
]

.. note::
``mpirun_extra_params`` is a list/tuple of strings; AiiDA will join them with spaces. Keep launcher-specific flags consistent with your cluster (OpenMPI, MPICH, srun, etc.).


Full list of metadata available
-------------------------------

Here is the full list of options that can be set in ``builder.metadata``.

.. dropdown:: Click to see all available metadata options

The following fields can be set on ``builder.metadata``:

- call_link_label (str): The label to use for the CALL link if the process is called by another process.
- computer (Computer | None): When using a "local" code, set the computer on which the calculation should be run.
- description (str | None): Description to set on the process node.
- disable_cache (bool | None): Do not consider the cache for this process, ignoring all other caching configuration rules.
- dry_run (bool): When set to True will prepare the calculation job for submission but not actually launch it.
- label (str | None): Label to set on the process node.
- options (Namespace):

- account (str | None): Set the account to use for the queue on the remote computer.
- additional_retrieve_list (list | tuple | None): Relative file paths to retrieve in addition to what the plugin specifies.
- append_text (str): Text appended to the scheduler-job script just after the code execution.
- custom_scheduler_commands (str): Raw scheduler directives inserted before any non-scheduler command (e.g. extra ``#SBATCH`` lines).
- environment_variables (dict): Environment variables to export for this calculation.
- environment_variables_double_quotes (bool): If True, use double quotes instead of single quotes to escape ``environment_variables``.
- import_sys_environment (bool): If True, the submission script will load the system environment variables.
- input_filename (str): Name of the main input file written to the remote working directory.
- max_memory_kb (int | None): Maximum memory in kilobytes to request from the scheduler.
- max_wallclock_seconds (int | None): Wallclock time in seconds requested from the scheduler.
- mpirun_extra_params (list | tuple): Extra parameters passed to the MPI launcher in addition to the computer's configured command.
- output_filename (str): Name of the primary output file produced by the code.
- parser_name (str): Entry point name of the parser to use for this calculation.
- prepend_text (str): Text prepended in the scheduler-job script just before the code execution.
- priority (str | None): Job priority (if supported by the scheduler).
- qos (str | None): Quality of service to use for the queue on the remote computer.
- queue_name (str | None): Name of the queue/partition on the remote computer.
- rerunnable (bool | None): Whether the job can be requeued/rerun by the scheduler.
- resources (dict) [required]: Scheduler resources (e.g. number of nodes, CPUs, MPI per machine). The exact keys are scheduler-plugin dependent.
- scheduler_stderr (str): Filename to which the scheduler stderr stream is written.
- scheduler_stdout (str): Filename to which the scheduler stdout stream is written.
- stash (Namespace): Directives to stash files after the calculation completes.

- dereference (bool | None): Whether to follow symlinks while stashing (applies to certain stash modes).
- source_list (tuple | list | None): Relative filepaths in the remote directory to be stashed.
- stash_mode (str | None): Mode with which to perform stashing; value of ``aiida.common.datastructures.StashMode``.
- target_base (str | None): Base location to stash files to (e.g. absolute path on remote computer for copy mode).
- submit_script_filename (str): Filename to which the job submission script is written.
- withmpi (bool): Whether to run the code with the MPI launcher.
- without_xml (bool | None): If True, the parser will not fail if a normally expected XML file is missing in the retrieved folder (plugin-dependent).
- store_provenance (bool): If False, provenance will not be stored in the database (use with care).


Understand the builder structure
================================

When you are running a complex workflow, it is often useful to understand what inputs can be passed to it (or better, to its builder).
This is particularly useful when you are using a new workflow for the first time, or if you are using a complex workflow with many nested subworkflows.
You can use the following in a ``verdi shell`` to print the structure of inputs accepted by a workflow (or any process class):

.. code-block:: python

from aiida_quantumespresso.workflows.pw.base import PwBaseWorkChain
PwBaseWorkChain.spec().get_description()['inputs'].keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to somewhere / somehow show how the nested structure can be explored, e.g., how to know what to set for the pw input namespace 🤔

Also, maybe this is useful to show somewhere? (in a folded block)
I feel like this is the nicest display of calculation / workflow API, without looking into src:

❯ verdi plugin list aiida.calculations quantumespresso.pw
Description:

    `CalcJob` implementation for the pw.x code of Quantum ESPRESSO.

Inputs:
        kpoints  KpointsData               kpoint mesh or kpoint path
     parameters  Dict                      The input parameters that are to be used to construct the input file.
        pseudos  UpfData, UpfData          A mapping of `UpfData` nodes onto the kind name to which they should apply.
      structure  StructureData             The input structure.
           code  AbstractCode, NoneType    The `Code` to use for this job. This input is required, unless the
                                           `remote_folder` input is specified, which means an existing job is being
                                           imported and no code will actually be run.
   hubbard_file  SinglefileData, NoneType  SinglefileData node containing the output Hubbard parameters from a
                                           HpCalculation
       metadata
       monitors  Dict                      Add monitoring functions that can inspect output files while the job is
                                           running and decide to prematurely terminate the job.
parallelization  Dict, NoneType            Parallelization options. The following flags are allowed: npool  : The
                                           number of 'pools', each taking care of a group of k-points. nband  : The
                                           number of 'band groups', each taking care of a group of Kohn-Sham orbitals.
                                           ntg    : The number of 'task groups' across which the FFT planes are
                                           distributed. ndiag  : The number of 'linear algebra groups' used when
                                           parallelizing the subspace diagonalization / iterative orthonormalization.
                                           By default, no parameter is passed to Quantum ESPRESSO, meaning it will use
                                           its default.
  parent_folder  RemoteData, NoneType      An optional working directory of a previously completed calculation to
                                           restart from.
  remote_folder  RemoteData, NoneType      Remote directory containing the results of an already completed calculation
                                           job without AiiDA. The inputs should be passed to the `CalcJob` as normal
                                           but instead of launching the actual job, the engine will recreate the input
                                           files and then proceed straight to the retrieve step where the files of
                                           this `RemoteData` will be retrieved as if it had been actually launched
                                           through AiiDA. If a parser is defined in the inputs, the results are parsed
                                           and attached as output nodes as usual.
       settings  Dict, NoneType            Optional parameters to affect the way the calculation job and the parsing
                                           are performed.
      vdw_table  SinglefileData, NoneType  Optional van der Waals table contained in a `SinglefileData`.

Required inputs are displayed in bold red.

Outputs:
        output_parameters  Dict             The `output_parameters` output node of the successful calculation.
            remote_folder  RemoteData       Input files necessary to run the process will be stored in this folder
                                            node.
                retrieved  FolderData       Files that are retrieved by the daemon will be stored in this node. By
                                            default the stdout and stderr of the scheduler will be added, but one can
                                            add more by specifying them in `CalcInfo.retrieve_list`.
output_atomic_occupations  Dict
              output_band  BandsData        The `output_band` output node of the successful calculation if present.
           output_kpoints  KpointsData
         output_structure  StructureData    The `output_structure` output node of the successful calculation if
                                            present.
        output_trajectory  TrajectoryData
             remote_stash  RemoteStashData  Contents of the `stash.source_list` option are stored in this remote folder
                                            after job completion.

Required outputs are displayed in bold red.

Exit codes:

  0  The process finished successfully.
  1  The process has failed with an unspecified error.
  2  The process failed with legacy failure mode.
 10  The process returned an invalid output.
 11  The process did not register a required output.
100  The process did not have the required `retrieved` output.
110  The job ran out of memory.
120  The job ran out of walltime.
131  The specified account is invalid.
140  The node running the job failed.
150  {message}
301  The retrieved temporary folder could not be accessed.
302  The retrieved folder did not contain the required stdout output file.
303  The retrieved folder did not contain the required XML file.
304  The retrieved folder contained multiple XML files.
305  Both the stdout and XML output files could not be read or parsed.
310  The stdout output file could not be read.
311  The stdout output file could not be parsed.
312  The stdout output file was incomplete probably because the calculation got
     interrupted.
320  The XML output file could not be read.
321  The XML output file could not be parsed.
322  The XML output file has an unsupported format.
340  The calculation stopped prematurely because it ran out of walltime but the
     job was killed by the scheduler before the files were safely written to
     disk for a potential restart.
350  The parser raised an unexpected exception: {exception}
360  The code failed in finding a valid reciprocal lattice vector.
400  The calculation stopped prematurely because it ran out of walltime.
410  The electronic minimization cycle did not reach self-consistency.
461  The code failed with negative dexx in the exchange calculation.
462  The code failed during the cholesky factorization.
463  Too many bands failed to converge during the diagonalization.
464  The S matrix was found to be not positive definite.
465  The `zhegvd` failed in the PPCG diagonalization.
466  The `[Q, R] = qr(X, 0)` failed in the PPCG diagonalization.
467  The eigenvector failed to converge.
468  The factorization in the Broyden routine failed.
481  The k-point parallelization "npools" is too high, some nodes have no
     k-points.
500  The ionic minimization cycle did not converge for the given thresholds.
501  Then ionic minimization cycle converged but the thresholds are exceeded in
     the final SCF.
502  The ionic minimization cycle did not converge after the maximum number of
     steps.
503  The ionic minimization cycle did not finish because the calculation was
     interrupted but a partial trajectory and output structure was successfully
     parsed which can be used for a restart.
510  The electronic minimization cycle failed during an ionic minimization
     cycle.
511  The ionic minimization cycle converged, but electronic convergence was not
     reached in the final SCF.
520  The ionic minimization cycle terminated prematurely because of two
     consecutive failures in the BFGS algorithm.
521  The ionic minimization cycle terminated prematurely because of two
     consecutive failures in the BFGS algorithm and electronic convergence
     failed in the final SCF.
531  The difference between the total charge and the number of electrons exceeds
     the threshold. Smearing might be required, check the output file for
     details.
541  The variable cell optimization broke the symmetry of the k-points.
542  The cell relaxation caused a significant volume contraction and there is
     not enough space allocated for radial FFT.
710  The electronic minimization cycle did not reach self-consistency, but
     `scf_must_converge` is `False` and/or `electron_maxstep` is 0.

Exit codes that invalidate the cache are marked in bold red.

❯ verdi plugin list aiida.workflows quantumespresso.pw.relax
Description:

    Workchain to relax a structure using Quantum ESPRESSO pw.x.

Inputs:
                           base  Data           Inputs for the `PwBaseWorkChain` for the main relax loop.
                      structure  StructureData  The inputs structure.
                 base_final_scf  Data           Inputs for the `PwBaseWorkChain` for the final scf.
                  clean_workdir  Bool           If `True`, work directories of all called calculation will be cleaned at
                                                the end of execution.
max_meta_convergence_iterations  Int            The maximum number of variable cell relax iterations in the meta
                                                convergence cycle.
               meta_convergence  Bool           If `True` the workchain will perform a meta-convergence on the cell volume.
                       metadata
             volume_convergence  Float          The volume difference threshold between two consecutive meta convergence
                                                iterations.

Required inputs are displayed in bold red.

Outputs:
        output_parameters  Dict             The `output_parameters` output node of the successful calculation.
            remote_folder  RemoteData       Input files necessary to run the process will be stored in this folder
                                            node.
                retrieved  FolderData       Files that are retrieved by the daemon will be stored in this node. By
                                            default the stdout and stderr of the scheduler will be added, but one can
                                            add more by specifying them in `CalcInfo.retrieve_list`.
output_atomic_occupations  Dict
              output_band  BandsData        The `output_band` output node of the successful calculation if present.
           output_kpoints  KpointsData
         output_structure  StructureData    The successfully relaxed structure.
        output_trajectory  TrajectoryData
             remote_stash  RemoteStashData  Contents of the `stash.source_list` option are stored in this remote folder
                                            after job completion.

Required outputs are displayed in bold red.

Exit codes:

  0  The process finished successfully.
  1  The process has failed with an unspecified error.
  2  The process failed with legacy failure mode.
 10  The process returned an invalid output.
 11  The process did not register a required output.
401  the relax PwBaseWorkChain sub process failed
402  the final scf PwBaseWorkChain sub process failed

Exit codes that invalidate the cache are marked in bold red.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the verdi plugin list aiida.workflows quantumespresso.pw.base but I do not quite like the output:

Description:

    Workchain to run a Quantum ESPRESSO pw.x calculation with automated error handling and restarts.

Inputs:
                  pw  Data
       clean_workdir  Bool                   If `True`, work directories of all called calculation jobs will be cleaned
                                             at the end of execution.

I mean, the pw namespace is there, but it is shown as Data, so the user will not know what is it...

Running PwBaseWorkChain.spec().get_description()['inputs']['pw] shows quite extensive summary of the possible inputs, so I think I will add a sentence about that... do you agree?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the wording of the verdi plugin list command seems a bit confusing. Especially if the person is new, it is not immediately clear to what aiida.calculations quantumespresso.pw actually refers (in the python script one would need to import PwCalculation, so the name is different).

The output for PwCalculation looks very nice! I agree with @mikibonacci that instead the output for the workchain is not very useful. I guess we want the users to preferably use the workchain and not the calculation.. However, we can probably still mention that with this command it is possible to get the information about the .pw namespace..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence on that

# -> dict_keys(['_attrs', 'metadata', 'max_iterations', 'clean_workdir', 'handler_overrides', 'pw', 'kpoints', 'kpoints_distance', 'kpoints_force_parity'])

Or via tab completion:

.. code-block:: python

builder = PwBaseWorkChain.get_builder()
builder.<TAB>

.. code-block:: text

_attrs handler_overrides kpoints_distance max_iterations
clean_workdir kpoints metadata pw


How to interactively explore the provenance of a node
=====================================================

If a calculation or workflow node is in the database, it is possible to explore its provenance interactively via the verdi shell (or a jupyter notebook).
For example, if you want to explore the provenance of a calculation with pk ``<pk>``, you can do the following:

.. code-block:: python

from aiida import orm
pw_calc = orm.load_node(<pk>)

pw_calc.inputs.<TAB>
# -> dict_keys(['code', 'kpoints', 'settings', 'parameters', 'parent_folder', 'pseudos', 'structure'])

pw_calc.outputs.<TAB>
# -> dict_keys(['output_parameters', 'output_structure', 'output_trajectory', 'retrieved', 'remote_folder'])

It is possible to inspect, for example, the creator of a given remote_folder (in this case, the pw_calc itself):

.. code-block:: python

remote_folder = pw_calc.outputs.remote_folder

remote_folder.creator
# -> <CalcJobNode: uuid: 'a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6' (pk: 123) (aiida.calculations:quantumespresso.pw)>

remote_folder.creator.pk
# -> 123

remote_folder.creator.process_type
# -> 'aiida.calculations:quantumespresso.pw'

from the creator, it is possible to go back to its inputs and outputs, and so on.
It is also possible to find the higher-level workflow that called a given calculation via the ``.caller`` attribute:

.. code-block:: python

pw_calc.caller
# -> <WorkChainNode: uuid: 'z1y2x3w4-v5u6-t7s8-r9q0-p1o2n3m4l5k6' (pk: 456) (aiida.workflows:quantumespresso.pw.base)>

pw_calc.caller.pk
# -> 456

pw_calc.outputs.remote_folder.creator.caller.pk
# -> 456


How to go to quickly inspect a calculation
===========================================

There are a few ways to inspect the raw inputs/outputs of a calculation as read/written by the executable.

Go to the remote folder of a calculation
----------------------------------------

If you want to go to a calculation folder to see what happened, e.g. if it failed. To go on the remote folder of a given calculation with pk ``<pk>``,
you can use the following command:

.. code-block:: console

verdi calcjob gotocomputer <pk>

And that will open an SSH session on the remote folder of the calculation.

Dump the retrieved files of a calculation
-----------------------------------------

If you want to inspect the retrieved files of a calculation, you can use the following command:

.. code-block:: console

verdi process dump <pk>

That will create a folder in your current directory, and it will contain all the retrieved files of the calculation (including the inputs).
This is particularly useful if you want to inspect the retrieved files of a failed calculation, or if you want to re-run the calculation locally or somewhere else for debugging.

Once you checked that a calculation failed, and you understood what happened, you may want to re-submit it. Please check :ref:`how-to:quick-restart` below.

.. _how-to:quick-restart:

How to quickly re-submit something: get_builder_restart()
=========================================================

If you want to re-submit a calculation/workflow (i.e. a process) for whatever reason, i.e. it failed for some wrong input or not enough resources, you can use
the ``get_builder_restart()`` method of the process node. This is particularly useful if you want to re-submit a complex workflow with many inputs, and you do not want to
build the process builder from scratch again.
The ``get_builder_restart()`` method will return a process builder with all the inputs of the previous calculation, so that you can modify only what you want to change,
and then submit it again.
For example, if you want to re-submit a calculation with pk ``<pk>``, you can use the following:

.. code-block:: python

from aiida import orm
from aiida.engine import submit

failed_pw_base_workchain = orm.load_node(<pk>)
builder = failed_pw_base_workchain.get_builder_restart()

# modify the builder if needed
builder.pw.metadata.options.max_wallclock_seconds = 4 * 60 * 60 # 4 hours

new_calc = submit(builder)
Loading