Batch job schedulers manage the job queues and execution on a compute resource. AiiDA ships with plugins for a range of schedulers, and this section describes the interface of these plugins.
Follow these instructions to add support for a custom scheduler.
The PBSPro scheduler is supported (tested: version 12.1).
All the main features are supported with this scheduler.
Use the NodeNumberJobResource (PBS-like) when setting job resources.
The SLURM scheduler is supported (tested: version 2.5.4).
The SGE scheduler (Sun Grid Engine, now called Oracle Grid Engine) and some of its main variants/forks are supported (tested: version GE 6.2u3).
Use the ParEnvJobResource (SGE-like) when setting job resources.
The IBM LSF scheduler is supported (tested: version 9.1.3 on the CERN lxplus cluster).
Torque (based on OpenPBS) is supported (tested: version 2.4.16 from Ubuntu).
The direct scheduler plugin simply executes the command in a new bash shell, puts it in the background and checks for its process ID (PID) to determine when the execution is completed.
direct
Its main purpose is debugging on the local machine. Use a proper batch scheduler for any production calculations.
Warning
Compared to a proper batch scheduler, direct execution mode is fragile. In particular:
There is no queueing, i.e. all calculations run in parallel.
PID numeration is reset during reboots.
Do not use the direct scheduler for running on a supercomputer. The job will end up running on the login node (which is typically forbidden), and if your centre has multiple login nodes, AiiDA may get confused if subsequent SSH connections end up at a different login node (causing AiiDA to infer that the job has completed).
Unsurprisingly, different schedulers have different ways of specifying the resources for a job (such as the number of required nodes or the numbers of MPI processes per node).
In AiiDA, these differences are accounted for by subclasses of the JobResource class. The previous section lists which subclass to use with a given scheduler.
JobResource
All subclasses define at least the get_tot_num_mpiprocs() method that returns the total number of MPI processes requested but otherwise have slightly different interfaces described in the following.
get_tot_num_mpiprocs()
Note
You can manually load a specific JobResource subclass by directly importing it, e.g.
from aiida.schedulers.datastructures import NodeNumberJobResource
In practice, however, the appropriate class will be inferred from scheduler configured for the relevant AiiDA computer, and you can simply set the relevant fields in the metadata.options input dictionary of the CalcJob.
metadata.options
CalcJob
For a scheduler with job resources of type NodeNumberJobResource, this could be:
NodeNumberJobResource
from aiida.orm import load_code inputs = { 'code': load_code('somecode@localhost'), # The configured code to be used, which also defines the computer 'metadata': { 'options': { 'resources', {'num_machines': 4, 'num_mpiprocs_per_machine': 16} } } }
The NodeNumberJobResource class is used for specifying job resources in PBS and SLURM.
The class has the following attributes:
res.num_machines: the number of machines (also called nodes) on which the code should run
res.num_machines
res.num_mpiprocs_per_machine: number of MPI processes to use on each machine
res.num_mpiprocs_per_machine
res.tot_num_mpiprocs: the total number of MPI processes that this job requests
res.tot_num_mpiprocs
res.num_cores_per_machine: the number of cores to use on each machine
res.num_cores_per_machine
res.num_cores_per_mpiproc: the number of cores to run each MPI process on
res.num_cores_per_mpiproc
You need to specify only two among the first three fields above, but they have to be defined upon construction. We suggest using the first two, for instance:
res = NodeNumberJobResource(num_machines=4, num_mpiprocs_per_machine=16)
asks the scheduler to allocate 4 machines, with 16 MPI processes on each machine. This will automatically ask for a total of 4*16=64 total number of MPI processes.
4*16=64
When creating a new computer, you will be asked for a default_mpiprocs_per_machine. If specified, it will automatically be used as the default value for num_mpiprocs_per_machine whenever creating the resources for that computer.
default_mpiprocs_per_machine
num_mpiprocs_per_machine
If you prefer using res.tot_num_mpiprocs instead, make sure it is a multiple of res.num_machines and/or res.num_mpiprocs_per_machine.
The first three fields are related by the equation:
res.num_machines * res.num_mpiprocs_per_machine = res.tot_num_mpiprocs
The num_cores_per_machine and num_cores_per_mpiproc fields are optional and must satisfy the equation:
num_cores_per_machine
num_cores_per_mpiproc
res.num_cores_per_mpiproc * res.num_mpiprocs_per_machine = res.num_cores_per_machine
In PBSPro, the num_mpiprocs_per_machine and num_cores_per_machine fields are used for mpiprocs and ppn respectively.
In Torque, the num_mpiprocs_per_machine field is used for ppn unless the num_mpiprocs_per_machine is specified.
The ParEnvJobResource class is used for specifying the resources of SGE and similar schedulers, which require specifying a parallel environment and the total number of CPUs requested.
ParEnvJobResource
res.parallel_env: the parallel environment in which you want to run your job (a string)
res.parallel_env
Both attributes are required. No checks are done on the consistency between the specified parallel environment and the total number of MPI processes requested (for instance, some parallel environments may have been configured by your cluster administrator to run on a single machine). It is your responsibility to make sure that the information is valid, otherwise the submission will fail.
Setting the fields directly in the class constructor:
res = ParEnvJobResource(parallel_env='mpi', tot_num_mpiprocs=64)
And setting the fields using the metadata.options input dictionary of the CalcJob:
inputs = { 'metadata': { 'options': { resources', {'parallel_env': 'mpi', 'tot_num_mpiprocs': 64} } } }
A scheduler plugin allows AiiDA to communicate with a specific type of scheduler. The plugin should subclass the Scheduler class and implement a number of methods, that will instruct how certain key commands are to be executed, such as submitting a new job or requesting the current active jobs. To get you started, you can download this template and implement the following methods:
Scheduler
this template
_get_joblist_command: returns the command to report a full information on existing jobs. _get_detailed_job_info_command: returns the command to get the detailed information on a job, even after the job has finished. _get_submit_script_header: return the submit script header. _get_submit_command: return the string to submit a given script. _parse_joblist_output: parse the queue output string, as returned by executing the command returned by _get_joblist_command. _parse_submit_output: parse the output of the submit command, as returned by executing the command returned by _get_submit_command. _get_kill_command: return the command to kill the job with specified jobid. _parse_kill_output: parse the output of the kill command. parse_output: parse the output of the scheduler.
_get_joblist_command: returns the command to report a full information on existing jobs.
_get_joblist_command
_get_detailed_job_info_command: returns the command to get the detailed information on a job, even after the job has finished.
_get_detailed_job_info_command
_get_submit_script_header: return the submit script header.
_get_submit_script_header
_get_submit_command: return the string to submit a given script.
_get_submit_command
_parse_joblist_output: parse the queue output string, as returned by executing the command returned by _get_joblist_command.
_parse_joblist_output
_parse_submit_output: parse the output of the submit command, as returned by executing the command returned by _get_submit_command.
_parse_submit_output
_get_kill_command: return the command to kill the job with specified jobid.
_get_kill_command
_parse_kill_output: parse the output of the kill command.
_parse_kill_output
parse_output: parse the output of the scheduler.
parse_output
All these methods have to be implemented, except for _get_detailed_job_info_command and parse_output, which are optional. In addition to these methods, the _job_resource_class class attribute needs to be set to a subclass JobResource. For schedulers that work like SLURM, Torque and PBS, one can most likely simply reuse the NodeNumberJobResource class, that ships with aiida-core. Schedulers that work like LSF and SGE, may be able to reuse ParEnvJobResource instead. If neither of these work, one can implement a custom subclass, a template for which, the class called TemplateJobResource, is already included in the template file.
_job_resource_class
aiida-core
TemplateJobResource
To inform AiiDA about your new scheduler plugin you must register an entry point in the aiida.schedulers entry point group. Refer to the section on how to register plugins for instructions.
aiida.schedulers