|
1 |
| -# ClusterManagers.jl |
| 1 | +# HTCondorClusterManager.jl |
2 | 2 |
|
3 |
| -The `ClusterManagers.jl` package implements code for different job queue systems commonly used on compute clusters. |
| 3 | +The `HTCondorClusterManager.jl` package implements code for HTCondor clusters. |
4 | 4 |
|
5 |
| -> [!WARNING] |
6 |
| -> This package is not currently being actively maintained or tested. |
7 |
| -> |
8 |
| -> We are in the process of splitting this package up into multiple smaller packages, with a separate package for each job queue systems. |
9 |
| -> |
10 |
| -> We are seeking maintainers for these new packages. If you are an active user of any of the job queue systems listed below and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use. |
11 |
| -
|
12 |
| -## Available job queue systems |
13 |
| - |
14 |
| -### In this package |
15 |
| - |
16 |
| -The following managers are implemented in this package (the `ClusterManagers.jl` package): |
17 |
| - |
18 |
| -| Job queue system | Command to add processors | |
19 |
| -| ---------------- | ------------------------- | |
20 |
| -| Local manager with CPU affinity setting | `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)` | |
21 |
| - |
22 |
| -### Implemented in external packages |
23 |
| - |
24 |
| -| Job queue system | External package | Command to add processors | |
25 |
| -| ---------------- | ---------------- | ------------------------- | |
26 |
| -| Slurm | [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) | `addprocs(SlurmManager(); kwargs...)` | |
27 |
| -| Load Sharing Facility (LSF) | [LSFClusterManager.jl](https://github.com/JuliaParallel/LSFClusterManager.jl) | `addprocs_lsf(np::Integer; bsub_flags=``, ssh_cmd=``)` or `addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))` | |
28 |
| -| Kubernetes (K8s) | [K8sClusterManagers.jl](https://github.com/beacon-biosignals/K8sClusterManagers.jl) | `addprocs(K8sClusterManager(np; kwargs...))` | |
29 |
| -| Azure scale-sets | [AzManagers.jl](https://github.com/ChevronETC/AzManagers.jl) | `addprocs(vmtemplate, n; kwargs...)` | |
30 |
| - |
31 |
| -### Not currently being actively maintained |
32 |
| - |
33 |
| -> [!WARNING] |
34 |
| -> The following managers are not currently being actively maintained or tested. |
35 |
| -> |
36 |
| -> We are seeking maintainers for the following managers. If you are an active user of any of the following job queue systems listed and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use. |
37 |
| -> |
| 5 | +Implemented in this package: |
38 | 6 |
|
39 | 7 | | Job queue system | Command to add processors |
|
40 | 8 | | ---------------- | ------------------------- |
|
41 |
| -| Sun Grid Engine (SGE) via `qsub` | `addprocs_sge(np::Integer; qsub_flags=``)` or `addprocs(SGEManager(np, qsub_flags))` | |
42 |
| -| Sun Grid Engine (SGE) via `qrsh` | `addprocs_qrsh(np::Integer; qsub_flags=``)` or `addprocs(QRSHManager(np, qsub_flags))` | |
43 |
| -| PBS (Portable Batch System) | `addprocs_pbs(np::Integer; qsub_flags=``)` or `addprocs(PBSManager(np, qsub_flags))` | |
44 |
| -| Scyld | `addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))` | |
45 | 9 | | HTCondor | `addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))` |
|
46 | 10 |
|
47 |
| -### Custom managers |
48 |
| - |
49 |
| -You can also write your own custom cluster manager; see the instructions in the [Julia manual](https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers). |
50 |
| - |
51 |
| -## Notes on specific managers |
52 |
| - |
53 |
| -### Slurm: please see [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) |
54 |
| - |
55 |
| -For Slurm, please see the [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) package. |
56 |
| - |
57 |
| -### Using `LocalAffinityManager` (for pinning local workers to specific cores) |
58 |
| - |
59 |
| -- Linux only feature. |
60 |
| -- Requires the Linux `taskset` command to be installed. |
61 |
| -- Usage : `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)`. |
62 |
| - |
63 |
| -where |
64 |
| - |
65 |
| -- `np` is the number of workers to be started. |
66 |
| -- `affinities`, if specified, is a list of CPU IDs. As many workers as entries in `affinities` are launched. Each worker is pinned |
67 |
| -to the specified CPU ID. |
68 |
| -- `mode` (used only when `affinities` is not specified, can be either `COMPACT` or `BALANCED`) - `COMPACT` results in the requested number |
69 |
| -of workers pinned to cores in increasing order, For example, worker1 => CPU0, worker2 => CPU1 and so on. `BALANCED` tries to spread |
70 |
| -the workers. Useful when we have multiple CPU sockets, with each socket having multiple cores. A `BALANCED` mode results in workers |
71 |
| -spread across CPU sockets. Default is `BALANCED`. |
72 |
| - |
73 |
| -### Using `ElasticManager` (dynamically adding workers to a cluster) |
74 |
| - |
75 |
| -The `ElasticManager` is useful in scenarios where we want to dynamically add workers to a cluster. |
76 |
| -It achieves this by listening on a known port on the master. The launched workers connect to this |
77 |
| -port and publish their own host/port information for other workers to connect to. |
78 |
| - |
79 |
| -On the master, you need to instantiate an instance of `ElasticManager`. The constructors defined are: |
80 |
| - |
81 |
| -```julia |
82 |
| -ElasticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all, printing_kwargs=()) |
83 |
| -ElasticManager(port) = ElasticManager(;port=port) |
84 |
| -ElasticManager(addr, port) = ElasticManager(;addr=addr, port=port) |
85 |
| -ElasticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie) |
86 |
| -``` |
87 |
| - |
88 |
| -You can set `addr=:auto` to automatically use the host's private IP address on the local network, which will allow other workers on this network to connect. You can also use `port=0` to let the OS choose a random free port for you (some systems may not support this). Once created, printing the `ElasticManager` object prints the command which you can run on workers to connect them to the master, e.g.: |
89 |
| - |
90 |
| -```julia |
91 |
| -julia> em = ElasticManager(addr=:auto, port=0) |
92 |
| -ElasticManager: |
93 |
| - Active workers : [] |
94 |
| - Number of workers to be added : 0 |
95 |
| - Terminated workers : [] |
96 |
| - Worker connect command : |
97 |
| - /home/user/bin/julia --project=/home/user/myproject/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("4cOSyaYpgSl6BC0C","127.0.1.1",36275)' |
98 |
| -``` |
99 |
| - |
100 |
| -By default, the printed command uses the absolute path to the current Julia executable and activates the same project as the current session. You can change either of these defaults by passing `printing_kwargs=(absolute_exename=false, same_project=false))` to the first form of the `ElasticManager` constructor. |
101 |
| - |
102 |
| -Once workers are connected, you can print the `em` object again to see them added to the list of active workers. |
103 |
| - |
104 |
| -### Sun Grid Engine (SGE) |
105 |
| - |
106 |
| -See [`docs/sge.md`](docs/sge.md) |
| 11 | +The functionality in this package originally used to live in [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl). |
0 commit comments