Skip to content

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Mar 4, 2025

No description provided.

@casparvl
Copy link
Collaborator

casparvl commented Mar 4, 2025

Ugh... our test are being cancelled because of a brownout https://github.blog/changelog/2025-01-15-github-actions-ubuntu-20-runner-image-brownout-dates-and-other-breaking-changes/

It is a good point though, I guess we should update the version of ubuntu used in these tests. But from a comment in our workflows:

ubuntu <= 20.04 is required for python 3.6

Not entirely sure why, but I guess that means we'll loose python 3.6 support. We should discuss if that's acceptable to us, or if we need to do something else to keep that support.

It does not really explain why Python 3.6 is not supported with newer ubuntu - we could give that another try and see what fails...

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this fix as it resolves your issue. One remark is that you might consider to still purge the environment, but then reset export SLURM_CONF=/etc/slurm/slurm.conf_dodrio using https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.env_vars

Anyway, you could do that in a follow-up PR if you prefer that approach. At least this PR fixes your immediate issue.

Edit: I can't merge this yet anyway because of the failing CI. So you could also still change it in this PR if you want to :)

@boegel
Copy link
Contributor

boegel commented Mar 4, 2025

Ugh... our test are being cancelled because of a brownout https://github.blog/changelog/2025-01-15-github-actions-ubuntu-20-runner-image-brownout-dates-and-other-breaking-changes/

It is a good point though, I guess we should update the version of ubuntu used in these tests. But from a comment in our workflows:

ubuntu <= 20.04 is required for python 3.6

Not entirely sure why, but I guess that means we'll loose python 3.6 support. We should discuss if that's acceptable to us, or if we need to do something else to keep that support.

It does not really explain why Python 3.6 is not supported with newer ubuntu - we could give that another try and see what fails...

If we want to keep running tests with Python 3.6, the approach we're taking in EasyBuild may be helpful: easybuilders/easybuild-framework#4783

@laraPPr
Copy link
Collaborator Author

laraPPr commented Mar 5, 2025

I'm fine with this fix as it resolves your issue. One remark is that you might consider to still purge the environment, but then reset export SLURM_CONF=/etc/slurm/slurm.conf_dodrio using https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.env_vars

Anyway, you could do that in a follow-up PR if you prefer that approach. At least this PR fixes your immediate issue.

Edit: I can't merge this yet anyway because of the failing CI. So you could also still change it in this PR if you want to :)

That sets the environment variable on the partition. The environment variable does not need to be set on the partition. It only needs to be set on the system from where ReFrame is launching the jobs. And I do not think you can set in the config.

That environment variable is also set and unset by a sticky module. so once environment is purged it will always unset the environment variable. So I do not think that it will work if I might set in the ci_config for instance.

@casparvl
Copy link
Collaborator

casparvl commented Mar 5, 2025

It only needs to be set on the system from where ReFrame is launching the jobs. And I do not think you can set in the config.

I think you can: https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.env_vars

Though the docs are not entirely clear on whether this indeed modifies the environment where the reframe runtime is being run, rather than the test job. But considering the other options at the system level (e.g. setting the staging dir), I'm pretty sure that's what it does.

But, again, up to you :) I'm also happy to go for the 'don't purge' solution. Either way works. The advantage of being able to purge is that you have a more controlled environment: only what is set in the ReFrame config file will be set/loaded.

Let me know what you prefer :) I'll at least retrigger the CI, I think the brownout is over...

@laraPPr
Copy link
Collaborator Author

laraPPr commented Mar 6, 2025

Tested env_vars and this is the job. The environment variable is set in the job not in the environment.

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_LAMMPS_lj_248c5679"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH --mem=5200M
module swap cluster/doduo
module --force purge
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load env/vsc/doduo
export SLURM_CONF=/etc/slurm/slurm.conf_doduo
module load LAMMPS/29Aug2024-foss-2023b-kokkos
export OMP_NUM_THREADS=1
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=1:compact
export OMPI_MCA_rmaps_base_mapping_policy=slot:PE=1
mpirun -np 2 lmp -in in.lj
echo "EESSI_CVMFS_REPO: $EESSI_CVMFS_REPO"
echo "EESSI_SOFTWARE_SUBDIR: $EESSI_SOFTWARE_SUBDIR"
echo "FULL_MODULEPATH: $(module --location show LAMMPS/29Aug2024-foss-2023b-kokkos)"

@casparvl
Copy link
Collaborator

Tested env_vars and this is the job

And you set this at the system level, not at the partition level in your config? Strange... That's not what I would expect. Also, I don't understand what the difference then is between setting this at the system and partition level...

@laraPPr
Copy link
Collaborator Author

laraPPr commented Apr 17, 2025

@casparvl It works by adding SLURM_CONF on the system level in the env_vars. I did discover that the GPU_drivers have not been exposed on our system yet. but the jobs are now being submitted to the queue so this resolves #242

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@laraPPr
Copy link
Collaborator Author

laraPPr commented Apr 23, 2025

@casparvl can you hit merge? EESSI GPU hostinjections is now also set up on hortense and I'm testing as we speak and everything looks good

@casparvl casparvl merged commit e6340ce into EESSI:main Apr 24, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants