Skip to content

Conversation

@wdconinc
Copy link
Contributor

This PR adds multithreading support to ddsim, through the -j flag. More later.

BEGINRELEASENOTES

  • Multithreading support in ddsim

ENDRELEASENOTES

@github-actions
Copy link

github-actions bot commented Mar 11, 2024

Test Results

  9 files    9 suites   1h 40m 30s ⏱️
 26 tests  23 ✅ 0 💤  3 ❌
201 runs  184 ✅ 0 💤 17 ❌

For more details on these failures, see this check.

Results for commit d0ae8cd.

♻️ This comment has been updated with latest results.

@wdconinc
Copy link
Contributor Author

This is still a work in progress, but let me give some more background here as to the plan (the "more later" from the top).

  • The intention is, of course, full backwards compatibility so this runs in the single-threaded G4RunManager with command line options as given now (and therefore must pass all tests unchanged).
  • This PR already works in multi-threaded mode for generators like the gun, but
    • Not for external (hepmc) input files: I would like to consider this beyond the scope of this PR since this will require some changes to the input file reader to make sure it does not read the same sequence of events in each file.
    • Adapting to MT has caused single-threaded running to break: As written above, this needs to get fixed, of course.
  • There may be changes to the output files that are due to event reordering. These are likely unavoidable.

It is likely that I will be doing some rewrites of the last commit (5bf711c) which is a bit too big for my taste and not as clear as one would like for non-squash merging.

@wdconinc
Copy link
Contributor Author

The status of this PR after #1417 is that we can now run MT gun simulations, like so:

ddsim --compactFile DDDetectors/compact/SiD.xml -j8 -G -N160 --gun.distribution uniform --outputFile SiD.edm4hep.root

Main missing feature is ensuring inputFiles is serialized. Right now this just reads the same event sequence for all threads.

Comment on lines -80 to -81
self._g4gun = None
self._g4gps = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Geant4 gun and GPS event generators are not initialized correctly due to this removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite modified implementation, this remain a TODO since only one thread reads the macroFile... (as of da36a3e). And Geant4GeneratorWrapper doesn't allow setting properties when shared = True.

@wdconinc
Copy link
Contributor Author

The primary remaining difficulty I'm having here is with the event input files (I'm using HepMC3), which we create in:

gen = DDG4.GeneratorAction(kernel, "Geant4InputAction/hepmc%d" % index)
gen.Parameters = self.hepmc3.getParameters()
gen.Input = "HEPMC3FileReader|" + inputFile

Simply making the GeneratorAction shared with DDG4.GeneratorAction(kernel, "Geant4InputAction/hepmc%d" % index, shared=True) fails because the properties (Parameters, Input) that are subsequently set are properties of the Geant4InputAction and not exposed to the GeneratorAction handle gen.

To get around this, I went through setting up the necessary DDG4.InputAction support (duplicating what I felt was too much stuff along the way), and it still ended up not working.

So, I'm open to input as to how best to have a single Geant4InputAction where we can hand-out events sequentially while locking access with G4AutoLock in Geant4InputAction::create_reader and Geant4InputAction::operator().

@MarkusFrankATcernch @andresailer Do you have any suggestions?

@MarkusFrankATcernch
Copy link
Contributor

Hi Wouter,

This construct is not simple. Let's see what is necessary before the hacking starts...

For Geant4 the GeneratorAction belongs to the worker thread's execution flow, not the main thread.
This idea is also reflected in DDG4. If you look at DD4hep/blob/master/DDG4/examples/SiDSim_MT.py you can clearly see this.

Now what one probably wants is somehow a structure that this worker thread bound GeneratorAction is sharable between all workers (share same input file, share common parametrization etc) and use this action from all threads. Secondly the underlying reader implementation should be accessible by the configuration (python with ddg4, ddsim, etc) to allow more fine grained customization than just setting some input file and defining a factory.

From what I see somehow the input stage needs then to be re-designed. In particular exchanging data between the actions building the queue of GeneratorActions which define the input stage, cannot use the Geant4Event and its extension mechanism to pass these data. The Geant4Event is bound to the worker thread via the Geant4Context.

Having said all this, it looks necessary that the queue of GeneratorActions needs to be cut in two:

  • the first part being locked and single threaded doing the file reading (multiple inputs, spillover, multiple collisions per crossing, etc) and the data preparation with user exposure of the queue elements before execution starts.
  • the second part is not locked and can then finalize the event generation (smearing, primary vertex movements, etc.) and passes all the initial elements to the Geant4Event, which is then passed to Geant4 for the actual simulation work.

How to do something like this in a backwards compatible way is not so obvious to me and should be designed properly.
As I use to say: software is never difficult or complicated unless there is a lack of thinking upfront.
Maybe we should first collect all the goodies we want to see in such an implementation. The parameter setup you mention here is already a first step.

@MarkusFrankATcernch
Copy link
Contributor

So after coming back from the drawing board....
What you effectively want is a transition from the following object model as it is implemented now:

Geant4-Input

to an input stage which looks much rather like this (details are left out here):

Geant4-MT

Here you have

  • a single threaded simple input stage filling the Geant4Event structure. The Geant4Events must be requested by the initializing action (Geant4GenerationInit) from the Geant4EventQueue object to limit the number of queued events (memory, etc). This compromises the first actions from the above model.
  • these Geant4Event entities are then pushed into a queue from which these Geant4Events are then requested by the multi-threaded simulation.
  • the multi-thread action sequence then finalizes the primaruy generation and the rest continues as shown here: DD4hep/blob/master/DDG4/examples/SiDSim_MT.py

In principle this is not too difficult, but still quite more difficult that you have imagined upfront.

In addition:

  • the Geant4InputAction objects in the queue need their properties be exposed to the user
  • the entire machinery must be embedded in ddsim. How this can be done elegantly has to be seen.

@wdconinc
Copy link
Contributor Author

While I agree that the diagram you outline (with an event queue) is optimal for throughput performance, I wonder if there is another less optimal but still functional approach (which may have some threads waiting unnecessarily for some of the time). What I had originally anticipated was a pattern where there are multiple Geant4InputActions sequences, one per thread as now, but the action that reads from disk would have a shared reader that is locked. This would mimic the approach of the Geant4OutputAction where each thread has an output event action, but they share and lock the output file.

@MarkusFrankATcernch
Copy link
Contributor

@wdconinc
We should under no circumstances deviate too far from the processing model Geant4 uses.
Geant4 expects the G4GeneratorAction to be part of the worker thread and each worker thread is only provided with a unique seed for the random generator for this event. That's it.

Where one actually puts the Geant4EventQueue is sort of user-gusto here, but it must be located before the
G4Primary particles and vertices are created. If you want to work with input files a la HepMC, you have to go the complicated way I am afraid. Otherwise you would need one input per thread.

If we invest here, we should do it properly....even if it costs some development.

@wdconinc
Copy link
Contributor Author

wdconinc commented Apr 2, 2025

In addition:

  • the Geant4InputAction objects in the queue need their properties be exposed to the user

See #1440, which exposes the properties correctly now.

@MarkusFrankATcernch
Copy link
Contributor

@wdconinc @murnanedaniel
So in other words, only the raw DDG4 python interface can currently be used to run multi-threaded.
Correct Wouter ?

@wdconinc
Copy link
Contributor Author

@wdconinc @murnanedaniel
So in other words, only the raw DDG4 python interface can currently be used to run multi-threaded.
Correct Wouter ?

The scope of this PR has always been to make multi-threaded running work with ddsim. I don't know what other ways of running DDG4 people are using, but I'm not familiar with them.

@murnanedaniel
Copy link

Hi @wdconinc. Thanks for the quick response. I have been trying to reproduce your working setup with input hepmc3 files, but get this error, when setting to 8 threads:

=======================================================================
Geant4Kernel     INFO  +++ Created worker instance id=1046861376l
UserInitialization INFO  +++ Executing Geant4UserActionInitialization::Build. Context:0x7f9b303af6a0 Kernel:0x7f9b303afb30 [0]
PyG4Init         INFO  +++ Worker:0 Build PYTHON Worker [empty]....
Geant4Kernel     INFO  ++ Registered global action RunInit of type dd4hep::sim::Test::Geant4TestRunAction
Geant4UI         INFO  +++ RunAction> Install Geant4 control directory:/ddg4.0/RunAction/
Traceback (most recent call last):
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDSim/DD4hepSimulation.py", line 480, in __setupWorker
    self.__setupGeneratorActions(kernel, geant4)
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDSim/DD4hepSimulation.py", line 339, in __setupGeneratorActions
    shared = (kernel.RunManagerType == "G4MTRunManager")
              ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDG4.py", line 161, in _getKernelProperty
    elif hasattr(self, name):
         ^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDG4.py", line 161, in _getKernelProperty
    elif hasattr(self, name):
         ^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDG4.py", line 161, in _getKernelProperty
    elif hasattr(self, name):
         ^^^^^^^^^^^^^^^^^^^
  [Previous line repeated 995 more times]
RecursionError: maximum recursion depth exceeded
TypeError: 'NoneType' object cannot be interpreted as an integer
PyG4Init         FATAL +++ Worker setup command returned -1, not SUCCESS (1). Terminating setup

**************************************************** 
*  A runtime error has occured :                     
*    +++ Worker setup command returned -1, not SUCCESS (1). Terminating setup
*  the program will have to be terminated - sorry.   
**************************************************** 

I removed the elif python handler test that was causing the recursion, leading to (I believe) the real error, which is this:

=======================================================================
Geant4Kernel     INFO  +++ Created worker instance id=16168512l
UserInitialization INFO  +++ Executing Geant4UserActionInitialization::Build. Context:0x7f01f43af6a0 Kernel:0x7f01f43afb30 [0]
PyG4Init         INFO  +++ Worker:0 Build PYTHON Worker [empty]....
Geant4Kernel     INFO  ++ Registered global action RunInit of type dd4hep::sim::Test::Geant4TestRunAction
Geant4UI         INFO  +++ RunAction> Install Geant4 control directory:/ddg4.0/RunAction/
Traceback (most recent call last):
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDSim/DD4hepSimulation.py", line 480, in __setupWorker
    self.__setupGeneratorActions(kernel, geant4)
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDSim/DD4hepSimulation.py", line 339, in __setupGeneratorActions
    shared = (kernel.RunManagerType == "G4MTRunManager")
              ^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m4958/usr/danieltm/ColliderML/software/OtherLibraries/dd4hep-mt/lib/python3.11/site-packages/DDG4.py", line 162, in _getKernelProperty
    raise KeyError(msg)
KeyError: 'Geant4Kernel::GetProperty [Unhandled]: Cannot access Kernel.RunManagerType'
TypeError: 'NoneType' object cannot be interpreted as an integer
PyG4Init         FATAL +++ Worker setup command returned -1, not SUCCESS (1). Terminating setup

**************************************************** 
*  A runtime error has occured :                     
*    +++ Worker setup command returned -1, not SUCCESS (1). Terminating setup
*  the program will have to be terminated - sorry.   
**************************************************** 

For context, here is the full output log: full_output.log
My environment should be perfectly vanilla - LCG 107, running on the OpenDataDetector, with Pythia-generated hepmc3 files. I will continue to debug, but hoped you might have a quick idea.

@murnanedaniel
Copy link

Okay, I made a couple of quick tweaks and it does seem to work, which is really promising. Firstly, the tweaks I made are here: murnanedaniel@cbbe86e. I'm not sure if you agree with them, or why they are not necessary in your setup?

Secondly, there is one final error appearing after the simulation that doesn't appear in the main DD4hep. You can see it at the end of the successful log file: full_output_mt.log
Any idea where this is coming from?

@murnanedaniel
Copy link

murnanedaniel commented May 14, 2025

@wdconinc Just following up here, since we will soon be running a large set of simulations with your MT changes. I just want to ensure that the small error that appears, and the tweaks that I needed to make, are still consistent with the working version of ddsim as you see it. On that same point - have you validated that MT sim of hepmc3 inputs produces more or less identical outputs to DD4hep main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants