Skip to content

Conversation

SudeshnaBora
Copy link
Contributor

No description provided.

@SudeshnaBora SudeshnaBora self-assigned this Nov 4, 2021
Comment on lines 451 to 454
cudaMemcpy(&num_spiking_neurons,
&dev_array_spikegeneratorgroup__spikespace[synapses_pre_eventspace_idx][_num__array_spikegeneratorgroup__spikespace - 1],
sizeof(int32_t), cudaMemcpyDeviceToHost);
num_blocks = num_parallel_blocks * num_spiking_neurons;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command will execute in default stream 0. Use cudaMemcpyAsync. You might have to declare the host memory to be pinned memory (not sure, never tried this, try without first, I'm sure you will get an error if you need to use pinned memory). Here are some slights on that that I just found..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we might have to be careful. When using cudaMemcpyAsync, the host will not wait for the copy to finish. But we need the host to have access to the number of spiking neurons (which we are copying from device to host here). You might have to synchronize this stream with the host here for things to work when using cudaMemcpyAsync. That means something like cudaDeviceSynchronize(&this_stream);.

I'm wondering if we could avoid this synchronization by calling the memcpy before the push kernel instead. But that is something to think of later maybe. For now, leave it as is with the synchronization I mentioned above.

Comment on lines 268 to 270
cudaMemcpy(&num_spiking_neurons,
dev_array_spikegeneratorgroup__spikespace[current_idx_array_spikegeneratorgroup__spikespace] + _num_spikespace - 1,
sizeof(int32_t), cudaMemcpyDeviceToHost)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also executed in Stream 0. But don't worry about this one for now. It is only relevant when we have heterogeneous delays. You are testing on mushroombody, right? It does not have heterogeneous delays so this cudaMemcpy is never executed!

);

// advance spike queues
_advance_kernel_synapses_pre_push_spikes<<<1, num_parallel_blocks>>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executes in stream 0. Execute it in the same stream as the main kernel of this file. You need them in the same stream since the advance kernel sets the correct spike queue into which spiking synapses are collected. Hence you need to run the advance kernel before the actual push kerne. When they are executed in the same stream, you will execute them in sequence, which is what you want.

raise NotImplementedError("Using OpenMP in a CUDA standalone project is not supported")

def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks):
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks, stream_info):

multisynaptic_idx_vars=multisyn_vars,
profiled_codeobjects=self.profiled_codeobjects)
profiled_codeobjects=self.profiled_codeobjects,
parallelize=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should become a preference later on. Just putting it here as TODO, not to forget.

for key in streams_organization:
for object in streams_organization[key]:
streams_details[object.name] = count
count +=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, lets make the default 0. Or do we even need a default? Can't we just pass 0 to the kernel (which would run it in the actual CUDA default stream)? Lets check this later.

# create all random numbers needed for the next clock cycle
for clock in net._clocks:
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer);')
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer, {self.stream_info["default"]});')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random number buffer is a special case. It is not generated from common_group.cu, but is defined separately in rand.cu. So you don't need to add a stream argument here at all (I think this should even fail, because _run_random_number_buffer in rand.cu is defined without arguments).

For context: The random number buffer has a fixed size of memory on the GPU (which can be controlled via preference). It generates random number from the host, knowing how many random numbers the kernels will require. The kernels then use this data for multiple time steps (where the _run_random_number_buffer only increments the data pointer to the random number). And only when the generated numbers on the GPU are empty, new numbers are generated.

Each random number generation call should generate enough random numbers to occupy the entire GPU. So no need for concurrent kernel execution here at all.

{% block kernel_call %}
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads>>>(
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick to the code formatting in the files

Suggested change
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads, 0, stream>>>(

@denisalevi
Copy link
Member

I added a bunch of review comments. As you fix them (and push them), feel free to "Resolve conversation"

SudeshnaBora and others added 4 commits April 6, 2022 16:03
This also adds an unused `stream` parameter to the RNG function, which
is the only network function that always runs in the default stream (for
now).

// go through each list of func group - 2 loops
for(int i=0; i<func_groups.size(); i++){
for(int j=0; j<func_groups.size(); j++){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second loop is wrong:

Suggested change
for(int j=0; j<func_groups.size(); j++){
for(int j=0; j<func_groups[i].size(); j++){

func(custom_stream[j]);
}
// reset the func group for that sub stream
func_groups.resize(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After each function group, you need to synchronize host and device. Check the documentation if cudaDeviceSynchronize() will do the job or if you need to synchronize all streams.

SudeshnaBora and others added 2 commits May 21, 2022 19:12
Required for `cudaStream_t` in `network.cu`
cudaMemcpy(&num_spiking_neurons,
dev{{_eventspace}}[current_idx{{_eventspace}}] + _num_{{owner.event}}space - 1,
sizeof(int32_t), cudaMemcpyDeviceToHost)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above this line is another cudaMemcpy that needs to be a cudaMemcpyAsync. You don't need a cudaStreamSychronize here, since you will have to perform one below anyways (see next comment).

_advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks, 0, stream>>>();

CUDA_CHECK_ERROR("_advance_kernel_{{codeobj_name}}");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _advance_kernel needs to finish before we can call the synapses push kernel (happening after this line in the generated code, based on common_group.cu. Therefore, add another cudaStreamSynchronize here for the same stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants