Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 181 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,187 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Wayne Wu
* [LinkedIn](https://www.linkedin.com/in/wayne-wu/), [Personal Website](https://www.wuwayne.com/)
* Tested on: Windows 10, AMD Ryzen 5 5600X @ 3.70GHz 32GB, RTX 3070 8GB (personal)

### (TODO: Your README)
## Background

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project examines two fundamental GPU programming algorithms, **Scan** and **Stream Compaction**.

### Scan
Scanning is the process of calculating the prefix sums of all elements in an array. Five different implementations are examined in this project:

1. **CPU**:
Non-parallelized scanning in CPU.
2. **GPU Naive**:
Naively parallelized setup.
3. **GPU Work Efficient**:
Parallelized setup that reduces the amount of work required by breaking the scanning process into two phases: reduce phase and down-sweep phase. See `kernUpSweep()` and `kernDownSweep()` kernels for implementation details.
4. **GPU Work Efficient with Shared Memory (Extra Credit)**:
Based on the concept from 3, both reduction and down-sweep phases are now done in one kernel (`kernScan()`) using shared memory. In order to support **an array of arbitrary size**, the implementation uses the block-sum method to recursively divide the array into manageable chunks by each block and aggregate divided results together using an auxiliary array (See `recursiveScan()`).
5. **GPU using Thrust API**:
Uses the `inclusive_scan()` call from the Thrust library.

### Stream Compaction
Stream compaction is the process of only keeping the elements in an array that satisfy a specific condition, thereby compacting the array. Three different implementations are examined in this project:

1. **CPU without Scan**:
Non-parallelized compaction without using scan functions.
2. **CPU with Scan**
Non-parallelized compaction that uses the scan and scatter paradigm of stream compaction.
3. **GPU Work Efficient**
Parallelized setup that uses the scan and scatter paradigm of stream compaction.
The scan function used is identical to Implementation 3 of scanning (i.e. `kernUpSweep()` and `kernDownSweep()`).

### Power of Two
All implementations are made to support both an array size that is a power of two and one that isn't.

## Optimal Block Sizes

We begin by optimizing the block sizes for each implementation such that all implementations are analyzed and compared in the optimized state.

### GPU Naive: **128**
![Figure 1](img/naive-blocksize.png)

*Figure 1. GPU Naive Implementation Block Size Comparison*

### GPU Work-Efficient: **128**
![Figure 2](img/efficient-blocksize.png)

*Figure 2. GPU Work-Efficient (w/o Shared Memory) Implementation Block Size Comparison*

It should be noted here that the algorithm also dynamically shrinks/grows the block size depending on the iteration depth, such that the index value does not overflow. 128 is the optimal starting block size.


### GPU Work-Efficient with Shared Memory: **128**
This implementation does not use a `blockSize` variable unlike the previous two methods.
Instead, a `B` variable is used which is the number of elements that will be processed by each block.
The actual number of threads required is `B/2`.

The optimal `B` value is 128, as all other `B` values fail at `N = 2^25`.

## Sample Output
Below is the output of the test program used for testing and timing the different implementations. `N = 2^24`.
```
****************
** SCAN TESTS **
****************
[ 3 42 32 13 26 31 45 7 20 2 5 1 10 ... 1 0 ]
==== cpu scan, power-of-two ====
elapsed time: 9.7722ms (std::chrono Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841500 410841500 ]
==== cpu scan, non-power-of-two ====
elapsed time: 9.9391ms (std::chrono Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841449 410841488 ]
passed
==== naive scan, power-of-two ====
elapsed time: 9.41562ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841500 410841500 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 9.46506ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 4.65533ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841500 410841500 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 4.6632ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841449 410841488 ]
passed
==== work-efficient scan (shared memory, recursive), power-of-two ====
elapsed time: 0.965856ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841500 410841500 ]
passed
==== work-efficient scan (shared memory, recursive), non-power-of-two ====
elapsed time: 1.99728ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841449 410841488 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 0.47568ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841500 410841500 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.976416ms (CUDA Measured)
[ 3 45 77 90 116 147 192 199 219 221 226 227 237 ... 410841449 410841488 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 2 2 1 0 0 3 2 2 2 1 1 3 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 24.0027ms (std::chrono Measured)
[ 2 2 1 3 2 2 2 1 1 3 1 3 2 ... 3 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 24.8678ms (std::chrono Measured)
[ 2 2 1 3 2 2 2 1 1 3 1 3 2 ... 3 3 ]
passed
==== cpu compact with scan ====
elapsed time: 56.7523ms (std::chrono Measured)
[ 2 2 1 3 2 2 2 1 1 3 1 3 2 ... 3 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 8.44557ms (CUDA Measured)
[ 2 2 1 3 2 2 2 1 1 3 1 3 2 ... 3 3 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 8.28093ms (CUDA Measured)
[ 2 2 1 3 2 2 2 1 1 3 1 3 2 ... 3 3 ]
passed
Press any key to continue . . .

```

## Performance Analysis

### Scan
![Figure 3](img/scan-performance.png)

*Figure 3. Scan Algorithm Performance Comparison (Power of 2)*

From Figure 3, we can observe the following:
* **GPU Naive** implementation is equally inefficient as the CPU implementation
* **GPU Work-Efficient** implementation is better than the CPU implementation.
However the trajectory of the curve still implies that it is getting exponentially slower with larger elements.
* **GPU Work-Efficient with Shared Memory** is significantly more efficient than without shared memory.
* **GPU Thrust** implementation is the fastest out of all.

#### GPU Work-Efficient
The work-efficient algorithm (without shared memory) is faster than the naive approach as there is overall less work to do for scanning. However, the performance bottlenecks are still noticeable as they mainly come from:
1. Accessing Global Memory - there are lots of read and write operations from/to the global memory.
2. Unused Threads - there are many threads that are unused, especially when the index offset is large.
3. Memory Random Access - global memory are accessed randomly with no particular order.

To fix these, we can:
1. Use Shared Memory - this is essentially the shared memory scanning implementation as discussed in the next section.
2. Use Warp Partitioning - ensure that all the non-active threads are under the same warp so that the warp can be deactivated automatically once the warp is no longer in use.
3. Ensure Memory Coalescing - ensure that we're accessing memory contiguously by rearranging the data array in each iteration.

#### GPU Work-Efficient with Shared Memory
By using shared memory, the need of accessing global memory is significantly reduced. Additionally, shared memory is faster with random memory access. Therefore, problem 1 and problem 3 from above are greatly alleviated. However, with the usage of shared memory, we also introduce a bottleneck from **bank conflicts**. As such, there is still space for improvement once we remove the bank conflicts.

#### GPU Thrust
![Figure 4](img/thrust-timeline.png)

*Figure 4. Nsight timeline for the Thrust implementation*

If we look at the Nsight timeline for the Thrust implementation, we can see that it calls the `cudaMemcpyAsync` function instead of `cudaMemcpy` like all other implementations. This suggests that there may be an asynchronous process going on in Thrust's implementation that copies the buffer data while another process is running, thus having some performance gain.

#### Non Power of 2
![Figure 5](img/scan-performance-npot.png)

*Figure 5. Scan Algorithm Performance Comparison (Non Power of 2)*

There is no significant performance difference between power-of-two sized versus. non-power-of-two-sized arrays.

### Stream Compaction
![Figure 6](img/compact-performance.png)

*Figure 5. Stream Compaction Algorithm Performance Comparison (Non Power of 2)*

As the stream compaction algorithm uses the GPU Work-Efficient (w/o Shared Memory) scan function,
it naturally follows the same performance result. Performance gain can be achieved by alleviating the problems discussed above.
Binary file added img/compact-performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/efficient-blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive-blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-performance-npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan-performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/thrust-timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 23 additions & 9 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 24; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -64,35 +64,49 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan (shared memory, recursive), power-of-two");
StreamCompaction::Efficient::recursiveScan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan (shared memory, recursive), non-power-of-two");
StreamCompaction::Efficient::recursiveScan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand Down Expand Up @@ -137,14 +151,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
13 changes: 11 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
return;

bools[index] = (idata[index] > 0) ? 1 : 0;
}

/**
Expand All @@ -32,7 +36,12 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n)
return;

if (bools[index] == 1)
odata[indices[index]] = idata[index];
}

}
Expand Down
6 changes: 6 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ inline int ilog2ceil(int x) {
return x == 1 ? 0 : ilog2(x - 1) + 1;
}

inline int imakepower2(int x) {
while ((x & (x - 1)) != 0)
++x;
return x;
}

namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);
Expand Down
Loading