Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Performance.xlsx
Binary file not shown.
126 changes: 120 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,126 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yuxuan Zhu
* [LinkedIn](https://www.linkedin.com/in/andrewyxzhu/)
* Tested on: Windows 10, i7-7700HQ @ 2.80GHz 16GB, GTX 1050 4096MB (Personal Laptop)

### (TODO: Your README)
**Introduction**

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project implemented different versions of the exclusive scanning algorithm and the stream compaction algorithm.
I implemented one CPU version of exclusive scanning and four GPU versions of exclusive scanning.

The CPU version of exclusive scanning supports in-place scanning operation by using local variables. It is surprisingly fast, even for
millions of elements.
The first GPU version of exclusive scanning is naive scanning, which iterativesly sums the array elements and does O(nlog n) computations.
The second GPU version of exclusive scanning is the work efficient version, which consists of an up-sweep portion and a down-sweep portion. I
launched different number of threads for each level of computation to reduce the number of "wasted" threads and improve performance.
The third GPU version is optimized based on the second version. I used shared memory to decrease the freqeuncy of global memory access. It is
roughly twice as fast as the second version. It is also faster than the CPU version for larger arrays. This is considered extra credit.
The fourth GPU version is done by calling the thrust library. It is highly optimized and extremely fast for large arrays.

I also implemented two CPU versions of stream compaction and two GPU versions of stream compaction.
The first CPU version does stream compaction iteratively. It is quite fast.
The second CPU version simulates the stream compaction algorithm on a GPU by using scanning. It is a lot slower.
The first GPU version uses work efficient scanning to implement the stream compaction. It is quite fast.
The second GPU version uses the optimized work efficient scanning to implement stream compaction. It is also faster than the first version and often faster than
the CPU version.

**Performance Analysis**

I empirically found that the best block size is 256 for all versions of scan.

![Scan](img/Scan_Performance.png)

The graph above shows the performance comparsion among the different versions of the algorithm. The CPU version is almost always faster, unless the array
size is extremely large. The thrust library performance is always strictly better than my implementations. This is reasonable since I did not optimize everything due to
my current limited understanding of GPU performance. For example, I did not try to reduce bank conflicts. I don't know what's happening under the hood for the thrust version of scanning yet. I tried to analyze performance bottle necks of my algorithms by commenting out certain kernels and checking how much improvement in performance I get. For example, I realized the kernel that adds block increment to each element in a block is very slow due to frequent read/write of global memory. I optimized the code to use shared memory to improve the bottleneck.

![Compaction](img/Stream_Compaction_Performance.png)

The graph above shows the performance comparision among different versions of stream compaction algorithm. The CPU version was out-performed by my optimized GPU version when the array size is larger than 100000.

Below is an attachment of the output of my test program.


```

****************
** SCAN TESTS **
****************
[ 34 43 38 22 3 18 25 14 49 31 41 44 20 ... 9 0 ]
==== cpu scan, power-of-two ====
elapsed time: 9.4994ms (std::chrono Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933885 244933894 ]
==== cpu scan, non-power-of-two ====
elapsed time: 6.9843ms (std::chrono Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933830 244933839 ]
passed
==== naive scan, power-of-two ====
elapsed time: 26.3491ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933885 244933894 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 25.6866ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 18.4584ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933885 244933894 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 17.1187ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933830 244933839 ]
passed
==== optimized work-efficient scan, power-of-two ====
elapsed time: 8.34717ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933885 244933894 ]
passed
==== optimized work-efficient scan, non-power-of-two ====
elapsed time: 7.95645ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933830 244933839 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 1.05674ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933885 244933894 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 1.1897ms (CUDA Measured)
[ 0 34 77 115 137 140 158 183 197 246 277 318 362 ... 244933830 244933839 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 1 0 3 3 1 2 3 1 3 0 2 1 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 31.6402ms (std::chrono Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 28.173ms (std::chrono Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 3 2 ]
passed
==== cpu compact with scan ====
elapsed time: 52.5519ms (std::chrono Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 21.0639ms (CUDA Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 2 3 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 20.3782ms (CUDA Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 3 2 ]
passed
==== optimized work-efficient compact, power-of-two ====
elapsed time: 11.0875ms (CUDA Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 2 3 ]
passed
==== optimized work-efficient compact, non-power-of-two ====
elapsed time: 11.0332ms (CUDA Measured)
[ 1 1 3 3 1 2 3 1 3 2 1 2 1 ... 3 2 ]
passed


```
Binary file added img/Scan_Performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Stream_Compaction_Performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 40 additions & 11 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 10000000; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -51,11 +51,11 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
onesArray(SIZE, c);
//* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
/**onesArray(SIZE, c);
printDesc("1s array for finding bugs");
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */
Expand All @@ -64,35 +64,50 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("optimized work-efficient scan, power-of-two");
StreamCompaction::Efficient::optimizedScan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("optimized work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::optimizedScan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);


zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand Down Expand Up @@ -137,14 +152,28 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("optimized work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::optimizedCompact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("optimized work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::optimizedCompact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
15 changes: 13 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,15 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) {
if (idata[index] == 0) {
bools[index] = 0;
}
else {
bools[index] = 1;
}
}
}

/**
Expand All @@ -32,7 +40,10 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n && bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
#define blockSize 256
#define efficientBlockSize 256

/**
* Check for CUDA errors; print and exit if there was a problem.
Expand Down
42 changes: 36 additions & 6 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,14 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int curr;
int sum = idata[0];
odata[0] = 0;
for (int i = 1; i < n; i++) {
curr = idata[i];
odata[i] = sum;
sum += curr;
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +37,14 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int count = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +53,28 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int* indicator = new int[n];
timer().startCpuTimer();
for (int i = 0; i < n; i++) {
if (idata[i] == 0) {
indicator[i] = 0;
}
else {
indicator[i] = 1;
}
}
odata[0] = 0; // odata is currently the array storing the exclusive scan result
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + indicator[i - 1];
}
int count = odata[n - 1] + indicator[n - 1];
for (int i = 0; i < n; i++) {
if (indicator[i] != 0) {
odata[odata[i]] = idata[i]; //odata is now the compacted array
}
}
timer().endCpuTimer();
return -1;
return count;
}
}
}
Loading