diff --git a/README.md b/README.md
index 0e38ddb..2f791d3 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,225 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Bowen Deng
+  * [LinkedIn](www.linkedin.com/in/bowen-deng-7dbw13), [twitter](https://twitter.com/7Dbw13)
+* Tested on: Windows 10, AMD Ryzen 9 5900HX with Radeon Graphics @ 3.30GHz 16GB, GeForce RTX 3070 Laptop GPU 8GB (Personal Computer)
 
-### (TODO: Your README)
+## Abstract
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+GPU stream compaction and scan (prefix sum) algorithm implemented in CUDA.
+> Examples
 
+* scan:
+  - goal: produce a prefix sum array of a given array (we only care about exclusive scan here)
+  - input
+    - [1 5 0 1 2 0 3]
+  - output
+    - [0 1 6 6 7 9 9]
+* compact:
+  - goal: closely and neatly packed the elements != 0
+  - input
+    - [1 5 0 1 2 0 3]
+  - output
+    - [1 5 1 2 3]
+
+Different kinds of scan algorithms implemented in this project:
+
+1. CPU based scan
+2. Naive GPU scan
+3. Work-Efficient GPU scan
+4. GPU scan using Thrust
+5. Upgraded GPU efficient scan (*extra credit*)
+6. GPU scan using shared memory (*extra credit*)
+
+Different kinds of compaction implemented in this project:
+
+1. CPU based compaction
+2. CPU compaction using scan
+3. GPU compaction using work-efficient GPU scan
+
+## CMake Note
+
+The `CMakeLists.txt` files are modified. New files are added in the following way
+
+```
+.
+└── stream_compaction
+    ├── radix_sort.cu
+    └── radix_sort.h
+```
+
+## Output
+
+Run the test program on array with size `65536`
+```
+****************
+** SCAN TESTS **
+****************
+    [  29  17   2   3  30  15  27  26  30   1   2  23   4 ...  43   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0295ms    (std::chrono Measured)
+    [   0  29  46  48  51  81  96 123 149 179 180 182 205 ... 1610735 1610778 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0266ms    (std::chrono Measured)
+    [   0  29  46  48  51  81  96 123 149 179 180 182 205 ... 1610664 1610685 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.093184ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.093184ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.089088ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.090112ms    (CUDA Measured)
+    passed
+==== work-efficient-upgraded scan, power-of-two ====
+   elapsed time: 0.090112ms    (CUDA Measured)
+    passed
+==== work-efficient-upgraded scan, non-power-of-two ====
+   elapsed time: 0.089088ms    (CUDA Measured)
+    passed
+==== work-efficient-shared scan, power-of-two ====
+   elapsed time: 0.038912ms    (CUDA Measured)
+    passed
+==== work-efficient-shared scan, non-power-of-two ====
+   elapsed time: 0.038912ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.041984ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.045056ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   1   0   3   0   3   3   0   0   3   0   1   2 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.1044ms    (std::chrono Measured)
+    [   1   1   3   3   3   3   1   2   2   3   3   1   3 ...   1   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.1054ms    (std::chrono Measured)
+    [   1   1   3   3   3   3   1   2   2   3   3   1   3 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.2209ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.136256ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.137216ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+
+### Measurement Metric
+
+`std::chrono` and CUDA event are applied to measure the time cost of algorithms on CPU and GPU respectively. The unit used for timing is `ms`.
+
+### Block Size Optimization
+
+To optimize the block sizes for each kind of scan algorithm, I run all GPU algorithms with different block sizes on an array with size `65536`. The results are shown as below
+
+![](img/block.png)
+
+According to this, the block size of the upgraded GPU efficient scan is chosen to be `128`, while all others are set to be `256`.
+
+### Scan Performance with Different Array Size
+
+![](img/scan.png)
+
+From the result we can see:
+
+- The CPU based scan performs surprisingly well for a small array, but as the size grows, its performance drops dramatically due to high computation complexity.
+- The naive GPU scan is even worse than the CPU method, and its performance also drops for a large size since the computation is also complex.
+- The work-efficient GPU scan has a performance similar to naive method at first, but it outperforms both naive GPU and CPU scans when the size is large enough. It can be considered that memory I/Os limits the performance for small arrays.
+- The Thrust GPU scan is robust. Its performance curve shows almost a linear increment.
+- The upgraded GPU efficient scan and GPU scan using shared memory will be talked about with details in extra credit part.
+
+### Compaction Performance with Different Array Size
+
+![](img/compact.png)
+
+Notice that when scan is applied, the performance of CPU compaction drops sharply. This is because scan brings additional computation to the compaction algorithm, which is necessary for parallel computing in GPU method.
+
+## Extra Credit
+
+### Why is My GPU Approach So Slow
+
+Obviously, the work-efficient GPU scan is not so efficient that it is even worse than the CPU based scan and similar to naive GPU scan, when the size of array is not large enough. The reason is that as the loop level of upper/down sweep goes deeper, more and more threads becomes inactive. Let's take a look into this part for example
+```
+int index = threadIdx.x + (blockIdx.x * blockDim.x);
+if (index >= n) {
+    return;
+}
+
+// Only for multiple of 2^(d+1)
+if ((index & ((1 << (d + 1)) - 1)) == 0) {
+    idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
+}
+```
+The code above performs reduction on `idata`. Note the `if` condition here. Every loop `d` is increased by `1`, and so the number of threads, whose `index` meets the condition, will be halved. These threads do nothing actually.
+
+To resolve this, I multiply `index` of each thread by `2^(d+1)`. Now all threads with new `index` less than `n` satisfy the previous condition. A new version is like
+```
+unsigned long int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+// Index hack
+// Make use of all threads
+index *= (1 << (d + 1));
+
+if (index >= n) {
+    return;
+}
+
+// 'index' is now multiple of 2^(d+1)
+idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
+```
+Thus each thread is utilized. At the same time, since the total number of active threads stays unchanged, the number of blocks launched should be halved every loop
+```
+// Reduction for log(n) times
+for (int d = 0; d < ilog2ceil(n); d++) {
+    // Halve the number of blocks launched in each turn
+    act_n /= 2;
+    dim3 fullBlocksPerGrid((act_n + BLOCK_SIZE_UPGRADED - 1) / BLOCK_SIZE_UPGRADED);
+    kern_reduction << <fullBlocksPerGrid, BLOCK_SIZE_UPGRADED >> > (n, d, idata);
+    checkCUDAError("kern_reduction failed!");
+}
+```
+As a result, a performance gain can be observed in the performance figure when the size of array is large enough.
+
+### Radix Sort
+
+A radix sort is implemented using the upgraded GPU efficient scan in `radix_sort.cu`. It can be called by
+```
+Sort::radix_sort(int n, int num_bits, int *odata, const int *idata);
+```
+where `n` is the length of the array, and `num_bits` refers to number of bits in keys. This function sorts `idata` from smaller to larger and stores the result to `odata`.
+
+The result of test added in `main.cpp` is (`5`-bit keys, array size `16`)
+```
+***************
+** SORT TEST **
+***************
+    [  18  14   7  11  16   5   3  10   1  25  21  11  13   3  19  16 ]
+==== radix sort, power-of-two ====
+    [   1   3   3   5   7  10  11  11  13  14  16  16  18  19  21  25 ]
+==== radix sort, non-power-of-two ====
+    [   1   3   5   7  10  11  11  13  14  16  18  21  25 ]
+```
+
+### GPU Scan Using Shared Memory
+
+The shared memory-based scan is implemented according to [GPU Gem Ch 39](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html). However, due to the memory limitation, the original method can only handle arrays with size smaller than twice the maximal number of threads in one block (`2048` for my computer).
+
+To resolve this, first I break the input array into blocks, and then run the original scan on each block. The sum of each block is stored into a new array. The increments between blocks can be computed by scanning this new array. Finally adding these increments to each element in the corresponding block results in a fully-scanned array.
+
+From the performance figure, this method shows significantly improvement on the performance, and it even beats Thrust scan in my experiment settings.
diff --git a/img/block.png b/img/block.png
new file mode 100644
index 0000000..03eef2b
Binary files /dev/null and b/img/block.png differ
diff --git a/img/compact.png b/img/compact.png
new file mode 100644
index 0000000..f9f316a
Binary files /dev/null and b/img/compact.png differ
diff --git a/img/scan.png b/img/scan.png
new file mode 100644
index 0000000..c67e3f3
Binary files /dev/null and b/img/scan.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..5faf056 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,9 +11,11 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix_sort.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 16; // feel free to change the size of array
+//const int SIZE = 1 << 8; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -36,13 +38,13 @@ int main(int argc, char* argv[]) {
     // At first all cases passed because b && c are all zeroes.
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
+    StreamCompaction::CPU::scan(SIZE, b, a, true);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(SIZE, b, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
+    StreamCompaction::CPU::scan(NPOT, c, a, true);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
@@ -64,23 +66,65 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    //zeroArray(SIZE, c);
+    //printDesc("naive-shared scan, power-of-two");
+    //StreamCompaction::Naive_Shared::scan(SIZE, c, a, true);
+    //printElapsedTime(StreamCompaction::Naive_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    ////printArray(SIZE, c, true);
+    //printCmpResult(SIZE, b, c);
+
+    //zeroArray(SIZE, c);
+    //printDesc("naive-shared scan, non-power-of-two");
+    //StreamCompaction::Naive_Shared::scan(NPOT, c, a, true);
+    //printElapsedTime(StreamCompaction::Naive_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    ////printArray(NPOT, c, true);
+    //printCmpResult(NPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
+    StreamCompaction::Efficient::scan(SIZE, c, a, true);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
+    StreamCompaction::Efficient::scan(NPOT, c, a, true);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("work-efficient-upgraded scan, power-of-two");
+    StreamCompaction::Efficient_Upgraded::scan(SIZE, c, a, true);
+    printElapsedTime(StreamCompaction::Efficient_Upgraded::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient-upgraded scan, non-power-of-two");
+    StreamCompaction::Efficient_Upgraded::scan(NPOT, c, a, true);
+    printElapsedTime(StreamCompaction::Efficient_Upgraded::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient-shared scan, power-of-two");
+    StreamCompaction::Efficient_Shared::scan(SIZE, c, a, true);
+    printElapsedTime(StreamCompaction::Efficient_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient-shared scan, non-power-of-two");
+    StreamCompaction::Efficient_Shared::scan(NPOT, c, a, true);
+    printElapsedTime(StreamCompaction::Efficient_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
@@ -130,7 +174,7 @@ int main(int argc, char* argv[]) {
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +191,29 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("***************\n");
+    printf("** SORT TEST **\n");
+    printf("***************\n");
+
+    // Sort test
+
+    int bits = 5;
+    int sort_size = 1 << 4;
+
+    genArray(sort_size, a, (1 << bits) - 1);
+    printArray(sort_size, a, true);
+
+    zeroArray(sort_size, b);
+    printDesc("radix sort, power-of-two");
+    Sort::radix_sort(sort_size, bits, b, a);
+    printArray(sort_size, b, true);
+
+    zeroArray(sort_size, c);
+    printDesc("radix sort, non-power-of-two");
+    Sort::radix_sort(sort_size - 3, bits, c, a);
+    printArray(sort_size - 3, c, true);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index 567795b..7b34ba9 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix_sort.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix_sort.cu"
     )
 
 list(SORT headers)
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..8bd7bbf 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            bools[index] = (idata[index] != 0);
         }
 
         /**
@@ -32,8 +37,15 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
 
+            // Assume bools initialized with 0s
+            if (bools[index]) {
+                odata[indices[index]] = idata[index];
+            }
+        }
     }
 }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..15581bd 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -17,10 +17,20 @@ namespace StreamCompaction {
          * For performance analysis, this is supposed to be a simple for loop.
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
+        void scan(int n, int *odata, const int *idata, bool timing_on) {
+            if (timing_on) {
+                timer().startCpuTimer();
+            }
+
+            // Exclusive scan
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
+
+            if (timing_on) {
+                timer().endCpuTimer();
+            }
         }
 
         /**
@@ -30,9 +40,18 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            
+            // Simply pick non-zero elements
+            int num_element = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i]) {
+                    odata[num_element] = idata[i];
+                    num_element++;
+                }             
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return num_element;
         }
 
         /**
@@ -41,10 +60,36 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            // Memory allocation
+            int *bool_buffer = new int[n];
+            int *scan_buffer = new int[n];
+
             timer().startCpuTimer();
-            // TODO
+            
+            // Set 1 for non-zero elements
+            for (int i = 0; i < n; i++) {
+                bool_buffer[i] = (idata[i] != 0);
+            }
+
+            // Scan
+            scan(n, scan_buffer, bool_buffer, false);
+
+            // Scatter
+            for (int i = 0; i < n; i++) {
+                if (bool_buffer[i]) {
+                    odata[scan_buffer[i]] = idata[i];
+                }
+            }
+
+            // Compute the number of elements remaining after compaction
+            int num_element = bool_buffer[n - 1] + scan_buffer[n - 1];
+
             timer().endCpuTimer();
-            return -1;
+
+            // Memory free
+            delete[] bool_buffer;
+            delete[] scan_buffer;
+            return num_element;
         }
     }
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..3e9119c 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -6,7 +6,7 @@ namespace StreamCompaction {
     namespace CPU {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool timing_on);
 
         int compactWithoutScan(int n, int *odata, const int *idata);
 
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..d068a61 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,9 @@
 #include "common.h"
 #include "efficient.h"
 
+// Block size used for CUDA kernel launch
+#define BLOCK_SIZE 128
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +15,99 @@ namespace StreamCompaction {
             return timer;
         }
 
+        // Add each value at (index+2^(d+1)-1) to the value at (index+2^d-1) in place
+        __global__ void kern_reduction(int n, int d, int* idata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            // Only for multiple of 2^(d+1)
+            if ((index & ((1 << (d + 1)) - 1)) == 0) {
+                idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
+            }
+        }
+
+        // Up-Sweep phase of efficient scan
+        void up_sweep(int n, int* idata) {
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            // Reduction for log(n) times
+            for (int d = 0; d < ilog2ceil(n); d++) {
+                kern_reduction << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, d, idata);
+                checkCUDAError("kern_reduction failed!");
+            }
+        }
+
+        // Swap values of left and right children, then add value of left to right
+        __global__ void kern_child_swap_add(int n, int d, int* idata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            // Only for multiple of 2^(d+1)
+            if ((index & ((1 << (d + 1)) - 1)) == 0) {
+                int temp = idata[index + (1 << d) - 1];
+                idata[index + (1 << d) - 1] = idata[index + (1 << (d + 1)) - 1];
+                idata[index + (1 << (d + 1)) - 1] += temp;
+            }
+        }
+
+        // Set last element to zero
+        __global__ void kern_clear_root(int n, int *idata) {
+            idata[n - 1] = 0;
+        }
+
+        // Down-Sweep phase of efficient scan
+        void down_sweep(int n, int* idata) {
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            // Set root to zero
+            kern_clear_root << <1, 1 >> > (n, idata);
+            checkCUDAError("kern_clear_root failed!");
+
+            // log(n) passes
+            for (int d = ilog2ceil(n) - 1; d >= 0; d--) {
+                kern_child_swap_add << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, d, idata);
+                checkCUDAError("kern_child_swap_add failed!");
+            }
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+        void scan(int n, int *odata, const int *idata, bool timing_on) {
+            // Create device array
+            // Rounded to next power of two
+            int round_n = 1 << ilog2ceil(n);
+            int *dev_array;
+            cudaMalloc((void**)&dev_array, round_n * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
+            if (timing_on) {
+                timer().startGpuTimer();
+            }
+
+            up_sweep(round_n, dev_array);
+
+            down_sweep(round_n, dev_array);
+
+            if (timing_on) {
+                timer().endGpuTimer();
+            }
+
+            // Copy data back
+            cudaMemcpy(odata, dev_array, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            checkCUDAError("cudaFree failed!");
         }
 
         /**
@@ -31,10 +120,385 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            // Used for computing the number of elements remaining after compaction
+            int *last_elements = new int[2];
+
+            // Create device array
+            int *dev_array;
+            int *dev_bool_buffer;
+            int *dev_scan_buffer;
+            int *dev_res;
+            cudaMalloc((void **)&dev_array, n * sizeof(int));
+            cudaMalloc((void **)&dev_bool_buffer, n * sizeof(int));
+            cudaMalloc((void **)&dev_scan_buffer, n * sizeof(int));
+            cudaMalloc((void **)&dev_res, n * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
             timer().startGpuTimer();
-            // TODO
+
+            // Set 1 for non-zero elements
+            StreamCompaction::Common::kernMapToBoolean << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, dev_bool_buffer, dev_array);
+            checkCUDAError("kernMapToBoolean failed!");
+
+            // Scan
+            scan(n, dev_scan_buffer, dev_bool_buffer, false);
+
+            // Scatter
+            StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, dev_res, dev_array, dev_bool_buffer, dev_scan_buffer);
+            checkCUDAError("kernScatter failed!");
+
             timer().endGpuTimer();
-            return -1;
+
+            // Fetch last element of bool array and scan array respectively
+            cudaMemcpy(last_elements, dev_bool_buffer + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            cudaMemcpy(last_elements + 1, dev_scan_buffer + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Compute the number of elements remaining after compaction
+            int num_element = last_elements[0] + last_elements[1];
+            free(last_elements);
+
+            // Copy data back
+            cudaMemcpy(odata, dev_res, sizeof(int) * num_element, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            cudaFree(dev_bool_buffer);
+            cudaFree(dev_scan_buffer);
+            cudaFree(dev_res);
+            checkCUDAError("cudaFree failed!");
+
+            return num_element;
+        }
+    }
+
+
+
+    namespace Efficient_Upgraded {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer &timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        // Add each value at (index+2^(d+1)-1) to the value at (index+2^d-1) in place
+        __global__ void kern_reduction(int n, int d, int *idata) {
+            unsigned long int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+            // Index hack
+            // Make use of all threads
+            index *= (1 << (d + 1));
+
+            if (index >= n) {
+                return;
+            }
+
+            // 'index' is now multiple of 2^(d+1)
+            idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
+        }
+
+        // Up-Sweep phase of efficient scan
+        void up_sweep(int n, int *idata) {
+            // Number of active elements in array
+            int act_n = n;
+
+            // Reduction for log(n) times
+            for (int d = 0; d < ilog2ceil(n); d++) {
+                // Halve the number of blocks launched in each turn
+                act_n /= 2;
+                dim3 fullBlocksPerGrid((act_n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kern_reduction << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, d, idata);
+                checkCUDAError("kern_reduction failed!");
+            }
+        }
+
+        // Swap values of left and right children, then add value of left to right
+        __global__ void kern_child_swap_add(int n, int d, int *idata) {
+            unsigned long int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+            // Index hack
+            // Make use of all threads
+            index *= (1 << (d + 1));
+
+            if (index >= n) {
+                return;
+            }
+
+            // 'index' is now multiple of 2^(d+1)
+            int temp = idata[index + (1 << d) - 1];
+            idata[index + (1 << d) - 1] = idata[index + (1 << (d + 1)) - 1];
+            idata[index + (1 << (d + 1)) - 1] += temp;
+        }
+
+        // Set last element to zero
+        __global__ void kern_clear_root(int n, int *idata) {
+            idata[n - 1] = 0;
+        }
+
+        // Down-Sweep phase of efficient scan
+        void down_sweep(int n, int *idata) {
+            // Set root to zero
+            kern_clear_root << <1, 1 >> > (n, idata);
+            checkCUDAError("kern_clear_root failed!");
+
+            // Number of active elements in array
+            int act_n = n / (1 << (ilog2ceil(n) + 1)) < 1 ? 1 : n / (1 << (ilog2ceil(n) + 1));
+
+            // log(n) passes
+            for (int d = ilog2ceil(n) - 1; d >= 0; d--) {
+                // Double the number of blocks launched in each turn
+                act_n *= 2;
+                dim3 fullBlocksPerGrid((act_n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kern_child_swap_add << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, d, idata);
+                checkCUDAError("kern_child_swap_add failed!");
+            }
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata, bool timing_on) {
+            // Create device array
+            // Rounded to next power of two
+            int round_n = 1 << ilog2ceil(n);
+            int *dev_array;
+            cudaMalloc((void **)&dev_array, round_n * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
+            if (timing_on) {
+                timer().startGpuTimer();
+            }
+
+            up_sweep(round_n, dev_array);
+
+            down_sweep(round_n, dev_array);
+
+            if (timing_on) {
+                timer().endGpuTimer();
+            }
+
+            // Copy data back
+            cudaMemcpy(odata, dev_array, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            checkCUDAError("cudaFree failed!");
+        }
+    }
+
+
+
+    namespace Efficient_Shared {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer &timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        // Perform inclusive scan in place on arr, and store sum of this block to sum
+        // Use shared memory to reduce memory access latency
+        // Notice that this can only process within ONE block, so n is at most as TWICE as max number of threads in a block
+        // 
+        // Reference: GPU Gem Ch 39 Example 39.2
+        // https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
+        __global__ void kern_prescan_inclusive(int n, int *arr, int *sum) {
+            extern __shared__ int shared_buffer[];
+
+            int index = threadIdx.x;
+            int act_index = threadIdx.x + blockIdx.x * blockDim.x;
+            int block_index = blockIdx.x;
+            int offset = 1;
+
+            // Copy data to shared memory
+            shared_buffer[index * 2] = arr[act_index * 2];
+            shared_buffer[index * 2 + 1] = arr[act_index * 2 + 1];
+
+            // Up-Sweep
+            for (int d = n >> 1; d > 0; d >>= 1) {
+                // Synchronize all threads at each turn
+                __syncthreads();
+
+                // Reduction
+                if (index < d) {
+                    shared_buffer[offset * (2 * index + 2) - 1] += shared_buffer[offset * (2 * index + 1) - 1];
+                }
+
+                // At next turn, double the stride to access
+                offset *= 2;
+            }
+
+            // Clear root
+            if (index == 0) {
+                shared_buffer[n - 1] = 0;
+            }
+
+            // Down-Sweep
+            for (int d = 1; d < n; d *= 2) {
+                // At next turn, halve the stride to access
+                offset >>= 1;
+
+                // Synchronize all threads at each turn
+                __syncthreads();
+
+                // Swap values of left and right children, then add value of left to right
+                if (index < d) {
+                    int temp = shared_buffer[offset * (2 * index + 1) - 1];
+                    shared_buffer[offset * (2 * index + 1) - 1] = shared_buffer[offset * (2 * index + 2) - 1];
+                    shared_buffer[offset * (2 * index + 2) - 1] += temp;
+                }
+            }
+
+            __syncthreads();
+
+            // Copy data back
+            arr[act_index * 2] = shared_buffer[index * 2 + 1];
+
+            // Write sum of block
+            if (index * 2 + 2 == n) {
+                arr[act_index * 2 + 1] += shared_buffer[index * 2 + 1];
+                sum[block_index] = arr[act_index * 2 + 1];
+            }
+            else {
+                arr[act_index * 2 + 1] = shared_buffer[index * 2 + 2];
+            }
+        }
+
+        // Perform exclusive scan in place on arr
+        // Use shared memory to reduce memory access latency
+        // Notice that this can only process within ONE block, so n is at most as TWICE as max number of threads in a block
+        // 
+        // Reference: GPU Gem Ch 39 Example 39.2
+        // https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
+        __global__ void kern_prescan_exclusive(int n, int *arr) {
+            extern __shared__ int shared_buffer[];
+
+            int index = threadIdx.x;
+            int offset = 1;
+
+            // Copy data to shared memory
+            shared_buffer[index * 2] = arr[index * 2];
+            shared_buffer[index * 2 + 1] = arr[index * 2 + 1];
+
+            // Up-Sweep
+            for (int d = n >> 1; d > 0; d >>= 1) {
+                // Synchronize all threads at each turn
+                __syncthreads();
+
+                // Reduction
+                if (index < d) {
+                    shared_buffer[offset * (2 * index + 2) - 1] += shared_buffer[offset * (2 * index + 1) - 1];
+                }
+
+                // At next turn, double the stride to access
+                offset *= 2;
+            }
+
+            // Clear root
+            if (index == 0) {
+                shared_buffer[n - 1] = 0;
+            }
+
+            // Down-Sweep
+            for (int d = 1; d < n; d *= 2) {
+                // At next turn, halve the stride to access
+                offset >>= 1;
+
+                // Synchronize all threads at each turn
+                __syncthreads();
+
+                // Swap values of left and right children, then add value of left to right
+                if (index < d) {
+                    int temp = shared_buffer[offset * (2 * index + 1) - 1];
+                    shared_buffer[offset * (2 * index + 1) - 1] = shared_buffer[offset * (2 * index + 2) - 1];
+                    shared_buffer[offset * (2 * index + 2) - 1] += temp;
+                }
+            }
+
+            __syncthreads();
+
+            // Copy data back
+            arr[index * 2] = shared_buffer[index * 2];
+            arr[index * 2 + 1] = shared_buffer[index * 2 + 1];
+        }
+
+        // Add block increments to each element in the corresponding block
+        __global__ void kern_add_increment(int n, int *arr, int *sum) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            int block_index = blockIdx.x;
+            if (index >= n) {
+                return;
+            }
+
+            arr[index * 2] += sum[block_index];
+            arr[index * 2 + 1] += sum[block_index];
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata, bool timing_on) {
+            int num_blocks = (n + (2 * BLOCK_SIZE) - 1) / (2 * BLOCK_SIZE);
+            int round_num_blocks = 1 << ilog2ceil(num_blocks);
+
+            // Create device array
+            int *dev_array;
+            int *dev_block_sums;
+            cudaMalloc((void **)&dev_array, n * sizeof(int));
+            cudaMalloc((void **)&dev_block_sums, round_num_blocks * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
+            if (timing_on) {
+                timer().startGpuTimer();
+            }
+
+            dim3 halfBlocksPerGrid(num_blocks);
+            // Scan each block and record block sums
+            kern_prescan_inclusive << <halfBlocksPerGrid, BLOCK_SIZE, (2 * BLOCK_SIZE) * sizeof(int) >> > ((2 * BLOCK_SIZE), dev_array, dev_block_sums);
+            checkCUDAError("kern_prescan failed!");
+
+            // Scan block sums
+            Efficient_Upgraded::up_sweep(round_num_blocks, dev_block_sums);
+
+            Efficient_Upgraded::down_sweep(round_num_blocks, dev_block_sums);
+
+            // Add increments
+            kern_add_increment << <halfBlocksPerGrid, BLOCK_SIZE >> > (n, dev_array, dev_block_sums);
+            checkCUDAError("kern_add_increment failed!");
+
+            // Set identity
+            odata[0] = 0;
+
+            if (timing_on) {
+                timer().endGpuTimer();
+            }
+
+            // Copy data back
+            // Shift inclusive scan to exclusive scan           
+            cudaMemcpy(odata + 1, dev_array, sizeof(int) * (n - 1), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            cudaFree(dev_block_sums);
+            checkCUDAError("cudaFree failed!");
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..bd70aa8 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,8 +6,24 @@ namespace StreamCompaction {
     namespace Efficient {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool timing_on);
 
         int compact(int n, int *odata, const int *idata);
     }
+
+
+
+    namespace Efficient_Upgraded {
+        StreamCompaction::Common::PerformanceTimer &timer();
+
+        void scan(int n, int *odata, const int *idata, bool timing_on);
+    }
+
+
+
+    namespace Efficient_Shared {
+        StreamCompaction::Common::PerformanceTimer &timer();
+
+        void scan(int n, int *odata, const int *idata, bool timing_on);
+    }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..06d481a 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -3,6 +3,9 @@
 #include "common.h"
 #include "naive.h"
 
+// Block size used for CUDA kernel launch
+#define BLOCK_SIZE 128
+
 namespace StreamCompaction {
     namespace Naive {
         using StreamCompaction::Common::PerformanceTimer;
@@ -11,15 +14,146 @@ namespace StreamCompaction {
             static PerformanceTimer timer;
             return timer;
         }
-        // TODO: __global__
+
+        // Add each value at (index-2^(d-1)) to the value at (index)
+        __global__ void kern_add_pairs(int n, int d, const int* idata, int* odata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n || index < (1 << (d - 1))) {
+                return;
+            }
+
+            odata[index] = idata[index] + idata[index - (1 << (d - 1))];
+        }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            // Create device array and a buffer
+            int *dev_array;
+            int *dev_array_buf;
+            cudaMalloc((void **)&dev_array, n * sizeof(int));
+            cudaMalloc((void **)&dev_array_buf, n * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            cudaMemcpy(dev_array_buf, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
             timer().startGpuTimer();
-            // TODO
+            
+            // Add for log(n) times
+            for (int d = 1; d <= ilog2ceil(n); d++) {
+                kern_add_pairs << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, d, dev_array, dev_array_buf);
+                checkCUDAError("kern_add_pairs failed!");
+
+                // Ping-pong the buffers
+                cudaMemcpy(dev_array, dev_array_buf, sizeof(int) * n, cudaMemcpyDeviceToDevice);
+                checkCUDAError("ping-pong failed!");
+            }
+
+            // Set identity
+            odata[0] = 0;
+
             timer().endGpuTimer();
+
+            // Copy data back
+            // Shift inclusive scan to exclusive scan           
+            cudaMemcpy(odata + 1, dev_array, sizeof(int) * (n - 1), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            cudaFree(dev_array_buf);
+            checkCUDAError("cudaFree failed!");
+        }
+    }
+
+
+
+    namespace Naive_Shared {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer &timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        // Perform scan on arr
+        // Use shared memory to reduce memory access latency
+        // Notice that this can only process within ONE block, so n is at most as SAME as max number of threads in a block
+        // 
+        // Reference: GPU Gem Ch 39 Example 39.1
+        // https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
+        // Bug in origin code
+        __global__ void kern_prescan(int n, int *arr) {
+            extern __shared__ int shared_buffer[];
+
+            int index = threadIdx.x;
+            int out_index = 0, in_index = 1;
+
+            // Copy data to shared memory
+            shared_buffer[index] = arr[index];
+
+            __syncthreads();
+
+            // Add pairs
+            for (int offset = 1; offset < n; offset *= 2) {
+                // Swap indices for two halves of array
+                shared_buffer[in_index * n + index] = shared_buffer[out_index * n + index];
+                out_index = 1 - out_index;
+                in_index = 1 - out_index;
+
+                if (index >= offset) {
+                    shared_buffer[out_index * n + index] += shared_buffer[in_index * n + index - offset];
+                }
+                else {
+                    shared_buffer[out_index * n + index] = shared_buffer[in_index * n + index];
+                }
+
+                // Synchronize all threads at each turn
+                __syncthreads();
+            }
+            
+            // Copy data back
+            // Shift by 1 and set 0 for first element
+            arr[index] = index > 0 ? shared_buffer[out_index * n + index - 1] : 0;
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata, bool timing_on) {
+            // Create device array
+            int *dev_array;
+            cudaMalloc((void **)&dev_array, n * sizeof(int));
+            checkCUDAError("cudaMalloc failed!");
+
+            // Copy data to GPU
+            cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy failed!");
+
+            if (timing_on) {
+                timer().startGpuTimer();
+            }
+
+            kern_prescan << <1, n, n * 2 * sizeof(int) >> > (n, dev_array);
+            checkCUDAError("kern_prescan failed!");
+
+            if (timing_on) {
+                timer().endGpuTimer();
+            }
+
+            // Copy data back
+            cudaMemcpy(odata, dev_array, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
+
+            // Cleanup
+            cudaFree(dev_array);
+            checkCUDAError("cudaFree failed!");
         }
     }
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..49d90c7 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -8,4 +8,12 @@ namespace StreamCompaction {
 
         void scan(int n, int *odata, const int *idata);
     }
+
+
+
+    namespace Naive_Shared {
+        StreamCompaction::Common::PerformanceTimer &timer();
+
+        void scan(int n, int *odata, const int *idata, bool timing_on);
+    }
 }
diff --git a/stream_compaction/radix_sort.cu b/stream_compaction/radix_sort.cu
new file mode 100644
index 0000000..c10d5a0
--- /dev/null
+++ b/stream_compaction/radix_sort.cu
@@ -0,0 +1,149 @@
+#include "radix_sort.h"
+
+// Block size used for CUDA kernel launch
+#define BLOCK_SIZE 128
+
+namespace Sort {
+    // Map each element in idata to 0/1 contrary to its d-th bit
+    __global__ void kern_map_bit_to_bool(int n, int d, int *rbools, const int *idata) {
+        int index = threadIdx.x + (blockIdx.x * blockDim.x);
+        if (index >= n) {
+            return;
+        }
+
+        rbools[index] = !((idata[index] >> d) & 1);
+    }
+
+    // Generate the indices of split result for elements with true keys
+    __global__ void kern_gen_true_key_index(int n, int falses, int *odata, const int *scan) {
+        int index = threadIdx.x + (blockIdx.x * blockDim.x);
+        if (index >= n) {
+            return;
+        }
+
+        odata[index] = index - scan[index] + falses;
+    }
+
+    // Generate the indices of split result for all elements
+    __global__ void kern_gen_index(int n, int *odata, const int *rbools, const int *scan, const int *t) {
+        int index = threadIdx.x + (blockIdx.x * blockDim.x);
+        if (index >= n) {
+            return;
+        }
+
+        odata[index] = rbools[index] ? scan[index] : t[index];
+    }
+
+    // Scatter based on index array 
+    __global__ void kern_scatter(int n, int *odata, const int *addr, const int *idata) {
+        int index = threadIdx.x + (blockIdx.x * blockDim.x);
+        if (index >= n) {
+            return;
+        }
+
+        odata[addr[index]] = idata[index];
+    }
+
+    /*
+    * Performs split on idata at turn d, storing the result into odata.
+    * Output array with false keys before true keys.
+    *
+    * @param n      The number of elements in idata.
+    * @param idata  The array of elements to split.
+    * @param rbools True/False for bit d (reversed).
+    * @param odata  The result array.
+    */
+    void split(int n, int *odata, const int *idata, const int *rbools) {
+        // Create device array
+        int *dev_scan_buffer;
+        int *dev_true_buffer;
+        int *dev_index_buffer;
+        cudaMalloc((void **)&dev_scan_buffer, n * sizeof(int));
+        cudaMalloc((void **)&dev_true_buffer, n * sizeof(int));
+        cudaMalloc((void **)&dev_index_buffer, n * sizeof(int));
+        checkCUDAError("cudaMalloc failed!");
+
+        // Exclusive scan on reversed bool array
+        StreamCompaction::Efficient_Upgraded::scan(n, dev_scan_buffer, rbools, false);
+
+        // Used for computing the number of elements remaining after compaction
+        int *last_elements = new int[2];
+
+        // Fetch last element of reversed bool array and scan array respectively
+        cudaMemcpy(last_elements, rbools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+        cudaMemcpy(last_elements + 1, dev_scan_buffer + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+        checkCUDAError("cudaMemcpy back failed!");
+
+        // Compute the number of total falses
+        int total_falses = last_elements[0] + last_elements[1];
+        free(last_elements);
+
+        // Generate index array for writing true keys
+        dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+        kern_gen_true_key_index << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, total_falses, dev_true_buffer, dev_scan_buffer);
+        checkCUDAError("kern_gen_true_key_index failed!");
+
+        // Generate index array for writing all keys
+        kern_gen_index << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, dev_index_buffer, rbools, dev_scan_buffer, dev_true_buffer);
+        checkCUDAError("kern_gen_index failed!");
+
+        // Scatter to output
+        kern_scatter << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, odata, dev_index_buffer, idata);
+        checkCUDAError("kern_scatter failed!");
+
+        // Cleanup
+        cudaFree(dev_scan_buffer);
+        cudaFree(dev_true_buffer);
+        cudaFree(dev_index_buffer);
+        checkCUDAError("cudaFree failed!");
+    }
+
+    /*
+    * Performs radix sort on idata, storing the result into odata.
+    * Sort data from smaller to larger.
+    *
+    * @param n          The number of elements in idata.
+    * @param num_bits   The maximum number of bits.
+    * @param idata      The array of elements to sort.
+    * @param odata      The result array.
+    */
+    void radix_sort(int n, int num_bits, int *odata, const int *idata) {
+        // Create device array
+        int *dev_array;
+        int *dev_res;
+        int *dev_bool_buffer;
+        cudaMalloc((void **)&dev_array, n * sizeof(int));
+        cudaMalloc((void **)&dev_res, n * sizeof(int));
+        cudaMalloc((void **)&dev_bool_buffer, n * sizeof(int));
+        checkCUDAError("cudaMalloc failed!");
+
+        // Copy data to GPU
+        cudaMemcpy(dev_array, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+        checkCUDAError("cudaMemcpy failed!");
+
+        dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+        // Split for num_bits times
+        for (int k = 0; k < num_bits; k++) {
+            // Generate bool array
+            kern_map_bit_to_bool << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, k, dev_bool_buffer, dev_array);
+            checkCUDAError("kern_map_bit_to_bool failed!");
+
+            split(n, dev_res, dev_array, dev_bool_buffer);
+
+            // Ping-pong the buffers
+            cudaMemcpy(dev_array, dev_res, sizeof(int) * n, cudaMemcpyDeviceToDevice);
+            checkCUDAError("ping-pong failed!");
+        }
+
+        // Copy data back
+        cudaMemcpy(odata, dev_res, sizeof(int) * n, cudaMemcpyDeviceToHost);
+        checkCUDAError("cudaMemcpy back failed!");
+
+        // Cleanup
+        cudaFree(dev_array);
+        cudaFree(dev_res);
+        cudaFree(dev_bool_buffer);
+        checkCUDAError("cudaFree failed!");
+    }
+}
\ No newline at end of file
diff --git a/stream_compaction/radix_sort.h b/stream_compaction/radix_sort.h
new file mode 100644
index 0000000..f6313e6
--- /dev/null
+++ b/stream_compaction/radix_sort.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include <cuda.h>
+#include "common.h"
+#include "efficient.h"
+
+namespace Sort {
+    void split(int n, int *odata, const int *idata, const int *rbools);
+
+    void radix_sort(int n, int num_bits, int *odata, const int *idata);
+}
\ No newline at end of file
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..9ce53f9 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,20 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            // Create thrust vector and cast to GPU
+            thrust::host_vector<int> thrust_idata(idata, idata + n);
+            thrust::device_vector<int> dev_thrust_idata = thrust_idata;
+            thrust::device_vector<int> dev_thrust_odata(n);
+            
             timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+            thrust::exclusive_scan(dev_thrust_idata.begin(), dev_thrust_idata.end(), dev_thrust_odata.begin());
+
             timer().endGpuTimer();
+
+            int *dev_odata = thrust::raw_pointer_cast(dev_thrust_odata.data());
+            cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy back failed!");
         }
     }
 }