diff --git a/README.md b/README.md
index d63a6a1..bdf357e 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,72 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dongying Liu
+  * [LinkedIn](https://www.linkedin.com/in/dongying-liu/), [personal website](https://vivienliu1998.wixsite.com/portfolio)
+* Tested on:  Windows 11, i7-11700 @ 2.50GHz, NVIDIA GeForce RTX 3060
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+# Project Description
+<img src="/images/result2.gif"  width="600">
+Uniformed Gird, Coherent Memory Buffer, 100,000 Boids.
+<img src="/images/result.gif"  width="700">
+Uniformed Gird, Coherent Memory Buffer, 50,000 Boids.
+
+This project is about creating a Boids flocking simulation. 
+Particles(boids) behave according to three rules:
+1. **cohesion** - boids move towards the perceived center of mass of their neighbors
+2. **separation** - boids avoid getting to close to their neighbors
+3. **alignment** - boids generally try to move with the same direction and speed as their neighbors
+
+These three rules specify a boid's velocity change in a timestep. 
+The code is based on [Conard Parker's notes](http://www.vergenet.net/~conrad/boids/pseudocode.html) with slight adaptations.
+This project is basically getting familiar with writing simple CUDA kernels and using them.
+
+I've implemented the update function(main simulation step) using three ways:
+
+1. **NAIVE** - Each boid checks every other boid in the simulation to compute its resulting velocity.
+
+2. **UNIFORM GRID** - Each boid is pre-processed into a uniform grid cell. Every boid only need to check its 27 neighbor cells to compute its velocity change. 
+
+3. **COHERENT GRID** - Same with the Uniform Grid, however, the position buffer and velocity buffer are sorted as well according to the grid_cell_idx so we can read from the contiguous memory when doing the simulation.
+
+# Performance Analysis
+
+## Framerate change with increasing count of boids
+
+![How Boids Affect FPS](/images/boid_count_with_visualization.png)
+![How Boids Affect FPS](/images/boid_count_without_visualization.png)
+
+As the graph shows, generally, the FPS declines as the number of boids increases. 
+For the naive method, the performance is worse than the other two methods, since every boids need to check for all the other boids, and for the other two methods we used the grid cell to narrow down the total boids need to be checked for one boid. 
+What's more, we can tell the coherent grid acts a little bit better than uniform grid, since we rearraged the position and velocity buffer so the memory can store contiguously.
+
+![How Boids Affect FPS](/images/visualization_comparison.png)
+
+Since we are doing the visualization, the framerate is not reporting the framerate for the the simulation only. So, we can tell from the graph the perfromance is better without visualization.
+
+## Framerate change with increasing block size
+![How Block_Size Affect FPS](/images/block_size.png)
+Tested with Coherent, 500,000 boids.
+
+From the graph we can see there are only a mild falloff in FPS when block size increases. I think the reason is changing the block size does not change the number of threads running together at the same time, it only changes the configuration of the threads, which is how many blocks and how many threads in one block.
+
+# Questions
+Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+
+**Q: For each implementation how does changing the number of boids affect performance? Why do you think this is?**
+
+When the number of boids increases, the framerate dereases significantly. For each boids, we need to check for its 26 neighbor grid cells to change it velocity. When the number of boids increase, the number of boids in each grid cell increases generally. Although we only check for the boids in 27 cells, the total number of boids checked increase however. However, limited the total number of boids checked for each boid indeed increase the framerate, this can be told comparing the naive method with the other two methods.
+
+**Q: For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+As said in the Performance Analysis. I think the reason is, changing the block count and block size do not change the number of threads running together at the same time, it only changes the configuration of the threads, which is how many blocks and how many threads in one block.
+
+**Q: For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, there is a performance improvements with the coherent uniform grid and it is what I expected. I think the reason is because of the contiguous memory access. It is always slow when access memory that are not contiguous. Althought we spend extra time to create the sorted position and velocity buffers, the time saved for memory access is much more than the cost, it's a good trade off.
+
+**Q: Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+I toggled between test checking 27 neighbor with cell_width=neighbor_distance and checking 8 neighbor with cell_width=double_neighbor_distance with uniform grid. When boids_count = 50,000, it turns out checking 8 neighbors(780 FPS) are somehow slightly faster than checking 27 neighbors(690 FPS). When boids_count = 100,000, checking 27 neighbors(523 FPS) are faster than checking 8 neighbors(370 FPS). 
+Instinctively thinking, checking 27 neighbors with cell_width=neighbor_distance(assume neighbor_distance=1) is equal to checking 3x3x3 volume of boids. However, checking 8 neighbor with cell_width=double_neighbor_distance is equal to checking 4x4x4 volume of boids. So, it seems checking 8 neighbors will leads to checking more number of boids(assume the number of boids in every 1x1x1 cell are same). So, checking 27 neighbors with cell_width=neighbor_distance might performance better than checking 8 neighbor with cell_width=double_neighbor_distance. 
diff --git a/images/block_size.png b/images/block_size.png
new file mode 100644
index 0000000..336bd60
Binary files /dev/null and b/images/block_size.png differ
diff --git a/images/boid_count_with_visualization.png b/images/boid_count_with_visualization.png
new file mode 100644
index 0000000..94c6277
Binary files /dev/null and b/images/boid_count_with_visualization.png differ
diff --git a/images/boid_count_without_visualization.png b/images/boid_count_without_visualization.png
new file mode 100644
index 0000000..2c64562
Binary files /dev/null and b/images/boid_count_without_visualization.png differ
diff --git a/images/project.gif b/images/project.gif
new file mode 100644
index 0000000..98c8b05
Binary files /dev/null and b/images/project.gif differ
diff --git a/images/result.gif b/images/result.gif
new file mode 100644
index 0000000..b2212ec
Binary files /dev/null and b/images/result.gif differ
diff --git a/images/result2.gif b/images/result2.gif
new file mode 100644
index 0000000..2c06fdb
Binary files /dev/null and b/images/result2.gif differ
diff --git a/images/visualization_comparison.png b/images/visualization_comparison.png
new file mode 100644
index 0000000..0f37fdd
Binary files /dev/null and b/images/visualization_comparison.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..695664f 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -17,6 +17,9 @@
 
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
 
+#define CHECKING_8_NEIGHBOURS 0
+#define CHECKING_27_NEIGHBOURS 1
+
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
@@ -37,7 +40,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 1024
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
@@ -85,6 +88,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_pos_coh;
+glm::vec3* dev_vel_coh;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -152,12 +157,18 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
+  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1,
+                                                                                                                numObjects,
+                                                                                                                dev_pos, 
+                                                                                                                scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+#if CHECKING_8_NEIGHBOURS
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#elif CHECKING_27_NEIGHBOURS
+  gridCellWidth =  std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +180,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos_coh, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos_coh failed!");
+
+  cudaMalloc((void**)&dev_vel_coh, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel_coh failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -230,10 +259,55 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 posSelf = pos[iSelf];
+    glm::vec3 velSelf = vel[iSelf];
+    glm::vec3 vel1 = glm::vec3(0.0f);
+    glm::vec3 vel2 = glm::vec3(0.0f);
+    glm::vec3 vel3 = glm::vec3(0.0f);
+
+    glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+    glm::vec3 awayDist(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceivedVel(0.0f, 0.0f, 0.0f);
+    int neighborCountRule1 = 0;
+    int neighborCountRule3 = 0;
+
+    for (int idx = 0; idx < N; ++idx) {
+        if (idx == iSelf)
+            continue;
+        glm::vec3 posOther = pos[idx];
+        //if (iSelf == 1002) {
+        //    printf("distance:                     %f\n", glm::distance(posSelf, posOther));
+        //}
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        float distance = glm::distance(posSelf, posOther);
+        if (distance < rule1Distance) {
+            perceivedCenter += posOther;
+            neighborCountRule1++;
+        }
+        // Rule 2: boids try to stay a distance d away from each other
+        if (distance < rule2Distance) {
+            awayDist -= (posOther - posSelf);
+        }
+        // Rule 3: boids try to match the speed of surrounding boids
+        if (distance < rule3Distance) {
+            perceivedVel += vel[idx];
+            neighborCountRule3++;
+        }
+    }
+    //if (iSelf == 1002) {
+    //    printf("perceivedCenter:                     %f, %f, %f \n", perceivedCenter.x, perceivedCenter.y, perceivedCenter.z);
+    //    printf("neighborCountRule1:                     %d \n", neighborCountRule1);
+    //}
+    if (neighborCountRule1 > 0) {
+        perceivedCenter = perceivedCenter / glm::vec3(neighborCountRule1);
+        vel1 = (perceivedCenter - posSelf) * glm::vec3(rule1Scale);
+    }
+    vel2 = awayDist * glm::vec3(rule2Scale);
+    if (neighborCountRule3 > 0) {
+        perceivedVel /= glm::vec3(neighborCountRule3);
+        vel3 = perceivedVel * glm::vec3(rule3Scale);
+    }
+    return (vel1 + vel2 + vel3);
 }
 
 /**
@@ -241,10 +315,26 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
+                                                                            glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+
+    // get the boid by index
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    glm::vec3 velSelf = vel1[index];
+    // calculate vel change according to rules
+    glm::vec3 velChanged = computeVelocityChange(N, index, pos, vel1);
+    glm::vec3 newVel = velSelf + velChanged;
+    // clamp the speed
+    if (glm::length(newVel) > maxSpeed) {
+        newVel = glm::normalize(newVel) * glm::vec3(maxSpeed);
+    }
+    // update boid newVel to vel2 a new buffer
+    vel2[index] = newVel;
 }
 
 /**
@@ -270,6 +360,9 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
 
   pos[index] = thisPos;
+  //if (index == 1002) {
+  //    printf("thisPos:                     %f, %f, %f \n", thisPos.x, thisPos.y, thisPos.z);
+  //}
 }
 
 // LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
@@ -283,12 +376,27 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
+                                                                   glm::vec3 gridMin, float inverseCellWidth,
+                                                                   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+      return;
+    }
+    glm::vec3 boidPos = pos[index];
+    // according to position, calculate the x, y, z for the boid
+    int x = (boidPos.x - gridMin.x) * inverseCellWidth;
+    int y = (boidPos.y - gridMin.y) * inverseCellWidth;
+    int z = (boidPos.z - gridMin.z) * inverseCellWidth;
+    // calculate the gridCellIdx where the boid is
+    int gridCellIdx = gridIndex3Dto1D(x, y, z, gridResolution);
+    // update the dev_particleGridIndices and dev_particleArrayIndices buffers
+    gridIndices[index] = gridCellIdx;
+    indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -300,12 +408,69 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   }
 }
 
+// 2.3 for making coh buffer
+__global__ void kernCreateCohBuffer(int N, int* arrayIndicesBuffer,  
+                                                                    glm::vec3* posBuffer, glm::vec3* velBuffer, 
+                                                                    glm::vec3* posCohBuffer, glm::vec3* velCohBuffer) {
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    }
+    int idx = arrayIndicesBuffer[index];
+    posCohBuffer[index] = posBuffer[idx];
+    velCohBuffer[index] = velBuffer[idx];
+}
+
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
+                                                                            int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+    //int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    //if (index >= N) {
+    //    return;
+    //}
+    //int startIdx = 0;
+    //int endIdx = 1;
+    //while(endIdx  <= N){
+    //    int startGridCellIdx = particleGridIndices[startIdx];
+    //    int endGridCellIdx = particleGridIndices[endIdx];
+    //    if(endIdx == N){
+    //        gridCellStartIndices[startGridCellIdx] = startIdx;
+    //        gridCellEndIndices[startGridCellIdx] = endIdx;
+    //        break;
+    //    }
+    //    if(startGridCellIdx != endGridCellIdx){ // if gridCellIdx different
+    //        gridCellStartIndices[startGridCellIdx] = startIdx;
+    //        gridCellEndIndices[startGridCellIdx] = endIdx;
+    //        startIdx = endIdx;
+    //        endIdx++;
+    //    }
+    //    else{ // gridCellIdx same
+    //        endIdx++;
+    //    }
+    //}
+    //get thread index
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+
+    // get the current cell and the next cell (if > length of array, will be -1)
+    int gridCell = particleGridIndices[index];
+    int gridCellNext = index + 1 >= N ? -1 : particleGridIndices[index + 1];
+
+    // if is the first in particleGridIndices, create the startIndices
+    if (index == 0) {
+        gridCellStartIndices[gridCell] = index;
+    }
+    // if the current cellIdx does not equal the next cellIdx
+    // mark the endIdx and the next startIdx both as index+1 --> [startIdx, endIdx)
+    else if (gridCell != gridCellNext) {
+        gridCellEndIndices[gridCell] = index + 1;
+        if (gridCellNext != -1) {
+            gridCellStartIndices[gridCellNext] = index + 1;
+        }
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +487,137 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  // get the boid info by index
+  glm::vec3 boidVel = vel1[index];
+  glm::vec3 boidPos = pos[index];
+  //if (index == 1002) {
+  //    printf("boidVel:                     %f, %f, %f \n", boidVel.x, boidVel.y, boidVel.z);
+  //    printf("boidPos:                     %f, %f, %f \n", boidPos.x, boidPos.y, boidPos.z);
+  //}
+  
+#if CHECKING_27_NEIGHBOURS
+  // calculate the grid cell that this particle is in base on the boidPos
+  glm::vec3 boidGridCell = glm::floor((boidPos - gridMin) * glm::vec3(inverseCellWidth));
+  
+  glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+  int neighborCountRule1 = 0;
+  glm::vec3 awayDist(0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedVel(0.0f, 0.0f, 0.0f);
+  int neighborCountRule3 = 0;
+
+for (int z = -1; z <= 1; ++z) {
+    int nbGridCellZ = boidGridCell.z + z;
+    if (nbGridCellZ > (gridResolution - 1) || nbGridCellZ < 0)
+        continue;
+      for(int y = -1; y <= 1; ++y){
+          int nbGridCellY = boidGridCell.y + y;
+          if(nbGridCellY > (gridResolution - 1) || nbGridCellY < 0)
+              continue;
+          for (int x = -1; x <= 1; ++x) {
+              int nbGridCellX = boidGridCell.x + x;
+              if (nbGridCellX > (gridResolution - 1) || nbGridCellX < 0)
+                  continue;
+              int nbGridCellIdx = gridIndex3Dto1D(nbGridCellX, nbGridCellY, nbGridCellZ, gridResolution);
+              int nbBoidStartIdx = gridCellStartIndices[nbGridCellIdx];
+              int nbBoidEndIdx = gridCellEndIndices[nbGridCellIdx];
+              if (nbBoidStartIdx == -1 && nbBoidEndIdx == -1)
+                  continue;
+              for(int i = nbBoidStartIdx; i < nbBoidEndIdx; ++i){
+                int nbBoidIdx = particleArrayIndices[i];
+                if(nbBoidIdx == index)
+                  continue;
+                glm::vec3 nbPos = pos[nbBoidIdx];
+                float distance = glm::distance(boidPos, nbPos);
+                if(distance < rule1Distance){
+                  perceivedCenter += nbPos;
+                  neighborCountRule1++;
+                }
+                if(distance < rule2Distance){
+                  awayDist -= (nbPos - boidPos);
+                }
+                if(distance < rule3Distance){
+                  perceivedVel += vel1[nbBoidIdx];
+                  neighborCountRule3++;
+                }
+              }
+          }
+      }  
+  }
+#elif CHECKING_8_NEIGHBOURS
+  // calculate the grid cell that this particle is in base on the boidPos
+  glm::vec3 roundedBoidPos = glm::vec3(glm::round(boidPos.x), glm::round(boidPos.y), glm::round(boidPos.z));
+
+  glm::vec3 boidGridCell = (roundedBoidPos - gridMin) * glm::vec3(inverseCellWidth);
+
+  glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+  int neighborCountRule1 = 0;
+  glm::vec3 awayDist(0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedVel(0.0f, 0.0f, 0.0f);
+  int neighborCountRule3 = 0;
+
+  for (int z = -1; z <= 0; ++z) {
+      int nbGridCellZ = boidGridCell.z + z;
+      if (nbGridCellZ > (gridResolution - 1) || nbGridCellZ < 0)
+          continue;
+      for (int y = -1; y <= 0; ++y) {
+          int nbGridCellY = boidGridCell.y + y;
+          if (nbGridCellY > (gridResolution - 1) || nbGridCellY < 0)
+              continue;
+          for (int x = -1; x <= 0; ++x) {
+              int nbGridCellX = boidGridCell.x + x;
+              if (nbGridCellX > (gridResolution - 1) || nbGridCellX < 0)
+                  continue;
+              int nbGridCellIdx = gridIndex3Dto1D(nbGridCellX, nbGridCellY, nbGridCellZ, gridResolution);
+              int nbBoidStartIdx = gridCellStartIndices[nbGridCellIdx];
+              int nbBoidEndIdx = gridCellEndIndices[nbGridCellIdx];
+              if (nbBoidStartIdx == -1 && nbBoidEndIdx == -1)
+                  continue;
+              for (int i = nbBoidStartIdx; i < nbBoidEndIdx; ++i) {
+                  int nbBoidIdx = particleArrayIndices[i];
+                  if (nbBoidIdx == index)
+                      continue;
+                  glm::vec3 nbPos = pos[nbBoidIdx];
+                  float distance = glm::distance(boidPos, nbPos);
+                  if (distance < rule1Distance) {
+                      perceivedCenter += nbPos;
+                      neighborCountRule1++;
+                  }
+                  if (distance < rule2Distance) {
+                      awayDist -= (nbPos - boidPos);
+                  }
+                  if (distance < rule3Distance) {
+                      perceivedVel += vel1[nbBoidIdx];
+                      neighborCountRule3++;
+                  }
+              }
+          }
+      }
+  }
+#endif
+
+  glm::vec3 velDelta1 = glm::vec3(0.0f);
+  glm::vec3 velDelta2 = glm::vec3(0.0f);
+  glm::vec3 velDelta3 = glm::vec3(0.0f);
+  if (neighborCountRule1 > 0) {
+      perceivedCenter = perceivedCenter / glm::vec3(neighborCountRule1);
+      velDelta1 = (perceivedCenter - boidPos) * glm::vec3(rule1Scale);
+  }
+  velDelta2 = awayDist * glm::vec3(rule2Scale);
+  if (neighborCountRule3 > 0) {
+      perceivedVel /= glm::vec3(neighborCountRule3);
+      velDelta3 = perceivedVel * glm::vec3(rule3Scale);
+  }
+  glm::vec3 newVel = boidVel + velDelta1 + velDelta2 + velDelta3;
+  // clamp the speed??
+  if (glm::length(newVel) > maxSpeed) {
+      newVel = glm::normalize(newVel) * glm::vec3(maxSpeed);
+  }
+  vel2[index] = newVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,14 +637,98 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) {
+        return;
+    }
+    // get the boid info by index
+    glm::vec3 boidVel = vel1[index];
+    glm::vec3 boidPos = pos[index];
+    // calculate the grid cell that this particle is in base on the boidPos
+    glm::vec3 boidGridCell = glm::floor((boidPos - gridMin) * glm::vec3(inverseCellWidth));
+
+    glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+    int neighborCountRule1 = 0;
+    glm::vec3 awayDist(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceivedVel(0.0f, 0.0f, 0.0f);
+    int neighborCountRule3 = 0;
+
+    for (int z = -1; z <= 1; ++z) {
+        int nbGridCellZ = boidGridCell.z + z;
+        if (nbGridCellZ > (gridResolution - 1) || nbGridCellZ < 0)
+            continue;
+        for (int y = -1; y <= 1; ++y) {
+            int nbGridCellY = boidGridCell.y + y;
+            if (nbGridCellY > (gridResolution - 1) || nbGridCellY < 0)
+                continue;
+            for (int x = -1; x <= 1; ++x) {
+                int nbGridCellX = boidGridCell.x + x;
+                if (nbGridCellX > (gridResolution - 1) || nbGridCellX < 0)
+                    continue;
+                int nbGridCellIdx = gridIndex3Dto1D(nbGridCellX, nbGridCellY, nbGridCellZ, gridResolution);
+                int nbBoidStartIdx = gridCellStartIndices[nbGridCellIdx];
+                int nbBoidEndIdx = gridCellEndIndices[nbGridCellIdx];
+                if (nbBoidStartIdx == -1 && nbBoidEndIdx == -1)
+                    continue;
+                for (int nbBoidIdx = nbBoidStartIdx; nbBoidIdx < nbBoidEndIdx; ++nbBoidIdx) {
+                    if (nbBoidIdx == index)
+                        continue;
+                    glm::vec3 nbPos = pos[nbBoidIdx];
+                    float distance = glm::distance(boidPos, nbPos);
+                    if (distance < rule1Distance) {
+                        perceivedCenter += nbPos;
+                        neighborCountRule1++;
+                    }
+                    if (distance < rule2Distance) {
+                        awayDist -= (nbPos - boidPos);
+                    }
+                    if (distance < rule3Distance) {
+                        perceivedVel += vel1[nbBoidIdx];
+                        neighborCountRule3++;
+                    }
+                }
+            }
+        }
+    }
+    glm::vec3 velDelta1 = glm::vec3(0.0f);
+    glm::vec3 velDelta2 = glm::vec3(0.0f);
+    glm::vec3 velDelta3 = glm::vec3(0.0f);
+    if (neighborCountRule1 > 0) {
+        perceivedCenter = perceivedCenter / glm::vec3(neighborCountRule1);
+        velDelta1 = (perceivedCenter - boidPos) * glm::vec3(rule1Scale);
+    }
+    velDelta2 = awayDist * glm::vec3(rule2Scale);
+    if (neighborCountRule3 > 0) {
+        perceivedVel /= glm::vec3(neighborCountRule3);
+        velDelta3 = perceivedVel * glm::vec3(rule3Scale);
+    }
+    glm::vec3 newVel = boidVel + velDelta1 + velDelta2 + velDelta3;
+    // clamp the speed??
+    if (glm::length(newVel) > maxSpeed) {
+        newVel = glm::normalize(newVel) * glm::vec3(maxSpeed);
+    }
+    vel2[index] = newVel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    kernUpdateVelocityBruteForce << < fullBlocksPerGrid, threadsPerBlock >> > ( numObjects, 
+                                                                                                                                               dev_pos,
+                                                                                                                                               dev_vel1, 
+                                                                                                                                               dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+    kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > ( numObjects,
+                                                                                                                  dt,
+                                                                                                                  dev_pos,
+                                                                                                                  dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
   // TODO-1.2 ping-pong the velocity buffers
+    cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,10 +744,43 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+  // TODO: kernResetIntBuffer to set start end buffer to -1
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 gridCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+    kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount,
+                                                                                                                            gridMinimum, gridInverseCellWidth,
+                                                                                                                            dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    kernResetIntBuffer << <gridCellBlocksPerGrid, threadsPerBlock >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <gridCellBlocksPerGrid, threadsPerBlock >> > (gridCellCount, dev_gridCellEndIndices, -1);
+    thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+    thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+    thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleGridIndices,
+                                                                                                                                  dev_gridCellStartIndices, dev_gridCellEndIndices);
+    kernUpdateVelNeighborSearchScattered << < fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum,
+                                                                                                                                                                gridInverseCellWidth, gridCellWidth,
+                                                                                                                                                                dev_gridCellStartIndices, dev_gridCellEndIndices,
+                                                                                                                                                                dev_particleArrayIndices,
+                                                                                                                                                                dev_pos, dev_vel1, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+    kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects,
+                                                                                                                  dt,
+                                                                                                                  dev_pos,
+                                                                                                                  dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+    // TODO-1.2 ping-pong the velocity buffers
+      //dev_vel1 = dev_vel2;
+      //glm::vec3* tmp = dev_vel1;
+      //dev_vel1 = dev_vel2;
+      //dev_vel2 = tmp;
+    cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
+
+
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
@@ -382,6 +795,36 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 gridCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+    kernComputeIndices << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount,
+        gridMinimum, gridInverseCellWidth,
+        dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    kernResetIntBuffer << <gridCellBlocksPerGrid, threadsPerBlock >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <gridCellBlocksPerGrid, threadsPerBlock >> > (gridCellCount, dev_gridCellEndIndices, -1);
+    thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+    thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+    thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects, dev_particleGridIndices,
+        dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    kernCreateCohBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices,
+                                                                                                                            dev_pos, dev_vel1,
+                                                                                                                            dev_pos_coh, dev_vel_coh);
+    kernUpdateVelNeighborSearchCoherent << < fullBlocksPerGrid, threadsPerBlock >> > (numObjects, gridSideCount, gridMinimum,
+                                                                                                                                                                gridInverseCellWidth, gridCellWidth,
+                                                                                                                                                                dev_gridCellStartIndices, dev_gridCellEndIndices,
+                                                                                                                                                                dev_pos_coh, dev_vel_coh, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed!");
+    kernUpdatePos << <fullBlocksPerGrid, threadsPerBlock >> > (numObjects,
+                                                                                                                  dt,
+                                                                                                                  dev_pos_coh,
+                                                                                                                  dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed!");
+    //// TODO-1.2 ping-pong the velocity buffers
+    cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+    cudaMemcpy(dev_pos, dev_pos_coh, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
 }
 
 void Boids::endSimulation() {
@@ -390,6 +833,14 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+
+  cudaFree(dev_pos_coh);
+  cudaFree(dev_vel_coh);
+
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..a759d65 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 100000;
 const float DT = 0.2f;
 
 /**
@@ -217,8 +217,8 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
-                       // your CUDA development setup is ready to go.
+    //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+    //                   // your CUDA development setup is ready to go.
 
     while (!glfwWindowShouldClose(window)) {
       glfwPollEvents();