Vulkan, C++, GLSL
A real-time cloudscape renderer in Vulkan.
It is based on the cloud system 'NUBIS' that was implemented for the Decima Engine by Guerrilla Games for the game 'Horizon Zero Dawn'
Unreal Engine 4
Hand of God is an asymmetric co-op game, where one player is the ‘Runner’ and uses a traditional game controller while the other is a ‘God’ and uses the HTC Vive to reach into the world for an immersive feel.
The goal is for the 'Runner' to move and stay ahead of the Darkness that swirls around along their dangerous path and for the 'God' to help the runner through this arduous journey. If the Runner cannot move forward fast enough, the darkness envelops them for a grim end.
Monte Carlo Path tracing is a rendering technique that aims to represent global illumination as accurately as possible with respect to reality. When combined with physically accurate models of surfaces, accurate models light sources, and optically-correct cameras, path tracing can produce still images that are indistinguishable from photographs.
GPU Flocking Simulation written in CUDA
Tested on: Windows 10, i7-7700HQ @ 2.8GHz 32GB, GTX 1070(laptop GPU) 8074MB
Can simulate over 2100 lights at 60 FPS.
A WebGL implementation of Clustered Deferred and Clustered Forward Shading Techniques.
A procedural multi-layer dungeon generator that generates levels based on a dynamically generated voronoi like graph after it has been heavily modified by various filters.
Realistic fog, crumbling pathways, and terrain are added over the basic level layout to give it a unique mysterious foreign world aesthetic.
Vulkan, C++, GLSL
This project is a real-time cloudscape renderer in Vulkan that was made as the final project for the course, CIS 565: GPU Programming and Architecture, at the University of Pennsylvania. It is based on the cloud system 'NUBIS' that was implemented for the Decima Engine by Guerrilla Games. The clouds were originally made for the game 'Horizon Zero Dawn' and were described in the following SIGGRAPH 2015 and 2017 presentations:
The project is explained pretty well on the github repository .
Monte Carlo Path tracing is a rendering technique that aims to represent global illumination as accurately as possible with respect to reality. The algorithm does this by approximating the integral over all the illuminance arriving at a point in the scene. A portion of this illuminance will go towards the viewpoint camera, and this portion of illuminance is determined by a surface reflectance function (BRDF). This integration procedure is repeated for every pixel in the output image. When combined with physically accurate models of surfaces, accurate models light sources, and optically-correct cameras, path tracing can produce still images that are indistinguishable from photographs.
Path tracing naturally simulates many effects that have to be added as special cases for other methods (such as ray tracing or scanline rendering); these effects include soft shadows, depth of field, motion blur, caustics, ambient occlusion, and indirect lighting, etc.
Due to its accuracy and unbiased nature, path tracing is used to generate reference images when testing the quality of other rendering algorithms. In order to get high quality images from path tracing, a large number of rays must be traced to avoid visible noisy artifacts.
Mario; 5k triangles;
With BVH at 1 sample/pixel: 4594 milliseconds;
Without BVH at 1 sample/pixel: 454041 milliseconds
Render Time Sped up 98.83347 times (or 9883.347% speed increase)
BVH Construction Time For Wahoo.obj: 156 milliseconds
If an implicit surface can be defined by an equation equating to zero, it can define a Signed Distance Field(SDF); I used SDF's during intersection tests to determine if we're inside, outside, or actually intersecting the object.
Thin Convex Camera Lenses can be mimicked by having rays shot out from the camera pass through some point sampled on the lens and then using this point to change the original ray based on the properties of the lens. The technique is just an approximation for the actual math and produces better renders with more samples/pixel.
Numerous Material models including Transmissive and Micorfacet material models were implemented. Each Material is defined by properties that give the path tracer a great deal of versatility and variety for composing scenes.
The differences amongst the 3 major integrators is pretty apparent on comparing the renders:
Naive Integrator: Renders scenes based on the Light Transport Equation, but implemented in a very naive sense, without any multiple important sampling, rusian roullete termination, etc.
Direct Lighting Integrator: Renders the scene without any global illumination by simply ignoring secondary bounces of the ray once it hits something in the scene.
Full Lighting Integrator: Renders the scene based upon the Light Transport Equation but takes into account Multiple Important Sampling, Russian Roullete Termination, Conditionals for various types of light sources, and is implemented in an iterative rather than recursive manner. All these features let it avoid blown out images that can occur in the naive integrator, keep global illumination, terminate rays early with very little impact on the render quality, etc.
A volumetric Integrator exists but it is essentially the full lighting integrator that will ray march through the scene in addition to everything else to account for the influence of the participating medium.
Cornell Box, 2 lights; WRONG!; No MIS;
Cornell Box, 2 lights; Correct!; With MIS;
The color at every point in the scene is correctly scaled via Multiple Important Sampling by accounting for the contribution of each incoming ray at that point with respect to the actual energy that the light rays bring in. This changes with the scene because it can depend on a lot of biases in terms of sampling that we introduce to get the image to converge faster. These biases include light important sampling, BSDF important sampling, consine weighted sampling, etc. Instead of relying on anyone of these biased forms of sampling we get the best result across all kinds of scenes by taking a weighted average of them.
They are used to weight the energy contributions of rays to assist with Multiple Important Sampling. Power and Balance Heuristics are just different rules of thumb for the weighting
It is a technique employed to cut short the ray bounce depth once the energy being returned from the ray falls below a randomised threshold. The loss in energy is compensated by scaling up the existing energy value by an appropriate amount. When averaged over a bunch of samples, the loss in accuracy of the render is barely noticeable if at all.
The Path Tracer handles various primitives such as implicit surfaces, triangles for meshes, and the basic ones like cubes, spheres, discs, and squareplanes.
The sampler written for the path tracer is responsible for creating samples on the various primitives listed above. The samples generated for the shapes start from a uniform stratified distribution on a square plane which are then warped so as to resemble whatever shape we desire, also allowing for the introduction of biases such as cosine weighted sampling, in a controllable manner and from a common starting point.
This project implements Clustered Deferred and Clustered Forward Shading Techniques. These techniques improve performance for scenes with many dynamic lights.
Typical forward shading involves looping over every object and then looping over every single light in the scene to accumulate the light energy on the object. If the number of lights is high, this can lead to a drastic drop in performance. We need to cull lights so that only those lights which actually contribute energy on a fragment are looped over. Clustering is a solution that bins lights into 3D buckets based on what the light can influence. Deferred shading is an improvement over this by only dealing with the fragments with the smallest z-depth, which is accomplished via a culling stage and using a "g-buffer".
These solutions allow you to have an enormous number of dynamic lights in the scene when optimized. This project can run over 2100 lights at 60 FPS.
The performance analysis of deferred shading vs. forward plus shading vs. forward shading is explained pretty well on the github repository .
Clustering is the binning technique used in this project to store lights in a spatial data structure based on their area of influence. Clusters are very similar to the cells that make up a 3D grid, where each cell in the grid represents a certain spatial boundary and does not overlap any other cell of the grid. A collection of clusters arent simply called a grid because they are made by slicing up the camera view frustum, which makes each cluster a tiny frustum instead of a cuboid.
Clusters are used to determine if a specific light influences any part of the volume that the cluster occupies. This means that a datastructure representing the cluster would store all the lights that affect that cluster. A light can be stored in multiple clusters.
It is surprisingly hard to do frustum based culling simply due to the ease with which false positives can pop up. There are multiple proven techniques that can be used to solve such issues. For example: fast frustum culing .
The clusters used in this project were uniformly spaced in z, and were slicing the view frustum into 15 sections along each dimension. The number of slices per plane affects the size of the cluster buffer and so is a good number to play around with. Additionally it may be beneficially to have exponential splitting or some other way of splitting up the z dimension, because usually in real-time applications things in the distance are sampled at lower resolutions. This is something that can and has been taken advantage of, for example the game doom implemented an exponential divisioning of the z dimension.
In WebGL, we can't simply create buffers to store data that can be accessed by the shaders. Thus, we store our data by packing it into textures. The textures in this project have been laid out with a single cluster's data occupying an entire column. It would be more memory coherent to implement it such that the cluster's data occupies an entire row. Within the column itself however we have packed data relatively tight. Every pixel stores four floats. The first float in every column is the number of lights in that cluster. The floats that follow are just the light indices in no particular order.
This packing into textures additionally means that we have to unpack data in the shader itself. Additionally since we only store light indices we also need to store all of the lights in the scene in another texture. This texture also has to be packed and unpacked.
Forward Plus is a shading technique that involves culling lights in some fashion and then binning them in either 2D tiles or as was done in this project 3D clusters. This means that there is overhead for light culling but the payoff during shading is enormous. Culling lights means every fragment loops through only a select few lights in relation to the entire scene instead of all the lights.
The preformance can be many folds as the number of lights in the scene increase. About 17-18 times at 500 lights as compared to forward shading.
The graphs show at a 1000 lights that forward plus is only about 5 times faster than forward shading. This is much better ofcourse but isn't as great as 17. This might be because of the cluster texture not fitting well into memory leading to many expensive calls to global memorry during shading.
It is typical to end up with multiple fragments at verious depths for every pixel in the fragment buffer for a regular scene. It seems like an obvious solution to only calculate complex shading calculations including lighting for only the top most fragment,ie the one that is actually seen and has the lowest depth.
Deferred shading is an improvement over forward plus shading. It has 2 render passes, the first one looping over fragments and creating a "g-buffer" that stores only the topmost fragment, ie the one with the smallest depth, and storing the data associated with it in a single or multiple buffers. This data can be whatever is required for shading such as positions, colors, normals, depth, specular, etc.
Not having to loop over all the the fragments for a pixel and perform light accumulation on all the fragments leads to massive performance boosts even over forward plus. This boost is anywhere from 2 to 8 times more. The performance can be improved by using smaller and more data-coherent textures and so compact g-buffers.
The performance of both clustered forward plus and deffered shading is heavily dependent upon the size of the cluster buffer which is based mainly on how many clusters and the maximum number of lights that can be stored in the cluster. Making this texture data coherent goes a long way towards making the shading technique run really fast. This however is hard to know intuitively and was more of a guessing process.
Reducing the number of g-buffers (or how many components make up the g-buffer) helps make deferred shading faster which can be done by compactly storing data in them. Colors often don't need the alpha component, normals can be reconstructed simply from 2 of their components, data can be stored in 24bit floats instead of 32 bits, are just some of the ways to achieve this compression.
This project implemented the following layout for the 2 g-buffers used:
If we assume the magnitude of a normal is one, which it usually is because its simply a direction vector. We can reconstruct the z-value of the normal from its x and y values. The magnitude of a vector is defined as the square root of x squared + y squared + z squared. This formula gives us the magnitude of z. The sign of z is positive in camera space for all the fragments that we can see. Using these 2 bits of information we can easily reconstruct the normal from only its x and y components.
The debug view for the project can be found on the github repository .
Unreal Engine 4
You are running from the encroaching Darkness of a dying world. Your path is hard to see and disintegrating. Creatures emerge from the shadows to drag you into the Darkness. Fortunately a God has decided to help you, but no God is perfect. How long can you last?
Hand of God is an asymmetric co-op game. One player is the ‘Runner’, and uses a traditional game controller to move and stay ahead of the Darkness that swirls around along the dangerous path. If the Runner cannot move forward fast enough, the darkness envelops them for a grim end. The other player is the ‘God’, trying to help the Runner last as long as possible. This is a God, however, who needs the Runner's help. The Runner collects resources that enable the God to generate and place items to help the Runner, such as blocks to cross gaps in the world, and fending off creatures that emerge from the shadows. The God is immersed into the world of the runner via the HTC Vive and has a broader sense of the world itself.
This project served as an introduction to CUDA kernels, how to use them, and how to analyze their performance. This was done by implementing the Reynolds Boids algorithm in parallel on the GPU.
Boids is a crowd simulation algorithm developed by Craig Reynolds in 1986 which is modeled after the flocking behaviors exibited by birds and fish.
The most basic version of the algorithm operates on three rules:
Cohesion: Boids will try to move towards the perceived center of mass of other boids around them.
Alignment: Boids will tend to steer in the perceived average velocity of other boids around them.
Separation: Boids will try to keep some amount of distance between them and other boids.
The naive approach for computing the new positions and velocities for the boids is to check each boid against every other boid and apply the three rules described above. This is extremely slow with a time complexity of O(N^2).
A simple way to implement this in CUDA is to have one thread for each boid. Each thread would loop over the entire position and velocity buffers (skipping itself) and compute a new velocity for that boid. Then, another CUDA kernel would apply the position update using the new velocity data.
It is quite obvious that a spatial data structure can greately improve performance for the algorithm. If we reduce the number of boids each boid needs to check against, we can decrease the amount of work each thread needs to do. Because the three rules only apply within a certain radius, organizing the boids on a uniformly spaced grid allows us to perform a very efficient neighbor search and only check a very limited number of boids. We choose our uniform grid cell width to be twice the maximum search radius. This makes it so that at most, a boid may be influenced by other boids in the 8 cells directly around it.
To create the "grid" we create an additional CUDA kernel which fills a buffer with respective grid cell indices. A parallel sorting algorithm, thrust::sort_by_key, is then used to sort boid indices by their corresponding grid cell indices. Furthermore, we store the "start" and "end" indicies for each grid cell in two new buffers by checking for transitions in the sorted grid cell buffer.
Now, instead of checking against all other boids, each thread in our velocity update can determine the boids grid index based on its position; this then allows us to determine the 8 neighboring cells in 3D space, and only apply rules for those boids that have an index between "start" and "end" of a neighboring cell.
We can make the uniform grid spatial structure better by making memory accesses more coherent. The uniform scattered grid implementation does unecessary hopping and "pointer"-chasing. Instead of using an additional sorted buffer to find the index into our particle data, we can shuffle the particle data so that it is also sorted by grid cell index. This allows the GPU to load in and cache our position and velocity data for the boids resulting in fewer global memory calls. This small change ends up making signficant performance improvements.
As would be expected, as the number of boids increases, all 3 approaches suffer slowdowns. However, the line graph above clearly shows drastic things. We can see that using the Uniform Scattered Grid greatly improves performance and then the Uniform Coherent Grid surpasses even that in terms of performance. Intuitively, this makes sense. Using the Uniform Scattered Grid we greatly reduced the number of boids that have to be checked against for each boid. Making the boid data more memory coherent allowed us to access data(memory) faster because of better caching and reduction in the number of calls to global memory. This made Uniform Coherent Grids Perform that much better.
Looking at the bar graph we can see that the blocksize really only effects one implementation, that is the Naive implementation beginning at a block size of 32. This might be because at 32 or fewer threads per block, there is only one warp (a group of 32 threads) in a block. These smaller blocks mean that we need a larger number of blocks. With every warp in its own block, we lose the performance benefits of shared memory within a block and instead need to allocate memory for each of our very many blocks.
It is curious that the blocksize doesn't affect Uniform Scattered and Uniform Coherent Grid implementations too much. At a blocksize of 8 however, looking at the datatable we can see a drop in framerate for both of them although it isn't as drastic as it is for the Naive implementation. My guess is that with even smaller blocksizes we would see worse and worse framerates for all the reasons described above, for all the implementations.
As can be seen in the table above increasing the number of neighboring grid cells you are checking while appropriately changing the width of the cells, results in a lower framerate. This is because we have essentially just increased the overhead for maintaining the grid without a significant reduction in the number of boids we check. If we manipulate the ratio of 'cell width' to 'the number of neighboring grid cells that have to be checked' we could possibly result in cases where the 27 neighboring grid cells would result in a better framerate.
If we zoom into to the first few data points, we can notice a sudden jump that almost doubles our framerate while increasing the number of boids. At first glance, this seems kind of absurd, but it is more likely that in some situations, the number of boids do not map well into the memory of the underlying architecture which leads to a frustrating reduction in framerate.
While trying to code this project, I ran into a ridiculous bug that made my Uniform Scattered and Uniform Coherent Grids perform significantly worse than my Naive implementation. I may have inadvertently coded up the worst-case scenario for both these methods. The issue arose because I had divided my boid's position along all three axes by the 'grid resolution'. And I then used that position to calculate an index into the gridcells. However, because I had divided by the 'grid resolution' the index that I obtained for every boid was the same index, i.e they were all in the same grid cell. This meant that I was basically running my Naive implementation inside one grid cell along with the overhead of creating and maintaining extra buffers for the grid structure, and doing a bunch of computation to determine which gridcells to look at. A stupid mistake but with mildly interesting insight.
The 'Interesting Level Generator' is a procedural multi-layer dungeon generator that generates levels based on a dynamically generated voronoi like graph after it has been heavily modified by various filters.
Realistic fog, crumbling pathways, and terrain are added over the basic level layout to give it a unique mysterious foreign world aesthetic.
Procedural 3D game level ( multiple stacks of 2D maps ).
Implicit surfaces to create terrain on the resulting map floors
Custom fog shader
The specifications below will be implemented if I have enough time in the final week
Collision handling so player can’t walk through geometry.
Using sampling and rejection to layout the slabs in a given space.
Then use a fake voronoi generation technique to create a graph. The fake voronoi technique consists of first starting with 3 connected nodes, and then for every new node you want to add to the graph you find the 2 closest neighbours from the existing graph and form edges between them.
We can improve the above technique a bit more by sorting the positions of the slabs along one axis. This makes the connections look more like a voronoi pattern.
This graph is highly connected so we randomly remove connections.
The graph can end up with intersecting edges in a few edge cases, so we carry out 2D line intersection tests and remove any intersections if they exist.
Walkways between slabs:
Because of the graph we created above we have edge and node data, in other words we know which points to connect. So between any 2 points we could lay out planes to connect them, but this is boring.
Instead we can use instancing to place many many tiny cubes along the line segment and then randomly remove cubes to make it look like a crumbling walkway.
We also need to give the walkway some width so we take the cross product of the direction of the line segment and the y axis (up) to get a horizontal ortho-normal direction for width. Add instanced cubes not just on the line segment but for a certain width for each line segment.
We can create multiple 2D maps at different heights.
Interlevel Connecting Paths:
This is a similar problem to the one we solved in “Walkway between slabs” section; But now it’s in 3D.
For every layer we pick a random node/slab as the starting point of our path between layers.
For the end point of that line segment we search through the nodes in the layer above for rooms that are beyond a certain distance from the randomly picked starting node (so paths don’t go straight up and there is more complexity in connections), and form a list of these “toNodes”.
Pick a random “toNode”, and using the random starting node we have a 3D line segment.
Create a similar instancing cubes setup for these paths as we did with the walkways.
Remove random lines, and also carry out 3D intersection tests and remove any intersecting paths, if they exist.
Path shifting for Paths:
To make the paths connecting walkways seem more organic and prevent janky looking paths that start at the center of each cell and end at the center of the other one ( this is a problem as a player can never go to a higher layer, they will be stuck underneath the “toNode” ).
We need to shift paths to the edges of the cells they are connecting. Simply offset by the width and length in the correct direction.
To add organic paths we should shift by both the width and length.
Created in the shader, with global control of density, color, and a on/off switch.
A good approximation of fog fall-off is: e-(fogDensity2 x distance2)
Fog also appears to have different densities at the periphery of our vision, so we need to account for rimColor.
Terrain was created in the shader.
Create an elevation map and a moisture map using smoothed 2D noise with different seed values (or different noise functions).
Use the elevation map to deform vertices in the shader.
Create a moisture map ( similar to the elevation map ).
Use the float values from the elevation and moisture as uv values to determine colors from gradients.
Grid based Acceleration:
Takes too much memory, and so was never used for anything.
kD tree acceleration Structure for Collision Handling
Character with basic character controls
Collision handling for the character
Trampolines to fill up empty and negative space