|
Zaknafein
|
 |
« on: May 31, 2009, 09:03:06 PM » |
|
Ever since I've played that Minecraft game (and its inspiration Infiniminer) I've been wanting to do some similar first-person world carving game with a huge world and maybe some artificial life in there. Or at least environmental effects. But these games render a HUGE amount of similar tiles, not unlike my main project Fez, and the usual way to that in DirectX is by using geometry instancing. I've read that Arius wants to implement hardware instancing in TV3D, but this technology is only available to SM3.0 hardware so I wanted to try and milk out as much performance as I can from SM2.0-compatible Shader Instancing (using vertex shader constants). I've got something working now but it's a bit backwards to use... It's a class that is completely parallel to TVMesh/TVMinimesh, it uses Managed DirectX stuff extensively and the shader doesn't get the standard TV3D semantics. But the performance is great, so it's encouraging.  TVMiniMesh (52 instances per batch), not textured, not transformed. All I use is SetEnableArray and SetPositionArray, ~43k instances : 26 FPS.  My custom InstancedMesh class (248 instances per batch), using a low-res cubemap, with a matrix transformation for rotation, ~43k instances : 60 FPS!  I'll keep you updated and eventually post the source! Edit : Three InstancedMeshes with different textures and some very WIP random world generation algorithm :  Edit 2 : Hardware Occlusion Queries are in! And since all my data is in an octree I can use this to my advantage and do broad-phase occlusion culling. In this screenshot there are over 20k instances under the ground but the culling only renders 7k!  This comes to a general performance hit because the world needs to be re-rendered to a RS but it ended up being faster in many cases.
|
|
|
|
« Last Edit: June 01, 2009, 11:00:07 PM by Zaknafein »
|
Logged
|
|
|
|
rootsage
Customers
Community Member
    
Posts: 444
Gamer Enthusiast
|
 |
« Reply #1 on: June 01, 2009, 01:50:57 AM » |
|
Wow, this looks great Zak, this stuff is somewhere on my list of TODOs, looks like you got to it first. Definitely very useful, thanks for putting your time into this, I cannot wait to see how you do things in the source, maybe they will be similar to my ideas  I will most certainly keep checking back to this thread for updates. How much time has been invested into this so far?
|
|
|
|
|
Logged
|
while( !( succeed = try_again()) ); ------ 10 print "Is this recursive?" 20 goto 10
|
|
|
Lenn
Customers
Community Member
    
Posts: 876
+/-
|
 |
« Reply #2 on: June 01, 2009, 09:07:49 AM » |
|
Amazing! Really would love to play with this. Keep us posted. 
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #3 on: June 01, 2009, 10:55:09 AM » |
|
How much time has been invested into this so far?
I started working on this maybe 5 days ago, on nights and weekends. I already had an octree class (that I'd never used) from Fez so I started with that until I got frustum culling using the octree working, then worked on the InstancedMesh. So it's been pretty quick to set up. 
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #4 on: June 01, 2009, 11:15:05 PM » |
|
I added some realtime controls to compare the performance of the various settings I have in place. The clear winner (in FPS and number of rendered instances) on still-frame is hardware occlusion queries when they're used carefully. My current sweet-spot is to traverse the octree until I'm 3 levels away from individual unit cells, and have a HW occlusion RenderSurface that's 3 times smaller than the main buffer. I also made a simpler vertex shader pass for the occlusion render, which speeds it up a tiny bit. (the framerates below are super low because I'm on a sh**ty laptop with integrated graphics) No Culling Frustum-AABB Culling Hardware Occlusion Queries (uses the same AABBs) Now as far as culling goes, I have one problem. I don't know of a good, fast algorithm to check frustum-AABB intersection or containment ( fixed, see Edit #2) (I'd like to distinguish both cases) so my current method has some false positives and false negatives. So on the borders of the screen sometimes there are chunks that are culled even if they're visible, and it will render stuff that's clearly not on screen. Since my frustum-box culling doesn't work well, I can't use it as a pre-pass to the occlusion culling method. I feel like a hybrid method incorporating both (such that there are fewer queries, they're pretty slow...) would be the best thing ever. Another thing is that the culling gets done everytime the view matrix changes, which means if you don't move it's a lot faster than if you do. This may get annoying. I also have no temporal optimization; the result of last frame's culling is not used in the current frame's culling. There should be ways to use that data without producing artifacts, but I haven't read/thought much about it yet. So I'd love to hear ideas if you have any! Edit : For the record, my current AABB-Frustum culling algo is the following : 1. Check if the camera position is inside the AABB; if so, there's an intersection 2. Check if there are vertices of the box that are on either side of a frustum "cross-plane", if there's both, it's an intersection (idea taken from this gamedev.net thread... and unlike he thinks, it's NOT fail-safe) 3. Check if some (intersects) or all (contains) the box vertices are inside of all of the frustum planes (positive dot product). It does all three steps in sequence, since it goes from fastest to slowest. But surely there is something better and more fail-safe. Edit 2 : Oh yeah! I've found something that works really well and it's faster too : the n-vertex/p-vertex method. Described here : http://zach.in.tu-clausthal.de/teaching/cg_literatur/lighthouse3d_view_frustum_culling/index.html
|
|
|
|
« Last Edit: June 02, 2009, 12:07:00 AM by Zaknafein »
|
Logged
|
|
|
|
Trashcan
Community Member

Posts: 1352
|
 |
« Reply #5 on: June 02, 2009, 11:42:27 AM » |
|
Why render so many adjacent tiles separately? Are you able to scale your instances? If you have, say, a 2x2 grid of similarly textured tiles, you could merge them and tile the texture.
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #6 on: June 02, 2009, 11:52:43 AM » |
|
Scaling the instances would mean cutting the number of instances per batch by half. Unless I have several InstancedMeshes with different sizes... But then one mesh per size per texture, that sounds kind of inefficient.
I'll definitely try "grouping" at some point though!
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #7 on: June 02, 2009, 09:55:10 PM » |
|
More on occlusion queries... I have added many optimizations after talking with Sylvain. - The RenderSurface's color buffer doesn't get written to, so you can choose the smallest format possible and it'll still work. In my case that format is R5G6B5 (16 BPP). - Occlusion queries are asynchronous, so it's a much better idea use the query results in the next frame instead of waiting in the current frame, to reduce stalls. This introduces some problems (what's visible last frame != what's visible right now), but they can be addressed. It's my next goal. - I can use the visible node-set from the last frame to render into the occlusion rendersurface. This way I exploit some form of temporal coherency, because what wasn't visible a frame ago has a small chance of occluding something right now. If it did, it's no big deal, we'll cull it next frame. (I know, this is confusing  ) - I managed to use frustum culling to reduce the number of occlusion queries : if you don't see a node, you don't need to query its occlusion. The speed difference is invisible on a screenshot, but generally moving around the world is slightly smoother. - I traded a occlusion query draw call for a point-in-AABB test, for the case where the camera is inside a node whose occlusion I want to test. It's faster to test that in code than with the GPU, and easier on the fillrate... So it's getting pretty sweet, I just need to fix the "nodes disappear near the camera edges" problem brought by using the last-frame data... And then I want to polish my minimesh class and it's time for a release! P.S.Another temporal optimization that Sylvain mentioned but I can't be bothered to try right now is categorization of nodes according to their occlusion state. If a node is occluded two frames in a row, it has a good chance of staying like this, so you can categorize it as such. Same for nodes that are visible twice in a row. You can group all nodes that have similar "chances" into a single occlusion query until it changes state, then you ungroup it. Sounds like a lot of work, but it could be worth it... I'll see about trying it out.
|
|
|
|
« Last Edit: June 02, 2009, 09:59:47 PM by Zaknafein »
|
Logged
|
|
|
|
|
GD
|
 |
« Reply #8 on: June 03, 2009, 03:44:39 PM » |
|
Hey Zak, nice stuff! I'm sure I'll repeat what others said, but anyway. My bbox-frustum visibility code using mid/size bboxes, dot products, frustum planes. public bool IsBBoxVisible(ref BoundingBox b) {
if ((Math.Abs(b.Mid.x - Pos.x) < b.Size.x) && (Math.Abs(b.Mid.y - Pos.y) < b.Size.y) && (Math.Abs(b.Mid.z - Pos.z) < b.Size.z)) return true;
for (int i = 0; i < 6; i++) { float m = b.Mid.x * planes[i].x + b.Mid.y * planes[i].y + b.Mid.z * planes[i].z + planes[i].w; float n = b.Size.x * absPlanes[i].x + b.Size.y * absPlanes[i].y + b.Size.z * absPlanes[i].z; if (m > n) return false; }
return true; } //absPlanes[i].x is abs(planes[i].x)
I hope you render low poly cubes onto occlusion surface, not your trixels. Set TVScene::SetColorWriteEnable to false when rendering occluders.
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #9 on: June 03, 2009, 04:49:19 PM » |
|
Thanks GD! I don't use trixels in this mini-project, although I might try to put that stuff in Fez...
I'm not sure if your testing method is faster or slower than mine. They're pretty similar. But the p/n-vertex method can distinguish intersection and containment, which is useful in my case, so I think I'll keep mine. Thanks for posting though.
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #10 on: June 06, 2009, 07:38:38 PM » |
|
Alright, time for an early release! http://theinstructionlimit.com/samples/OctreeCulling/OctreeRendering_v1.zip There are four modes in this release : - No culling, just render everything all the time (no TV3D frustum culling either because it's not using a TVMesh) - Frustum culling (the depth of octree traversal can be changed with a realtime setting) - Hardware occlusion queries, nodes that aren't in the view frustum may be queried too. - Frustum culling + HW Occlusion Queries, only nodes in the frustum will be queried. (and additional optimizations) There is a new option called "sychronous queries". You can choose whether you want to stall the pipeline with the GetData call in the same frame as the query itself. If you use this with the Frustum+Occlusion mode, only nodes intersecting with (and not contained in!) the view frustum will be synchronously tested. The rest are done asynchronously. The reason why that is : I haven't found a perfect, fast way to stop the edge-of-camera artifacts with using just asynchronous queries. So I guess you need to have some synchronous queries, but not all of them, I use the frustum intersection data to find out which. This works pretty well. I've found with a 64³ world, it's often faster to render all the world than doing complicated things to find what's visible. This is a little annoying, but my efforts will probably make more sense with a much bigger/more complex world. The stuff that's missing : - Comments and cleanup - I haven't profiled the performance of the culling code CPU-wise much - Grouping of queries based on how often they're occluded/visible - The InstancedMesh class is not very friendly, it uses MDX semantics. I want to mix it up with more TVMesh functionality Still, enjoy!
|
|
|
|
|
Logged
|
|
|
|
|
Shargot
|
 |
« Reply #11 on: June 07, 2009, 02:32:44 AM » |
|
It is wonderful! Thanks you huge) You have spent very serious, useful and interesting work.
|
|
|
|
|
Logged
|
|
|
|
|
winspy
|
 |
« Reply #12 on: June 07, 2009, 03:19:13 AM » |
|
the oculling is wonderful,i wish there will be a separate example to show how to work with the oculling functions(to cull actor/mesh/landscapes)!!
|
|
|
|
|
Logged
|
|
|
|
Mietze
Community Member

Posts: 415
Pleeease, don't let it crash!
|
 |
« Reply #13 on: June 07, 2009, 05:22:51 AM » |
|
I haven't found a perfect, fast way to stop the edge-of-camera artifacts with using just asynchronous queries. What about using a sightly bigger frustum for the testing? This may do the job and be cheaper.
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #14 on: June 08, 2009, 01:43:28 AM » |
|
@Mietze : I'll definitely try it out. I'm currently refactoring the project, putting things in their right places and isolating the culling code from the instancing/rendering. Then I'll try some other ways to make it fast & pretty.  @winspy : This demo is really made for filled blocky worlds. It's going to need coding and maybe a redesign to allow sparse polygonal worlds. I think GD's TVSM might be a better starting point for that.
|
|
|
|
|
Logged
|
|
|
|
Trashcan
Community Member

Posts: 1352
|
 |
« Reply #15 on: June 09, 2009, 12:28:31 PM » |
|
Would your shader compile for PS1.1?
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #16 on: June 09, 2009, 12:58:49 PM » |
|
I'm not sure how many vertex shader constants you have in vs_1_1... 64? It would have to be modified, like have two different codepaths. And the speed increase would be smaller.
|
|
|
|
« Last Edit: June 09, 2009, 04:27:02 PM by Zaknafein »
|
Logged
|
|
|
|
Trashcan
Community Member

Posts: 1352
|
 |
« Reply #17 on: June 10, 2009, 03:34:36 PM » |
|
I decided to see what kind of performance I could get with some LOD in an environment like this. Hope you don't mind that I used your texture, Zak--it's really nice for visualizing blocks  I forgot to capture the title where the stats are, but the scene contains ~19,000 blocks (minimeshes), while it appears to be rendering ~30,000 blocks (it's a very large cube, most of which you can't see). I just used the standard TV minimeshes with a custom shader that enables texture tiling per-minimesh; this allowed me to make 1 minimesh appear to be 8 or 16 blocks with almost no work. Excuse the framerate; camtasia + intel graphics aren't great for 3D, and I'm doing zero culling (next step). http://www.youtube.com/watch?v=txthzVU9j38
|
|
|
|
|
Logged
|
|
|
|
|
Zaknafein
|
 |
« Reply #18 on: June 10, 2009, 03:41:39 PM » |
|
Ah, very nice! Looks like it's worth it. I'll have to integrate that in my code then. 
|
|
|
|
|
Logged
|
|
|
|
|
Stelios_81
|
 |
« Reply #19 on: October 20, 2009, 07:42:11 AM » |
|
My thanks as well. I could see the issue rising in my project and your timing was perfect!
|
|
|
|
|
Logged
|
|
|
|
|