|
全球图形学领域教育的领先者、自研引擎的倡导者、底层技术研究领域的技术公开者,东汉书院在致力于使得更多人群具备内核级竞争力的道路上,将带给小伙伴们更多的公开技术教学和视频,感谢一路以来有你的支持。我们正在用实际行动来帮助小伙伴们构建一套成体系的图形学知识架构,你在我们这里获得的不止于那些毫无意义的代码,我们这里更多的是代码背后的故事,以及精准、透彻的理解。我们不会扔给人们一本书或者给个思路让人们去自学,我们是亲自来设计出好的课程,让人们明白到底背后还有哪些细节。
这里插播一个引擎大赛的消息,感兴趣的同学可以看一眼,这也是东汉书院的立项使命:
原文参考链接
I was looking around the Unreal source the other day and inspired by some excellent breakdowns of how popular games render a frame, I thought to try something similar with it as well, to study how it renders a frame (with the default settings/scene setup).
我曾几何时在研究Unreal源代码的时候受到了一些非常棒的文章的启发《how popular games render a frame》,我其实当时是在做一些类似的事情,去研究如何渲染出一帧画面。
Since we have access to the source code, it is possible to study the renderer source to see what it does, although it is quite a beast and rendering paths depend a lot on the context so a clean, low-level API call list will be easier to follow (looking into the code to fill in any missing gaps).
当我访问过源码之后,其实就可以通过渲染器的源码去看看它做了什么,尽管源码看起来非常庞大并且渲染管线会因为很多上下文的配置而不一样,所以一个简单的、使用底层API调用序列会使得学习变得更简单。
I put together a simple scene with a few static and dynamic props, a few lights, volumetric fog, transparent objects and a particle effect to cover a large enough range of materials and rendering methods.
我构建了一个简单的场景,里面有一些静态和动态的东西,一些灯光,体积雾,透明物体以及一个例子效果,这样一来我们就可以让场景中的物体就能使用各种各样的材质并且可以覆盖到很多的渲染算法。
So, I ran the Editor through RenderDoc and triggered a capture. This might not be representative of how a real game frame will look like but it should give us a rough idea of how Unreal performs rendering of a typical frame (I haven’t changed any of the default settings and I am targeting “highest quality” on PC):
我打开了了RenderDoc然后开始进行App的运行时数据捕获。这可能无法去展示出真正的游戏中某一帧是如何被渲染的,但是这个测试应该还是能够大概的让我们知道,Unreal在渲染某一帧画面的时候是如何执行渲染的。(我没有改变任何默认的配置并且我设置的PC上的渲染质量为:最好)
Disclaimer: the following analysis is based on the GPU capture and renderer source code (version 4.17.1), without prior Unreal experience really. If have missed something, or got anything wrong, please let me know in the comments.
申明:接下来的分析数据都是从GPU上捕获到的,使用的Unreal的渲染器的源代码版本是4.17.1,同时,我事先没有任何Unreal的使用经验。如果我遗忘了什么,或者有什么错误,请在评论区告诉在下。
Helpfully, Unreal’s draw call list is clean and well annotated so it should make our work easier. The list can look different in case you are missing some entities/materials in your scene or you are targeting lower quality. For example if you are rendering no particles, the ParticleSimulation passes will be missing.
比较庆幸的是,Unreal的DC清单是非常干净的并且有着良好的注释,所以这使得我们的工作变得更简单。这个DC清单会在你忘了一些物体或者材质的时候或者是你使用较低画质的渲染配置进行渲染的时候变得不一样。比如,如果你没有渲染粒子,那么ParticleSimulation通道就不存在。
The SlateUI render pass includes all API calls the Unreal Editor performs to render its UI so we will ignore it, focusing instead on all passes under Scene.
SlateUI渲染通道包含了所有的Unreal Editor在渲染UI的时候会调用的API,所以我们将忽略这个部分,而关注场景中那些API的调用。
Particle Simulation
The frame begins with ParticleSimulation pass. It calculates particle motion and other properties for of each particle emitter we have in the scene on the GPU writing to two rendertargets, one RGBA32_Float for positions and one RGBA16_Float for velocities (and a couple of time/life related data). This, for example is the output for the RGBA32_Float rendertarget, each pixel corresponding to the world position of a sprite:
一帧画面一开始就是粒子模拟的渲染通道。它在GPU中计算了粒子的运动以及场景中每一个粒子发射器的其他的一些属性,然后把这些数据写入到两个渲染目标里去,一个渲染目标的格式的RGBA32_Float类型的,它存储的是位置数据,另一个是RGBA16_Float类型的来存储速度。比如,下面的这张图就是RGBA32_Float格式的渲染目标上的画面,每一个像素对应的是一个精灵在世界坐标系下的位置信息:
In this case the particle effect I added to the scene seems to have two emitters that require GPU simulation without collision, so the corresponding rendering passes can be run early in the frame.
在我的这个例子中,我使用的例子效果看起来有两个需要GPU模拟的无需计算碰撞的发射器,所以这样的渲染通道就能在每一帧较早的时候被执行了。
Z-Prepass
Next up is the PrePass render pass, which is essentially a z-prepass. This renders all the opaque meshes to an R24G8 depth buffer:
接下来的一步就是PrePass渲染通道,这实际上基本上就是z-prepass。这一步会把所有的不透明的数据都渲染到R24G8这样的深度缓冲区上去:
It is worth noting the Unreal uses reverse-Z when rendering to the depth buffer, meaning that the near plane is mapped to 1 and the far plane to 0. This allows for better precision along the depth range and reduces z-fighting on distant meshes. The name of the rendering pass suggests that the pass was triggered by a “DBuffer”. This refers to the decal buffer Unreal Engine uses to render deferred decals. This requires the scene depth so it activates the Z-prepass. The z-buffer is used in other contexts though, such as for occlusion calculations and screen space reflections as we will see next.
值得注意的是,Unreal在渲染深度缓冲区的时候,逆转了z方向,意思就是说,近剪裁面被映射到了1,远剪裁面映射到了0。这么做了之后,我们可以在深度图上获得更好的精度并且可以减少远处物体的z-fighting现象。这个渲染通道的名字暗示了,这个通道是被DBuffer启动的。这个名字指代的是Unreal引擎用来渲染延迟雕花的Decal缓冲区。由于它需要场景的深度图,所以它激活了z-prepass的渲染。z-buffer计算出来后会在其他的地方被使用,比如说遮挡计算以及屏幕空间的反射等等,我们接下来就会看到。
Some render passes in the list appear to be empty, like the ResolveSceneDepth, which I guess is for platforms that actually need “resolving” a rendertarget before using it as a texture (the PC doesn’t) as well as ShadowFrustumQueries which looks like it is a dummy marker, as the actual occlusion tests for shadows take place in the next render pass.
在清单里的有些渲染通道是空的,比如ResolveSceneDepth,我猜测这个东西是平台在把它纹理使用之前需要"resolving"一个渲染目标,同样的像是ShadowFrustumQueries,这个看起来就像是一个标记,真正的阴影的遮挡测试实际上会在下一步的渲染通道中被执行。
Testing for occlusion
BeginOcclusionTests handles all occlusion tests in a frame. Unreal uses hardware occlusion queries for occlusion testing by default. In short, this works in 3 steps:
BeginOcclusionTest执行了一帧中所有的遮挡剔除测试。Unreal默认的情况下使用硬件来执行遮挡测试的,总的来看,这个过程分为3个步骤:
We render everything that we regard as an occluder (i.e. a large solid mesh) to a depth bufferWe create an occlusion query, issue it and render the prop we wish to determine occlusion for. This is done using a z-test and the depth buffer we produced in step 1. The query will return the number of pixels that passed the z-test, so if it is zero this means that the prop is behind a solid mesh. Since rendering a full prop mesh for occlusion can be expensive, we typically use the bounding box of that prop as a proxy. If it is not visible, then the prop is definitely not visible.We read the query results back to the CPU and based on the number of pixels rendered we can decide to submit the prop for rendering or not (even if a small number of pixels are visible we might decide that it is not worth rendering the prop).
我们把所有的物体都当成是遮挡物,把他们渲染到一个深度缓冲区上去。我们创建一个遮挡查询,然后使用它并且把所有我们希望查询遮挡信息的物体渲染一遍。这一步我们需要使用到深度测试和第一步中我们生成的深度缓冲区。这个查询会返回通过了深度测试的像素的数量,所以如果返回值是0,这就意味着刚才渲染的那个东西被遮挡住了。由于为了进行遮挡查询而渲染出完整的物体模型的开销是巨大的,作为代替,我们可以在这里渲染物体的包围盒。如果不可见,那么物体就完全不可见。我们把查询结果读回到CPU上来,然后根据被渲染出来的像素点的数量,我们可以决定是否渲染这个物体。
Unreal uses different types of occlusion queries based on the context:
Unreal使用根据上下文环境的不一样,会使用不同类型的物体来进行遮挡查询:
Hardware occlusion queries have disadvantages such as they have “drawcall” granularity meaning that they require the renderer to submit one drawcall per mesh (or mesh batches) that needs determining occlusion for, which can increase the number of drawcalls per frame significantly, they require CPU-readback which introduces CPU-GPU sync points and makes the CPU wait until the GPU has finished processing the query. They are not that great for instanced geometry as well but we’ll ignore this for now.
硬件加速的遮挡查询也会有它的缺陷,比如说他们等同于一个drawcall,意思就是说,他们需要渲染器为每个物体提交一个drawcall来确定是否被遮挡,这样一来就会显著的增加每一帧的drawcall的数量。并且他们需要CPU从GPU端读回结果数据,这就带来了CPU-GPU的同步问题,使得CPU必须等待GPU完成查询指令。这对于instanced的绘制模式来说,也不是很友好,但是我们现在先忽略这个问题。
The CPU-GPU sync point problem Unreal solves like any other engine that uses queries, by deferring reading the query data for a number of frames. This approach works, although it might introduce props popping in the screen with a fast moving camera (in practice it might not be a massive problem though since doing occlusion culling using bounding boxes is conservative, meaning that a mesh will in all likelihood be marked as visible before it actually is). The additional drawcall overhead problem remains though and it is not easy to solve. Unreal tries mitigate it by grouping queries like this: At first it renders all opaque geometry to the z-buffer (the Z-prepass discussed earlier). Then it issues individual queries for every prop it needs to test for occlusion. At the end of the frame it retrieves query data from the previous (or further back) frame and decides prop visibility. If it is visible it marks it as renderable for the next frame. On the other hand, if it is invisible, it adds it to a “grouped” query which batches the bounding boxes of up to 8 props and uses that to determine visibility during the next frame. If the group becomes visible next frame (as a whole), it breaks it up and issues individual queries again. If the camera and the props are static (or slowly moving), this approach reduces the number of necessary occlusion queries by a factor of 8. The only weirdness I noticed was during the batching of the occluded props which seems to be random and not based of spatial proximity.
在解决CPU-GPU的同步的这个问题方面,Unreal引擎的解决方案就如同其他引擎一样,采用的是延迟若干帧画面读回查询数据的方式。这种做法是可取的,尽管这会在摄像机移动速度比较快的时候,物体可能会突然出现在屏幕上。然后额外的那些drawcall的问题依然存在,并且解决他们并不容易。Unreal尝试通过这样的方式来改善这个问题:在最开始的时候,它渲染所有的非透明的几何形体到z缓冲区上。然后它为每个它需要进行遮挡查询的物体单独启动查询。在这一帧的结尾,它从前一帧中读出查询数据然后来判断物体是否可见。如果它是可见的,那么就标记它在下一帧中是可以被渲染的状态。另一方面,如果它不可见,那么就把它添加到一个待查询的物体组里面,顶多在这里面塞8个物体,然后对这个组里面的所有物体求包围盒,然后使用这个包围盒在下一帧渲染的时候,来判断物体是否可见。如果某一组物体在下一帧的时候变得可见了,那么把它打散,然后又开始为里面的物体执行各自的可见性查询。如果摄像机和物体都是静态的,这一招可以将必要的遮挡查询的次数降低到原来的八分之一。这里唯一的比较奇怪的地方就是,在我们把物体分组,合并到一起进行遮挡查询的时候,如何分组看起来是随机的,它并没有基于物体空间关系来划分。
This process corresponds to the IndividualQueries and GroupedQueries markers in the renderpass list above. The GroupedQueries part is empty as the engine did not manage to produce any during the previous frame.
这个处理过程对应的是在上面的渲染通道列表里标记为IndividualQueries和GroupedQueries的部分。GroupedQueries部分是空的,因为在前一帧画面中,引擎没有去生成任何信息。
To wrap up the occlusion pass, ShadowFrustumQueries issues hardware occlusion queries for the bounding meshes of the local (point or spot) lights (both non and shadowcasting it appears, contrary to what the name declares). If they are occluded there is no point in doing and lighting/shadowing calculations for them. Worth noting is that although we have 4 shadow casting local lights in the scene (for which we need to calculate a shadowmap every frame frame), the number of drawcalls under ShadowFrustumQueries is 3. I suspect this is because one of the lights’ bounding volume intersects the camera’s near plane so Unreal assumes that it will be visible anyway. Also, worth mentioning is that for dynamic lights, where a cubemap shadowmap will be calculated, we submit a sphere shape for occlusion tests,
为了结束遮挡查询,ShadowFrustumQueries开启了对局部灯光下物体包围盒的遮挡查询。如果他们被遮挡了,那么就不会有任何一点的信息或者灯光/阴影的数据需要计算。值得注意的是,尽管我们在局部灯光里有四个阴影的投射操作,但是在ShadowFrustumQueries下面的drawcall的数量只有3。我猜这是因为某一盏灯光的包围盒与摄像机的近剪裁面相交,所以Unreal会认为那个一定是可见的。另外一个值得一提的是,对于那些将会计算cubemap阴影图的动态灯光来说,我们使用了一个球形去做遮挡剔除。
while for static dynamic lights which Unreal calculates per object shadows (more on this later), a frustum is submitted:
但光源是静态的时候,Unreal会为每个物体计算阴影,此时会有一个视锥体被提交:
Finally I assume that PlanarReflectionQueries refers to occlusion tests performed when calculating planar reflections (produced by transforming the camera behind/below the reflection plane and redrawing the meshes).
最后,我们假设PlanarReflectionQueries指的是计算平面反射的时候执行的遮挡测试。
Hi-Z buffer generation
Next, Unreal creates a Hi-Z buffer (passesHZB SetupMipXX) stored as a 16 floating point number (texture format R16_Float). This takes the depth buffer produced during the Z-prepass as in input and creates a mip chain (i.e. downsamples it successively) of depths. It also seems to resample the first mip to power of two dimensions for convenience:
下一步,Unreal创建了一个16比特的浮点数的Hi-Z缓冲区(纹理格式是R16_Float)。这一步的时候会使用z-prepass中生成的深度图作为输入,然后创建一个关于深度的mip链。看起来,这一步会重新采样第一级深度图中的数据,然后各个维度上,以2为单位进行降幂处理:
Since Unreal uses reverse-Z, as mentioned earlier, the pixel shader uses the min operator during downscaling.
前面们提到过,Unreal将z坐标进行了逆转,pixel shader会在降采样操作的时候使用min操作符。
Shadowmap rendering
Next follows the shadomap calculation render pass (ShadowDepths).
接下来的步骤就是生成shadowmap的渲染通道了。
In the scene I have added a “Stationary” directional Light, 2 “Movable” point lights, 2 “Stationary” point lights and a “Static” point light, which all cast shadows:
在场景里面,我们添加了一个“Stationary”的方向光,2个“Movable”点光源,2个“Stationary”点光源和一个“Static”点光源,并它们设置为能够产生阴影:
For stationary lights, the renderer bakes shadows for static props and calculates shadows only for dynamic (movable) props. With movable lights it calculates shadows for everything every frame (totally dynamic). Finally for static lights it bakes light+shadows into the lightmap, so they should never appear during rendering.
对stationary类型的光源来说,渲染器只会为静态物体烘焙阴影并且在计算阴影的时候只为动态物体计算阴影。对于movable光源来说,它会每一帧为所有的物体计算阴影(完全动态)。最后对于static光源来说,它会烘焙光和阴影到lightmap上去,这样一来,在真正渲染的时候,他们就不会出现了。
For the directional light I have also added cascaded shadowmaps with 3 splits, to see how they are handled by Unreal. Unreal creates a 3×1 shadowmap R16_TYPELESS texture (3 tiles in a row, one for each split), which it clears every frame (so no staggered shadowmap split updates based on distance). Then, during the Atlas0 pass it renders all solid props in to the corresponding shadowmap tile:
对于directional光源来说,为了知道Unreal是怎么处理这个玩意的,我为它添加了切片为3的cascaded阴影图。Unreal会生成一个类型为R16_TYPELESS类型的纹理,图里面的数据被分割成1行3列,这个纹理每一帧都会被擦除。然后,在Atlas0渲染通道的时候,它会渲染与对应的阴影图里的各个切片图片相关的所有实体物体。
As the call list above corroborates, only Split0 has some geometry to render so the other tiles are empty. The shadowmap is rendered without using a pixel shader which offers double the shadowmap generation speed. Worth noting is that the “Stationary” and “Movable” distinction does not hold for the Directional light it seems, the renderer renders all props (including static ones) to the shadowmap.
正如上面的那些函数调用列表一样,只有Split0有关联一部分需要被渲染的几何数据,其他的那些部分的shadowmap数据是空的。由于生成shadowmap的时候没有使用pixel shader,这样就会加速shadowmap的生成速度。值得注意的是stationary和movable的区别,看起来这俩玩意对directional光不适用,渲染器会渲染所有的东西到shadowmap里去。
Next up is the Atlas1 pass which renders shadowmaps for all stationary point lights. In my scene only the Rock prop is marked as “movable” (dynamic). For stationary lights and dynamic props, Unreal uses per object shadowmaps which stores in a texture atlas, meaning that it renders one shadowmap tile per dynamic prop per light:
接下来的是Atlas1通道,它会为所有的stationary类型的点光源渲染出shadowmap。在我的场景中,只有石头被标记为movable类型的物体了。对于stationary类型的光源和动态的物体,Unreal会为每个物体生成shadowmap,并把它们存储在atlas里面,也就是说它会为每个光源对每个动态物体生成一张shadowmap:
Finally, for dynamic (Movable) lights, Unreal produces a traditional cubemap shadowmap for each (CubemapXX passes), using a geometry shader to select which cube face to render to (to reduce the number of drawcalls). In it, it only renders dynamic props, using shadowmap caching for the static/stationary props. The CopyCachedShadowMap pass copies the cached cubemap shadowmap, and then the dynamic prop shadowmap depths are rendered on top. This is for example a face of the cached cube shadowmap for a dynamic light (output of CopyCachedShadowMap):
最后,对于动态光源来说,Unreal会为每个光源生成一个传统的cubemap格式的shadowmap,渲染的时候会使用Geometry Shader去选择把数据渲染到CubeMap的哪个面里面去。在渲染的时候,它只会渲染动态的物体,使用shadowmap去缓存static/stationary类型的物体。CopyCachedShadowMap通道就是在拷贝缓存起来的cubmap shadowmap,然后动态物体的shadowmap会被渲染到这之上。下面的就是一个例子,一个动态光源的缓存起来的cube shadowmap的一个面
And this is with the dynamic Rock prop rendered in:
然后接下来的图片就是把动态的石头渲染进去后,shadowmap的模样:
The cubemap for the static geometry is cached and not produced every frame because the renderer knows that the light is not actually moving (although marked as “Movable”). If the light is animated, the renderer will actually render the “cached” cubemap with all the static/stationary geometry every frame, before it adds the dynamic props to the shadowmap (this is from a separate test I did to verify this):
静态几何数据的cubemap会被缓存起来,并不是每一帧都会去重新生成,因为渲染器知道光源不会移动(尽管被标记为movable)。如果光源运动了,那么渲染器就会每一帧在缓存数据被拷贝到动态物体渲染的shadowmap之前,会为static/stationary几何数据渲染出“cached”cubemap。
The single Static light does not appear at all in the drawcall list, confirming that it does not affect dynamic props only static ones through the pre-baked lightmap.
唯一的静态光源没有出现在任何dc调用列表里,这样就可以肯定的说,它只会影响静态物体而不会影响到动态物体,并且它的影响已经写入到预先生成的光照贴图里去了。
Finally a word of advice, if you have stationary lights in the scene make sure that you bake lighting before doing any profiling in the Editor (at least, I am not sure what running the game as “standalone” does), Unreal seems to treat them as dynamic, producing cubemaps instead of using per object shadows, if not.
最后来一句忠告,如果你的场景中有stationary类型的光源,请确保在Editor里进行profiling之前烘焙完光照,如果不这么干的话,Unreal看起来会把它们当成动态的光源,生成cubemap而不是为每个物体生成阴影。
In the next blog post we continue the exploration of how Unreal renders a frame by looking into light grid generation, g-prepass and lighting.
在接下来的文章中,我们将继续通过看看那些光照贴图里的小格子里的玩意是咋生成的、G-Prepass以及光照的情况来探索Unreal如何渲染一帧画面的。
本小段结束
我们核心关注和讨论的领域是引擎的底层技术以及商业化方面的信息,可能并不适合初级入门的同学。官方相关信息主要包括了对市面上的引擎的各种视角的分析以及宏观方面的技术讨论,相关领域感兴趣的同学可以关注东汉书院以及图形之心公众号。
只言片语,无法描绘出整套图形学领域的方方面面,只有成体系的知识结构,才能够充分理解和掌握一门科学,这是艺术。我们已经为你准备好各式各样的内容了,东汉书院,等你来玩。 |
本帖子中包含更多资源
您需要 登录 才可以下载或查看,没有账号?立即注册
×
|