IT圈老男孩1 发表于 2022-6-19 19:41

Unreal 4 性能入门指南

之前 4 个月,我们 Cohort 约 50 个学生一起用 Ureal 4 开发了一款赛车游戏。
我在团队中负责 Performance 相关问题,主要工作为优化游戏性能,保证游戏在最坏情况(4 位玩家分屏,4位 AI)能够达到 60 FPS。
游戏在 Alpha 时,所有 level 的帧率在 30 FPS 左右。在没有修改引擎,没有自定义 Render path 的情况下,通过优化,在最后 RTM 时,4 个 level 的其中三个可以达到稳定保持 60 FPS,其余一个稳定在 50 FPS 以上。
本文内容面向 new learner 以及 intermediate level。主要介绍 ue4 所提供的一系列内置工具,并且介绍引擎内部渲染实现的各种术语。由于个人时间原因,不再做翻译。
Performance Profile tools available in Unreal 4

Unreal 4 as an advanced commercial engine has a lot of built-in profiler tools for developers. According to different scenarios, developers can choose various tools to help them to achieve the required performance. However, developers need to have enough knowledge about certain aspects of the hardware or software, so that they can understand the term and concepts occurring in the data.
# Frontend Tools

## Session Frontend(ref)

### Enable the Profiler:


[*]Run the game with the parameter -messaging (for example UE4Editor-Win64-Debug.exe -messaging).
[*]Run the UFE with the parameter -messaging (for example UnrealFrontend-Win64-Debug.exe -messaging).
[*]Select Session Frontend from the Developer Tools section of the Window menu bar, then select the Profiler tab.
Here are the instructions from Unreal's official guide. There are several things requiring clarification:

[*]First, the session frontend tool(UFE) is enabled by default when you are developing the game, in the Unreal editor, no matter you play in editor


or run it with standalone lunch


.

[*]Second, regarding ‘Run the game,’ that means the actual build executable, if you play the build with -messaging, we can profile the game with some limited information in session frontend. For more detail about the session frontend content, see the
[*]Third, Another helpful thing is if we use the command `stat startfile` and `stat stopfile` in the console. We can get a .ue4stats file. This .ue4stats file is the Unreal4 dump file which UFE can load in and visualize.
The file is primarily under the folder `Profiling.` Depend on what kind of build you are using; it can be `${BuildDir}/Profiling` in packaged build, or `${ProjectDir}/Saved/Profiling/UnrealStats` in the editor mode.


### What can I get from session frontend?

Once the data finishes dumping, it is available for analysis in UFE http://window.In short, it records how the time gets spent in different dimensions on CPU side. For example, Graph view is the time graph about frames. You can consider the x-Axis is the index of the frame and the y-Axis is how much time the specific category spent in that frame. The Reference link of the official document has well explained about the UFE layout.
The names of stats are confusing at first glance, here are some tips to identify the functionalities of the stats

[*]CPU Stall - Wait for Event/Sleep …
CPU Stall means this thread is mainly idle for these time since the thread cannot step forward to the next frame until all the thread finishes this frame, which means if a thread finishes it is work too early, it has to wait, which is the event `Wait for Event.` In another case, when it shows `CPU Stall - Sleep,` that means that thread is paused and sleeping, doing nothing.

[*]XXX Tick, Tick Time
The keyword is Tick. The term tick in most cases indicates this event(and it is child events) is the event charged for the update of the current event. Most the problems happen here.

[*]Self
Self counts the time the function costs excluding all other function time.
### Figure out bottleneck through UFE

If the frame is CPU bound, probably you can finish the profiling only with the UFE. A typical process is that the team find there is a frame rate issue. So we open the UFE and dump some data. At that time we need to figure out its game tick too slow, or the render path(CPU side) is too slow, or it is just the GPU bottleneck, in which case it needs to draw too many things.
If the game tick is too slow, that means the game scripts are not efficient enough. In this case, please refer to general blueprint tips in the As a programmer section and As a Level Designer section. However, generally, most of the useful information would go into the Game Thread.

## GPU Visualizer(ref)

In Editor, if you type `ProfileGPU` in the console, you can dump a current frame and get a visual version of the data in the GPU Visualizer. The time here is all GPU time.





As shown in the picture, the rendering path in ue4 has a lot of different stages.
Take one frame from the game and analyze it in render doc as an example. Side note, you can enable RenderDoc plugin in ue4, `Edit>Plugin>Rendering.`The stats shows in the Render Doc matches the ones in the GPU Visualizer.
This is a link to the render doc frame capture in the example






[*]SlateUI Render Pass is all Unreal editor related things so that we can ignore them.
[*]ParticleSimulation pass calculates particles related properties for all particle emitter to through GPU and encoding the information on two render targets. RGBA32_Float for positions and RGBA16_Float for velocities. In our frame sample, this pass is not involved.
[*]PrePass DDM_AllOpaque. This is technically a Z-Prepass where the engine renders all the opaque meshes to an R24G8 depth buffer. One thing we need to be careful is that ue4 use reverse-Z when rendering Depth, which can bring more precision.




[*]ResolveSceneDepthTexture. This is nothing in this frame. According to the socurce code:
void FSceneRenderTargets::ResolveSceneDepthTexture(FRHICommandList& RHICmdList, const FResolveRect& ResolveRect)
{
        SCOPED_DRAW_EVENT(RHICmdList, ResolveSceneDepthTexture);

        if (ResolveRect.IsValid())
        {
                RHICmdList.SetScissorRect(true, ResolveRect.X1, ResolveRect.Y1, ResolveRect.X2, ResolveRect.Y2);
        }

        FSceneRenderTargets& SceneContext = FSceneRenderTargets::Get(RHICmdList);
        uint32 CurrentNumSamples = SceneDepthZ->GetDesc().NumSamples;

        const EShaderPlatform CurrentShaderPlatform = GShaderPlatformForFeatureLevel;
        if ((CurrentNumSamples <= 1 || !RHISupportsSeparateMSAAAndResolveTextures(CurrentShaderPlatform)) || !GAllowCustomMSAAResolves)
        {
        RHICmdList.CopyToResolveTarget(GetSceneDepthSurface(), GetSceneDepthTexture(), true, FResolveParams());
        }
        else
        {
                ResolveDepthTexture(RHICmdList, GetSceneDepthSurface(), GetSceneDepthTexture(), FResolveParams());
        }

        if (ResolveRect.IsValid())
        {
                RHICmdList.SetScissorRect(false, 0, 0, 0, 0);
        }
}
This step is platform related. My current running platform(Win10 x64) did not hit this part.

[*]ShadowFrustumQueries.
There are two places this event gets triggered.
Source\Runtime\Renderer\Private\SceneOcclusion.cpp(1287)Source\Runtime\Renderer\Private\SceneOcclusion.cpp(1447)
Since it is empty for now, there is no actual drawing step get included.

[*]BeginOcclusionTests. All Unreal’s occlusion test happens here. Unreal uses hardware occlusion queries for occlusion testing by default.


Different types of occlusion queries are applied based on the context. According to the geometry, Unreal use different bonding box to do the test. For example, for dynamic point lights, a sphere is submitted.(We do not have this in the sample frame). For a general shape, it submits a cube for the test. In the GroupedQueries, Several geometries are grouped as a single draw call.





[*]HZB SetupMip xxx. Unreal setup Hi-Z buffer stored as R16_Float textures, where it use the Depth buffer rendered in the PrePass as input, output a MipMap chain. Each time it downsamples the previously generated mipmap.













[*]ShadowDepths. Compute the Shadow Map. It creates quads for directional lights and static point lights and cube map for movable point lights.
[*]ComputeLightGrid. The light grid gets generated in this stage using compute shaders. In this sample, the size of the light grid is 20x12x32. The light grid split the space into small boxes, which is used for clustered shading so that Unreal can support more lights in the scene. The shader file is at Engine\Shaders\Private\LightGridInjection.usf. Unreal uses light grids during the Volumetric Fog Pass to add light scattering to the fog, the environment reflection pass and the translucency rendering pass/



[*]CompositionBeforeBasePass. For here, this calls to clear the GBuffers to get ready for the Main rendering part.



[*]BassPass. First, Let’s clarify EBasePassDrawListType enum value in ue4. This draw event get logged in Engine\Source\Runtime\Renderer\Private\BasePassRendering.cpp(970), SCOPED_DRAW_EVENTF(RHICmdList, StaticType, TEXT("Static EBasePassDrawListType=%d"), DrawType);
The Definition of the type:
enum EBasePassDrawListType
{
        EBasePass_Default=0,
        EBasePass_Masked,
        EBasePass_MAX
};That means Unreal renders Masked item first, then other items.
In each base pass, each draw call takes 11 textures as inputs, and output to seven separate render targets(GBuffer A~E, SceneColor, SceneDepth).
Input textures are as followed:





I only want to talk about the texture 0; another paragraph said this is the sample lighting information from 3 mipmapped atlases that appear to cache shadows and surface normals, but I am not sure about this. Please absolutely tell me if you have more information :D
Output render targets are as followed:




GBufferA: RGB10A2_UNORM, world normal
GBufferB: PBR material properties(metalness, roughness, specular intensity...)
GBufferC: Albedo in RGB, AO in Alpha
GBufferD: Custom Data based on the shading model
GBufferE: Pre-baked shadowing factors

The following steps are pretty straightforward from there name, for more information about the Unreal rendering frame, you can check the reference link.
# Console Commands

## stats xx commands



There are a bunch of commands to see the statistics of the game. The most commonly used ones are:
Stat FPS
Stat UNIT
Stat UnitGraph
Stat Game
Stat RHI
Stat GPU

You can play around with them to see what they do.
# Debug Render Views

Here is the location of the debug view in Unreal Editor:





Be aware of these debug render views, which help to optimize the level.
For example, the following one is the view rendered in the wireframe mode:


The selected mesh is rendered in yellow. Static Meshes are cyan; Stationary Meshes are Magenta and Movable Meshes are purple. Though this view, you can have an overview of how crowded your meshes are in specific view and whether all the meshes have reasonable mobility property.
We can use the Shader Complexity & Quads view to find whether the LOD is set up correctly and the shader for different geometry is reasonably complicating.


General Guides in Development

# Zoo Levels

Zoo levels are isolations of a particular feature, which are also living documentation for other developers(commonly are also referenced in the TDD/GDD).
Performance wise, Zoos are set up to do pressure test about a particular aspect of the engine. The Zoo levels should be optimized so that it is precisely 60 fps to give a precise idea how much we can have for this aspect in this engine with this device. There are several categories we can build around when starting playing with a new engine or new device. Noticing that all these are optional, we do not necessarily do them if those categories are very trivial for the game.

Mesh: static/stationary/movable
Physics: collision…
lighting: static/stationary/movable, point/directional/spot


A critical thing I need to address again is that always make zoos in isolation. Additional costs can be very easily created and affect the frame rate. For example, when testing how many movable meshes we can have in a specific view, we might not put some small script for the mesh so that it can rotate a little bit every frame. Enough numbers of small scripts can affect performance if we spawn them without constraints. So be careful about the zoo level and try to double check the result with build-in frontend tools.
# Be aware of your bottleneck

The bottleneck is the most significant issue that drags down the frame rate. For example, in a frame, if your logic(CPU) takes 20ms, the GPU side spends 16ms/frame to render the scene. Then you see the GPU thread would stall for another 4ms. As a result, this frame takes 20ms. In this case, we call it CPU bounded. Similarly, if CPU takes 16ms while GPU costs 20ms, the frame still takes 20ms, which is called GPU bounded.
Furthermore, if the frame is GPU bounded, it can be vertex bounded or pixel/fragment bounded.
If it is vertex bounded, the most common reason is that it takes too much time for the GPU to finish the computation for all the vertices, which means either there are too many triangles or the shader logic for every single vertex is too complicating.
If it is fragment bounded, it is commonly related to the screen solution. Maybe there are too many post-processing steps, or some fragment shader is too heavy.

# Performance in Daily Development Tips

## When Should we concern about the Performance?

When developing the game, things change rapidly every day. So we do not want to concern about the performance too early since the contents maybe are just for prototype or temporary contents which might be replaced later.
When we are in Vertical Slice, the whole team need to be concerned about the performance problem because we commonly consider the vertical slice as the final game quality level. So at that time, Artists can try to optimize the assets according to the level and designer can layout the level with final quality meshes. The programmer can profile the level and do some specific optimization.
When we are in the Alpha or later milestones, performance gets most prioritized. The frame rate should get monitored through daily build so that we can immediately be aware of the potential problems and allocate team resource to handle it if necessary.
## As a Programmer


[*] Most of the CPU bottlenecks are not hard to figure out and optimize, so make it works first!
[*]When there is a bottleneck from your blueprint or your code, you can use Session Frontend to profile your logic.
[*]Some general things to avoid:

[*]Don’t do typecast per frame.
[*]Don’t do actor searching by class per frame
[*]Be aware of what you are costing, though possibly it is just a simple node in Blueprint, it still can be expensive(the cast is a good example).
[*]Expensive update probably can be done with a lower frequency. For example, if you have 8 units, each one has an expensive operation, they get update 60 times/sec. While if you update one unit per frame, each unit still can get updated 7.5 times/sec.

## As an Artist


[*]For all models, there should be a budget. As an artist, if you are not sure about the budget, ask your programmers or designers! If they can not tell you the budget, then push them and consider that as a blocker. This process helps you to mostly reduce the chance to re-make/optimize the models later.
[*]Save your budget, makes your models as simple as possible. If it is a cube, then there is no excuse to have more than 12 triangles for that.
[*]Being aware of how expensive is your materials, if you are not sure about that, consult your programmer teammates.
[*]When making the models, make sure you communicate with your designers to check how they use their models so that you can do some specific optimization(like removing hidden triangles, making LOD)
## As a (Level) Designer


[*]Same to Artist, be aware of your budget so that you can more confidently layout your level. That means pushing your programmers instead of waiting for notifications from them.
[*]When white boxing your level, always keep in mind that we need to hit 60 FPS. Check all the player views to make sure there is no view with too many unnecessary details/contents/overdraw.
[*]When playing with Blueprint, be aware that even small script can be non-trivial when there are thousands of them running in a frame. Your programmer teammates will be glad to provide some advice.

Credits

Thanks for the help and review from the following friends:

[*]Xia Hua(Bryan) @做个小游戏
[*]Joe Holan

reference:


[*]https://tw.wxwenku.com/d/105049640
[*]https://interplayoflight.wordpress.com/2017/10/25/how-unreal-renders-a-frame/
[*]https://docs.unrealengine.com/en-us/Engine/Performance/GPU

acecase 发表于 2022-6-19 19:48

厉害了

pc8888888 发表于 2022-6-19 19:50

看不懂但是厉害了

yukamu 发表于 2022-6-19 19:59

都是 producer 管的好否则没时间写的

acecase 发表于 2022-6-19 20:03

多少帧是不是应该说明下在什么系统和硬件平台下?

johnsoncodehk 发表于 2022-6-19 20:03

不好意思,请问你说的是哪里?

yukamu 发表于 2022-6-19 20:06

文章开头部分

Arzie100 发表于 2022-6-19 20:14

啊对,的确这个我没有提,不过因为没有和其他工作的比较,应该不会影响对内容的理解:P

七彩极 发表于 2022-6-19 20:18

CPU Stall那部分没太看懂,Wait for Event这个指标是越高还是越低性能消耗越大。

Ylisar 发表于 2022-6-19 20:27

说明这个thread闲置,在等其他thread
页: [1]
查看完整版本: Unreal 4 性能入门指南