Graph Performance: 1296 Glyphs processed in ~11ms on Raspi 4b

Graph Perf Update: 1296 chars to Region per Frame: (updated post 2x)

  • RaspiPi4 11.34ms (regioned) + 5.5ms (draw)
  • PC 1.93ms (regioned) + 0.28ms (draw)

Performance update from commit 607eb99b9cad227dd7be6d149c6b6cf57d060c35
(Note: There I mentioned the total duration for 20 frames, not per frame)

Edit: I have updated the logs and numbers, now with vsync swap-buffers turned off. However, this seems to make no difference on the Raspi 4b. Also added screenshots of the visual performance analysis.Above a screenshot from a Raspberry Pi 4b (console) using a 800×480 pixel screen.

All Raspberry Pi 4 results were using our DRM EGL/GBM console driver and the Open Source ES3.1 driver and GLContext.

regioned is the process where all single pre-computed OutlineShape instances per Glyph are processed to become one Region. This process includes our Font layouting and Region.addOutlineShape().

Region.addOutlineShape() itself performs the triangulation of the shapes, compounding of all vertices and pushing all data down to the VBO buffer, ready to be rendered. Hence, the crucial Graph hotspot.

A Performance @ 2.4.0 with 119,787 vertices:
doc/curve/tests/perf00/rpi4_old.log
– RaspiPi4 57.20ms (regioned) + 23.4ms (draw)

B Performance @ last commit with 81,092 vertices:
doc/curve/tests/perf01/rpi4_7.log + doc/curve/tests/perf01/pc_7.log
– RaspiPi4 11.76ms (regioned) + 3.5ms (draw)
– PC 3.4ms (regioned) + 0.35ms (draw)

C Now with 81,092 vertices from a 1296 character string and font FreeSans (vsync off):
doc/curve/tests/perf02/rpi4_10.log + doc/curve/tests/perf02/pc_10.log
– RaspiPi4 11.34ms (regioned) + 5.5ms (draw)
– PC 1.93ms (regioned) + 0.28ms (draw)

Hence we have achieved a reasonable performance enhanced from A -> C.

Most important is that neither Flight Recorder nor Visual VM could identify
Region.addOutlineShape()‘s triangulation nor its compounding of all vertices to be a significant bottleneck.
After further triangulation bugfixes (delauny tessellation), we will re-validate this part.

Enhancements of VBO GLArrayData data management
where Region.addOutlineShape() finally pushes the data into the VBO helped to remove certain overhead.

The buffer-size enhancements including API-hooks to count the required vertices & indices to issue Region.setBufferCapacity() helped to ease the GC.

Raspi 4b Flight Recorder Screenshot after 1 minute of intensive method sampling

The following JVM launch options were used.

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints
-XX:FlightRecorderOptions=stackdepth=2048,threadbuffersize=16k

Raspi 4b Visual VM Screenshot after 1 minute of intensive CPU sampling (10ms)

Interestingly Visual VM still shows swapBuffers to be a performance hog, despite not using VSync.

Perhaps I have to test on a lower resolution, since this test here use a screen pixel size of 1920 x 1080, not quite the usual embedded device resolution maybe 😉

 

3 thoughts on “Graph Performance: 1296 Glyphs processed in ~11ms on Raspi 4b”

  1. OK, almost same on a different Raspi-4b machine using a 800 x 480 resolution screen.
    Here we have 10.65ms for region and 3.75ms to draw w/ a slightly longer text of 1315 chars instead of 1296.

    VisualVM showed swap-buffer being the hog 😉

    (Produced with commit 9e3070acf9c70a8b333435889ae77e581cd65b32)

    GLRegion: for GLProfile[GLES3/GLES3.hw] using int32_t indiced: true
    GLRegion: Region[vbaa, q 1, dirty 0, vertices 85758, box [ dim 38.57701 x 20.159 x 0.0, box 0.672 / -19.418001 / 0.0 .. 39.249012 / 0.741 / 0.0, ctr 19.960506 / -9.338501 / 0.0 ]]
    Text length: 1315
    VBORegion2PVBAAES2: idx32 true
    indices [elements 44,750 cnt / 61,614 cap, bytes 537,000 cnt / 739,368 cap, filled 72.6%, left 27.4%]
    vertices [elements 85,758 cnt / 101,930 cap, bytes 1,029,096 cnt / 1,223,160 cap, filled 84.1%, left 15.9%]
    params [elements 85,758 cnt / 101,930 cap, bytes 1,029,096 cnt / 1,223,160 cap, filled 84.1%, left 15.9%]
    color [null]
    total [bytes 2,595,192 / 3,185,688], filled 81.5%, left 18.5%]
    20 / 40: Perf Frame40: Total: graph 3, txt 213, draw 75, txt+draw 288 [ms]
    20 / 40: Perf Frame40: PerLoop: graph 173,006, txt 10,651,301, draw 3,750,390, txt+draw 14,401,692 [ns]

  2. One unexpected overhead was the chosen FSAA from the GLCapabilities, … memories ..
    https://jogamp.org/cgit/jogl.git/commit/?id=adfe4abec47313d2c533096f6c3e9a94d2fc4571

    Graph GPUUISceneNewtDemo: Filter out all FSAA (multisample) caps if undesired (Graph VBAA + MSAA); Add NonFSAAGLCapabilitiesChooser
    Notable: On RaspiPi4b w/ Mesa3D’s Broadcom/VC driver,
    the chosen capabilities is a multisamnple one even though not requested.

    This causes
    – extra performance overhead
    – doubled AA: 1st our VBAA, then the FSAA (multisample) -> loss of sharpness

    Simply dropping the undersired FSAA helps and ups performance
    on the Raspi board (22 -> 35 fps).

Comments are closed.