One PolyData Mapper Instead of Two: Measuring Vertex Pulling on the Desktop 

June 11, 2026
Visualization of some colors

A data-driven case for unifying vtkOpenGLPolyDataMapper and vtkOpenGLLowMemoryPolyDataMapper.

TL;DR

VTK ships two OpenGL polydata mappers that do the same job two different ways. We maintain both. I wrote the second WebGL2/GLES3.0 mapper back in 2023 due to a need for polydata rendering in VTK.wasm using vertex pulling. It’s called vertex pulling since it’s the vertex shader that decides which vertex data to read vs the traditional way where vertex data is supplied automatically via attributes. The historical reason for keeping them separate was a performance belief: vertex pulling is too slow for the desktop. We turned that belief into measurements on a native NVIDIA GL stack, and it does not hold up. With a two-line change: an indexed draw, vertex pulling can match the classic interleaved-VBO mapper on GPU time across every workload we tried, it carries a realistic 4-attribute vertex with no penalty, and adds zero CPU overhead per frame. The assumption that kept the mappers apart is no longer relevant. This post shows the numbers and argues we should merge the two mappers.

The maintenance problem

VTK has two code paths for drawing polygonal surfaces through OpenGL:

  • vtkOpenGLPolyDataMapper is the desktop default for OpenGL 3.2+. Geometry goes into an interleaved Vertex Buffer Object plus a per-primitive Index Buffer Object; vertices are fetched by fixed-function hardware; wide lines and wireframe mesh rendering lean on geometry shaders.
  • vtkOpenGLLowMemoryPolyDataMapper is built for OpenGL ES / WebGL2, where the feature set is narrower. It uses vertex pulling: the raw data arrays live in texture buffers instead of VBOs, and the vertex shader reconstructs each vertex from gl_VertexID with texelFetch. No geometry shaders, no dedicated VBO/IBO/SSBO machinery, just texture buffers and shader code.

Two mappers means two implementations of every feature, two places for every bug, and a steady drift between “what works on the desktop” and “what works on the web.” Every new capability gets built, reviewed, and tested twice. That is the maintenance burden we’d like to eliminate while we migrate to WebGPU and transition the OpenGL backend to maintenance mode.

The obvious move is to keep one. And the low-memory, pulling-based mapper (vtkOpenGLLowMemoryPolyDataMapper) looks appealing: it already runs everywhere WebGL2 runs, it has a smaller surface area, memory-footprint, and it doesn’t depend on geometry shaders. The only thing standing in the way is the desktop performance question.

The objection, stated precisely

“Vertex pulling is slower on the desktop.” That’s a reasonable assumption. Fixed-function vertex fetch (used when VBSo are defined) is purpose-built silicon; replacing it with texelFetch in the shader sounds like trading a hardware path for a software one. If that costs 1.4× on every frame, no desktop user wants the merged mapper.

So we measured it. The rule for the whole investigation was: don’t argue about the assumption, turn it into a benchmark. Every claim below is backed by a test that ships in Rendering/OpenGL2/Testing/Cxx/.

How we measured

All numbers here are from a single workstation NVIDIA RTX A5000, driver 580.159.04, native OpenGL 4.6 on Linux, rendering offscreen.

Two measurement details matter:

  1. Use the GPU timer, not the wall clock. We time frames with vtkOpenGLRenderTimer (GL timestamp queries), which reports the actual GPU execution time of the draw, free of CPU submission cost and the per-frame glFinish floor. On this driver the timer works (vtkOpenGLRenderTimer::IsSupported() is true); on some others (Apple’s GL-on-Metal) it returns zero and you’re stuck with whole-frame wall time, which is much noisier.
  2. Disable vsync or you measure how fast your monitor is. Even an offscreen render honored the driver swap interval here, pinning every variant to a flat 16.67 ms (60 Hz) and making the GPU column read garbage. Running with __GL_SYNC_TO_VBLANK=0 vblank_mode=0 exposes the real sub-millisecond draw costs. This one bit me for a good ten minutes; consider it a public-service announcement.

Result 1: the slowness was the draw call, not pulling

Here’s the measurement that reframes everything. Same geometry (2 M triangles, ~1 M shared vertices, small viewport so the workload is vertex-bound), three ways of drawing it, GPU milliseconds per frame:

draw pathGPU ms/frame
non-indexed pulling (glDrawArraysInstanced)0.31
indexed pulling (glDrawElementsInstanced)0.18
classic interleaved VBO/IBO0.19

Table 1: (GPU ms/frame)

The low-memory mapper’s reputation for being ~1.7× slower is real for the non-indexed draw. With glDrawArraysInstanced, the post-transform vertex cache can’t help: every shared vertex is re-fetched and re-shaded for each triangle that uses it (~6× on a closed mesh).

The fix is almost embarrassingly simple! Issue the draw as glDrawElementsInstanced with an element array buffer instead. With an indexed draw, gl_VertexID is the fetched index value, so the post-transform cache works exactly as it does for the classic mapper, and the shader can use gl_VertexID directly as the point ID. One indexed draw call restores the cache, and pulling lands at 0.18 ms vs the classic mapper’s 0.19 ms. Bit-identical output.

We swept the three vertex lookup regimes to make sure this wasn’t a single lucky workload:

workloadindexed pullingclassic VBO/IBOpulling / classic
vertex-bound, 2.0 M tris0.1800.1990.90
vertex-bound, 4.5 M tris0.3600.3690.98
setup-bound, 0.5 M /
huge viewport
0.1260.1400.90
fill-bound, 3 K tris /
huge viewport
0.1030.1150.89

Table 2: (GPU ms/frame; < 1.0 means pulling is faster.)

Indexed pulling ties or beats the classic mapper in every regime, including the tiny-triangle “setup-bound” case where fixed-function fetch was most expected to pull ahead. On this hardware, it simply doesn’t.

Result 2: no CPU tax either

GPU time isn’t the whole story. Pulling rebinds texture buffers and could cost more on the CPU side per frame. So we instrumented the benchmark to time the vtkRenderWindow::Render() call itself which measures pure CPU command submission separately from the GPU drain (glFinish).

trianglesindexed pulling, CPU msclassic, CPU ms
2.0 M0.1250.138
4.5 M0.1280.126
8.0 M0.1280.122

Table 3: (CPU ms/frame)

CPU submission is flat at ~0.13 ms and identical between the two paths, across a 4× range of triangle counts. There is no per-triangle CPU scaling and nothing pulling-specific to optimize; that ~0.13 ms is shared renderer/window work (clear, camera, state), not the mapper.

This result is worth dwelling on because it corrected us. An earlier draft of the analysis read a wobble in the whole-frame wall-clock numbers as “pulling has CPU overhead that grows with triangle count”, a plausible story that turned out to be measurement noise on the noisier timer. Reading the code settled it: the per-frame pulling path (vtkDrawTexturedElements::PreDraw/Draw/PostDraw) is O(1), buffer uploads early-return once the data is resident, and the shader string-substitution only re-runs when something actually changes, which it doesn’t frame to frame. The direct CPU measurement column then confirmed what we read in the code. The lesson learned was to trust the GPU timer, and verify timing intuitions against the actual hot path.

Result 3: a realistic pipeline doesn’t change the verdict

Real meshes don’t carry just positions. They carry normals, colors, texture coordinates, and other attributes. Pulling fetches each attribute from its own texture buffer with a separate texelFetch, where the classic mapper packs them into one cache-friendly interleaved record. Does pulling’s per-attribute fetch fall behind as the vertex gets fatter?

We swept the attribute count 1→4 on both paths (4.5 M triangles, vertex-bound):

attributespulling GPU msclassic GPU ms
position0.3620.355
+ normal0.3510.368
+ color0.3530.379
+ texcoord0.3760.469

Table 4: (GPU ms/frame)

Adding three extra per-vertex buffer fetches raises pulling’s GPU time by only ~4–6%, the independent fetches are latency-hidden, and there’s no cliff. Pulling stays at parity-or-better the whole way. (The classic column’s growth is partly its own per-fragment shading turning on as features are enabled, so read the flat pulling column as the clean result, not the gap.)

The one place it’s nuanced: how you generate shaders

Not everything came out in pulling’s favor, and the place it didn’t is the most useful thing we learned for how to build the merged mapper.

A unified mapper has to choose a shader-generation strategy: specialize (string- substitute an exact shader per configuration, VTK’s current approach in the desktop GL mapper; the compiler sees only the active code and can unroll and constant-fold) or run an über-shader that contains every path and selects at runtime by branching on a uniform (simpler to maintain, fewer compile hitches, but sized for the worst case). Because the number of texture fetches depends on the number of lights in a scene, we pitted them against each other across different light counts, with two material types:

32 lightsspecializedüber-shaderwinner
ALU-bound (no texturing)0.2440.39specialized, 1.6×
latency-bound
(per-light texture fetch)
0.5140.499über, ~1.03×

Table 5: (GPU ms/frame)

The verdict flips with the workload. When the shader is compute-bound, specialization wins big. But make the same shader memory-bound (a texture fetch per light, the realistic case for a textured material) and the über-shader comes out ahead. At low light counts the two are a wash either way.

The takeaway isn’t “pick one.” It’s default to a uniform-branching über-shader (simpler, fewer permutations, and as fast or faster for realistic textured materials) and reserve string-substitution specialization for the ALU-bound, constant-folding hot paths: high light or clip-plane counts with no per-fragment texturing. That’s a concrete design input to consider when merging these mappers, and we’d have guessed it backwards without measuring!

What this means for merging the mappers

Let’s compare our assumptions to the contradictory evidence we have gathered:

  • “Pulling is slower on the desktop.” Not on GPU time, indexed pulling ties or beats the classic mapper in every regime we measured, on native NVIDIA GPUs.
  • “The draw-call overhead will hurt.” Pulling’s per-frame CPU is flat and identical to the classic mapper’s.
  • “Realistic multi-attribute vertices will expose the per-fetch cost.” ~4–6%, no penalty versus the interleaved VBO.

The performance wall that justified two mappers wasn’t real. What’s there instead points to maintaining a single mapper, built on indexed vertex pulling, that runs on WebGL2 and the desktop at desktop-competitive speed, and a clear recommendation for its shader backend (uniform branching by default, targeted specialization where it constant- folds). One implementation of every feature. One place to fix every bug. The WebGL2/desktop drift goes away because there’s nothing to drift.

Honest caveats

This is a prototype and one machine. Before committing to this mapper merging we should:

  • Validate on other GPUs/drivers: AMD, Intel, and Apple at minimum. The story is unlikely to invert (the indexed-draw cache win is architectural), but the setup-bound margins are where a different driver could behave differently.
  • Validate no performance losses in VTK.wasm with WebGL2: The indexed pulling approach is unlikely to cause a performance drop here, but we should test this under VTK.wasm as well.
  • Account for the desktop features that lean on geometry shaders today: wide lines, some glyphing, point sprites, and confirm the pulling equivalents cover them at parity. This is feature work, not a perf unknown.
  • Keep the GPU timer and the vsync-off discipline in whatever CI perf guard we add, so a regression shows up as a number, not a vibe such as “it feels slow”.

None of these are the old assumptions. They’re the typical things that need to be worked on when landing a merge whose core performance question has already been answered.

Try it yourself

The prototype and all three benchmarks live on my branch https://gitlab.kitware.com/jaswant.panchumarti/vtk/-/tree/indexed-vertex-pulling-prototype:

cmake --build <build-dir> --target vtkRenderingOpenGL2CxxTests

# data flow: non-indexed vs indexed pulling vs classic, with GPU + CPU columns
__GL_SYNC_TO_VBLANK=0 vblank_mode=0 
  <build-dir>/bin/vtkRenderingOpenGL2CxxTests TestDrawTexturedElementsIndexedPerf

# multi-attribute sweep
__GL_SYNC_TO_VBLANK=0 vblank_mode=0 
  <build-dir>/bin/vtkRenderingOpenGL2CxxTests TestDrawTexturedElementsMultiAttributePerf

# shader strategy: specialized vs uber, ALU- and latency-bound
__GL_SYNC_TO_VBLANK=0 vblank_mode=0 
  <build-dir>/bin/vtkRenderingOpenGL2CxxTests TestDrawTexturedElementsShaderStrategyPerf 200 2000 200

Run them on your hardware. If the numbers hold, and the indexed-draw result should, then maintaining two mappers is a cost we’re choosing to keep paying for a performance gap that no longer exists. Can we pull this off? The preliminary data says yes.

AI use disclosure

Anthropic’s Claude Opus 4.8 was leveraged to code up benchmarks that validate/rule out assumptions.

Acknowledgements

VTK is an open source toolkit developed by an extended community. Refer to VTK’s GitLab repository for a detailed capture of contributions and enhancements. Research reported in this publication was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB014955. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Leave a Reply