This blog post was originally published at NVIDIA’s website. It is reprinted here with the permission of NVIDIA.
The constantly increasing compute throughput of NVIDIA GPUs presents a new opportunity for optimizing vision AI workloads: keeping the hardware fed with data. As GPU performance continues to scale, traditional data pipeline stages, such as I/O from storage, host-to-device data transfers (PCIe), and CPU-bound processing like decoding and resizing, don’t always keep pace. This disparity can create a bottleneck where the accelerator is left waiting for data, a challenge often called GPU starvation. Closing this data-to-tensor gap requires a smarter data pipeline designed to align with modern, high-performance hardware.
This post introduces the NVIDIA CUDA-accelerated implementation of SMPTE VC-6 (ST 2117-1), a codec architected for massively parallel computation. We explore how VC-6’s native features, like hierarchical, multi-resolution and selective decoding and fetching, are a natural fit for the parallel architecture of GPUs. By directly mapping the codec’s inherent parallelism to a GPU’s architecture, we can build a more efficient path from compressed bits to model-ready tensors. We’ll cover the performance gains of moving from CPU and OpenCL to a CUDA implementation, demonstrating how this approach helps with accelerated decode that can easily be consumed by demanding AI applications.
What is VC-6?
SMPTE VC-6 is an international standard for image and video coding designed from the ground up for direct, efficient interaction with modern compute architectures, particularly GPUs. Instead of encoding an image as a single, flat block of pixels, VC-6 generates an efficient multi-resolution hierarchy. Figure 1 shows an example; demonstrating powers-of-two downscaling between resolutions (8K, 4K, Full HD).
Figure 1.VC-6 hierarchical reconstruction
The encoding process works as follows:
- The source image is recursively downsampled to create multiple layers, called echelons, each representing a different level of quality (LoQ).
- The smallest echelon serves as the low-resolution, i.e., root LoQ, and is encoded directly.
- The encoder then reconstructs upwards. For each higher level, it upsamples the lower-resolution version and subtracts it from the original to capture the difference, or residuals.
- The final bitstream contains the root LoQ followed by these successive residual layers.
Figure 2. VC-6 encoder pipeline
The VC-6 decoder can reverse the process successively through every LoQ. This is done by starting with the root LoQ, upsampling to the next LoQ, and adding the corresponding residuals, until the target resolution/LoQ is reached. Crucially, every component, whether it’s a color plane, an echelon, or a specific image tile, can be accessed and decoded independently and in parallel. This structure allows developers to:
- Transfer only the bytes that matter, reducing I/O, bandwidth, memory use, and memory accesses while maximizing throughput.
- Decode only what’s needed, at any LoQ, producing tensors closer to the model’s required input size without a full decode and resize.
- Access specific regions of interest (RoI) within each LoQ instead of processing the entire frame, saving significant computation,
The following table summarizes the architectural benefits of VC-6:
Feature | SMPTE VC-6 (ST 2117) |
Core architecture | Hierarchical, S-Tree Predictive, Parallel. |
Selective data recall | Native support. The bitstream structure allows for fetching only the bytes required for a partial request. |
Selective resolution (LoQ) decode | Native support. Intrinsic to the hierarchical LoQ structure, produce surface near target size without full decode + resize. |
RoI decode | Native support. Intrinsic to the navigable S-tree structure, pull just the tiles that matter for the model stage. |
Parallel decode capability | Massively parallel. Plane/LoQ/tiled residuals independence enables fine‑grained GPU parallelism. |
Max bit depth | Up to 31 bits per component. |
Multi-plane support | Native, up to 255 planes (e.g., RGB, alpha, depth). |
Table 1.VC-6 capabilities for AI
As the table highlights, VC-6’s native support for selective resolution, RoI decoding, and especially selective data recall makes it suited for AI pipelines where efficiency and targeted data access are paramount.
I/O reduction with partial data recall
Beyond decode speed, VC-6’s ability to selectively recall data dramatically reduces I/O. Traditional codecs typically require reading the entire file, even for lower-resolution outputs. With VC-6, you only fetch the bytes needed for the target LoQ, RoI, or plane (i.e., color space). For CPU decoding, the benefit is fewer bytes from the network or storage to RAM. For GPU decoding, this also holds with the addition of reduced PCIe memory bandwidth and VRAM usage.
As illustrated in Figure 3, on the first 100 images of the DIV2K dataset (where one dimension is equal to 2,040 pixels and the other varies), we observed:
- LoQ1 (medium-resolution,1,020 pixels) transfers ~63% of the total file bytes.
- LoQ2 (low-resolution, 510 pixels) transfers ~27%.
Figure 3. Average file size required to decode different LoQ, 3 bpp example
That translates to I/O savings of ~37% and ~72% respectively, versus full‑resolution, proportionally reducing network, storage, PCIe, and memory traffic. You can subsequently fetch the remaining layers (or only the tiles for a small RoI) without reprocessing the entire file. For data loaders, this is an immediate way to lift throughput or increase batch sizes without changing model code.
Mapping VC-6 to GPU: a natural fit for parallelism
The architecture of VC-6 aligns well with the GPU’s single instruction, multiple thread (SIMT) execution model. Its design intentionally minimizes inter-dependencies to facilitate massive parallelism.
- Component independence: Image data is partitioned into tiles, planes, and echelons that can be processed independently. VC-6 encodes the information needed for independent tile decoding in a way that enables parallel processing of hundreds of thousands of tiles with minimal impact on compression efficiency.
- Simple, local operations: Unlike codecs using block-based DCT or wavelets, VC-6’s core pixel transforms operate on small, independent 2×2 pixel neighborhoods, which simplifies GPU kernel design.
- Memory efficiency: The entropy coding is designed to be inherently massively parallel, and lookups have a very low memory footprint, with tables small enough to fit into shared memory or even registers, making the process highly suitable for SIMT execution.
Although the term hierarchy may suggest serial processing, VC-6 has minimized the inter-dependency by incorporating two hierarchies that operate in largely orthogonal dimensions, offering a unique structure for concurrent processing.
This architectural parallelism, originally used for low-latency, random-access video editing workflows in the CPU and OpenCL versions, is also a perfect match for the high-throughput demands of AI.
AI training pipelines are designed to maximize throughput, where a framework like the PyTorch DataLoader spawns parallel processes to hide latency. Much like advanced editing systems, these AI workflows require fast, on-demand access to different resolutions and regions of an image. The opportunity to apply these features to accelerate AI workloads was the primary driver for creating a dedicated CUDA implementation. A native CUDA library enables targeted optimizations that maximize throughput, fully leveraging VC-6’s architectural strengths for the AI ecosystem.
VC-6 Python library with CUDA acceleration
V-Nova and NVIDIA collaborated to optimize VC-6 for the CUDA platform, recognizing that it’s the de facto standard in the AI ecosystem. Porting VC-6 from OpenCL ensures seamless integration with tools like PyTorch and the broader AI pipeline without additional CPU copies or synchronization points.
Moving VC-6 to CUDA provides several key advantages:
- Minimizes overhead: It avoids the expensive context-switching overhead between AI workloads and the OpenCL implementation.
- Enhances interoperability: It provides direct integration with the CUDA Tensor ecosystem. CUDA streams enable memory exchange without the need for CPU synchronization.
- Unlocks advanced profiling: It enables the use of powerful tools like NVIDIA Nsight Systems and NVIDIA Nsight Compute to identify and address performance bottlenecks.
- GPU hardware intrinsics: With CUDA it’ll be possible to use all available hardware intrinsics on NVIDIA GPUs.
The current VC-6 CUDA path is in alpha, with native batching and further optimizations on the roadmap, which are enabled through CUDA and are motivated by new AI requirements. Even at this stage, the performance gains over OpenCL and the CPU implementation are already significant, providing a strong foundation on which further development and improvements will continue to build.
Installation and usage
The VC-6 Python package is distributed as a pre-compiled Python wheel, enabling straightforward installation through pip. Following this, you can create VC-6 codec objects and start encoding, decoding, and transcoding. An example of how to encode and decode VC-6 bitstream follows (visit our GitHub repo for more complete samples):
from vnova.vc6_cuda12 import codec as vc6codec # for CUDA
# from vnova.vc6_opencl import codec as vc6codec # for OpenCL
# from vnova.vc6_metal import codec as vc6codec # for Metal
# setup encoder and decoder instances
encoder = vc6codec.EncoderSync(1920, 1080, vc6codec.CodecBackendType.CPU, vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CPU)
encoder.set_generic_preset(vc6codec.EncoderGenericPreset.LOSSLESS)
decoder = vc6codec.DecoderSync(1920, 1080, vc6codec.CodecBackendType.CPU, vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CPU)
encoded_image = encoder.read("example_1920x1080_rgb8.rgb")
decoder.write(encoded_image.memoryview, "recon_example_1920x1080_rgb8.rgb")
GPU memory output
In the case of the CUDA package (vc6_cuda12), the decoder output can yield a CUDA array interface. To enable this feature, create the decoder by specifying GPU_DEVICE
as the output memory type. With that, the output images will have __cuda_array_interface__
and can be used with other libraries like CuPy, PyTorch, and nvImageCodec.
from vnova.vc6_cuda12 import codec as vc6codec # for CUDA only
import cupy # setup GPU decoder instances with CUDA device output
decoder = vc6codec.DecoderSync(1920, 1080, vc6codec.CodecBackendType.CPU, vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CUDA_DEVICE)
# decode from file
decoded_image = decoder.read("example_1920x1080_rgb8.vc6")
# Make a cupy array from decoded image, download to cpu and write to file
cuarray = cupy.asarray(decoded_image)
with open("reconstruction_example_1920x1080_rgb8.rgb") as
decoded_file.write(cuarray.get(),
"reconstruction_example_1920x1080_rgb8.rgb")
For sync and async decoders, accessing __cuda_array_interface__
is blocking and implicitly waits for the result to be ready in the image.
The __cuda_array_interface__
always contains one-dimensional data of unsigned 8-bit type, like the CPU version. Adjusting dimensions (or the type in case of 10-bit formats) is up to the user.
Partial decode and IO operations
To perform a partial decoding, decoder functions accept an optional parameter that describes the region of interest. In the following example, the decoder will only read and process the data required for decoding the quarter-resolution image.
# Read and decode quarter resolution (echelon 1) FrameRegion can also be used to describe a target rectangle
decoded_image = decoder.read("example_1920x1080_rgb8.vc6", vc6codec.FrameRegion(echelon=1)
Partial data recall is also possible with the other decode functions that operate on memory rather than file paths. For that purpose, a utility function is exposed on the VC6 library to separately peek at the file header and report the required size for the target LoQ. The exact usage is shown in the GitHub samples.
Performance benchmarks: CPU compared to OpenCL and CUDA
We evaluated VC‑6 on an NVIDIA RTX PRO 6000 Blackwell Server-Edition using the DIV2K dataset (800 images), measuring per‑image decode time across CPU, OpenCL (GPU), and CUDA implementations for different LoQs. For batched tests, we used a “pseudo-batch” approach, which simulates native batching by running multiple asynchronous single-image decoders in parallel to maximize throughput. The same harness is usable to reproduce results, which are illustrated in Figure 4.
Figure 4.VC-6 decoding performance across implementations
The move to CUDA shows a clear performance uplift.
- For single-image decoding, CUDA is up to 13x faster than the CPU (1.24 ms vs. 15.95 ms).
- When compared to the existing GPU implementation, the CUDA version is between 1.2x and 1.6x faster than OpenCL. In the future, CUDA will enable more access to dedicated hardware intrinsics that can be leveraged.
- Efficiency improves with batching on all platforms, and we expect further uplift from a native batch decoder.
Profiling with Nsight and the road ahead
Nsight Systems shows the decode work split between CPU (bitstream parse, root nodes) and GPU (tile residual decode, reconstruction). The latency‑optimized single‑image path under‑utilizes the GPU and throughput mode is where CUDA shines. Three hotspots guided our plan:
- Kernel‑launch overhead in upsampling chains: At low LoQs, small kernels interleave with launch overhead. CUDA Graphs prototypes significantly reduce inter‑kernel gaps. We’re also exploring kernel fusion across early LoQs whose intermediates are never consumed.
- Kernel efficiency: Nsight Compute flagged branch divergence, register spills, and non‑coalesced IO in some stages. Cleaning these up should raise occupancy and throughput.
- Kernel-level parallelism: Currently, each decode launches its own chain of kernels, which is subpar compared to scaling the launch grid dimension. A series of kernels is launched per image, which does not scale perfect due to a limit to concurrent kernels on the GPU.
Upsampling chains
Figure 5.Upsampling kernel chains
A chain of upsampling kernels (Figure 5) reconstructs the image at successively higher LoQs. At lower LoQs, the proportion of useful computation (blue) versus overhead (white) is significant. Techniques such as CUDA graphs or kernel fusion can speed up computation by reducing the overhead between each of these kernels.
The Nsight trace also shows low GPU utilization, due to the small grid dimensions. Especially the first upsample kernel only launches a single block, which will only leverage one streaming multiprocessor (SM). With 188 SMs on an RTX PRO 6000, decoding a single image essentially only uses 1/188th of the GPU.
In theory, this would enable us to use the other 187/188th of the GPU to decode additional images in parallel. In practice, this is called “kernellevel parallelism”, and isn’t the best way to utilize an NVIDIA GPU.
Kernel-level parallelism
Figure 6.Kernel-level parallelism
The Nsight Systems trace in Figure 6 shows three concurrent decoders launched as Python threads on CPU (bottom) and their corresponding GPU activity (top). Each decode (blue) runs on its own stream. While this enables the GPU scheduler to launch these kernels simultaneously, it’s more efficient to launch a single larger grid on the GPU. Kernel-level parallelism can lead to scheduling conflicts, resource contention, and there’s even a hard limit to concurrently executable kernels that depends on the compute capability. As another upside this significantly reduces CPU overhead seen from the three concurrent threads.
Conclusions
AI pipelines don’t just need faster models; they need data to match the rate at which AI is being processed. By aligning VC-6’s hierarchical, selective architecture with CUDA’s powerful parallelism, we can significantly accelerate the path from storage to tensor. This approach complements established libraries by providing an AI-native solution for workloads where selective LoQ/RoI decoding and GPU-resident data offer immediate advantages.
The CUDA implementation is a practical building block you can use today to make your data pipelines faster and more efficient. While the current alpha version already delivers benefits, ongoing collaboration with NVIDIA engineers on native batching and kernel optimizations promises to unlock even greater throughput. As a next step, this initial CUDA implementation will enable tighter integration with popular AI SDKs and data loading pipelines. If you’re building high-throughput, multimodal AI systems, now is the time to explore how VC-6 on CUDA can accelerate your workflows.
Get started
The VC-6 SDKs for CUDA (alpha), OpenCL, and CPU are available with C++ and Python APIs.
- SDK and docs: Access the SDK Portal and documentation via V-Nova.
- Trial access: Contact V-Nova at [email protected] for the CUDA alpha wheel and benchmark scripts
- Samples on GitHub
Andreas Kieslinger
Senior Development Technology Engineer for Generative AI and LLMs, NVIDIA
Maximilian Müller
Developer Technology Engineer for Professional Visualization, NVIDIA
Ricardo Monteiro
Senior Video Coding Developer technology Engineer, NVIDIA
Guendalina Cobianchi
Senior Vice President of Strategic Analytics & Business Insights, V-Nova
Adam Kelly
Product Manager, V-Nova
Vinod Balakrishnan
Senior Software Engineer, V-Nova
Nima Shirvanian
Principal Software Engineer and Tech Lead of the VC-6 Codec, V-Nova