JavaOne 2026

JavaOne 2026 Session

Duke in front of a whiteboard

Running GPU-Accelerated AI Inference from Java at Uber Scale

Summary

Uber's Michelangelo platform powers thousands of latency-critical machine learning models, from real-time ETAs and marketplace optimization to large-scale deep learning and LLM inference. As we enabled GPU-backed online inference, the serving stack experienced a 10–100× increase in traffic, with per-service QPS growing from the hundreds into the 10K+ range. While GPUs dramatically improved model throughput and cost efficiency, this sudden traffic amplification exposed new bottlenecks across Java-based inference services.

In this talk, we share how Uber scaled GPU-accelerated inference while maintaining strict tail-latency SLOs in a JVM-centric production environment. We describe how Java services integrate with NVIDIA Triton Inference Server to serve models ranging from small neural networks to multi-hundred-gigabyte LLMs, and how GPU efficiency techniques—such as Multi-Instance GPU (MIG), model disaggregation, and dynamic batching—shifted pressure onto JVM memory management, garbage collection, threading, and metrics pipelines.

We dive deep into the JVM tuning techniques required to handle this traffic surge, including heap sizing strategies, G1GC pause-time tuning, executor thread configuration, off-heap memory considerations, and CPU/GPU resource isolation in containerized deployments. We also share production lessons learned from operating thousands of Java inference services, including the impact of Java upgrades, service warm-up strategies, and metrics optimizations under extreme QPS.

Attendees will leave with practical guidance for running low-latency, GPU-accelerated inference from Java, and concrete patterns for scaling JVM-based services when hardware acceleration fundamentally changes traffic dynamics.

Profile

Type: Learning Session (50 min)

Track: Machine Learning and Artificial Intelligence

Audience Level: Expert

Speaker: Baojun Liu

Session: Wednesday, March 18th at 9:30 AM in Room 203