Minisymposium Presentation

Speeding Up LLM Inference via Sequential Speculative Decoding

Tuesday, June 17, 2025

16:30

17:00

CEST

Climate, Weather and Earth Sciences

Chemistry and Materials

Computer Science and Applied Mathematics

Engineering

Life Sciences

Physics

Presenter

Ravi

Tandon

The University of Arizona

Ravi Tandon is the Litton Industries John M. Leonis Distinguished Associate Professor in the Department of ECE at the University of Arizona. He received the B.Tech. degree in Electrical Engineering from IIT Kanpur in 2004 and the Ph.D. degree in ECE from the University of Maryland, College Park in 2010. From 2010 to 2012, he was a post-doctoral research associate at Princeton University. He has received an NSF CAREER Award in 2017, the 2018 Keysight Early Career Professor Award, a Best Paper Award at 2011 IEEE Globecom conference and the Craig M. Berge Faculty Fellowship at University of Arizona in 2024.

Watch replay

Description

As Large Language Models (LLMs) grow in size and capability, their high computational cost poses a major challenge for real-time applications, making efficient inference a critical research problem. Speculative Decoding (SD) has emerged as a promising technique to accelerate LLM inference by leveraging a smaller draft model to generate candidate tokens, which are then verified in parallel by a larger target model to ensure statistical consistency. However, the need for frequent verification calls to the target LLM limits the potential speedup of SD. We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated by the draft model would be accepted by the target LLM. By performing approximate sequential verification, SPRINTER eliminates the need for constant verification by the target LLM and is only invoked when a token is deemed unacceptable. This significantly reduces the number of calls to the larger model, enabling further acceleration. We present a theoretical analysis of SPRINTER, examining the statistical properties of the generated tokens and the expected reduction in latency as a function of the verifier. Our evaluations on multiple datasets and model pairs demonstrate that approximate verification can maintain high-quality generation while achieving even greater speedups.

Authors

Meiyu

Zhong

The University of Arizona

Noel

Teku

The University of Arizona

Ravi

Tandon

The University of Arizona

Bookmark
this session

Unbookmark
this session

Saving...