From 10 to 1000 Tokens/Second: Cursor AI's Secret Weapon Revealed
In the rapidly evolving landscape of AI-assisted coding, Cursor AI has emerged as a groundbreaking tool that's transforming how developers write code.
One of its most impressive features is "Speculative Edits" - a novel approach that dramatically improves code generation speed while maintaining high accuracy.
What are Speculative Edits?
It is a technique that anticipates low-entropy actions (highly predictable code changes) and minimizes user input by reducing repetitive tasks. WoW stop, thats way much to process, lets understand step by step!!
Speculative Edits is an innovative variation of speculative decoding that leverages the initial prompt as a guide for faster code generation.
Unlike traditional approaches where tokens are generated sequentially, Speculative Edits can generate multiple tokens in parallel, significantly reducing latency.
The Problem with Traditional Generation
In traditional language models, token generation follows a strict sequential process:
- Generate token 1
- Use token 1 to generate token 2
- Use tokens 1 and 2 to generate token 3 And so on...
This sequential nature creates inherent latency in the generation process. When working with code, where precision and speed are crucial, this can become a significant bottleneck.
This is How Speculative Edits Works
Speculative Edits takes a different approach:
- Uses the initial prompt as a "speculation" of what might come next
- Validates these speculations in parallel
- Generates new tokens only when necessary
Cursor seems to have trained a 70B or larger models on speculation objective and hence they use it in their fast/instant-apply problems.
Let's look at a practical implementation comparing vanilla editing vs speculative editing:
Lets Implement Speculative edits from scratch
Here I did implement vanilla edit or to be simple, a generate loop with HF transformer,
def vanilla_edit(prompt: str, max_tokens: int) -> str:
model, tokenizer = load_model_and_tokenizer()
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
attention_mask = torch.ones_like(input_ids).to(model.device)
with torch.no_grad():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_length=input_ids.shape[1] + max_tokens,
num_return_sequences=1,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(output[0], skip_special_tokens=True)
Now for speculative edits, there stands 3 most important validations to keep in mind,
- Check to identify if predicted token matches the input prompt
- Skips generation when speculation is correct
- Advances the token counter efficiently
def speculative_edit(prompt: str, max_tokens: int) -> str:
start = time.time()
model, tokenizer = load_model_and_tokenizer()
model.eval()
# Initial tokenization
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
attention_mask = torch.ones_like(input_ids).to(model.device)
generated_ids = input_ids.clone()
total_tokens_generated = 0
with torch.no_grad():
while total_tokens_generated < max_tokens:
# Generate with context window management
outputs = model(generated_ids[:, -1024:],
attention_mask=attention_mask[:, -1024:])
next_token_logits = outputs.logits[:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
# Speculation validation
if total_tokens_generated < input_ids.shape[1] - 1:
if next_token.item() == input_ids[0, total_tokens_generated + 1].item():
total_tokens_generated += 1
continue
# Token generation
generated_ids = torch.cat([generated_ids, next_token], dim=-1)
attention_mask = torch.cat([attention_mask, torch.ones_like(next_token)], dim=-1)
total_tokens_generated += 1
if total_tokens_generated >= input_ids.shape[1]:
break
return tokenizer.decode(generated_ids[0], skip_special_tokens=True)
How did it perform??
I tested both vanilla edit and speculative edits on M1 air using model HuggingFaceTB/SmolLM-135M
and results were pretty good!
Prompt provided -
PROMPT_SPECULATION = """
Add type hints to this function
def calculate_average(numbers):
total = 0
for num in numbers:
total += num
return total / len(numbers)
"""
Speculative edits completed generation in 5.33 sec while vanilla generation took 7.22 sec to complete, a 26% improvement in generation time!!
Think of the Real-World Impact!
Cursor AI's implementation of Speculative Edits has shown impressive results:
- Achieved speeds of ~1000 tokens/second
- ~13x speedup over vanilla inference using Llama-3-70b
- ~9x speedup over their previous GPT-4 speculative edits deployment
You can Implement Speculative edits too
BUT here are some tips which could made this much easier -
- Implement proper batching for multiple requests
- Consider model quantization for better performance
- Implement proper device management (CPU/GPU/MPS)
By cleverly leveraging the initial prompt as a guide for parallel token generation, Speculative Edits achieves impressive speed improvements while maintaining accuracy.
As demonstrated in our implementation, this approach can be practically applied to create faster, more responsive code generation systems.
The success of Cursor AI's implementation shows that Speculative Edits isn't just a theoretical improvement - it's a practical solution that's already making a difference in real-world development environments. This Blog was Inspired from their interview exam.
Follow me here so that you don't miss out further blogs. See ya :)