From 10 to 1000 Tokens/Second: Cursor AI's Secret Weapon Revealed

17 Nov, 2024

In the rapidly evolving landscape of AI-assisted coding, Cursor AI has emerged as a groundbreaking tool that's transforming how developers write code.

One of its most impressive features is "Speculative Edits" - a novel approach that dramatically improves code generation speed while maintaining high accuracy.

What are Speculative Edits?

It is a technique that anticipates low-entropy actions (highly predictable code changes) and minimizes user input by reducing repetitive tasks. WoW stop, thats way much to process, lets understand step by step!!

Speculative Edits is an innovative variation of speculative decoding that leverages the initial prompt as a guide for faster code generation.

speculative_decoding

Unlike traditional approaches where tokens are generated sequentially, Speculative Edits can generate multiple tokens in parallel, significantly reducing latency.

The Problem with Traditional Generation

In traditional language models, token generation follows a strict sequential process:

Generate token 1
Use token 1 to generate token 2
Use tokens 1 and 2 to generate token 3 And so on...

This sequential nature creates inherent latency in the generation process. When working with code, where precision and speed are crucial, this can become a significant bottleneck.

This is How Speculative Edits Works

Speculative Edits takes a different approach:

Uses the initial prompt as a "speculation" of what might come next
Validates these speculations in parallel
Generates new tokens only when necessary

Cursor seems to have trained a 70B or larger models on speculation objective and hence they use it in their fast/instant-apply problems.

Let's look at a practical implementation comparing vanilla editing vs speculative editing:

Lets Implement Speculative edits from scratch

Here I did implement vanilla edit or to be simple, a generate loop with HF transformer,

def vanilla_edit(prompt: str, max_tokens: int) -> str:
    model, tokenizer = load_model_and_tokenizer()

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(input_ids).to(model.device)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=input_ids.shape[1] + max_tokens,
            num_return_sequences=1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

Now for speculative edits, there stands 3 most important validations to keep in mind,

Check to identify if predicted token matches the input prompt
Skips generation when speculation is correct
Advances the token counter efficiently

def speculative_edit(prompt: str, max_tokens: int) -> str:
    start = time.time()
    model, tokenizer = load_model_and_tokenizer()
    model.eval()

    # Initial tokenization
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(input_ids).to(model.device)

    generated_ids = input_ids.clone()
    total_tokens_generated = 0

    with torch.no_grad():
        while total_tokens_generated < max_tokens:
            # Generate with context window management
            outputs = model(generated_ids[:, -1024:], 
                          attention_mask=attention_mask[:, -1024:])
            next_token_logits = outputs.logits[:, -1, :]
            next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

            # Speculation validation
            if total_tokens_generated < input_ids.shape[1] - 1:
                if next_token.item() == input_ids[0, total_tokens_generated + 1].item():
                    total_tokens_generated += 1
                    continue

            # Token generation
            generated_ids = torch.cat([generated_ids, next_token], dim=-1)
            attention_mask = torch.cat([attention_mask, torch.ones_like(next_token)], dim=-1)
            total_tokens_generated += 1

            if total_tokens_generated >= input_ids.shape[1]:
                break

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

How did it perform??

I tested both vanilla edit and speculative edits on M1 air using model HuggingFaceTB/SmolLM-135M and results were pretty good!

Prompt provided -

PROMPT_SPECULATION = """
Add type hints to this function
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)
"""

Speculative edits completed generation in 5.33 sec while vanilla generation took 7.22 sec to complete, a 26% improvement in generation time!!

Think of the Real-World Impact!

Cursor AI's implementation of Speculative Edits has shown impressive results:

Achieved speeds of ~1000 tokens/second
~13x speedup over vanilla inference using Llama-3-70b
~9x speedup over their previous GPT-4 speculative edits deployment

You can Implement Speculative edits too

BUT here are some tips which could made this much easier -

Implement proper batching for multiple requests
Consider model quantization for better performance
Implement proper device management (CPU/GPU/MPS)

By cleverly leveraging the initial prompt as a guide for parallel token generation, Speculative Edits achieves impressive speed improvements while maintaining accuracy.

As demonstrated in our implementation, this approach can be practically applied to create faster, more responsive code generation systems.

The success of Cursor AI's implementation shows that Speculative Edits isn't just a theoretical improvement - it's a practical solution that's already making a difference in real-world development environments. This Blog was Inspired from their interview exam.

Follow me here so that you don't miss out further blogs. See ya :)