Subho's research at your service 🫡

From 10 to 1000 Tokens/Second: Cursor AI's Secret Weapon Revealed

In the rapidly evolving landscape of AI-assisted coding, Cursor AI has emerged as a groundbreaking tool that's transforming how developers write code.

One of its most impressive features is "Speculative Edits" - a novel approach that dramatically improves code generation speed while maintaining high accuracy.

What are Speculative Edits?

It is a technique that anticipates low-entropy actions (highly predictable code changes) and minimizes user input by reducing repetitive tasks. WoW stop, thats way much to process, lets understand step by step!!

Speculative Edits is an innovative variation of speculative decoding that leverages the initial prompt as a guide for faster code generation.

speculative_decoding

Unlike traditional approaches where tokens are generated sequentially, Speculative Edits can generate multiple tokens in parallel, significantly reducing latency.

The Problem with Traditional Generation

In traditional language models, token generation follows a strict sequential process:

  1. Generate token 1
  2. Use token 1 to generate token 2
  3. Use tokens 1 and 2 to generate token 3 And so on...

This sequential nature creates inherent latency in the generation process. When working with code, where precision and speed are crucial, this can become a significant bottleneck.

This is How Speculative Edits Works

Speculative Edits takes a different approach:

  1. Uses the initial prompt as a "speculation" of what might come next
  2. Validates these speculations in parallel
  3. Generates new tokens only when necessary

Cursor seems to have trained a 70B or larger models on speculation objective and hence they use it in their fast/instant-apply problems.

Let's look at a practical implementation comparing vanilla editing vs speculative editing:

Lets Implement Speculative edits from scratch

Here I did implement vanilla edit or to be simple, a generate loop with HF transformer,

def vanilla_edit(prompt: str, max_tokens: int) -> str:
    model, tokenizer = load_model_and_tokenizer()

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(input_ids).to(model.device)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=input_ids.shape[1] + max_tokens,
            num_return_sequences=1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

Now for speculative edits, there stands 3 most important validations to keep in mind,

def speculative_edit(prompt: str, max_tokens: int) -> str:
    start = time.time()
    model, tokenizer = load_model_and_tokenizer()
    model.eval()

    # Initial tokenization
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(input_ids).to(model.device)

    generated_ids = input_ids.clone()
    total_tokens_generated = 0

    with torch.no_grad():
        while total_tokens_generated < max_tokens:
            # Generate with context window management
            outputs = model(generated_ids[:, -1024:], 
                          attention_mask=attention_mask[:, -1024:])
            next_token_logits = outputs.logits[:, -1, :]
            next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

            # Speculation validation
            if total_tokens_generated < input_ids.shape[1] - 1:
                if next_token.item() == input_ids[0, total_tokens_generated + 1].item():
                    total_tokens_generated += 1
                    continue

            # Token generation
            generated_ids = torch.cat([generated_ids, next_token], dim=-1)
            attention_mask = torch.cat([attention_mask, torch.ones_like(next_token)], dim=-1)
            total_tokens_generated += 1

            if total_tokens_generated >= input_ids.shape[1]:
                break

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

How did it perform??

I tested both vanilla edit and speculative edits on M1 air using model HuggingFaceTB/SmolLM-135M and results were pretty good!

Prompt provided -

PROMPT_SPECULATION = """
Add type hints to this function
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)
"""

Speculative edits completed generation in 5.33 sec while vanilla generation took 7.22 sec to complete, a 26% improvement in generation time!!

Think of the Real-World Impact!

Cursor AI's implementation of Speculative Edits has shown impressive results:

You can Implement Speculative edits too

BUT here are some tips which could made this much easier -

By cleverly leveraging the initial prompt as a guide for parallel token generation, Speculative Edits achieves impressive speed improvements while maintaining accuracy.

As demonstrated in our implementation, this approach can be practically applied to create faster, more responsive code generation systems.

The success of Cursor AI's implementation shows that Speculative Edits isn't just a theoretical improvement - it's a practical solution that's already making a difference in real-world development environments. This Blog was Inspired from their interview exam.

Follow me here so that you don't miss out further blogs. See ya :)