The ApexGuru AI Engine, Explained

In our previous blog post, we introduced ApexGuru as your AI-powered coach that guides developers to write efficient, scalable Apex code with greater confidence. In this post, we take you behind the scenes to explore the specialized AI foundation that powers ApexGuru. You’ll learn how it leverages custom-trained models, real-world org telemetry, and intelligent filtering to deliver precise, contextual, and actionable insights — purpose-built for Apex performance optimization at scale.

Introduction

In today’s performance-critical environments, writing functional Apex code is just the starting point. Whether you’re building responsive Agentforce actions or optimizing real-time application logic, ensuring scalability and efficiency is essential. CPU limits, SOQL inefficiencies, and unoptimized DML patterns can introduce latency, degrade agent experience, and impact end-user responsiveness.

At Salesforce, we’ve long recognized this challenge. But solving it at scale — with thousands of Apex classes across orgs, each with unique patterns, runtime behaviors, and technical debt — requires a fundamentally different approach.

That’s why we built ApexGuru, an AI-powered, runtime-aware performance engine embedded in Scale Center. ApexGuru doesn’t just detect anti-patterns. It surfaces the ones that matter most, based on actual org behavior. And it doesn’t just offer code suggestions. It delivers fixes that are precise, contextual, and verified against Apex’s real-world semantics.

This post unpacks the AI research and engineering that powers ApexGuru — from model training and prompt design to runtime prioritization and precision benchmarking.

Why we built ApexGuru: A product perspective

From the beginning, ApexGuru was designed with a core principle in mind: maximize developer impact per hour spent. Let’s take a look at some of the ways that ApexGuru helps developers do just that.

Prioritizing what matters

Refactoring Apex takes real time — across development, review, and regression testing. So ApexGuru doesn’t just detect anti-patterns. It ranks them by runtime footprint, leveraging live org telemetry (e.g., Apex CPU and DB time) to guide developers toward the most meaningful optimizations.

Rather than surfacing 500 unbulkified SOQLs, ApexGuru flags the three that contribute the most to CPU usage, giving teams clear direction and measurable payoff.

Screenshot showing the runtime-aware recommendation prioritization UI

Prescriptive, contextual fixes

Each recommendation is grounded in context: which entry point it impacts, how often it runs, what class it lives in, and what the fix looks like side-by-side with the current code. This eliminates guesswork and manual reasoning overhead. The system also supports PDF exports for performance planning and sprint planning across teams.

Why general purpose LLMs don’t cut it

You could take an Apex class and pass it to an LLM like GPT-4 or Claude with a simple prompt like “Find performance issues in this class.” This would, however, lead to recommendations that are simply not targeted enough. We instead assessed these LLMs for detecting a specific kind of anti-pattern when prompted with long and detailed instructions (over 1700 tokens) along with the Apex methods, and our tests showed that LLM outputs were still far from reliable.

GPT-4o-mini: 62.9% precision, with four out of 10 flagged issues being false positives
GPT-4o: Higher precision (85.3%), but only 52.7% recall — missing half the real issues
Latency: Often exceeded 10 seconds per class — making org-scale reviews impractical

To address these gaps, we decided to train smaller models that would specialize in detecting and fixing Apex anti-patterns and also offer speedy inference.

Detecting and fixing redundant SOQL patterns

Let’s start with an exercise: here is an Apex snippet from a dummy Salesforce org. Can you spot the performance issue present here?

At first glance, this appears to be a clean implementation. But the two SOQL queries are redundant and mergeable — both pull data from CustomObject__c using similar filter logic and identical mutation behavior (Status__c = false).

The semantic redundancy here increases database CPU time and can lead to governor limit issues under high load. But detecting this programmatically requires understanding relationship fields, loop structure, field projection, and bulk DML operations.

The ApexGuru approach: Specialization and system-level filtering

We tackled the challenge by building a multi-stage detection pipeline:

Heuristic filtering: We use static analyzers to eliminate 70% of false candidates — SOQLs separated by DMLs, or ones in separate if-else branches that don’t require merging. This pre-filtering ensures that only meaningful candidates reach the next stage, reducing model load and false positives.
Post-trained LLMs (XGen-AG): We trained compact Apex-specific models on curated datasets of redundant vs. non-redundant query patterns. These models are better at handling the caveats in detecting and fixing the SOQL redundancy anti-pattern. Unlike external LLMs, these models are trained on Apex-specific constructs and guided by Salesforce domain knowledge — resulting in better fix quality and contextual accuracy.
Human-in-the-loop evaluation: Test set construction with expert labeling and prompt evaluation ensure production-level correctness before inference is allowed in ApexGuru. This adds an additional layer of quality control, ensuring only validated logic makes its way into the live system.

The chart below visualizes the effectiveness of this pipeline.

On the Y-axis, we have accuracy — the higher, the better. On the X-axis, we measure p90 response time — the lower, the faster.
The yellow star marks ApexGuru’s sweet spot: It delivers high accuracy with the lowest latency, outperforming GPT-4o, Llama-4, and Mixtral across the board. Notably, GPT-4o (even with advanced prompting) and Llama-4 Scout take significantly longer and still fall short in precision and recall.

A scatter plot comparing model accuracy vs. p90 response time with XGen-ApexGuru

The table below reinforces this comparison.

It shows that ApexGuru leads not just in accuracy (84.42%) but also in precision (89.2%) and recall (79.8%), all while maintaining a p90 response time of just 8.7 seconds
By contrast, GPT-4o’s response time is over 21 seconds, with lower recall and F1 scores — making it less suitable for org-scale scanning

A comparison table of LLMs showing detection metrics, with XGen-ApexGuru highlighting accuracy, precision, recall, and lowest response time

Taken together, these benchmarks validate ApexGuru’s technical edge. By combining lightweight, specialized LLMs with rule-based preprocessing and expert feedback, it achieves the right balance of speed, accuracy, and practicality for real-world Salesforce development. As with any machine learning system, occasional false positives may arise. ApexGuru is designed for high precision, but we encourage developers to use their own judgment when reviewing suggestions — especially in edge cases where nuanced business logic may apply.

How ApexGuru trains its specialized models

To ensure that ApexGuru delivers accurate and actionable performance insights, we developed a robust multi-stage training and evaluation pipeline. Here’s how it works:

Seed with synthetic and real code samples: Using Apex code from public repos and curated synthetic examples, we simulate common SOQL anti-patterns with domain-expert guidance.
Generate labeled examples: A rule-based generator creates high-confidence “Yes” and “No” labels (e.g., mergeable vs. non-mergeable SOQLs). A synthetic generator transforms unlabeled code into snippets with the presence or absence of specific anti-pattern types (using permissively licensed, open weight LLMs).
Triage via preprocessing: Samples are triaged — high-accuracy samples move straight into training, while ambiguous ones (“Unsure”) are flagged for LLM-based detectors or manual evaluation.
Enrich and refine: Model-based detectors (e.g., trained on prior samples) provide additional labels, and LLMs generate natural language explanations or fix recommendations.
Clean, augment, evaluate: A data cleaning and augmentation pipeline ensures training quality. Post-processing filters and human evaluation further polish the training set.
Model training and testing: Final training datasets are tested with real-world orgs, and experimental outputs feed back into refining generation recipes, prompts, and detection rules.

This setup allows us to rapidly expand to new anti-pattern types while maintaining strict quality controls.

Flowchart showing ApexGuru’s AI training pipeline from code generation to model refinement using domain knowledge and human evaluation.

Measurable impact at scale

Since its general availability in February 2024, ApexGuru has:

Served over 8,800 orgs
Delivered 80,000+ performance optimization recommendations

Recommendations range from CPU-heavy trigger chains to redundant queries, inefficient joins, unused methods, and more — all ranked by real-world impact.

ApexGuru represents a new class of AI systems: not general, but deeply specialized; not prompt-based, but pipeline-optimized; and not just insightful, but actionable.

By combining domain-specific models, runtime telemetry, and code-aware fix generation, we’re helping Salesforce teams shift from reactive firefighting to proactive performance excellence across apps, flows, and Agentforce automations. Whether you’re cleaning up legacy code or scaling your next high-volume product, ApexGuru is built to make your Apex better — where it counts.

Our goal: Build the most advanced code performance advisor ever made for Salesforce — powered by specialized AI, runtime-aware prioritization, and secure delivery via ApexGuru.

About the authors

Mayuresh Verma is a Product Manager at Salesforce working on the Scalability Products portfolio. Reach out to him on LinkedIn.

Akhilesh Gotmare is a Senior Research Staff member at Salesforce working on ApexGuru. Reach out to him on LinkedIn.