At Dreamforce 2023, one of the most popular AI sessions was “Supercharge Development Using CodeGen, Our In-House LLM.” It was presented by Sridhar Raghavan and Erik Nijkamp, both members of Salesforce’s AI Research team. In this session, they talked about CodeGen, Salesforce’s own large language model (LLM) for code understanding and code generation. A variant of CodeGen, specifically fine-tuned for Apex, is what currently powers Einstein for Developers.
A large language model is a trained, deep-learning model that understands text and is able to generate text. It analyzes previously generated words to predict which new word to generate next. Code generation is hence a task that is well-suited to an LLM.
Salesforce decided to create and train an open-source in-house LLM for code understanding and generation, and we called it CodeGen. CodeGen has been trained on a wide variety of programming languages, including Apex.
So far, we’ve released three major versions of CodeGen:
- CodeGen 1.0: In early 2022, we released CodeGen v1. At that point, CodeGen was by far the largest open-source LLM with up to 16B parameters (the configuration settings and components that define the behavior of the LLM).
- CodeGen 2.0: These models reached the threshold of being useful in the daily life of developers. So, we decided to invest more in CodeGen. In early 2023, we released CodeGen 2.0 with improved quality, and we started to use it internally as our AI-powered developer tooling. It demonstrated the ability to save ~90 minutes per day, per developer.
- CodeGen 2.5: Based on these results, we decided to optimize CodeGen for production use cases, and we released CodeGen 2.5 in July 2023. The 2.5 model not only delivers high-quality responses, but it also has low latency, which is important for user experience and cost-to-serve metrics — both critical for the business case. CodeGen 2.5 has been trained on the StarCoderData dataset that contains 783GB of code in 86 programming languages. Today, CodeGen 2.5 model enjoys over 600,000 downloads per month, and it’s widely used.
The LLM behind Einstein for Developers
So what about the LLM that powers Einstein for Developers? This model is based on CodeGen 2.5, but it was fine-tuned for Apex (note: support for LWC will be coming soon). Although based on open-source CodeGen 2.5, the Einstein for Developers model is proprietary.
Then, we added additional capabilities to it, including:
- Continuous learning of Apex: We repeatedly run internal testing and gather feedback from our Apex experts, which we then use to improve the model. Also, we analyze customer feedback coming in through the extension to help improve the model tuning strategy.
- Contextual grounding: At the moment, the extension includes some basic grounding features, such as custom object metadata (when referenced in the prompt), or the contents of the active open file in VS Code or Code Builder. More advanced contextual grounding capabilities with org metadata are on the roadmap.
- With custom object metadata, provided it’s referenced in the prompt (it says…. “CustomObject__c”)
- With the contents of the active open file in VS Code or Code Builder
- Security, enabled by our trust layer: Customer prompts, customer feedback, and Einstein-generated code is securely protected by Salesforce’s trust boundary, and no data ever leaves Salesforce.
The future of CodeGen and Einstein for Developers
The future of CodeGen is bright! The Salesforce AI Research team is continuously gathering feedback from the developer community and also from our own products, such as Einstein for Developers. These learnings will be incorporated into future versions of CodeGen to benefit the wider developer community.
Regarding Einstein for Developers itself, as part of the roadmap presented at Dreamforce this year, we are working on bringing some highly requested features to the product, such as in-line autocomplete built into the editor, writing test cases, explaining code, support for LWC, and more!
If you want to know more about CodeGen, be sure to watch our Dreamforce session video. Remember that CodeGen is open source, so if you’re interested in exploring further, you can check out the code for its different versions on GitHub.
Moreover, if you want to see Einstein for Developers in action, you can do it right away by installing the VS Code extension (beta), or you can also install it in Code Builder. Try Einstein for Developers today, and please do provide feedback on your experience through the extension, this GitHub repo, or the Einstein for Developers Trailblazer Community group.