Episode 184: AI Conversations with Greg Bennett

Join us in an enlightening discussion with Greg Bennett, a Director of Conversational Design at Salesforce. We cover how he applies his knowledge of interactional social linguistics and discourse analysis to create more natural and intuitive interactions with generative AIs. He also shares anecdotes from his tenure at Microsoft working on Cortana and how his insights from different writing systems in Japanese inform the design of chatbot dialogues.

In the second part of our discussion, Greg takes us through his path to Salesforce and how he made the business case for starting conversation design there. He provides insights into the careful design of prompts for models and model providers and the role of context in conversational AI. We then switch gears to explore the future of conversational design, the relevance of prompt engineering, and the exciting possibilities of generative AI.

Listen in as we unpack the evolution of conversational interfaces, the future of human-AI interaction, and the emerging vernacular of AI slang.

Show Highlights:

Greg’s journey to Salesforce and his role in pioneering conversation design, particularly the design of prompts for models and model providers.
The significance of context in conversational AI is discussed, along with the future of conversational design and the emerging discipline of prompt engineering.
The potential of generative AI and its impact on Salesforce’s user interface.
The evolution of conversational interfaces, human-AI interactions, and the development of AI slang.
The importance of prompt writing and its transformation into a fundamental skill for conversational AI is discussed, with insights into the system of prompt orchestration.

Links:

Greg on Twitter: https://twitter.com/gabennett45
Greg on LinkedIn: https://www.linkedin.com/in/gab45/

Episode Transcript

Greg Bennett:
It’s funny actually, when I told my parents that I was going to major in linguistics, they were like, “Are you going to be flipping burgers for the rest of your life?” Which I was like…

Josh Birk:
That is Greg Bennett, Director of Conversational Design here at Salesforce. I’m Josh Birk, your host with the Salesforce Developer Podcast and here on the podcast you’ll hear stories and insights from developers for developers. Today we sit down and talk with Greg about conversational design, about linguistics, and about how humans can interact with generative AI. But we will continue where we left off with his early years.

Greg Bennett:
That’s not great, but I think it’s interesting, I’ve always sort of followed my interests, followed my passion, and my passion really is around, I’ve always loved language. I loved learning foreign languages as a kid and my foray into linguistics in college really was motivated by a breakup that I had over instant messaging. And I knew the breakup was coming and it was my first major breakup and I was like, “I have to figure out how I knew it was coming.”
And so there was a class in school called Text and Talk, which I thought was going to be about text messaging and it turns out that was about five or six years ahead of its time. But yeah, it introduced me to the field of social linguistics and discourse analysis. Basically understanding how we negotiate relationships with what we say and how we say it. And so I didn’t really sort of have any intention at the time to move into anything around computers and I thought that I would end up becoming a professor. And so when I ended up having the job that I have now, it’s like, “All right, parents, look at me now. Bet you thought I wasn’t going to do this, huh?”

Josh Birk:
I love it. How many languages do you speak, by the way?

Greg Bennett:
Well, so I’m super picky because of my training in linguistics. I don’t feel comfortable claiming fluency in too many, but my native language is English, I can speak Japanese for business purposes and I can analyze and read Spanish and Korean to some degree.

Josh Birk:
Nice. Now a little bit of a tangent here, I just love, the title is what had me put this question in. In 2012, you wrote a paper on the representation of laughter as a contextualization cue in online Japanese discourse. Can you just give me the high level elevator pitch of what that was? That sounds fascinating.

Greg Bennett:
Wow, you really did your digging.

Josh Birk:
Oh yeah, I do my homework. I do do my homework, yeah.

Greg Bennett:
That’s amazing. So yeah, that was actually my master’s paper and that was really, like I had said, about how do we communicate, how as you and speakers orient to a conversation and communicate about the nature of the relationship of what we’re saying in chat when we don’t have our voice, our face, our hands, anything to communicate that extra layer of meaning? And so in Japanese, there are four writing systems that are used. So there’s kanji, which is the Chinese characters, hiragana and katakana, which are the Japanese characters and then Romaji, which is essentially the Western Romanized characters.
And it’s kind of like how you express LOL in English. The use of expressing laughter in Japanese can be done using any one of those four writing systems. And depending on which one you use and what you’ve said before, it can come across as sarcasm, it can come across as playfulness. And so when I think, for example, when we are designing a chatbot dialogue in Japanese, now, we wouldn’t really do this for enterprise use cases, but if we wanted to convey some sort of sense of laughter or amusement or whatever at what someone has said, we can leverage the particular writing system in Japanese that conveys politeness, for example, or a little bit more formality so as to make sure that we’re saying, “Okay, this is intended to be warm as opposed to teasing you.”

Josh Birk:
Interesting, nice. And I love how this is all coming together, I almost feel like I’m cheating and using a little foreshadowing on the pod right now. But sticking with this, because it kind of sounds like you’re already leaning into the online world, you’re leaning into electronic communications, when did your work start getting entwined with AI itself?

Greg Bennett:
So yeah, my start in the whole world of computers really was when I started at Microsoft working on Cortana. At the time, that was their competitor to Apple Siri. And I think at that time, that’s really when I started basically trying to frame everything that I’ve learned in linguistics about human communication as a system. It’s a system of features of language that we can pull and push on in order to affect a particular change or desired user experience. And that’s really where I think my start with AI was in that if we were going to have a system do this, we have to think about how the model works, what the constraints are, what types of data we have on the input side to then essentially design what the output should be.

Josh Birk:
Got you. I also think that’s kind of fascinating. I don’t think, and maybe you can help describe this little bit, with the rise of ChatGPT and Bard and these new bots. I don’t know if people appreciate the overlap there is between Cortana and Siri and Alexa and these natural language processing and the AI itself. Can you describe a little bit about how those two, what’s similar and what’s different between me talking to Cortana and then me talking to ChatGPT?

Greg Bennett:
Sure. I mean, I think the big thing is that Cortana doesn’t exist anymore, which is a huge heartbreak for me. But I think based on what it was when I worked on it like eight years ago or so, and what ChatGPT is now is that conversational experiences at that time were a lot more manual. So if you wanted Cortana to be able to say something in a certain way, we literally had at Microsoft conversation designers who would write the dialogue of what it would say, and the AI piece was more on the input side in terms of understanding and trying to categorize what it is that the user has said.
Whereas with ChatGPT, yes, it has the sort of processing and understanding side… Rather I should say it has the processing side because Emily Bender is a very prominent linguist and figure in the field of conversational AI from University of Washington, and I completely agree with her when she says that these machines aren’t capable of understanding, because understanding implies cognition and there’s nothing there. But in terms of processing what the user says, yes, you can say all these things to it in the chat and it can process that, but the real big change is that it can generate a response that’s not manually written by someone, that it’ll predict based on the training data that it has been given, in this case, the internet, for ChatGPT the internet up until 2022… Actually no, I take that back, for ChatGPT the internet up until 2021 to then essentially predict, “Okay, this is what comes next in the sentence.”

Josh Birk:
Got it. How did you get introduced to Salesforce and how would you describe your current job?

Greg Bennett:
Yeah, so my introduction to Salesforce was actually through a Google group. I want to say about seven and a half, almost eight years ago. I met with, I believe a researcher at the time named Becky Buck. And so we were chatting essentially about what my interests were in user experience, user research, and she had mentioned that they were hiring. And so that was really sort of my introduction to Salesforce. I interviewed with the head of Research and Insights, Nalini Kotamraju, and she and Jenny Williams and at the time the head of UX, Justin McGuire, they all really gave me my start in UX at Salesforce.
I started out as a researcher and that was primarily on Sales Cloud. We were developing Einstein for the first time at the time so that was super exciting. And it was really actually comforting to me to be on the Einstein project because I was very unfamiliar with anything related to Salesforce. I knew what a lead was in the office, but when would say accounts and opportunities, I would be really lost. And so the fact that they would talk about things like one and few shot learning, I was like, “Okay, at least I know what that is. “So that was really where I started on Sales Cloud. That’s really where I cut my teeth in the Salesforce world.
And then I would say about a year and a half after that, that’s when we essentially started building the Einstein Bot builder. And at that point I said, “Everybody out of my way,” because if we want to create something that can help businesses build relationships with their customers using chat, yeah, well shoot, we can do it in English, we can do it Japanese, we can have laughter, we can have all this kind of fun stuff in it. I did that for a few years, we did voice shortly after that, and then I moved into that realm for Einstein Voice.
And that’s really when I was able to make the sort of business case for starting conversation design at Salesforce where we have a really rigorous and regimented way of approaching how we design the experience when the experience itself is the language and the language is turn taking, there’s an exchange back and forth. So that was really when I was able to essentially found and start the conversation design, practice and role. When we acquired Slack, that’s when it sort of kicked things into overdrive because 17 lines of business making conversational apps at the same time, I was like, “That can’t be just me.” So hired a team, we started developing the Slack apps for our Slack integration, and then a couple years later we have Einstein GPT.
And now really what my job is, and the job of my team is to ensure that our prompts for our models and the model providers include interaction design. So for example, Sales Cloud, you may be a salesperson who is using Einstein GPT for sales to write an email to a customer of yours. Well, in this case, GPT isn’t writing something for you, it’s writing something for you to send to someone else so GPT needs to understand who’s the sender, who’s the recipient, how do they orient one another, what is the information that is relevant from your CRM that we have to pull in for context and make it personalized to you and the recipient in order for it to be a success. And so that is fundamentally all driven by prompt that we on my team engineer and design. So that is essentially what my role is now.

Josh Birk:
Got you. And you’ve described this exercise as kind of like setting a stage. Can you tell me a little bit more about that?

Greg Bennett:
Yeah, so if we go back to what I was saying, because you had asked before what’s the difference between conversational AI five or 10 years ago to now, the big difference is that manual piece where you had a conversation designer or somebody who was literally writing letter by letter what the bot would say to a user, and now the large language model is doing it. And so the way I kind of like to position it is you’re going from being an actor on stage who is literally saying the words to getting behind the camera or getting into the director seat and directing the LLM or Einstein GPT to say the words in the way that you envision for how the interaction should go.
And so it’s more about being able to give it scope, this is who’s saying this, this is who’s sending this email, this is who’s receiving it, here’s how they know each other. Because if you don’t tell the LLM these things, it’s just going to take a crack at it. It’s going to do what it can based on what it sees on or has seen on the internet. And so if you don’t tell it, “Hey, you’re going to email Tim Cookies, who is the CEO of this company,” then it’s just going to say something like, “Hey, Tim, what’s up?” Maybe that’s not what you want to say to the CEO of a company that you’re trying to close a deal with.

Josh Birk:
Got you. And so… Oh gosh, how to phrase this? It was a curiosity to find out just how much human hands are involved in both sides of the exercise, right? When I write a prompt, what that is is not 100% what the LLM is going to see, and as you’re describing, what I get back wasn’t 100% what the LLM would’ve originally thought about, it’s something that we’re kind of giving it a template and training it for. How much is it a human and how much is the machine, I guess is where I’m going?

Greg Bennett:
Oh, okay, I see. Yeah, I mean, because the prompt that you give to the LLM is exactly what it will see, and that’s precisely how you give it scope. So you keep it from running around on what it knows on the internet and tell it more like, “No, run in this way.”
And so I think in terms of the relationship between what’s human and what’s machine, what’s human is I think two main things, the prompt, and so the engineer or designer who is coming up with a scope of what the model output is supposed to serve in a particular interaction and then designing the prompt to give that scope to the model. So here’s who’s sending the email, here’s who’s receiving it, here’s the context in which it’s being received, here’s the style that we want it to be sent in, or what a lot of people like to call the tone, here’s what language we even want to send it in. All of this stuff is given inside of the prompt itself.
I think the other human piece of this is the Salesforce user. The information that you put in your CRM is part of what gives the prompt its context. We reference data directly from the CRM using, using the API name for different fields on record in Salesforce to give the detail that is needed in order to make the model output as personal as possible for you as the Salesforce user. What the machine does is really the interpretation of that entire prompt. So really being able to parse apart, what are the boundaries of what this model output are, how to essentially embed the information that has been given inside of the prompt into the model output based on the prompt itself, as well as the settings that you may give it. So for example, if you in OpenAI turn down the temperature setting, essentially it’s telling the model not as much variation, stick a little bit closer to the prompt than getting “creative”. That’s really essentially what the machine is doing.

Josh Birk:
The dangers of the AI hallucinating, I guess.

Greg Bennett:
Well, yeah, and I think that’s also another really interesting concept is the concept of hallucination. People will often say, “Oh, it hallucinated.” And I’m like, “But it doesn’t think, it doesn’t have a brain. It’s just a word calculator, so it didn’t hallucinate.”

Josh Birk:
So yeah, this is a word that’s being thrown around a lot right now. Let’s level set for people listening, when somebody says Bard hallucinated or just generally AI hallucinated, what do they mean? And then go ahead and what’s your personal opinion about what that term is doing to the conversation?

Greg Bennett:
Yeah, so I think what I hear, and again, I’m going to put my linguist hat on is what I hear when people talk about hallucinations, they’re really saying the model made something up, and this isn’t accurate. Whatever it is that it produced here isn’t necessarily factually accurate either with whatever we know to be true in the world, or whatever we know to be true in the prior conversation or in all of the conversation that we’ve had up until this very turn with the large language model in that there’s some sort of incongruency with something that happened previously. And that incongruency, whether that’s with the conversation up until now itself or with the outside world, that’s perceived as a hallucination.
I think why people call it a hallucination is because of the conversational competence that the large language models now have. They’re able to respond really quickly in a sort of more robust format with more variation than we’ve ever experienced before in conversational AI. And so there’s this perception there that if someone has conversational competence and they say something that is factually incongruent with conversation prior to this or with the outside world, that must be some sort of hallucination like you’ve misinterpreted something. But I think the challenge with using the term hallucination is, again, presuming that this technology has cognition. It’s not thinking, it’s just a predictor of what should come next in the sentence based on a large corpus. And so to call it a hallucination is to almost anthropomorphize this technology when it’s not a human. So I think it’s probably more accurate to say it made something up or it’s factually inaccurate.

Josh Birk:
I have sometimes described it as it’s trying to connect the dots, and so it just creates filler basically and the filler is really realistic sounding, but it may not have anything to do with actual reality kind of thing.

Greg Bennett:
Sure. I think, yeah, it’s trying to fill that next slot in the sentence. And I think what’s interesting about that is that it’s doing what it’s made to do, it’s predicting what should come next. It wasn’t made to be a fact-checker because again, it’s only trained up until a certain date. But it was made to say the next word in a sentence that should come potentially next that’s feasible.

Josh Birk:
Yeah. And it’s interesting that it’s such a crazy concept versus the laws of robots. Its main goal is to provide you an answer and so how it completes that goal, it may do it in different ways, but if facts aren’t its main goal, an answer that sounds like what it thinks you wanted to hear is its main goal. Does that sound accurate?

Greg Bennett:
Or an answer that sounds like it would be statistically plausible based on what everybody else has said on the internet because it doesn’t think, it just spits out words based on a prediction.

Josh Birk:
Right. So we can all agree on ChatGPT is not dreaming of electric sheep?

Greg Bennett:
No, exactly. No dreams, no dreams of the faith, no cognition. Just a word calculator.

Josh Birk:
Just a lot of math.

Greg Bennett:
Yeah, exactly. And I shouldn’t say just a word calculator because a word calculator’s really powerful. We haven’t had that before, but it isn’t a crystal ball, it’s not a magician, it’s a prediction, a statistical prediction.

Josh Birk:
Yeah. Going back a little more about the technical aspects of speaking to a machine, and I just tripped over my own question, and I know this could be long, but can you describe the speaking model to me and how does that play into prompt designing conversations?

Greg Bennett:
Sorry, I’m just so impressed. I’ve never had anyone who’s really gotten so deep into the work before and asked me specific questions. Yeah, so the speaking model, huge shout out to Denise Martinez on my team, who is also a linguist. She and I have a very similar background in terms of our academic training. And I was aware of the speaking grid before she introduced it to the field, but I didn’t think to introduce it to the field, she did. She was often running into challenges around, how do we operationalize context? If we’re going to create a conversational experience, where does that conversational experience live? How are users orienting to it? What are they doing while they’re potentially also conversing with this app or this experience? What are the needs and norms of that particular conversation?
All that stuff is contained in the speaking grid, which is a framework that was created by Dell Hymes within the field of ethnography of communication. And essentially, each item in the speaking grid represents a piece of the context of the conversation at hand. So happy to provide a link or give a little bit more because the speaking grid has very specific scope, but in terms of a sort of TLDR, it’s essentially a tool for you to walk through step-by-step to understand who are the participants in the interaction, where is the interaction happening, what are the sort of considered ideologies or norms of the interactions? So that way you can essentially instruct that in terms of the conversation design and what are the expected outcomes.
All of that, I think in terms of when you’re using it for prompt engineering, gets you as a prompt engineer or designer to think through, what exactly is this prompt supposed to do? And who is it for? What is the model output for? Where’s it going to go? It’s almost like if we’re going back to the theater model, thinking through what all is going on on stage, is this right a satire or is it a comedy, or is it a tragedy? Who are the characters? How do they orient to one another? What kind of language do they speak? What’s considered permissible versus offensive? And what are they trying to do in this scene? That’s what the speaking grid essentially outlines for you.

Josh Birk:
Nice, nice. Kind of going to the other side of things, people have described prompt writing and being able to talk to the machine as something that’s turning into a fundamental skill, if not maybe even a transformative skill. What are your thoughts there? How important is it going to be for people to be good at creating prompts that gets the outcomes that they’re looking for?

Greg Bennett:
I think it’s going to be really important in the near term. I think long term, what we should expect with prompts is that there is more of an orchestration of a system of prompts rather than having to do it manually. So in the same way that conversation design underwent a shift from seven, eight years ago where you were manually writing every little thing that the bot would say to now you’re instructing a large language model to produce that text or that discourse, now we’re in this phase of, “Okay, well, getting the model to produce text or discourse is a manual thing because you have to create a prompt in order to do it. What if we got to a point where we could orchestrate those prompts where we don’t have to do quite so much manually?”
So I think in order to get to that future, you have to be really strong in prompt engineering now because the only way to orchestrate those prompts is if they’re of quality now. And so that’s why it’s really important now to be able to follow a sort of rigorous system or systematic approach to prompt engineering, because you have to have a way to be able to isolate a variable. So if we’re trying to change the style or the tone of the model output, and we give it a specific example inside of the prompt saying, “Okay, I want you to express enthusiasm using intensifiers and exclamation points,” but then we decide to take out the exclamation points part and replace it with emojis, and what we get out of the model is not what we expected, well, we know because we isolated that specific piece of the prompt, we changed a variable. But if you do your prompts differently every single time you’re trying to do something or get a model output, it’s going to make it really hard for you to be able to isolate, “Okay, what part of the prompt really is the problem here?”

Josh Birk:
Got you. And that kind of dovetails because out of the papers you gave me to read for research, one of them was on the conversational evolution in IRC, and I’m going to admit, I skimmed it, I didn’t read all of it but I did find it kind of fascinating because I’m old, I was around during the early IRC days. I remember a world before LOL and emojis and all of those. I think I actually remember the first person who said LOL to me, you know what I mean? So all of this rise of abbreviations and slang and emojis because we’re in a textual world that lacks certain context, but has new context, do you see a similar evolution, because what you were saying there, the difference between an exclamation mark and a smiley face, do you think prompt design from a human point of view is going to evolve in that direction? Are we going to come up with AI slang so that the AI has shortcuts to know, “Oh, that’s what the human was actually thinking?”

Greg Bennett:
I think we could. When I mentioned before also about orchestrating the prompts, I think shortcuts like that would be exactly how you do it, where instead this shortcut is an abstraction of a prompt that lives underneath, and that prompt that lives underneath is something we’re designing now. We don’t have a robust enough system yet to be able to create abstractions that we can layer together really quickly. But I absolutely think that we would move to a world in which that kind of work would happen at scale, because right now it just doesn’t scale.

Josh Birk:
Right, right. And how, and I know we’re speaking very theoretically and on a high level, we can’t share internal Figmas or anything like that, but how are you seeing this influencing the overall Salesforce UI?

Greg Bennett:
I think what’s really interesting about this is that generative AI is inherently multimodal. You have generative language, you have generative image, you have generative video and sound. I think what it’s going to do to the Salesforce UI is very much in line with what Kat Holmes, our head of design at Salesforce, has a vision for which is making an experience that is more personalized and interactive with the end user. So are you able to get the information that you want in an embedded fashion where it’s inside of the UI, you don’t have to navigate somewhere else as opposed to, like I said, having to leave the experience to fetch what you want? Being able to control it with language as an input rather than having to navigate through too many clicks, for example.
Alan Ross, who, he’s our senior director of UX engineering, has deep experience in the world of generative AI and particularly on the graphical element of the experience, and I think that he articulates very clearly and grippingly how you can use a natural language interface to generate interactive images, interactive components. Imagine a world in which you could say, “Give me my top leads for the quarter,” and instead of having to otherwise navigate to a dashboard or a report, the report gets created in front of you based on what you said.

Josh Birk:
Yeah, yeah. Now I’ve been telling people, “Just try it.” There’s free versions of almost all of these things, ChatGPT, Bard, Bing, Midjourney I think is dropping their free tier for a little while or something like that. But I’m like, the best thing you can do is just get in there and start talking to these things so that you can get your hands on with it. Do you have further advice for people who want to get that prompt writing skill under their hat?

Greg Bennett:
I mean, I definitely think getting your hands on it right away is the best way to do it, to dive in. There’s a lot of content out there right now on prompt engineering that’s coming up, whether that’s through, I think I just saw MIT offer a prompt engineering class got it, to the tune of about $4,000 so I don’t know if that one’s going to be the one that you choose, there’s lots of other very cost-effective solutions out there that you can explore.
But I definitely think getting into the tool. Voiceflow, they are essentially a voice AI prototyping tool that we also use here at Salesforce, they have a lot of tutorials and content around things like prompt engineering, prompt chaining, so how you can link multiple prompts together in order to essentially get a more complex or more contextualized final model output, and you can sign up for their interface as well. Anthropic, for example, you can register for their beta and get in and test out their model model.
And really, I think the biggest advice I would give is to approach this like you’re a researcher. Observe yourself, make sure to save the prompts and the things that you say to or type to the model every time you send so that way you can watch that evolve. And again, isolate specific variables. So okay, the first prompt I sent was, “Hey, write me an email,” but the next one was, “Write me an email to a CEO,” see how it changes. “Okay, what if I change the CEO part to another type of role? What if I start adding and building onto the prompt that I sent?” And systematically call out what these pieces are, “Okay, here is the command, here is the context, here is the style,” for example, really just sort of, I think, observe yourself and take notes and kind of try to create a library or a system,

Josh Birk:
And that’s our show. Now before we go, I did ask after Greg’s favorite non-technical hobby, and well, he didn’t think it was a surprise, but it was kind of a surprise to me.

Greg Bennett:
I mean, I guess it’s not going to come as a huge surprise because of my background in linguistics and all this stuff around words, my favorite non-technical hobby is songwriting.

Josh Birk:
Okay. That came as a surprise, I got to tell you, actually, I thought you were going to say reading or writing or something. Songwriting, nice.

Greg Bennett:
Yeah, yeah. I mean, mostly the lyrics. I don’t write my own musical compositions because I’m not a musician really, but finding a composition and hearing the melody and then thinking through, “Okay, what is it that I want to communicate or say in these lyrics?” And making sure that they fit the sound. It’s interesting, I was asked years ago to think of how to compare my work to a hobby, and at the time I was working on Einstein Forecasting, I was like, “How the hell am I going to compare forecasting and songwriting?” The truth is that I do see the similarities because it’s like when I do songwriting, I’m not doing it completely from scratch, I’m working with some sort of partner or existing composition, and that’s the boundary that I’m working within.

Josh Birk:
I want to thank Greg for the great conversation and information, and as always, I want to thank you for listening. Now. If you want to learn more about this show, head on over to developer.salesforce.com/podcast where you can hear old episodes, see the show notes, and have links to your favorite podcast service. Thanks again, everybody. I’ll talk to you next week.