11 Apr 2017, 16:45 Technologies Watson Conversation Chatbots

Chatbot Best Practices

One of the most frequent questions clients ask the Bluemix Garage is “can you build us a chatbot?”. This is reflective of an industry-wide trend towards more natural language in computerised interactions, and also more automation of interactions which are currently handled by humans. There are currently more than 33,000 chatbots on Facebook Messenger alone. Many businesses are turning to Watson Conversation to help take out cost and improve user satisfaction. Our Hursley Labs colleague Simon Burns has written an excellent series of articles on how to write great Watson chatbots, which you should definitely go read. Think of this as a supplement, with our experiences from the field.

Users will expect human-like behaviour

While you should make it clear to users they are not dealing with a person, that doesn’t mean you should ignore their expectation of human-like behaviour (this is the positive side of skeuomorphism). Building in some charm and humour to your chatbot will almost certainly improve the user experience. On one chatbot project we did, the chatbot would detect profanity and respond with a mild rebuke (“Oh, I don’t like that word” or something similar). When we analysed the conversation logs, we found many users would apologise to the chatbot after it asked them not to swear. I think it’s lovely that people are saying “sorry” to computers, and I wouldn’t want a world where everything was so sterile and transactional that those kinds of characterful interactions weren’t possible.

Conversational patterns

Acknowledge your limitations and don’t be afraid to say “I don’t know”

When users head beyond the scope you’ve defined for your chatbot, be open about it. Let them know this isn’t something the bot can handle, and give some suggestions for things they can ask. “I’m only a computer and I don’t know everything” can work wonders.

The Watson Conversation service returns a confidence for each detected intent. In general, users will be a lot more disappointed by an incorrect or nonsensical response than they will be by a message saying “I’m just a bot, so I don’t know that.” In order to maintain user trust in the system, you should check the confidence of the top intent, and if it’s below a threshold, return a response like “This is too hard for me.” How damaging a wrong answer is depends on the domain, which will affect the value you choose for the confidence threshold.

Users like concision


When we imagine a chatbot, we imagine something so human-like that users will interact with it exactly as they would a person. Users, however, may see the text input box and expect it to behave like a search engine. This is ok! Don’t require your users to type four words when one would suffice. A conversational interface shouldn’t be a burden.

Unfortunately, when users input only one word, with no surrounding context, it’s sometimes hard for a conversation engine to correctly interpret the intent. Be aware of this when creating your training data.

For example, the garage wrote a system to help users plan holidays. We were expecting to be able to understand destinations we didn’t know about from the surrounding context, but we found users preferred single words. This made understanding a lot harder. For example, it was easy to answer “I want to fly to Chippawa” with “I’m sorry, we don’t fly there”, but what’s the correct answer to “Chippawa”; it could be a food, or a place, or a type of kayak … 
 As well as supporting single-word inputs, consider supporting one-tap responses. For example, if a user needs to respond with “yes” or “no,” don’t force them to type that out; give buttons. The same is true for cases where there are only a handful of possible responses, and even cases where there are an arbitrary number of responses but a subset are very popular. Buttons can also be useful for starting a conversation, to let users know what kinds of things they can ask.

Don’t repeat yourself

One pattern we’ve noticed in chatbot interactions is that the conversation may often circle back to the same node in the conversation tree. This will almost always frustrate users, so there are a few things you should do to handle this:

  • Provide multiple responses for each node. The Watson conversation service allows you to specify a range of responses for a single node in the conversation tree. It can select one at random or work sequentially through them. Take advantage of this.
  • Detect repeat visits to a single node, and do something about it. Almost certainly, if the conversation is going in circles, something has gone wrong. This is a good point to hand off to a person or make the scope and limitations of the chatbot clear, perhaps with a humorous apology.

Detect frustration, and handle it.

More generally, use the technical capabilities available to understand your users’ emotions. If they’re happy, that’s great, but if they’re frustrated, that should be addressed. For example, “I can see this isn’t going so well. Would you like to talk to a real person?”

Hand off to a human

Some interactions are too complex for a computer to handle — or rare enough that it’s not worth teaching a computer to handle them. Detect these, and hand off to a person. Even if a chatbot just handles the first, mechanical, part of a conversation or the most common questions, there can be great efficiency gains.

Context is important

Users will expect you to preserve context. The Watson conversation service has a mechanism to add things to the context and read information from the context, but what to do with the context will be particular to your domain. Too little context-dependent behaviour will frustrate users, but too much will increase the complexity of your conversation tree beyond what’s manageable. Too many closely-related nodes with subtly different behaviour for different conversation histories are a maintenance headache.

Slow down responses

Users are less likely to trust a response that seems to come back implausibly fast. Consider inserting a small pause before printing out a response, or print the response a word at a time for a more natural effect (yep, more skeuomorphism!). 


Open user intents

Open user intents are particularly hard to handle, and can lead to user frustration if intents are misinterpreted.

Here are some things we’ve found can help:

  • Lots and lots of examples to help the tool
  • Turn an open intent into a large, but, closed intent (by programatically creating an entity)
  • Pre-process with Alchemy Language to add extra semantic understanding (for example, Alchemy can detect the input language)

Then, observe your users

User Testing

Observing real users interacting with your product is an essential part of user-centred design. It’s extra-essential (if that’s possible) with chatbots, because supporting natural user interactions is the defining characteristic of the system. The range of possible interactions is almost unlimited, which makes all of design, implementation, and user testing a bigger challenge.

Users doing a user test may behave differently from users in the field

One thing we’ve observed is that because chatbots are a relatively new technology, users in a user study would often try and test the limits of the technology. For example, in a chatbot designed to help users track a lost parcel, test users (who in most test groups, hadn’t actually lost a parcel) might be more interested in the answers to questions like “how are you?” or “how smart are you?” than “where is my [fictious] parcel?“. This can skew the results of user testing. It doesn’t mean user testing shouldn’t be done, but it does mean user testing isn’t a substitute for monitoring and tuning interactions in the live system.

Test, Monitor, Tune

One of the first things you do when developing a chatbot system is instrument it, so that it can be properly monitored in the field. Make a dashboard where you can review what queries did not work, everyday. The Watson Conversation services pre-instruments the conversation and comes with a built-in dashboard. Like with search engine development, looking at the mistakes (and there will be mistakes) is what helps you improve. The requirement to iterate fast and catch problems early is even higher in chatbots than it is in apps.

It’s important to be prepared for the fact that users will use the tool in ways you don’t expect. You will miss obvious user intents in the development phase, because you have an idea how it will be used and users have different ideas.

Solicit user feedback

If things go well, at some point manual review of all of the conversation logs will become unsustainable because you’ve got so many users. This is great, but it doesn’t mean there’s no longer any need to monitor or tune. At this point, you should switch focus to identifying and correcting cases where the bot’s confidence was low, or tone analysis indicated users were frustrated. It’s also a very good idea to let your users tell you when things are going wrong using a non-conversational mechanism. There are lots of ways to do this, such as thumbs up and thumbs down buttons, a star ratings field, or a ‘This answer did not help me’ button. As well as providing you with a useful way of capturing cases where the conversation is going wrong, giving feedback may make your users feel better, and you can use the feedback to take corrective action within the conversation, or hand off to a human.

In summary:

  • User-test as much as possible before launch
  • Closely monitor dashboard post-launch and tune conversation tree daily
  • Add new paths to the integration tests to ensure they keep working as the tree is tuned
  • Make the conversation tree as rich as possible, to cover common paths (but keep the scope realistic)

… and don’t forget to observe the rest of the system (technical considerations)

You need a lot of automated tests - and you should write them first

Good conversation trees can get pretty complex. Just like good code, they need a robust set of tests to make sure that they behave the way you want, and that stuff which works continues to work as you layer in more function. (Just as working code can be regressed by new function, a working conversation tree can change in unexpected and unwanted ways when new nodes or examples are added.)

You will probably be exploring the behaviour of your conversation tree manually as you go along, but this isn’t enough. The Watson conversation service provides a REST API which lends itself well to automated tests, so get these built into your process as soon as possible. We use test-driven development in the Garage, so we write the automated tests for the conversation before we change anything in the tree itself.

For each intent, we’ll check a couple of example user inputs and confirm that the output from the conversation tree includes the expected intent and any expected context. We didn’t want our tests to be brittle, so we generally checked just for the returned intent, and not the precise text of the returned message. We also found writing some automated tests for entities helpful, although the entity detection is more straightforward than the intent recognition.

Think about Devops

One problem we have found in the past is that our automated tests would fail, to let us know that there was a problem with the conversation, but it was too late - the problematic changes had been made to a live conversation workspace. Conversation workspace changes should be managed using the same blue-green-deployment principles as the rest of the application.

In other words, both in order to avoid regressions, and in order to avoid delivering half-baked changes, it’s a good idea to edit a copy of the conversation workspace, and then switch the application over to use that copy only when you’re satisfied everything is working. This can be done programmatically as part of your build script, or as a separate build process. You can either update the live workspace with the json of the staging workspace once the tests pass, or keep the staging workspace as it is and use a user provided service to switch which workspace the application uses. The old ‘live’ workspace then becomes the new ‘staging’ workspace.

Backup your work

As part of the devops flow to test and promote workspaces, take backups! A conversation workspace is a valuable asset, and it should be source-controlled.

Finally …

There’s a reason bots are generating so much interest. After decades of buttons on screens, they offer us a different way of interacting with computers. From a business perspective, they can automate repetitive communication, open up the possibility of voice transactions. For us as developers and designers, they’re an interesting new challenge in user experience design and a new set of patterns to work out. We’ve enjoyed the Garage chatbot projects, and we’re confident there will be many more to come.