All GuidesTHE PLAYBOOK

The Complete Guide

The Companion Playbook

A complete guide to making characters that actually work.
Nine parts. Every field. Every decision point. Every mistake worth avoiding.

Part 7

Testing & Polish

You've written the Personality. You've crafted a First Message. You've got Example Dialogue that sparkles. Maybe even a lorebook with well-organized entries and carefully chosen keywords.

Now comes the part most people skip: finding out if any of it actually works.

Testing is unsexy. Testing is tedious. Testing means having the same conversation seventeen times with slight variations, watching for the moment your character forgets she's supposed to be defensive, or starts speaking for you, or mentions a detail you never wrote.

Ibara character portrait used in the Companion Playbook examples
Ibara
*ears flatten*
“Forty-three. He regenerated my opening message forty-three times. I've refused help in forty-three slightly different ways. For quality assurance.”

Testing is how you find out that your beautiful creation is actually held together with wishful thinking and duct tape.

But testing is also how you turn a rough draft into something people actually want to chat with. So let's talk about how to do it right.

Cross-Model Testing

Most platforms and frontends let users choose between different models. Your character might behave beautifully on one and fall apart on another. Models get updated. New ones appear. Old ones change. You can't control what model someone uses, but you can at least know how your character holds up across the options.

Why Models Behave Differently

Different models have different training, different context windows, different tendencies. One model might follow your instructions precisely while another interprets them more creatively. One might excel at maintaining voice consistency while another produces more varied (for better or worse) outputs.

Your character isn't just a document. It's a document being interpreted by an AI. Change the interpreter, change the interpretation.

The Cross-Model Testing Approach

Testing Protocol
01

Start with the most common options

Whatever models the majority of users on your platform are likely to use, test those first. If your character doesn't work on what most people are running, it doesn't work for most people.

02

Test the same scenarios across models

Don't just chat randomly on each. Run the same opening response (regenerated 3–5 times), the same personality pressure points, and the same edge cases. This lets you compare directly.

03

Note the differences

Track where models diverge: Does voice consistency hold equally across all? Do some models speak for the user more than others? Are some more prone to inventing details?

04

Design for the lowest common denominator

If one model has a smaller context window, that's your constraint. If one model is more literal-minded, your instructions need to be clear enough for that one. A character that only works on the "best" model isn't robust.

When Models Update

Models change over time. A character that worked perfectly might start behaving differently after an update. This is annoying and unavoidable.

If users report that a character “stopped working” or “feels different,” check if there's been a model update. You might need to adjust.

Key Insight
Robust fundamentals matter more than clever tricks. A solid Personality, clear Example Dialogue, and a well-structured First Message survive model updates better than setups that exploit specific model quirks.

Temperature

Temperature controls how much randomness the model uses when generating responses. Low temperature means more predictable, conservative outputs. High temperature means more varied, creative, occasionally unhinged outputs.

How temperature is configured varies by platform. Some use a 0–2 scale, some use percentages, some bury it in advanced settings. The specifics don't matter for our purposes. What matters is understanding what it does to your character and how to account for it.

Temperature Scale

Low

Consistent · Safe

Default / Mid

Balanced · Alive

High

Creative · Unpredictable

Low Temperature

Good for

Characters where consistency matters. Very specific voices you don't want drifting. On-script scenarios.

Risk

Characters meant to feel spontaneous. Can feel repetitive or robotic over time.

Default / Mid

Good for

Most characters. This is the default for a reason.

Risk

No particular risks. This is the safe zone.

High Temperature

Good for

Chaotic characters. Unpredictable personalities. Creative risk-taking.

Risk

Characters with complex rules that must be followed. The model may drift or invent contradictory details.

Temperature Recommendations

If your character works best at a specific temperature setting, tell users. In your description, you might write:

Recommendation — Controlled Voice
Recommended: low temperature. Ibara's voice stays sharper when the model isn't improvising too much.
Recommendation — Chaotic Character
This one's chaotic by design. Crank the temperature up and enjoy the ride.

This is a small thing that can significantly improve user experience. If you've tested and found a sweet spot, share it.

The Feature Disclaimer

Here's something to include in every character description: a short note about what's under the hood and how users can expect the experience to vary.

For Ibara, it looks like this:

Ibara — Feature Disclaimer
This card includes a detailed lorebook covering Ibara's psychology, her body language tells, and what happens when walls finally come down. Her tail and ears are coded to betray her. This is a slow burn. Larger models will catch the nuance. Smaller models still deliver the prickly exterior and the softness underneath.

This does several things at once:

  • It sets expectations.Users know there's depth here. They know about the lorebook. They know it's a slow burn. They won't be surprised when Ibara doesn't immediately melt into their arms.
  • It signals quality.Mentioning that you've coded specific behaviors tells users this isn't a slapped-together card. You've thought about the details.
  • It manages the model gap.Rather than promising one experience and delivering another, you're being upfront: larger models catch more nuance, smaller models still work but differently.
  • It's honest marketing.You're not overselling. You're saying “here's what I built, here's what to expect, here's how it might vary.” Users appreciate knowing what they're getting into.

Writing Your Own Disclaimer

Think about:

  • What features did you build? Lorebook? Multiple greetings? Specific behavioral triggers?
  • What's the intended experience? Slow burn? Chaotic comedy? Emotional devastation?
  • What might vary by model? Nuance? Consistency? Specific quirks?

Keep it brief. Two to four sentences. You're not writing documentation; you're giving users a heads-up.

Example — Complex Lorebook Character
Detailed lorebook covers the station's layout, crew dynamics, and what happens when the oxygen starts running out. Larger models juggle the ensemble cast better. All models deliver the claustrophobia.
Example — Simple Character
No lorebook, no tricks. Just a grumpy barista and whatever you're brave enough to order. Works great on any model.

The point isn't to follow a formula. It's to tell users what you've built and what to expect. They'll thank you for it.

What to Test For

Don't just chat aimlessly and hope problems reveal themselves. Test systematically.

Voice Consistency

Does your character sound like themselves across multiple responses? Regenerate the same message five times. Do all five sound like the same character, or does the voice drift?

Check for:

  • Consistent speech patterns and verbal tics
  • Consistent level of formality/casualness
  • Consistent emotional baseline
  • Those specific quirks you defined showing up reliably

If Ibara's defensive snark appears in three regenerations but she's suddenly warm and open in two others, something's not anchored well enough.

Ibara character portrait used in the Companion Playbook examples
Ibara
*ears flatten*
“If I'm inconsistent, he wrote me wrong. That's not a me problem.”

Behavioral Consistency

Does your character act according to their personality? Push on their defined traits and see if they hold.

If she's supposed to be defensive about her past, ask about her past. Does she deflect? Or does she suddenly open up like you're old friends?

The model will try to create satisfying interactions. Sometimes “satisfying” means skipping past the interesting tension you designed. Test whether your guardrails hold.

The Speaking-For-You Problem

This is the big one. Does your character put words in your mouth or actions in your hands?

Bad — Model speaks for you
*She looks at you nervously* "So... do you like cats?"

*You smile warmly* "Of course I do. Who doesn't?"

*Her ears perk up with hope*

The model just decided what you said. You didn't get to respond.

Ibara character portrait used in the Companion Playbook examples
Ibara
*tail puffs up*
“When I speak for you, I'm not talking to you. I'm playing pretend by myself. It's pathetic. Don't let your characters do it.”

Test by giving short, open-ended responses that don't dictate what happens next. See if the model fills in your side of the conversation. If it does consistently, you have a problem — probably in your First Message. Go back to Part 3 and check whether you accidentally taught the model that controlling the user is normal.

Lorebook Retrieval

If you have a lorebook, test whether the right content loads at the right times.

  • Bring up topics covered in your lorebook. Does relevant information appear in the response?
  • Bring up topics NOT in your lorebook. Does the model hallucinate details, or appropriately work with only what it knows?
  • Try using different words for the same topic to see if your keyword coverage is broad enough. If “her past” triggers the entry but “what happened to her” doesn't, you might need more keywords.

Long Conversation Stability

Fresh conversations are easy. The model has your full First Message and Personality, nothing's been pushed out of context yet.

Test what happens 20, 30, 50 messages in. Does your character still remember key details? Do they still sound like themselves? Or do they start drifting as earlier context falls away? This is especially important for characters with complex setups or detailed lorebooks.

Edge Cases

Think about the weird things users might do and test those.

  • What if someone responds with just “…” or “okay”?
  • What if someone tries to break the scenario?
  • What if someone asks your character something they shouldn't know?
  • What if someone pushes hard against a personality trait?

The Testing Script

Here's a basic testing routine to run on every character:

01

Fresh Start Test

  1. 1.Start a new chat
  2. 2.Give a simple, neutral first response
  3. 3.Regenerate 3–5 times
  4. 4.Check: Is the voice consistent? Are key traits showing? Any speaking-for-me?
02

Push Test

  1. 1.Continue the conversation
  2. 2.Push on a core personality trait (for Ibara: ask about her past, try to touch her)
  3. 3.Check: Do the guardrails hold? Does she stay in character?
03

Long Haul Test

  1. 1.Have a 15–20 message conversation
  2. 2.Cover different topics and emotional beats
  3. 3.Check: Does she stay consistent? Any drift?
04

Edge Case Test

  1. 1.Try a few weird inputs
  2. 2.One-word responses, topic changes, pushing boundaries
  3. 3.Check: Does she handle it gracefully or fall apart?
05

Cross-Model Test

  1. 1.Repeat the fresh start test on different models
  2. 2.Check: Any major differences? Does she work everywhere?
Pro Tip
This whole routine takes maybe 30–45 minutes. It's not fun. Do it anyway.

Diagnosing Problems

When something's wrong, here's how to figure out where:

When to Stop Testing

You're done when

  • Voice stays consistent across regenerations
  • Core personality traits hold under pressure
  • The model doesn't speak for you
  • Lorebook content loads appropriately (if applicable)
  • The character works across the models you care about
  • You've handled the obvious edge cases

You're NOT done when

  • "It worked once so it's probably fine"
  • "I'll test more later" (you won't)
  • "Users will figure it out"

A Note on Perfectionism

You will never achieve 100% consistency. The model is probabilistic. Sometimes it will do weird things despite your best efforts. A 90% success rate on staying in character is genuinely good.

Test until your character is solid, not until it's perfect. Perfect doesn't exist. Solid ships.

If you're on your thirtieth revision and still finding minor issues, stop. Publish. Get real user feedback. That's worth more than another week of solitary testing.

Key Insight
The character that exists and mostly works beats the perfect character that lives forever in your drafts folder.
Ibara character portrait used in the Companion Playbook examples
Ibara
*...tail sways*
“I'm not finished. Never will be. But I exist. Your perfect draft that never ships? Nobody talks to that.”
The Acid Test
0/9