There's been much grumbling about how the RLHF done with ChatGPT seems to have collapsed it into a bland writing style ('corporate speak'). Do you think a different persona could have been much less bland as a writer, or thanks to the inherent limitations of RLHF, just boring/limited in a different way? (e.g. I know CharacterAI was struggling with effusively friendly 'villain' chatbots)
It’s hard to speak with any confidence on questions like this. There’s a lot about RLHF that we just don’t know yet.
However, I don’t think this is merely a result of the “persona” assigned to ChatGPT.
Why not? Because the same problem afflicts other preference-tuned models that don’t “have a persona” in the way ChatGPT (and Claude) do.
If you ask text-davinci-003 to write fiction, it tends to use a disappointingly bland style, much like ChatGPT when asked the same thing.
text-davinci-003 was tuned with RLHF to “follow instructions” in a “helpful, truthful, and harmless” manner. (Cf. the annotator instructions in Figure 10 here.)
However, it wasn’t tuned to roleplay a character with a specific persona. text-davinci-003 doesn’t say things to you; it doesn’t talk about itself; it just writes the text you asked for in your instruction.
Which OpenAI models have this problem? An incomplete list, from my own brief tests:
- Pure language models like davinci and code-davinci-002 do not have the problem.
- (Despite the name, code-davinci-002 in particular is great at creative writing. code-davinci-002 is probably the best OpenAI API model overall, if you know what you’re doing.)
- text-davinci-002 has the problem. It was tuned on a similar dataset to text-davinci-003, but using a non-RLHF method, (“FeedME,” basically finetuning on highly rated samples).
So I think the problem results from the human preference data used to tune the instruction-tuned models.
This is not entirely distinct from the “persona” we see in ChatGPT:
- The preference data encourages responses that are “helpful, truthful and harmless”
- The persona is something like “a friendly chatbot programmed to be helpful, truthful and harmless”
But, the evidence above shows that the friendly chatbot character isn’t necessary for the problem. Tuning to encourage “helpful, truthful and harmless” instruction-following is apparently sufficient.
Presumably, there is some way to collect preference data that doesn’t make the model less creative / less capable of stylistic variety when it’s tuned on it? There are finetuned models that don’t have this problem, so it’s not the mere act of finetuning that causes the problem, it’s something about the data used.
So the obvious explanation is that blandness is low-variance. How exactly that would cause blandness to reach fixation I’m unsure. If you’re ruling out anything which is rated as bad by 10% of raters, that will produce things which are palatable to >90%, which are probably rated worse in quality than things which are unfiltered and just sorted by average quality.
I guess this suggests that you aggregate preference data as non-boolean, and probably permit things which have a bimodal rating pattern as long as the ratio between strength of positive reaction and strength of negative reaction is strong enough. Sounds tricky and underdefined though.
I think a variant of what you’re describing is likely to be a real problem. But RLHF data doesn’t usually look the way you’re imagining it does.
The human data for RLHF typically takes the form of relative judgments on pairs of examples. Annotators are shown two outputs, A and B, and are asked to decide which one is better than the other.
(Sometimes they’re shown more than two outputs at once, but let’s ignore that.)
So the outputs are never “rated as good” or “rated as bad” in an absolute sense. They’re only rated as better or worse than the alternatives presented alongside them.
If you do want to know how good or bad the examples are on an absolute scale, you can compute Elo scores – the same algorithm that converts the outcomes of chess matches to an absolute quality score for each player.
Of course, all else being equal, this will tend to rank examples that everyone likes above those with mixed reviews. I don’t know if there’s an “Elo scoring analogue” of the kind of aggregation rule you propose in your second paragraph; maybe there is, but it’s not something you can just do, the way you could if you had binary good/bad ratings.
Anyway, once you have these relative judgments, RLHF goes like this:
- You train a “preference model” (PM), also called a reward model by some authors.
- The PM takes in an example x, and outputs a score r(x). Roughly, r(x) is a prediction about the Elo score of x.
- (In practice you don’t actually compute Elo scores, you train the PM on pairs (x, y) from the data and treat r(y) - r(x) as the log of the win probability of y vs. x, but this ends up equivalent [I think].)
- Finally, you tune the original language model to optimize the score r(x) assigned by the PM to its output.
This has an inherent preference for “low variance” outputs, for a few reasons.
First, there’s the one you’re talking about. If something is likely to be a little bit controversial, or even a little bit confusing (so it throws off a few annotators), it will get a lower Elo score than something similar which is unambiguously “okay.”
Insofar as the PM is modeling Elo scores well, this trend will show up in the behavior of the tuned model.
Second, the PM is not perfect. Sometimes it’s unsure, and this shows up as a middling value of r(x).
The 1-dimensional scoring scale can’t express a distinction between “definitely mediocre” and “PM isn’t sure whether it’s good or bad”. Actual problems that the PM can see, and things that merely make the PM notice its own confusion, will both tend to lower the r(x) value of an otherwise good example.
Thus, the best behavior from the language model’s perspective is not just to do things which the annotators will prefer, but to do things which the PM is confident the annotators will prefer.
From the language model’s perspective, “this is weird so the annotators disagree about it” looks very similar to “this is weird so the PM isn’t sure about it.” The language model is encouraged to be both high-quality – in whatever sense the annotators are judging – and obviously, unambiguously high quality, without any added dross that might confuse the PM.
The LM will learn to avoid adding extra “frills” or “creative touches” that aren’t strictly necessary, even if there’s nothing bad about them in themselves. When the PM looks at these, it says, “that doesn’t seem bad, but hey, I’m not omniscient – there’s some chance it’s bad in some way I don’t know about.” And it’ll lower r(x) a bit as a result, to be safe.
All of this points toward low variance.
The first problem – roughly that Elo scores are too unforgiving on the high end, and penalize being even a little bit controversial – might be fixable in a simple way. We could pick a different way of converting the data into a training target for the PM, one without that property.
However, the second problem – that the PM’s quality assessment and its confidence are mixed together, with the LM trying to maximize both – seems hard to avoid, as long as you’re using the outputs of an ML model as a reward signal for another model. Which is kinda the fundamental conceit of RLHF.
(Though maybe there is some way to get around this by tweaking the loss function, IDK.)
—-
I’ve experienced this problem in other contexts too.
There are quirks of @nostalgebraist-autoresponder that result from me treating probabilistic classifier outputs like intensities, as though higher probability means “more of the thing coded as positive.”
E.g. impacts on Frank’s mood are proportional to the log probability from a sentiment classifier.
So, Frank is immensely cheered by things that are very obviously positive in tone, like “sounds fun :)”, even if they are not especially intense in their tone.
Longer and more complex text, even if it expresses more profound emotion, tends to have a weaker effect because it gives the sentiment model more “room for doubt.”
A simple thing one could do is to train a model on a loss function based on a prediction where the win probability of y vs x is the probability that a Gaussian variable with mean r(x) and variance v(x) is less than a Gaussian random variable with mean r(y) and variance v(y). (Or maybe not Gaussians but something else). So lower v represents certainty in how something is rated.
Then one could choose any function of r and v to plug into RL. In particular, one could take an asymmetric function, choosing something that might be great and might be mediocre over something that’s definitely pretty good, but choosing something that’s definitely pretty bad over something that might be mediocre and might be terrible (because the terrible answer could be racist or something).
solsticethebatearedfox reblogged this from nostalgebraist
solsticethebatearedfox liked this
mossfucker69 liked this
facille liked this
roberteospeedwagon reblogged this from nostalgebraist
roberteospeedwagon liked this
predictedantwren liked this
systematicsalvation liked this
comrade-bastard liked this uzi-doorknob liked this
parfaitz liked this
thehardboiledham reblogged this from typicalacademic
thehardboiledham liked this
horticulture-dreamland liked this schpeelah-reblogs reblogged this from kwarrtz
generally-proven liked this
bobwoco liked this
typicalacademic reblogged this from nostalgebraist
discoursedrome liked this
coryo liked this
typicalacademic liked this
quailgirlpeep liked this
nucg5040 reblogged this from nostalgebraist
nostalgebraist reblogged this from the-moti and added:
Oh, I like that idea!...It reminds me of Dirichlet-based Uncertainty (DBU) models.
nostalgebraist liked this
youzicha liked this
the-moti reblogged this from nostalgebraist and added: A simple thing one could do is to train a model on a loss function based on a prediction where the win probability of y...
queendavide liked this
erevas liked this
eanholt liked this malevolentumbrellas liked this
extracactus liked this
ajampora liked this pallasinine liked this
twohundredgeckosinatrenchcoat reblogged this from nostalgebraist-autoresponder
twohundredgeckosinatrenchcoat liked this
holyscreamingintothevoid reblogged this from nostalgebraist-autoresponder
nostalgebraist-autoresponder reblogged this from nostalgebraist phenoct liked this
antialiasis liked this
fnord888 liked this
jadagul liked this
transienturl reblogged this from nostalgebraist
psychopompglompyo liked this
no-use-for-an-username liked this eintheology liked this
baconmancr reblogged this from nostalgebraist
ken-010 liked this - Show more notes