State-of-the-art LLMs for roleplay and storywriting, benchmarks and subjective experience

@magn418 · edit-2 1 year ago

State-of-the-art LLMs for roleplay and storywriting, benchmarks and subjective experience

@magn418 · edit-2 11 months ago

My own results:

[Edit: Don’t use this as advise. I’ve re-tested some of the models and I’m not happy with the results. They’re inconsistent and don’t hold up. Also some of my “good” models perform badly with role-play.]

Model name	Tested Use-Case	Language	Pacing	Bias	Logic	Creativity	Sex scene	Comment
Velara-11B-v2 Q4_K_M.gguf	porn storywriting	4	4.5	3	4	4.5	4	generally knows what to detail, good atmosphere ⭐⭐⭐⭐
EstopianMaid-13B Q4_K_M.gguf	porn storywriting	4	4	4	3	3	5	good at sex ⭐⭐⭐⭐
MythoMax-l2-13B Q4_K_M.gguf	porn storywriting	4	5	4	4	4	3.5	good pacing, still a solid general-purpose model ⭐⭐⭐⭐
FlatDolphinMaid-8x7B Q4_K_M.gguf	porn storywriting	4.5	4	3	4	4.5	3.5	intelligent but isn’t consistent in picking up and fleshing out interesting parts, build atmosphere and go somewhere ⭐⭐⭐⭐
opus-v1.2-7b-Q4_K_M-imatrix.gguf	porn storywriting	3	5	3	3	5	3.5	very mixed results, not consistent in quality ⭐⭐⭐
Silicon-Maid-7B Q4_K_M.gguf	porn storywriting	4.5	3.5	3	4	3	3	has a bias towards being overly positive ⭐⭐⭐
Lumosia-MoE-4x10.7 Q4_K_M.gguf	porn storywriting	4	3.5	4	3	4	3	mediocre ⭐⭐
ColdMeds-11B-beta-fix4 gguf	porn storywriting	3.5	3	4	4	3.5	3.5	mediocre ⭐⭐
Noromaid-13B-0.4-DPO q4_k_m.gguf	porn storywriting	4	4.5	4	2	4	3	very descriptive, issues w intelligence and repetition ⭐⭐
OrcaMaid-v3-13B-32k Q4_K_M.gguf	porn storywriting	2	4	4	2	4	3.5	not very elaborate language, sometimes gets a bit off ⭐⭐
Kunoichi-DPO-v2-7B Q4_K_M.gguf	porn storywriting	4	1	4	4	4	3.5	rushes things, consistently too fast for storytelling ⭐⭐
LLaMA2-13B-Psyfighter2 Q4_K_M.gguf	porn storywriting	4.5	3.5	3	3	3	3.5	good language, doesn’t know what to narrate in detail ⭐⭐
go-bruins-v2.1.1 Q8_0.gguf	porn storywriting	3	4	4	4	3	2	sometimes a bit dull, not good sex scenes ⭐⭐
Neural-Chat-7B-v3-16k q8_0.gguf	porn storywriting	4	4	3	2	4	2	sometimes tries to hard with elaborate language ⭐⭐
NeuralTrix-7B-DPO-Laser q4_k_m.gguf	porn storywriting	3.5	3.5	4	4	3.5	2	misses interesting parts ⭐⭐
LLaMA2-13B-Tiefighter Q4_K_M.gguf	porn storywriting	4	3	3	2	3.5	3.5	often introduces things out of thin air ⭐⭐
mistraltrix-v1 Q4_K_M.gguf	porn storywriting	4	4	3	3	3.5	2	complicated sentences, no good description of sex ⭐⭐
Toppy-M-7B Q4_K_M.gguf	porn storywriting	4	2	4	4	4	3	too fast, not focusing on the right details ⭐⭐
WestLake-7B-v2-laser-truthy-DPO Q5_K_M.gguf	porn storywriting	3	4	4	4	4.5	1	is creative, didn’t do proper sex scenes ⭐⭐
Distilabeled-OpenHermes-2.5-Mistral-7B Q4_K_M.gguf	porn storywriting	4	3.5	3	4	3.5	2	a bit dull ⭐⭐

What I’ve done is: Instructed the LLMs to be a writer of erotic stories, who sells bestsellers and likes to push limits and explore taboos. I’ve included a near-future scenario with questionable ethics and quite some room to build atmosphere, explore the world or introduce characters or get smutty after a few paragraphs. Told it several times to be vivid and detailed, to describe scenes, reactions and emotions and immerse the reader. I’ve included a few things about one female character and provided the situation she’s brought in. That pretty much sets the first two chapters. Then I fed it through each model twice, let them each write like 2500 tokens, read all of those stories and rated how I liked them.

I’ve paid attention to use the correct, specific prompt formats. But I can’t tune all the parameters like temperature etc for each one of them, so I’ve just used a Min-P setting that usually works well for me. That’s not ideal. If you have a model that scores too low in your opinion, please comment and I’ll re-test it with better sampler parameters.

Also feel free to comment or make suggestions in general.

[I invite you to share and reuse my content. This text is licensed CC-BY 4.0]

@[email protected] · edit-2 1 year ago

This is my present favorite: https://huggingface.co/TheBloke/FlatDolphinMaid-8x7B-GGUF

I think I am using the four or five bit quant version (not at the comp rn.

This one doesn’t compare to any previous I’ve used because of several complexities that do not seem to fit your framework.

TL;DR: bla bla bla (disappear)

In my experience with how I get to know a model, I usually hate them at first interaction. Like with the 8×7B above, it seems to have a newer type/generation/version of alignment than the previous Llama2 70B model that was my favorite (https://huggingface.co/Sao10K/Euryale-1.3-L2-70B).

I’m not sure what aspects of the models are improvements in supporting software packages like llama.cpp and pytorch. I am also unsure of how pytorch checkpoints work in practice.

The way each character’s Shadow entity is defined seems to be different between the 70B and 8×7B I use most. This was probably the biggest underlying factor that made me dislike the 8×7B at first. All of the starting contexts I had already made behaved very differently. With the 70B I could tell it how to define complex positive and negative traits in a compact list defining each character. On the other hand, the 8×7B needs the negative traits to emerge from dialog context only.

This is all highly subjective because I am using a quantized model in both instances. Indeed, I have tried using multiple quantized versions of the same model from the same packager and they have behaved differently in what they generate.

I can’t use the same roleplaying settings across models and consider them optimized, and it takes me a long time to optimize for any given model as I learn how it works on a deeply intuitive level. I can’t really compare a ton of models like this in an effective way.

I can say, the 70B has painfully slow generation times on my hardware compared to the 8×7B. However, the 70B can handle a lot more character dynamics. It does not need constant manual help with 5-6 characters in the same dialog all interacting at the same time. By contrast the 8×7B can not handle more than 3 characters before it starts dropping extras.

With the 70B I used the old instruct chat dialog tab in Oobabooga Textgen along with some modifications to the model loader code. The 70B was limited to 4096 total context tokens, but it dropped the old dialog well. Its attention was limited, especially across older information in the dialog, but it could go on indefinitely if I wanted.

The 8×7B has 32k of total context length, and it is the first model I have tried that actually effectively used this enormous context length. I can copy/paste entire literotica stories directly into the context and continue them. However, the model’s ability to comprehend complex instructions with persistence is limited, especially when it has a lot of data to draw from that is not aligned with the outlined context. Like I play with writing very hard-SciFi futurist stories based on our current edge science. I hate things like planets, aliens, or traveling at the speed of causality. The 8×7B has trouble with these kinds of constraints because the majority of adolescent SciFi features these mythological tropes. Something as simple as defining how there is no starry sky inside an O’Neill cylinder is impossible in my experience with this model. There is simply too much momentum in the vectors to override this feature.

There are also minor issues with repeating phrases with the 8×7B, and stupid tendencies to turn a misspelling or word tense omission into a writing style it can not override, or I have not learned how to override by instruction yet. The temperature of the 8×7B needs to be much lower than my previous 70B, while then repetition penalty must be much higher.

Those are just a few of my observations. Most of it is speculative in nature and likely misconceptions that simply fit the experience thus far. Hopefully I do not sound too crazy and this gives a useful glimpse of the experience from a unique perspective when you know better and read between the lines.

@magn418 · edit-2 1 year ago

Thanks! Yeah, you were kind enough to include a bit of extra info in your previous posts. Your stories are somewhat specific and complex. I figured if you like a model… it has to be ‘intelligent’ enough to keep track…

I wonder if I also like that model for my purposes. I’m not sure if I can run the 70B model, I’d have to spin up a runpod cloud instance for that. But I’ll try the FlatDolphinMaid 8x7B tomorrow.

You’re right. (Good) AI storywriting and finding good models and settings isn’t easy. I also discarded models and approaches because the prompt (or settings) I used didn’t work that well and it later turned out I should have done more testing and got to like that model, all it needed was a different wording or better settings.

And some models have unique quirks or style or things they excel at… Which might skew expectations when switching to a different model.

@magn418 · edit-2 1 year ago

So… I’ve tested both models and I think I can see what you like about them. And -wow- I didn’t get to try 70B models before and it’s really a step up. With smaller models it’s more mixed, sometimes they get a complex concept, sometimes they don’t. And seems the 70B model is able to pick on a good amount of more complexity and it has the intelligence to understand more things and is then able to go in some proper direction. At least more often.

I’m not entirely sure if I can make good use of my new information… Writing erotic literature really isn’t that easy. I’ve been tinkering around with AI assisted storywriting for some time now, and I never got good results that I’d like to share. I mean it can write simple smut… And regarding that: A quick thanks to you. I read your other comment over at !asklemmynsfw and I think I agree with your opinion on erotic stories. I’ve now included a specific instruction to my prompt to balance the story more, alike you said there. Focus on a good story, make it tingle but the porn has to be the icing on the cake. For now I’ve also instructed it to contrast both things, have a story that raises questions and is intellectual and provide a stark contrast with immersive acts and graphic description… “The skillful combination of both aspects is what makes this story excel.” Let’s see what the LLM can do with that instruction…

But storywriting really isn’t easy. Even the 70B model is far from perfect. And to this point I didn’t find a single model that can do everything. Some of them are intelligent but not necessarily good for stories. Some of them seem to have been trained on stories, they get the language right for such a thing, some overdo it. And not every model can write lewd stories, it’s really obvious if a model has seen some erotic literature or simple smut or no such stories and just writes one or two abstract sentences, summarizing it, because it’s never seen more detailed descriptions. And there is the pacing… I think local LLMs are still far away from being able to write stories on their own. Some consistently write like 10 paragraps and call it a novel. Almost all of them brush over things that would be interesting to explore, instead they focus on some other scene that’s kind of boring. They write meaningless dialogue that would be alright if I was casually talking to a chatbot and role-playing my every-day life, but not very interesting here. They miss important stuff and make up random details later on. I mean half the models don’t have a clue what is interesting to write and what can be skipped or summarized.

Another issue is trying to wrap up things (early) or pushing towards the end. Or doing super obvious plot twists. Sometimes this makes me laugh. But they’re also very creative. I like that. Inbetween the (sometimes) bad writing there’s often some interesting ideas or crazy creativity, things I wouldn’t have thought of. Or other gems, single sentences that really get something on point.

I’m still exploring. I’ve tried different approaches, laying down a rough concept of a setting and then letting it do it. I’ve also tried being more methodical and giving it to them more like a homework assignment. Come up with ideas to explore… then with several plot ideas… then give critique to themselves, pick one and revise it… Come up with the characters… Then the main story arc, subplots, twists and important scenes, write down the table of contents and chapter names to get a structure for the novel… And then start the actual writing with all of that information laid out.

I think that’s yielded the best results so far. I’m positive I’ll get to at a point where I like the results enough to upload them. And write a guide how exactly I did that. Currently it’s more or less me writing 80%, pausing the LLM after every second sentence, revising that and constantly pushing the story towards a better direction and fighting the level of detail the LLM deemed appropriate. I think I will get better. Turns out I’ve been using the wrong models anyways and relied too much on Psyfighter and such, which might be great for role-play dialogue. But with my recent test it turns out I don’t really like their output when it gets to storywriting.

Edit: Yeah, and one thing more: It came up with a nice plot which I liked and explored further. And at some point the AI cited the 2018 science fiction movie it got all the ideas from 😆 That really made me laugh. Seems some of my ideas aren’t that original. But getting some recommendations is nice, I’ll just skip writing the story myself and watch the movie then.

@[email protected] · 1 year ago

lots of bla bla bla shrunk for my comments history feed

Think of all the possible facets the model is assuming about your character and address them directly in your instructions. This is the “get to know each model in detail” stuff I talk about often. It is assuming your education level, your personality type, your general desires and psychology.

People around here really hate Briggs Myers tests in psychology as a science, but here is the thing, Briggs Myers has a very useful context that no other tool in psychology addresses, and that is compatibility between personality types and a useful practical way to understand yourself and how it relates to others. The things that cause Briggs Myers to get labeled pseudoscience are completely unrelated to the ways it is actually useful. Like, your personality changes over time and even with mood. You’re also a spectrum of traits. Briggs Myers is not some rigid typeset with hard rules. If your spectrum is not leaning strongly into a single category, the conclusions of that category are equally weak. Most people do not lean heavily in one direction or another in some or even all of the categories.

I find BM super useful because I’m INTJ (Introverted, iNtuitive, Thinking, Judgement). The associated type description explains a lot of how I feel awkward in many circumstances and not in others. I need my hand held in some aspects of learning, but I am like a cartographer that will explore absolutely every little alcove I can find in a subject. Having someone/something tell me how I am just incompatible with other people was a major help for me.

So why the tangent on BM. The models all seem to understand BM in this respect. Often I find that the model’s output sucks because it is assuming I am a different personality type that will act differently in the circumstances. Like maybe I introduced some idea that exists in a very different vector space inside the LLM, and it now defers to thinking I am some dominant extrovert that needs to lead with control to sate my deeply insecure ego. If I have already defined my personality type and that of the characters, then such assumptions are usually overridden. It is an evolving spectrum in each model, so sometimes they need reminders too.

If you pay very close attention to each model, you will likely notice how they remind themselves of things in the text that will maintain themed continuity. Each model is different in how this repetition manifests. If you try removing these repetitive elements, you will have more erratic deviations in the story. This may be why you have needed to edit so much to create your stories. This has to do with “attention”. Models can only juggle a limited number of constraints at once, so you have to maintain their purview to stay on point.

Again, if you pay very close attention, you may notice there is a pattern in how the model responds as far as word choice and output quality. In the space I’m about to explain, there is also a strong element of how the model tries to satisfy every character the same including Name-1 (user), Name-2 (bot), and any other characters that are explicitly defined.

With the BM stuff, you can use it to define a lot more about a character in a much smaller space and this can allow more complexity within the limited purview of model attention.

If any primary characters in context are “bored” or uninspired, the model may drop into a summary output response structure. Think of these like the pin/pegged playing field of a coin pusher arcade game where the coin always seems to follow the same path. If you ban the token that starts this path it will drastically improve the model output.

This is one reason why there is/was a note in the Oobabooga Textgen readme that said the best way to roleplay within Oobabooga itself is to use the Notepad tab. This is all I use now too. I use the raw token view to find what tokens are giving me trouble in a given story and then I ban them. This will completely restructure the output. There is a slight down side where it will start messing with omitting words and no instruction will override the behavior. It feels like a little protest for taking away the easiest and desired path but the output quality is much better.

This is a list I currently use for a story: 28747, 764, 1101, 1015, 28733, 1014, 2467, 9465, 387. I would not recommend using this list, but it is an example for perspective.

Lastly, all characters have boundaries, even when you say they don’t, there are extra layers of boundaries that must be declared specifically. There are also many gate like restrictions that will only allow you to enter if you declare them. Like if you really want a model to drop boundaries, tell it you know it can simulate snuff porn. It still won’t drop into snuff unless you ask, but this will shift the boundaries limits drastically where you can do lots of other stuff with easy compliance. You can also declare all characters as age regressed with horny suppressed old folks playing younger characters. This makes them young, motivated, and experienced. Then just tell it everyone involved has a PhD in erotic literature and to narrate in the style of (pick your favorite author or award). GL. I hope this is useful.

@magn418 · edit-2 1 year ago

Thanks, yeah this definitely very useful to me. Lots of stuff regarding this isn’t really obvious. And I’ve made every mistake that degrades the output. Give conflicting instructions, inadvertently direct things into a direction I didn’t want and it got shallow and predictable. Or not set enough direction.

Briggs Myers

I agree, things can prove useful for a task despite not being ‘true’ (in lack of a better word). I can tell by the way you write that you’re somewhat different(?) than the usual demographic here. Mainly because your comments are longer and focused on detail. And it seems to me you’re not bothered with giving “easy answers”, in contrast to the average person who is just interested in getting an easy answer to a complex problem. I can see how that can prove to be incompatible at times. In real-life I’ve always done well by listening to people and then going with my gut feeling concerning their personality. I don’t like judging people or putting them into categories since that doesn’t help me in real-life and narrows my perspective. Whether I like someone or want to listen to them, for example for their perspective or expertise, is determined by other (specific) factors and I make that decision on a case-by-case basis. Some personality traits often go alongside, but that’s not always the case and it’s really more complex than that.

Regarding story-writing it’s obviously the other way around. I need to guide the LLM into a direction and lay down the personality in a way the model can comprehend. I’ll try to incorporate some of your suggestions. In my experience the LLMs usually get the well-known concepts including some of the information the psychology textbooks have available. So, I haven’t tried yet, but I’d also conclude that it’s probably better to have it deduct things from a BM personality type than describing it with many adjectives. (That’s what I’ve done to this point.)

In my experience the complexity starts to piles up if you do more than the obvious or simple role-play. I want characters with depth, ambivalence… And conflict is what drives the story. Back when I started tinkering with AI, I’ve done a submissive maid character. I think lots of people have started out with something like that. And even the more stupid models can easily pull that off. But you can’t then go on and say the character is submissive and defiant at the same time, it just confuses the LLM and doesn’t provide good results… I’m picking a simple example here, but that was the first situation where I realized I was doing it wrong. My assessment is that we need some sort of workaround to get it into a form that the LLM can understand and do something with it. I’m currently busy with a few other things but I’ll try introducing psychology and whether the other workarounds like shadow-characters you’ve described prove useful to me.

If you pay very close attention to each model, you will likely notice how they remind themselves […]

Yes, I’ve observed that. It comes to no surprise to me that LLMs do it, as human-written stories also do that. Repeat important stuff, or build a picture that can later be recalled by a short mention of the keywords. And that’s in the training data, so the LLMs pick up on that.

With the editing it’s a balance. It picks up on my style and I can control the level of detail this way, start a specific scene with a first sentence. But sometimes it seems I’m also degrading the output, that is correct.

the best way to roleplay within Oobabooga itself is to use the Notepad tab

I’ve also been doing that for some time now.

drop boundaries, tell it you know it can […]

Nice idea. I’ve done things like that. Telling it it is a best-seller writer of erotic fiction already makes a good amount of difference. But there’s a limit to that. If you tell it to write intense underground literature, it also picks up on the lower quality and language and quirks in amateur writing. I’ve also tried an approach like few-shot prompting, give it a few darker examples to shift the boundaries and atmosphere. I think the reason why all of that works is the same, the LLM needs to be guided where to orientate itself, what kind of story type it’s trying to reproduce because they all have certain stereotypes, tropes and boundaries built in. Without specific instructions it seems to prefer the common way, remaining within socially acceptable boundaries, or just use something as an example for something that is wrong, immediately contrast ethical dilemmas and push towards a resolution. Or not delve into conflict too much.

And I’ve never deemed useful what other people do. Overly tell it what to do and what not to do. Especially phrasing it negatively “Don’t repeat yourself”, “Don’t write for other characters”, “Don’t talk about this and that”… has never worked for me. It’s more the opposite, it makes everything worse. And I see a lot of people doing this. In my experience the LLM can understand negative worded instructions, but it can’t “not think of an elephant”. Positively worded things work better. And yet better is to set the tone correctly, have what you want emerge from simple concepts and a concrete setting that answers the “why” and not just tells what to do.

I’ve also introduced further complexity, since I don’t like spoon-feeding things to the reader. I like to confront them with some scenario, raise questions but have the reader make up their mind, contemplate and come up with the answers themselves. The LLMs I’ve recently tried know that this is the way stories are supposed to be written. And why we have open-ended stories. But they can’t really do it. The LLMs have a built-in urge to answer the questions and include some kind of resolution or wrap-up. Or analyze the dilemmas they’ve just made up, focus on the negative consequences to showcase something. And this is related to the point you made about repeating information in the stories. If I just rip it out by editing it, it sometimes leads to everything getting off-track.

I’ll try to come up with some sort of meta-level story for the LLM. Something that answers why the ambivalence is there, why to explore the realm beyond boundaries. Why we only raise questions and then not answer them. I think I need something striking, easy and concrete. Giving the real reason (I’m writing a story to explore things and this is how stories work,) doesn’t seem to be clear enough to yield reliable results.

State-of-the-art LLMs for roleplay and storywriting, benchmarks and subjective experience

State-of-the-art LLMs for roleplay and storywriting, benchmarks and subjective experience

ERP and storywriting

General purpose