Simple Markov chains have been the workhorse of blogposts and websites about text generation for many years. They are suprisingly effective at producing quite realistic sentences trained from a quite small corpus. A year ago I wrote a simple twitter bot trained on Trump’s tweet history and it was retweeted and favourited a suprising amount, and often by people who hadn’t realized the joke. Some of this can definitely be chalked up to bots retweeting bots - but for me it was certainly a demonstration of their effectiveness.
In terms of state of the art text generation, they’ve been surpassed by LTSMs and other RNN variants that are able to learn to condition themselves on longer or shorter sequences with different weights or do things like include noise in the activations to allow for ‘true’ creativity - but the simple Markov chain model still has it’s appeal. The subject of a later post may be a more formal comparison between what is happening in this style of Markov chain vs. an RNN model.
In this post I will extend the normal application of Markov chains by building novel words from two seperate corpora, but first a quick review:
The logic of Markov chains for text generation is quite simple - the theory is that by looking at the frequency that words or character n-grams follow each other in the corpus, we can create a probabalistic model of it. What the ‘Markov’ part of the name means is that it’s a Markov process (memoryless) - so the prediction only depends on the current state. In this instance that means that we have a sliding window approach to the state, as it is only the last n-1 characters of the string generated - it doesn’t care what path has brought the code to where it is, it simply looks at the previous 2 characters (in the case we’re using tri-grams) and says “in the training corpus, these 2 characters were followed by an e 25% of the time and an a 75% of the time”, and makes a selection based on that probability.
Portmanteaus
So what I want to do is create a program that can take 2 sets of words and create novel combinations of them (or new words drawn from a similar distribution) that flow together smoothly.
So in terms of the Markov chain, what I want to do is this:
- Start off with a walk down a chain from one corpus.
- Continue this walk until I hit a state that occurs in the 2nd corpus’s chain.
- Transition to walking down a chain from the 2nd corpus, starting at this state.
- If the transition state isn’t found or a transition made, start over.
The Code
A few quick notes on this code:
-
it’s not highly optimized. The topic of a future post may be re-writing this for dealing with very large corpora quickly (likely using numpy arrays as transition tables instead of the dict structure, and better I/O reading.)
-
I’ve used generators at most points, so it should already be able to handle pretty unreasonably long strings without memory constraint issues. The other advantage of this approach is that it allows us to naturally walk down the first chain until we don’t need it anymore.
-
the interface isn’t quite polished - a good idea for extending this code would be to allow for a custom weight function for predict and better utilities for loading the corpus in different forms (different seperators, file formats, etc.). If refactoring I’d probably reorganize the way the functions are structed as well (lots of this logic could just happen in _init_ and still be quite readable, but when I started writing this I wasn’t totally sure where I was going with it.)
And here’s a quick and dirty function that will handle the logic for transitioning between chains. Right now it doesn’t use the frequency information for deciding when to transition, but just walks down the first chain until it finds any overlap.
The Results
Here are some example outputs given different training corpora. These are all done on 4-gram chains (meaning it looks for a shared state of length 3 to transition.) Using corpora with distinct patterns makes the most obvious use of this technique, which is why I’ve selected the ones I have here.
Countries+Pokemon | Fruits+Countries | Countries+Elements |
---|---|---|
Yemence | Lychelles | Luxembodium |
Cubonee | Duritan | Afghanum |
Ghandour | Prunei | Guine |
Kyrgyzsta | Blackcurra Leone | Philicon |
Grenipede | Mangladesh | Kyrgyzstatine |