I was a bit of a fan of House MD when it was on TV, and one day not long ago I stumbled upon a webpage with transcripts of every episode. So to practice some text manipulation and web retrieval skills, I decided to write a (very simple) Markov chain text generator. It produces short lines of text that are intended to sound like something House might have said.
- “Listen, Vogler’s suggestion was idiotic — we’re amazing.”
- “I need to look there, cut it off.”
- “It’s a parasite. Because if we can confirm an infection from the bus.”
- “Hi. Guess I should have told her that every day should be naked pictures.”
- “What did I fire you. BOO! Just tell them how appalling the doctor.”
- “Cameron? Surgery is supposed to be a man.”
- “Now I sobered up and he’s the treatment.”
- “MRI to rule out hep-E. Because it’s why I get it.”
- “Take me back just a solid gold, black president.”
So here’s how it works. Everything House ever said is extracted and stored in a long string of text, which is broken into word pairs. For example, in the episode called “Distractions,” House says…
What have you got a date or something? 40% of his body, if the burns unit can prevent an infection, his body will regenerate maybe 10%, surgeons will do 20 or so grafts, 6 months in this room he’ll end up with a series of nasty scars, maybe some pain but he’ll live.
This would be broken into word pairs (What, have) and (have, you) and (you, got) and (got, a) and (a, date) and so forth. The entire corpus of everything House ever said is broken up this way.
To begin generating text, a random word pair is chosen from among all pairs that begin with a capitalized word. Now, once the text is in the midst of being generated, a new word is added on by searching for the most recent word pair among the entire corpus and choosing one of its follow-up word pairs at random. For example, in the middle of one of the samples listed above, the word pair (I, should) was used, and among all the word pairs that ever follow that one, the pair (should, have) was chosen. Among all the pairs that ever follow (should, have), the pair (have, told) was chosen. This process continues until there are at least 10 words — punctuation marks are counted as “words” — and then stops when a period, exclamation point, or question mark is reached.
My code for this is on GitHub.