Class MarkovText
java.lang.Object
com.github.yellowstonegames.text.MarkovText
A simple Markov chain text generator; call
analyze(CharSequence) once on a large sample text, then you can
call chain(long) many times to get odd-sounding "remixes" of the sample text. This is an order-2 Markov
chain, so it chooses the next word based on the previous two words. This is meant to allow easy
serialization of the necessary data to call chain(); if you can store the words and processed
arrays in some serialized form, then you can reassign them to the same fields to avoid calling analyze(). One way to
do this conveniently is to use stringSerialize() after calling analyze() once and to save the resulting
String; then, rather than calling analyze() again on future runs, you would call
stringDeserialize(String) to create the MarkovText without needing any repeated analysis.-
Field Summary
FieldsModifier and TypeFieldDescriptioncom.github.tommyettinger.ds.IntIntMapMap of all pairs of words encountered to the position in the order they were encountered.int[][]Complicated data that mixes probabilities of words using their indices inwordsand the indices of word pairs inpairs, generated during the latest call toanalyze(CharSequence).String[]All words (case-sensitive and counting some punctuation as part of words) that this encountered during the latest call toanalyze(CharSequence). -
Constructor Summary
ConstructorsConstructorDescriptionCreates an empty MarkovText; you should callanalyze(CharSequence)before doing anything else with this new object. -
Method Summary
Modifier and TypeMethodDescriptionvoidanalyze(CharSequence corpus) This is the main necessary step before using a MarkovText; you must call this method at some point before you can call any other methods.chain(long seed) Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (usinganalyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "..."), or once the length would be greater than 200 characters without encountering stop punctuation(it terminates such a sentence with "." or "...").chain(long seed, int maxLength) Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (usinganalyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "...") or once the maxLength would be exceeded by any other words (it terminates such a sentence with "." or "...").voidchangeNames(Translator translator) After callinganalyze(CharSequence), you can optionally call this to alter any words in this MarkovText that were used as a proper noun (determined by whether they were capitalized in the middle of a sentence), changing them to a ciphered version using the givenTranslator.voidchangeNames(Translator translator, Collection<? extends CharSequence> names) After callinganalyze(CharSequence), you can optionally call this to alter any words in this MarkovText that are present in the given Collection, changing them to a ciphered version using the givenTranslator.copy()static MarkovTextstringDeserialize(String data) Recreates an already-analyzed MarkovText given a String produced bystringSerialize().Returns a representation of this MarkovText as a String; usestringDeserialize(String)to get a MarkovText back from this String.
-
Field Details
-
words
All words (case-sensitive and counting some punctuation as part of words) that this encountered during the latest call toanalyze(CharSequence). Will be null ifanalyze(CharSequence)was never called. -
pairs
public com.github.tommyettinger.ds.IntIntMap pairsMap of all pairs of words encountered to the position in the order they were encountered. Pairs are stored using their 16-bitwordsindices placed into the most-significant bits for the first word and the least-significant bits for the second word. The size of this IntIntMap is likely to be larger than the String arraywords, but should be equal toprocessed.length. Will be null ifanalyze(CharSequence)was never called. -
processed
public int[][] processedComplicated data that mixes probabilities of words using their indices inwordsand the indices of word pairs inpairs, generated during the latest call toanalyze(CharSequence). This is a jagged 2D array. Will be null ifanalyze(CharSequence)was never called.
-
-
Constructor Details
-
MarkovText
public MarkovText()Creates an empty MarkovText; you should callanalyze(CharSequence)before doing anything else with this new object.
-
-
Method Details
-
analyze
This is the main necessary step before using a MarkovText; you must call this method at some point before you can call any other methods. You can serialize this MarkovText after calling to avoid needing to call this again on later runs, or even include serialized MarkovText objects with a game to only need to call this during pre-processing. This method analyzes the pairings of words in a (typically large) corpus text, including some punctuation as part of words and some kinds as their own "words." It only uses one preceding word to determine the subsequent word. When it finishes processing, it stores the results inwordsandprocessed, which allows other methods to be called (they will throw aNullPointerExceptionif analyze() hasn't been called).- Parameters:
corpus- a typically-large sample text in the style that should be mimicked
-
changeNames
After callinganalyze(CharSequence), you can optionally call this to alter any words in this MarkovText that were used as a proper noun (determined by whether they were capitalized in the middle of a sentence), changing them to a ciphered version using the givenTranslator. Normally you would initialize a Translator with aLanguagethat matches the style you want for all names in this text, then pass that to this method during pre-processing (not necessarily at runtime, since this method isn't especially fast if the corpus was large). This method modifies this MarkovText in-place.- Parameters:
translator- a Translator that will be used to translate proper nouns in this MarkovText's word array
-
changeNames
After callinganalyze(CharSequence), you can optionally call this to alter any words in this MarkovText that are present in the given Collection, changing them to a ciphered version using the givenTranslator. Normally you would initialize a Translator with aLanguagethat matches the style you want for all names in this text, then pass that to this method during pre-processing (not necessarily at runtime, since this method isn't especially fast if the corpus was large). This method modifies this MarkovText in-place.
A good way to use this when you have some group of names, but might encounter them in different cases (such as ALL CAPS in a header, or Capitalized Like A Name), is to use aCaseInsensitiveSetas the Collection. It won't handle punctuation next to a word, though; a somewhat different approach is needed for that.- Parameters:
translator- a Translator that will be used to translate proper nouns in this MarkovText's word arraynames- a Collection of names that, if a word from the word array is also in names, will be translated
-
chain
Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (usinganalyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "..."), or once the length would be greater than 200 characters without encountering stop punctuation(it terminates such a sentence with "." or "...").- Parameters:
seed- the seed for the random decisions this makes, as a long; any long can be used- Returns:
- a String generated from the analyzed corpus text's word placement, usually a small sentence
-
chain
Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (usinganalyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "...") or once the maxLength would be exceeded by any other words (it terminates such a sentence with "." or "...").- Parameters:
seed- the seed for the random decisions this makes, as a long; any long can be usedmaxLength- the maximum length for the generated String, in number of characters- Returns:
- a String generated from the analyzed corpus text's word placement, usually a small sentence
-
stringSerialize
Returns a representation of this MarkovText as a String; usestringDeserialize(String)to get a MarkovText back from this String. Thewordsandprocessedfields must have been given values by either direct assignment, callinganalyze(CharSequence), or building this MarkovTest with the aforementioned destringSerialize method. Uses spaces to separate words and a tab to separate the two fields.- Returns:
- a String that can be used to store the analyzed words and frequencies in this MarkovText
-
stringDeserialize
Recreates an already-analyzed MarkovText given a String produced bystringSerialize().- Parameters:
data- a String returned bystringSerialize()- Returns:
- a MarkovText that is ready to generate text with
chain(long)
-
copy
Copies the String arraywordsand the 2D jagged int arrayprocessedinto a new MarkovText. None of the arrays will be equivalent references, but the Strings (being immutable) will be the same objects in both MarkovText instances. This is primarily useful withchangeNames(Translator), which can produce several variants on names given several initial copies produced with this method.- Returns:
- a copy of this MarkovText
-