Package squidpony

Class MarkovText

java.lang.Object
squidpony.MarkovText
All Implemented Interfaces:
Serializable

public class MarkovText
extends Object
implements Serializable
A simple Markov chain text generator; call analyze(CharSequence) once on a large sample text, then you can call chain(long) many times to get odd-sounding "remixes" of the sample text. This is an order-2 Markov chain, so it chooses the next word based on the previous two words; MarkovTextLimited is an order-1 Markov chain, and is faster, but produces lousy output because it only uses one previous word. This is meant to allow easy serialization of the necessary data to call chain(); if you can store the words and processed arrays in some serialized form, then you can reassign them to the same fields to avoid calling analyze(). One way to do this conveniently is to use serializeToString() after calling analyze() once and to save the resulting String; then, rather than calling analyze() again on future runs, you would call deserializeFromString(String) to create the MarkovText without needing any repeated analysis.
Created by Tommy Ettinger on 1/30/2018.
See Also:
Serialized Form
  • Field Summary

    Fields 
    Modifier and Type Field Description
    IntIntOrderedMap pairs
    Map of all pairs of words encountered to the position in the order they were encountered.
    int[][] processed
    Complicated data that mixes probabilities of words using their indices in words and the indices of word pairs in pairs, generated during the latest call to analyze(CharSequence).
    String[] words
    All words (case-sensitive and counting some punctuation as part of words) that this encountered during the latest call to analyze(CharSequence).
  • Constructor Summary

    Constructors 
    Constructor Description
    MarkovText()  
  • Method Summary

    Modifier and Type Method Description
    void analyze​(CharSequence corpus)
    This is the main necessary step before using a MarkovText; you must call this method at some point before you can call any other methods.
    String chain​(long seed)
    Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (using analyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "..."), or once the length would be greater than 200 characters without encountering stop punctuation(it terminates such a sentence with "." or "...").
    String chain​(long seed, int maxLength)
    Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (using analyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "...") or once the maxLength would be exceeded by any other words (it terminates such a sentence with "." or "...").
    void changeNames​(NaturalLanguageCipher translator)
    After calling analyze(CharSequence), you can optionally call this to alter any words in this MarkovText that were used as a proper noun (determined by whether they were capitalized in the middle of a sentence), changing them to a ciphered version using the given NaturalLanguageCipher.
    MarkovText copy()
    Copies the String array words and the 2D jagged int array processed into a new MarkovText.
    static MarkovText deserializeFromString​(String data)
    Recreates an already-analyzed MarkovText given a String produced by serializeToString().
    String serializeToString()
    Returns a representation of this MarkovText as a String; use deserializeFromString(String) to get a MarkovText back from this String.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • words

      public String[] words
      All words (case-sensitive and counting some punctuation as part of words) that this encountered during the latest call to analyze(CharSequence). Will be null if analyze(CharSequence) was never called.
    • pairs

      Map of all pairs of words encountered to the position in the order they were encountered. Pairs are stored using their 16-bit words indices placed into the most-significant bits for the first word and the least-significant bits for the second word. The size of this IntIntOrderedMap is likely to be larger than the String array words, but should be equal to processed.length. Will be null if analyze(CharSequence) was never called.
    • processed

      public int[][] processed
      Complicated data that mixes probabilities of words using their indices in words and the indices of word pairs in pairs, generated during the latest call to analyze(CharSequence). This is a jagged 2D array. Will be null if analyze(CharSequence) was never called.
  • Constructor Details

  • Method Details

    • analyze

      public void analyze​(CharSequence corpus)
      This is the main necessary step before using a MarkovText; you must call this method at some point before you can call any other methods. You can serialize this MarkovText after calling to avoid needing to call this again on later runs, or even include serialized MarkovText objects with a game to only need to call this during pre-processing. This method analyzes the pairings of words in a (typically large) corpus text, including some punctuation as part of words and some kinds as their own "words." It only uses one preceding word to determine the subsequent word. When it finishes processing, it stores the results in words and processed, which allows other methods to be called (they will throw a NullPointerException if analyze() hasn't been called).
      Parameters:
      corpus - a typically-large sample text in the style that should be mimicked
    • changeNames

      public void changeNames​(NaturalLanguageCipher translator)
      After calling analyze(CharSequence), you can optionally call this to alter any words in this MarkovText that were used as a proper noun (determined by whether they were capitalized in the middle of a sentence), changing them to a ciphered version using the given NaturalLanguageCipher. Normally you would initialize a NaturalLanguageCipher with a FakeLanguageGen that matches the style you want for all names in this text, then pass that to this method during pre-processing (not necessarily at runtime, since this method isn't especially fast if the corpus was large). This method modifies this MarkovText in-place.
      Parameters:
      translator - a NaturalLanguageCipher that will be used to translate proper nouns in this MarkovText's word array
    • chain

      public String chain​(long seed)
      Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (using analyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "..."), or once the length would be greater than 200 characters without encountering stop punctuation(it terminates such a sentence with "." or "...").
      Parameters:
      seed - the seed for the random decisions this makes, as a long; any long can be used
      Returns:
      a String generated from the analyzed corpus text's word placement, usually a small sentence
    • chain

      public String chain​(long seed, int maxLength)
      Generate a roughly-sentence-sized piece of text based on the previously analyzed corpus text (using analyze(CharSequence)) that terminates when stop punctuation is used (".", "!", "?", or "...") or once the maxLength would be exceeded by any other words (it terminates such a sentence with "." or "...").
      Parameters:
      seed - the seed for the random decisions this makes, as a long; any long can be used
      maxLength - the maximum length for the generated String, in number of characters
      Returns:
      a String generated from the analyzed corpus text's word placement, usually a small sentence
    • serializeToString

      Returns a representation of this MarkovText as a String; use deserializeFromString(String) to get a MarkovText back from this String. The words and processed fields must have been given values by either direct assignment, calling analyze(CharSequence), or building this MarkovTest with the aforementioned deserializeToString method. Uses spaces to separate words and a tab to separate the two fields.
      Returns:
      a String that can be used to store the analyzed words and frequencies in this MarkovText
    • deserializeFromString

      public static MarkovText deserializeFromString​(String data)
      Recreates an already-analyzed MarkovText given a String produced by serializeToString().
      Parameters:
      data - a String returned by serializeToString()
      Returns:
      a MarkovText that is ready to generate text with chain(long)
    • copy

      public MarkovText copy()
      Copies the String array words and the 2D jagged int array processed into a new MarkovText. None of the arrays will be equivalent references, but the Strings (being immutable) will be the same objects in both MarkovText instances. This is primarily useful with changeNames(NaturalLanguageCipher), which can produce several variants on names given several initial copies produced with this method.
      Returns:
      a copy of this MarkovText