Skip to content

The function computes ngrams from a text

Usage

topicsGrams(
  data,
  ngram_window = c(1, 3),
  stopwords = stopwords::stopwords("en", source = "snowball"),
  occurance_rate = 0,
  removal_mode = "frequency",
  removal_rate_most = NULL,
  removal_rate_least = NULL,
  pmi_threshold = 0,
  top_frequent = 200
)

Arguments

data

(tibble) The data

ngram_window

(list) the minimum and maximum n-gram length, e.g. c(1,3)

stopwords

(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")

occurance_rate

(numerical) The occurance rate (0-1) removes words that occur less then in (occurance_rate)*(number of documents). Example: If the training dataset has 1000 documents and the occurrence rate is set to 0.05, the code will remove terms that appear in less than 50 documents.

removal_mode

(character) The mode of removal, either "term", frequency" or "percentage"

removal_rate_most

(numeric) The rate of most frequent ngrams to remove

removal_rate_least

(numeric) The rate of least frequent ngrams to remove

pmi_threshold

(integer) The pmi threshold, if it shall not be used set to 0

top_frequent

(integer) The number of most frequently occuring ngrams to included in the output.

Value

A list containing tibble of the ngrams with the frequency and probability and a tibble containing the relative frequency of the ngrams for each user

GitHub