The function for creating a document term matrix

topicsDtm(
  data,
  ngram_window = c(1, 3),
  stopwords = stopwords::stopwords("en", source = "snowball"),
  removalword = "",
  occ_rate = 0,
  removal_mode = "none",
  removal_rate_most = 0,
  removal_rate_least = 0,
  split = 1,
  seed = 42L,
  save_dir = "./results",
  load_dir = NULL,
  threads = 1
)

Arguments

data

(list) the list containing the text data with each entry belonging to a unique id

ngram_window

(list) the minimum and maximum n-gram length, e.g. c(1,3)

stopwords

(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")

removalword

(string) the word to remove

occ_rate

(integer) the rate of occurence of a word to be removed

removal_mode

(string) the mode of removal -> "none", "frequency", "term" or "percentage", frequency removes all words under a certain frequency or over a certain frequency as indicated by removal_rate_least and removal_rate_most, term removes an absolute amount of terms that are most frequent and least frequent, percentage the amount of terms indicated by removal_rate_least and removal_rate_most relative to the amount of terms in the matrix

removal_rate_most

(integer) the rate of most frequent words to be removed, functionality depends on removal_mode

removal_rate_least

(integer) the rate of least frequent words to be removed, functionality depends on removal_mode

split

(float) the proportion of the data to be used for training

seed

(integer) the random seed for reproducibility

save_dir

(string) the directory to save the results, default is "./results", if NULL, no results are saved

load_dir

(string) the directory to load from.

threads

(integer) the number of threads to use

Value

the document term matrix

Examples

# \donttest{
# Create a Dtm and remove the terms that occur less than 4 times and more than 500 times.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
                 removal_mode = "frequency",
                 removal_rate_least = 4,
                 removal_rate_most = 500)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"

# Create Dtm and remove the 5 least and 5 most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
                 removal_mode = "term",
                 removal_rate_least = 1,
                 removal_rate_most = 1)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"

# Create Dtm and remove the 5% least frequent and 1% most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
                 removal_mode = "percentage",
                 removal_rate_least = 1,
                 removal_rate_most = 1)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
                 
# Load precomputed Dtm from directory
dtm <- topicsDtm(load_dir = "./results",
                 seed = 42)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
# }