The function for creating a document term matrix
(list) the list containing the text data with each entry belonging to a unique id
(list) the minimum and maximum n-gram length, e.g. c(1,3)
(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")
(string) the word to remove
(integer) the rate of occurence of a word to be removed
(string) the mode of removal -> "none", "frequency", "term" or "percentage", frequency removes all words under a certain frequency or over a certain frequency as indicated by removal_rate_least and removal_rate_most, term removes an absolute amount of terms that are most frequent and least frequent, percentage the amount of terms indicated by removal_rate_least and removal_rate_most relative to the amount of terms in the matrix
(integer) the rate of most frequent words to be removed, functionality depends on removal_mode
(integer) the rate of least frequent words to be removed, functionality depends on removal_mode
(float) the proportion of the data to be used for training
(integer) the random seed for reproducibility
(string) the directory to save the results, default is "./results", if NULL, no results are saved
(string) the directory to load from.
(integer) the number of threads to use
the document term matrix
# \donttest{
# Create a Dtm and remove the terms that occur less than 4 times and more than 500 times.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "frequency",
removal_rate_least = 4,
removal_rate_most = 500)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
# Create Dtm and remove the 5 least and 5 most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "term",
removal_rate_least = 1,
removal_rate_most = 1)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
# Create Dtm and remove the 5% least frequent and 1% most frequent terms.
dtm <- topicsDtm(data = dep_wor_data$Depphrase,
removal_mode = "percentage",
removal_rate_least = 1,
removal_rate_most = 1)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
# Load precomputed Dtm from directory
dtm <- topicsDtm(load_dir = "./results",
seed = 42)
#> [1] "The Dtm, data, and summary are saved in./results/seed_42/dtms.rds"
# }