Skip to content

The function to predict the topics of a new document with the trained model.

Usage

topicsPreds(
  model,
  data,
  num_iterations = 200,
  sampling_interval = 10,
  burn_in = 10,
  seed = 42,
  create_new_dtm = FALSE
)

Arguments

model

(list) The trained model.

data

(tibble) The text variable for which you want to infer the topic distribution. This can be the same data as used to create the dtm or new data.

num_iterations

(integer) The number of iterations to run the model.

sampling_interval

The number of iterations between consecutive samples collected. during the Gibbs Sampling process. This technique, known as thinning, helps reduce the correlation between consecutive samples and improves the quality of the final estimates by ensuring they are more independent. Purpose: By specifying a sampling_interval, you avoid collecting highly correlated samples, which can lead to more robust and accurate topic distributions. Example: If sampling_interval = 10, the algorithm collects a sample every 10 iterations (e.g., at iteration 10, 20, 30, etc.). Typical Values: Default: 10; Range: 5 to 50 (depending on the complexity and size of the data).

burn_in

The number of initial iterations discarded during the Gibbs Sampling process. These early iterations may not be representative of the final sampling distribution because the model is still stabilizing. Purpose: The burn_in period allows the model to converge to a more stable state before collecting samples, improving the quality of the inferred topic distributions. Example: If burn_in = 50, the first 50 iterations of the Gibbs Sampling process are discarded, and sampling begins afterward. Typical Values: Default: 50 to 100 Range: 10 to 1000 (larger datasets or more complex models may require a longer burn-in period).

seed

(integer) A seed to set for reproducibility.

create_new_dtm

(boolean) If applying the model on new data (not used in training), it can help to make a new dtm. Currently this is experimental, and using the textmineR::CreateDtm() function rather than the topicsDtm() function, which has more functions.

Value

A tibble of the predictions: The rows represent the documents, and the columns represent the topics. The values in the cells indicate the proportion of each topic within the corresponding document.

Examples

# \donttest{
# Predict topics for new data with the trained model

dtm <- topicsDtm(
data = dep_wor_data$Depphrase)

model <- topicsModel(dtm = dtm, # output of topicsDtm()
                     num_topics = 20,
                     num_top_words = 10,
                     num_iterations = 1000,
                     seed = 42)
                     
preds <- topicsPreds(model = model, # output of topicsModel()
                     data = dep_wor_data$Depphrase)
# }

GitHub