Writing custom RSMTool sections

RSMTool allows users to include custom Jupyter notebooks in the reports generated by rsmtool, rsmeval, and rsmcompare using the custom_sections fields in their configuration files. This can be particularly useful if a researcher wants to include custom analyses specific to their own scoring engine; she can do so while still using the familiar RSMTool pipeline.

Available variables

When writing such notebooks, some or all of the python variables below will be available for use in the notebooks.

experiment_id

The experiment ID from the respective configuration file.

description

The description string from the respective configuration file.

model_name

The name of the model from the respective configuration file.

context

Two possible values: rsmtool or rsmeval depending on which command-line tool is being used to generated the report (not available in rsmcompare notebooks).

use_scaled_predictions

A boolean Python variable containing the value of the setting with the same name from the respective configuration file.

exclude_zero_scores

A boolean Python variable containing the value of the setting with the same name from the respective configuration file.

length_column (rsmtool only)

The name of the column in the training and/or evaluation data containing response length. None if not specified in the configuration file.

second_human_score_column

The name of the column in the evaluation data containing the second human score. None if not specified in the respective configuration file.

groups_eval

A list containing the names of metadata or subgroup columns as specified in the respective configuration file.

min_items

The minimal number of items expected from each candidate. The value if set to 0 if the user did not specify a minimal number in the respective configuration file.

features_used (rsmtool only)

A list containing the names of all the features that are used for training the model. [rsmtool only]

In addition, several pandas data frames are also available. Many of these contain the same information produced in the intermediate CSV files produced by rsmtool or rsmeval. We have made these available to authors of custom notebooks to avoid the need for reading them from disk.

df_features (rsmtool only)

A data frame containing information about the feature columns that were included in the final model training. Same information as in feature.csv.

df_betas (rsmtool only)

Relative and standardized coefficients (betas.csv)

df_train_orig (rsmtool only)

df_test_orig (rsmtool only)

Data frames containing the original training and testing data as specified in the config file, without any changes.

df_train (rsmtool only)

df_train_preproc (rsmtool only)

df_test (rsmtool only)

df_test_preproc (rsmtool only)

Data frames containing the raw and pre-processed feature values.

df_train_other_columns (rsmtool only)

df_test_other_columns

Data frames containing the unused columns from the training and evaluation data.

df_train_responses_with_excluded_flags (rsmtool only)

df_test_responses_with_excluded flags

Data frames containing the flagged responses.

df_train_length (rsmtool only)

A data frame containing response lengths under the length column for the training data, along with the response IDs under the spkitemid column. These are only available (a) if length_column was specified in the configuration file, and (b) if no values in that column are missing, and (c) if the values in that column are not distributed with a standard deviation <= 0.

df_test_human_scores

A data frame containing the two human scores for the responses in the evaluation data under the sc1 and sc2 columns, along with the response IDs under the spkitemid column. This frame is only available if second_human_score_column was specified in the config file.

Note

This data frame will contain NaN for the responses for which no numeric second human score was available or for which the second score was 0 and exclude_zero_scores was set to true.

df_pred_preproc

A data frame containing the raw and post-processed predictions for the evaluation data.

df_feature_subset_specs (rsmtool only)

A data frame containing the contents of feature_subset_file if it was specified in configuration file. None if not specified.

Finally, the following variables are also available but you are strong encouraged not to re-read the files under these directories which are already available as data frames.

output_dir

The output sub-directory under the experiment output directory that contains all the intermediate CSV files.

figure_dir

The figure sub-directory under the experiment output directory that contains all of the generated SVG and PNG figures.

Note

All dataframes apart from df_train_orig and df_test_orig contain an spkitemid column which contains the unique response IDs.

All data frames except the df_*_other_columns contain an sc1 column which contains the human score for the responses.

df_train_orig and df_test_orig will contain the response IDs and human scores under columns with the original names, not spkitemid and sc1.