Advanced Uses of RSMTool

In addition to providing the rsmtool utility training and evaluating regression-based scoring models, the RSMTool package also provides six other command-line utilities for more advanced users.

`rsmeval` - Evaluate external predictions

RSMTool provides the rsmeval command-line utility to evaluate existing predictions and generate a report with all the built-in analyses. This can be useful in scenarios where the user wants to use more sophisticated machine learning algorithms not available in RSMTool to build the scoring model but still wants to be able to evaluate that model’s predictions using the standard analyses.

For example, say a researcher has an existing automated scoring engine for grading short responses that extracts the features and computes the predicted score. This engine uses a large number of binary, sparse features. She cannot use rsmtool to train her model since it requires numeric features. So, she uses scikit-learn to train her model.

Once the model is trained, the researcher wants to evaluate her engine’s performance using the analyses recommended by the educational measurement community as well as conduct additional investigations for specific subgroups of test-takers. However, these kinds of analyses are not available in scikit-learn. She can use rsmeval to set up a customized report using a combination of existing and custom sections and quickly produce the evaluation that is useful to her.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmeval is designed for evaluating existing machine scores. Once you have the scores computed for all the responses in your data, the next steps are fairly straightforward:

Create a data file in one of the supported formats containing the computed system scores and the human scores you want to compare against.
Create an experiment configuration file describing the evaluation experiment you would like to run.
Run that configuration file with rsmeval and generate the experiment HTML report as well as the intermediate CSV files.
Examine the HTML report to check various aspects of model performance.

Note that the above workflow does not use any customization features , e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Generate scores

rsmeval is designed for researchers who have developed their own scoring engine for generating scores and would like to produce an evaluation report for those scores. For this tutorial, we will use the scores we generated for the ASAP2 evaluation set using rsmtool tutorial.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

{
    "experiment_id": "ASAP2_evaluation",
    "description": "Evaluation of the scores generated using rsmtool.",
    "predictions_file": "ASAP2_scores.csv",
    "system_score_column": "system",
    "human_score_column": "human",
    "id_column": "ID",
    "trim_min": 1,
    "trim_max": 6,
    "second_human_score_column": "human2",
    "scale_with": "asis"
}

Let’s take a look at the options in our configuration file.

Line 2: We define an experiment ID.
Line 3: We also provide a description which will be included in the experiment report.
Line 4: We list the path to the file with the predicted and human scores. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
Line 5: This field indicates that the system scores in our .csv file are located in a column named system.
Line 6: This field indicates that the human (reference) scores in our .csv file are located in a column named human.
Line 7: This field indicates that the unique IDs for the responses in the .csv file are located in columns named ID.
Lines 8-9: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
Line 10: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the human2 column in the .csv file.
Line 11: This field indicates that the provided machine scores are already re-scaled to match the distribution of human scores. rsmeval itself will not perform any scaling and the report will refer to these as scaled scores.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmeval configuration files rather than creating them manually.

Run the experiment

Now that we have our scores in the right format and our configuration file in .json format, we can use the rsmeval command-line script to run our evaluation experiment.

$ cd examples/rsmeval
$ rsmeval config_rsmeval.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmeval
Assuming given system predictions are already scaled and will be used as such.
 predictions: /Users/nmadnani/work/rsmtool/examples/rsmeval/ASAP2_scores.csv
Processing predictions
Saving pre-processed predictions and the metadata to disk
Running analyses on predictions
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see the output, figure, and report sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_evaluation_report.html file, which is the final evaluation report.

Examine the report

Our experiment report contains all the information we would need to evaluate the provided system scores against the human scores. It includes:

The distributions for the human versus the system scores.
Several different metrics indicating how well the machine’s scores agree with the humans’.
Information about human-human agreement and the difference between human-human and human-system agreement.

… and much more.

Input

rsmeval requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmeval will use the current directory as the output directory.

Here are all the arguments to the rsmeval command-line script.

config_file: The JSON configuration file for this experiment.

output_dir (optional): The output directory where all the files for this experiment will be stored.

-f, --force: If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmeval experiment. Here’s an example configuration file for rsmeval.

Note

To make it easy to get started with rsmeval, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmeval configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

experiment_id

An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

predictions_file

The path to the file with predictions to evaluate. The file should be in one of the supported formats. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.

system_score_column

The name for the column containing the scores predicted by the system. These scores will be used for evaluation.

trim_min

The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min - trim_tolerance.

trim_max

The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max + trim_tolerance.

Note

Although the trim_min and trim_max fields are optional for rsmtool, they are required for rsmeval.

candidate_column (Optional)

The name for an optional column in prediction file containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

description (Optional)

A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.

exclude_zero_scores (Optional)

By default, responses with human scores of 0 will be excluded from evaluations. Set this field to false if you want to keep responses with scores of 0. Defaults to true.

file_format (Optional)

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

flag_column (Optional)

This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0 in a column called ADVISORY). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be evaluated. For example, a value of {"ADVISORY": 0} will mean that rsmeval will only use responses for which the ADVISORY column has the value 0. Defaults to None.

Note

If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY column has a value of 0 and the ERROR column has a value of 0).

Note

When reading the values in the supplied dictionary, rsmeval treats numeric strings, floats and integers as the same value. Thus 1, 1.0, "1" and "1.0" are all treated as the 1.0.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmeval (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

data_description: Shows the total number of responses, along with any responses have been excluded due to non-numeric/zero scores or flag columns.

data_description_by_group: Shows the total number of responses for each of the subgroups specified in the configuration file. This section only covers the responses used to evaluate the model.

consistency: shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.

evaluation: Shows the standard set of evaluations recommended for scoring models on the evaluation data:

a table showing human-system association metrics;

the confusion matrix; and

a barplot showing the distributions for both human and machine scores.

evaluation by group: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.

fairness_analyses: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:

percentage of variance in squared error explained by subgroup membership

percentage of variance in raw (signed) error explained by subgroup membership

percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score

plots showing estimates for each subgroup for each model

true_score_evaluation: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

Number of single and double-scored responses.

Variance of human rater errors and estimated variance of true scores

Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.

intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the evaluation.

sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

human_score_column (Optional)

The name for the column containing the human scores for each response. The values in this column will be used as observed scores. Defaults to sc1.

Note

All responses with non-numeric values or zeros in either human_score_column or system_score_column will be automatically excluded from evaluation. You can use exclude_zero_scores (Optional) to keep responses with zero scores.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmeval will look for a column called spkitemid in the prediction file.

min_items_per_candidate (Optional)

An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None.

min_n_per_group (Optional)

A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.

Note

If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.

rater_error_variance (Optional)

True score evaluations require an estimate of rater error variance. By default, rsmeval will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmeval. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.

scale_with (Optional)

In many scoring applications, system scores are re-scaled so that their mean and standard deviation match those of the human scores for the training data.

If you want rsmeval to re-scale the supplied predictions, you need to provide – as the value for this field – the path to a second file in one of the supported formats containing the human scores and predictions of the same system on its training data. This file must have two columns: the human scores under the sc1 column and the predicted score under the prediction.

This field can also be set to "asis" if the scores are already scaled. In this case, no additional scaling will be performed by rsmeval but the report will refer to the scores as “scaled”.

Defaults to "raw" which means that no-rescaling is performed and the report refers to the scores as “raw”.

second_human_score_column (Optional)

The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.

Note

You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and

All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. These subgroup columns need to be present in the input predictions file. If subgroups are specified, rsmeval will generate:

tables and barplots showing human-system agreement for each subgroup on the evaluation set.

trim_tolerance (Optional)

The single numeric value that will be used to pad the trimming range specified in trim_min and trim_max. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max + trim_tolerance for ceiling value and trim_min-trim_tolerance for floor value. Defaults to 0.4998.

Note

For more fine-grained control over the trimming range, you can set trim_tolerance to 0 and use trim_min and trim_max to specify the exact floor and ceiling values.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

use_wandb (Optional)

If set to true, the generated reports and all intermediate tables will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. The tables and plots will be logged in a section named rsmeval in a new run under the given project, and the report will be added to a reports section in that run. In addition, some evaluation metrics will be logged to the run’s history, see more details here. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmeval produces a set of folders in the output directory. If logging to Weights & Biases is enabled, the reports and all intermediate files are also logged to the specified Weights & Biases project.

report

This folder contains the final RSMEval report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmeval will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

figure

This folder contains all of the figures generated as part of the various analyses performed, saved as .svg files.

Intermediate files

Although the primary output of rsmeval is an HTML report, we also want the user to be able to conduct additional analyses outside of rsmeval. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the experiment_id provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:

spkitemid for the column containing response IDs.
sc1 for the column containing the human scores used as observed scores
sc2 for the column containing the second human scores, if this column was specified in the configuration file.
candidate for the column containing candidate IDs, if this column was specified in the configuration file.

Predictions

filename: pred_processed

This file contains the post-processed predicted scores: the predictions from the model are truncated, rounded, and re-scaled (if requested).

Flagged responses

filename: test_responses_with_excluded_flags

This file contains all of the rows in input predictions file that were filtered out based on conditions specified in flag_column.

Note

If the predictions file contained columns with internal names such as sc1 that were not actually used by rsmeval, they will still be included in these files but their names will be changed to ##name## (e.g. ##sc1##).

Excluded responses

filename: test_excluded_responses

This file contains all of the rows in the predictions file that were filtered out because of non-numeric or zero scores.

Response metadata

filename: test_metadata

This file contains the metadata columns (id_column, subgroups if provided) for all rows in the predictions file that used in the evaluation.

Unused columns

filename: test_other_columns

This file contains all of the the columns from the input predictions file that are not present in the *_pred_processed and *_metadata files. They only include the rows that were not filtered out.

Note

If the predictions file contained columns with internal names such as sc1 but these columns were not actually used by rsmeval, these columns will also be included into these files but their names will be changed to ##name## (e.g. ##sc1##).

Human scores

filename: test_human_scores

This file contains the human scores, if available in the input predictions file, under a column called sc1 with the response IDs under the spkitemid column.

If second_human_score_column was specfied, then it also contains the values in the predictions file from that column under a column called sc2. Only the rows that were not filtered out are included.

Note

If exclude_zero_scores was set to true (the default value), all zero scores in the second_human_score_column will be replaced by nan.

Data composition

filename: data_composition

This file contains the total number of responses in the input predictions file. If applicable, the table will also include the number of different subgroups.

Excluded data composition

filenames: test_excluded_composition

This file contains the composition of the set of excluded responses, e.g., why were they excluded and how many for each such exclusion.

Subgroup composition

filename: data_composition_by_<SUBGROUP>

There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup.

Evaluation metrics

eval: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.
eval_by_<SUBGROUP>: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.
eval_short: a shortened version of eval that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).
- h_mean
- h_sd
- corr
- sys_mean [raw/scale trim]
- sys_sd [raw/scale trim]
- SMD [raw/scale trim]
- adj_agr [raw/scale trim_round]
- exact_agr [raw/scale trim_round]
- kappa [raw/scale trim_round]
- wtkappa [raw/scale trim]
- sys_mean [raw/scale trim_round]
- sys_sd [raw/scale trim_round]
- SMD [raw/scale trim_round]
- R2 [raw/scale trim]
- RMSE [raw/scale trim]
score_dist: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.
confMatrix: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

Human-human Consistency

These files are created only if a second human score has been made available via the second_human_score_column option in the configuration file.

consistency: contains descriptives for both human raters as well as the agreement metrics between their ratings.
consistency_by_<SUBGROUP>: contains the same metrics as in consistency file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.
degradation: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.

Evaluations based on test theory

disattenuated_correlations: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.
disattenuated_correlations_by_<SUBGROUP>: contains the same metrics as in disattenuated_correlations file computed separately for each group.
true_score_eval: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.

Additional fairness analyses

These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.

<METRICS>_by_<SUBGROUP>.ols: a serialized object of type pandas.stats.ols.OLS containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics are osa (overall score accuracy), osd (overall score difference), and csd (conditional score difference).
<METRICS>_by_<SUBGROUP>_ols_summary.txt: a text file containing a summary of the above model
estimates_<METRICS>_by_<SUBGROUP>`: coefficients, confidence intervals and p-values estimated by the model for each subgroup.
fairness_metrics_by_<SUBGROUP>: the \(R^2\) (percentage of variance) and p-values for all models.

`rsmpredict` - Generate new predictions

RSMTool provides the rsmpredict command-line utility to generate predictions for new data using a model already trained using the rsmtool utility. This can be useful when processing a new set of responses to the same task without needing to retrain the model.

rsmpredict pre-processes the feature values according to user specifications before using them to generate the predicted scores. The generated scores are post-processed in the same manner as they are in rsmtool output.

Note

No score is generated for responses with non-numeric values for any of the features included into the model.

If the original model specified transformations for some of the features and these transformations led to NaN or Inf values when applied to the new data, rsmpredict will raise a warning. No score will be generated for such responses.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

Important

Although this tutorial provides feature values for the purpose of illustration, rsmpredict does not include any functionality for feature extraction; the tool is designed for researchers who use their own NLP/Speech processing pipeline to extract features for their data.

rsmpredict allows you to generate the scores for new data using an existing model trained using RSMTool. Therefore, before starting this tutorial, you first need to complete rsmtool tutorial which will produce a train RSMTool model. You will also need to process the new data to extract the same features as the ones used in the model.

Once you have the features for the new data and the RSMTool model, using rsmpredict is fairly straightforward:

Create a file containing the features for the new data. The file should be in one of the supported formats.
Create an experiment configuration file describing the experiment you would like to run.
Run that configuration file with rsmpredict to generate the predicted scores.

Note

You do not need human scores to run rsmpredict since it does not produce any evaluation analyses. If you do have human scores for the new data and you would like to evaluate the system on this new data, you can first run rsmpredict to generate the predictions and then run rsmeval on the output of rsmpredict to generate an evaluation report.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial. Specifically, We are going to use the linear regression model we trained in that tutorial to generate scores for new data.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

Extract features

We will first need to generate features for the new set of responses for which we want to predict scores. For this experiment, we will simply re-use the test set from the rsmtool tutorial.

Note

The features used with rsmpredict should be generated using the same NLP/Speech processing pipeline that generated the features used in the rsmtool modeling experiment.

Create a configuration file

The next step is to create an rsmpredict experiment configuration file in .json format.

{
    "experiment_dir": "../rsmtool",
    "experiment_id": "ASAP2",
    "input_features_file": "../rsmtool/test.csv",
    "id_column": "ID",
    "human_score_column": "score",
    "second_human_score_column": "score2"
}

Let’s take a look at the options in our configuration file.

Line 2: We give the path to the directory containing the output of the rsmtool experiment.
Line 3: We provide the experiment_id of the rsmtool experiment used to train the model. This can usually be read off the output/<experiment_id>.model file in the rsmtool experiment output directory.
Line 4: We list the path to the data file with the feature values for the new data. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
Line 5: This field indicates that the unique IDs for the responses in the .csv file are located in a column named ID.
Lines 6-7: These fields indicates that there are two sets of human scores in our .csv file located in the columns named score and score2. The values from these columns will be added to the output file containing the predictions which can be useful if we want to evaluate the predictions using rsmeval.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmpredict configuration files rather than creating them manually.

Run the experiment

Now that we have the model, the features in the right format, and our configuration file in .json format, we can use the rsmpredict command-line script to generate the predictions and to save them in predictions.csv.

$ cd examples/rsmpredict
$ rsmpredict config_rsmpredict.json predictions.csv

This should produce output like:

WARNING: The following extraneous features will be ignored: {'spkitemid', 'sc1', 'sc2', 'LENGTH'}
Pre-processing input features
Generating predictions
Rescaling predictions
Trimming and rounding predictions
Saving predictions to /Users/nmadnani/work/rsmtool/examples/rsmpredict/predictions.csv

You should now see a file named predictions.csv in the current directory which contains the predicted scores for the new data in the predictions column.

Input

rsmpredict requires two arguments to generate predictions: the path to a configuration file and the path to the output file where the generated predictions are saved in .csv format.

If you also want to save the pre-processed feature values,``rsmpredict`` can take a third optional argument --features to specify the path to a .csv file to save these values.

Here are all the arguments to the rsmpredict command-line script.

config_file: The JSON configuration file for this experiment.

output_file: The output .csv file where predictions will be saved.

--features <preproc_feats_file>: If specified, the pre-processed values for the input features will also be saved in this .csv file.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmpredict experiment. Here’s an example configuration file for rsmpredict.

Note

To make it easy to get started with rsmpredict, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmpredict configuration fields in detail. There are three required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

experiment_dir

The path to the directory containing rsmtool model to use for generating predictions. This directory must contain a sub-directory called output with the model files, feature pre-processing parameters, and score post-processing parameters. The path can be absolute or relative to the location of configuration file.

experiment_id

The experiment_id used to create the rsmtool model files being used for generating predictions. If you do not know the experiment_id, you can find it by looking at the prefix of the .model file under the output directory.

input_feature_file

The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the supported formats Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names must be the same as used in the original rsmtool experiment.

Note

rsmpredict will only generate predictions for responses in this file that have numeric values for the features included in the rsmtool model.

candidate_column (Optional)

The name for the column containing unique candidate IDs. This column will be named candidate in the output file with predictions.

file_format (Optional)

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

flag_column (Optional)

See description in the rsmtool configuration file for further information. No filtering will be done by rsmpredict, but the contents of all specified columns will be added to the predictions file using the original column names.

human_score_column (Optional)

The name for the column containing human scores. This column will be renamed to sc1.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmpredict will look for a column called spkitemid in the prediction file.

There are several other options in the configuration file that, while not directly used by rsmpredict, can simply be passed through from the input features file to the output predictions file. This can be particularly useful if you want to subsequently run rsmeval to evaluate the generated predictions.

predict_expected_scores (Optional)

If the original model was a probabilistic SKLL classifier, then expected scores — probability-weighted averages over a contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default. Set this field to true to compute expected scores as predictions. Defaults to false.

Note

If the model in the original rsmtool experiment is an SVC, that original experiment must have been run with predict_expected_scores set to true. This is because SVC classifiers are fit differently if probabilistic output is desired, in contrast to other probabilistic SKLL classifiers.
You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probability values for certain score points which can affect floating point computations.

second_human_score_column (Optional)

The name for the column containing the second human score. This column will be renamed to sc2.

standardize_features (Optional)

If this option is set to false features will not be standardized by dividing by the mean and multiplying by the standard deviation. Defaults to true.

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. All these columns will be included into the predictions file with the original names.

truncate_outliers (Optional)

If this option is set to false, outliers (values more than 4 standard deviations away from the mean) in feature columns will _not_ be truncated. Defaults to true.

use_wandb (Optional)

If set to true, the generated report and the predictions table will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. The predictions table will be logged in a section named rsmpredict in a new run under the given project, and the report will be added to a reports section in that run. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmpredict produces a .csv file with predictions for all responses in new data set, and, optionally, a .csv file with pre-processed feature values. If any of the responses had non-numeric feature values in the original data or after applying transformations, these are saved in a file name PREDICTIONS_NAME_excluded_responses.csv where PREDICTIONS_NAME is the name of the predictions file supplied by the user without the extension.

The predictions .csv file contains the following columns:

spkitemid : the unique resonse IDs from the original feature file.
sc1 and sc2 : the human scores for each response from the original feature file (human_score_column and second_human_score_column, respectively.
raw : raw predictions generated by the model.
raw_trim, raw_trim_round, scale, scale_trim, scale_trim_round : raw scores post-processed in different ways.

If logging to Weights & Biases is enabled, these csv files are also logged to the specified Weights & Biases project.

`rsmcompare` - Create a detailed comparison of two scoring models

RSMTool provides the rsmcompare command-line utility to compare two models and to generate a detailed comparison report including differences between the two models. This can be useful in many scenarios, e.g., say the user wants to compare the changes in model performance after adding a new feature into the model. To use rsmcompare, the user must first run two experiments using either rsmtool or rsmeval. rsmcompare can then be used to compare the outputs of these two experiments to each other.

Note

Currently rsmcompare takes the outputs of the analyses generated during the original experiments and creates comparison tables. These comparison tables were designed with a specific comparison scenario in mind: comparing a baseline model with a model which includes new feature(s). The tool can certianly be used for other comparison scenarios if the researcher feels that the generated comparison output is appropriate.

rsmcompare can be used to compare:

Two rsmtool experiments, or
Two rsmeval experiments, or
An rsmtool experiment with an rsmeval experiment (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the comparison experiment are all done using the same version of RSMTool.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmcompare is designed to compare two existing rsmtool or rsmeval experiments. To use rsmcompare you need:

Two experiments that were run using rsmtool or rsmeval.
Create an experiment configuration file describing the comparison experiment you would like to run.
Run that configuration file with rsmcompare and generate the comparison experiment HTML report.
Examine HTML report to compare the two models.

Note that the above workflow does not use the customization features of rsmcompare, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Run `rsmtool` (or `rsmeval`) experiments

rsmcompare compares the results of the two existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to itself.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

{
    "comparison_id": "ASAP2_vs_ASAP2",
    "experiment_id_old": "ASAP2",
    "experiment_dir_old": "../rsmtool/",
    "description_old": "RSMTool experiment.",
    "use_scaled_predictions_old": true,
    "experiment_id_new": "ASAP2",
    "experiment_dir_new": "../rsmtool",
    "description_new": "RSMTool experiment (copy).",
    "use_scaled_predictions_new": true
}

Let’s take a look at the options in our configuration file.

Line 2: We provide an ID for the comparison experiment.
Line 3: We provide the experiment_id for the experiment we want to use as a baseline.
Line 4: We also give the path to the directory containing the output of the original baseline experiment.
Line 5: We give a short description of this baseline experiment. This will be shown in the report.
Line 6: This field indicates that the baseline experiment used scaled scores for some evaluation analyses.
Line 7: We provide the experiment_id for the new experiment. We use the same experiment ID for both experiments since we are comparing the experiment to itself.
Line 8: We also give the path to the directory containing the output of the new experiment. As above, we use the same path because we are comparing the experiment to itself.
Line 9: We give a short description of the new experiment. This will also be shown in the report.
Line 10: This field indicates that the new experiment also used scaled scores for some evaluation analyses.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmcompare configuration files rather than creating them manually.

Run the experiment

Now that we have the two experiments we want to compare and our configuration file in .json format, we can use the rsmcompare command-line script to run our comparison experiment.

$ cd examples/rsmcompare
$ rsmcompare config_rsmcompare.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmcompare
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see an HTML file named ASAP2_vs_ASAP2_report.html. This is the final rsmcompare comparison report.

Examine the report

Our experiment report contains all the information we would need to compare the new model to the baseline model. It includes:

Comparison of feature distributions between the two experiments.
Comparison of model coefficients between the two experiments.
Comparison of model performance between the two experiments.

Note

Since we are comparing the experiment to itself, the comparison is not very interesting, e.g., the differences between various values will always be 0.

Input

rsmcompare requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmcompare will use the current directory as the output directory.

Here are all the arguments to the rsmcompare command-line script.

config_file: The JSON configuration file for this experiment.

output_dir (optional): The output directory where the report files for this comparison will be stored.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmcompare experiment. Here’s an example configuration file for rsmcompare.

Note

To make it easy to get started with rsmcompare, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmcompare configuration fields in detail. There are seven required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

comparison_id

An identifier for the comparison experiment that will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

experiment_id_old

An identifier for the “baseline” experiment. This ID should be identical to the experiment_id used when the baseline experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

experiment_id_new

An identifier for the experiment with the “new” model (e.g., the model with new feature(s)). This ID should be identical to the experiment_id used when the experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

experiment_dir_old

The directory with the results for the “baseline” experiment. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

experiment_dir_new

The directory with the results for the experiment with the new model. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

description_old

A brief description of the “baseline” experiment. The description can contain spaces and punctuation.

description_new

A brief description of the experiment with the new model. The description can contain spaces and punctuation.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmcompare (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

feature_descriptives: Compares the descriptive statistics for all raw feature values included in the model:

a table showing mean, standard deviation, skewness and kurtosis;

a table showing the number of truncated outliers for each feature; and

a table with percentiles and outliers;

a table with correlations between raw feature values and human score in each model and the correlation between the values of the same feature in these two models. Note that this table only includes features and responses which occur in both training sets.

features_by_group: Shows boxplots for both experiments with distributions of raw feature values by each of the subgroups specified in the configuration file.

preprocessed_features: Compares analyses of preprocessed features:

histograms showing the distributions of preprocessed features values;

the correlation matrix between all features and the human score;

a table showing marginal correlations between all features and the human score; and

a table showing partial correlations between all features and the human score.

preprocessed_features_by_group: Compares analyses of preprocessed features by subgroups: marginal and partial correlations between each feature and human score for each subgroup.

consistency: Compares metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated correlations for the whole dataset and by each of the subgroups specified in the configuration file.

score_distributions:

tables showing the distributions for both human and machine scores; and

confusion matrices for human and machine scores.

model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.

evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.

true_score_evaluation: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

Number of single and double-scored responses.

Variance of human rater errors and estimated variance of true scores

Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.

pca: Shows the results of principal components analysis on the processed feature values for the new model only:

the principal components themselves;

the variances; and

a Scree plot.

notes: Notes explaining the terminology used in comparison reports.

sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and

All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups.For example, ["prompt, gender, native_language, test_country"].

Note

In order to include subgroups analyses in the comparison report, both experiments must have been run with the same set of subgroups.

use_scaled_predictions_old (Optional)

Set to true if the “baseline” experiment used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

use_scaled_predictions_new (Optional)

Set to true if the experiment with the new model used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

Warning

For rsmtool and rsmeval, primary evaluation analyses are computed on both raw and scaled scores, but some analyses (e.g., the confusion matrix) are only computed for either raw or re-scaled scores based on the value of use_scaled_predictions. rsmcompare uses the existing outputs and does not perform any additional evaluations. Therefore if this field was set to true in the original experiment but is set to false for rsmcompare, the report will be internally inconsistent: some evaluations use raw scores whereas others will use scaled scores.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

use_wandb (Optional)

If set to true, the generated reports and all intermediate tables will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. The report will be added to a reports section in a new run under the given project. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmcompare produces the comparison report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file) in the output directory. If logging to Weights & Biases is enabled, the report will be also logged to the specified Weights & Biases project.

`rsmsummarize` - Compare multiple scoring models

RSMTool provides the rsmsummarize command-line utility to compare multiple models and to generate a comparison report. Unlike rsmcompare which creates a detailed comparison report between the two models, rsmsummarize can be used to create a more general overview of multiple models.

rsmsummarize can be used to compare:

Multiple rsmtool experiments, or
Multiple rsmeval experiments, or
A mix of rsmtool and rsmeval experiments (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the summary experiment are all done using the same version of RSMTool.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmsummarize is designed to compare several existing rsmtool or rsmeval experiments. To use rsmsummarize you need:

Two or more experiments that were run using rsmtool or rsmeval.
Create an experiment configuration file describing the comparison experiment you would like to run.
Run that configuration file with rsmsummarize and generate the comparison experiment HTML report.
Examine HTML report to compare the models.

Note that the above workflow does not use the customization features of rsmsummarize, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Run `rsmtool` and `rsmeval` experiments

rsmsummarize compares the results of the two or more existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to the evaluations we obtained in the rsmeval tutorial.

Note

If you have not already completed these tutorials, please do so now. You may need to complete them again if you deleted the output files.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

{
  "summary_id": "model_comparison",
  "description": "a comparison of the results of the rsmtool sample experiment, rsmeval sample experiment and once again the rsmtool sample experiment",
  "experiment_dirs": ["../rsmtool", "../rsmeval", "../rsmtool"],
  "experiment_names":["RSMTool experiment 1", "RSMEval experiment", "RSMTool experiment 2"]
}

Let’s take a look at the options in our configuration file.

Line 2: We provide the summary_id for the comparison. This will be used to generate the name of the final report.
Line 3: We give a short description of this comparison experiment. This will be shown in the report.
Line 4: We also give the list of paths to the directories containing the outputs of the experiments we want to compare.
Line 5: Since we want to compare experiments that all used the same experiment id (ASAP2), we instead list the names that we want to use for each experiment in the summary report.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmsummarize configuration files rather than creating them manually.

Run the experiment

Now that we have the list of the experiments we want to compare and our configuration file in .json format, we can use the rsmsummarize command-line script to run our comparison experiment.

$ cd examples/rsmsummarize
$ rsmsummarize config_rsmsummarize.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmsummarize
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see a new folder report containing an HTML file named model_comparison_report.html. This is the final rsmsummarize summary report.

Examine the report

Our experiment report contains the overview of main aspects of model performance. It includes:

Brief description of all experiments.
Information about model parameters and model fit for all rsmtool experiments.
Model performance for all experiments.

Note

Some of the information such as model fit and model parameters are only available for rsmtool experiments.

Input

rsmsummarize requires a single argument to run an experiment: the path to a configuration file. You can specify which models you want to compare and the name of the report by supplying the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmsummarize will use the current directory as the output directory.

Here are all the arguments to the rsmsummarize command-line script.

config_file: The JSON configuration file for this experiment.

output_dir (optional): The output directory where the report and intermediate .csv files for this comparison will be stored.

-f, --force: If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmsummarize experiment.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmsummarize experiment. Here’s an example configuration file for rsmsummarize.

Note

To make it easy to get started with rsmsummarize, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmsummarize configuration fields in detail. There are two required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

summary_id

An identifier for the rsmsummarize experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

experiment_dirs

The list of the directories with the results of the experiment. These directories should be the output directories used for each experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

description (Optional)

A brief description of the summary. The description can contain spaces and punctuation.

experiment_names (Optional)

The list of experiment names to use in the summary report and intermediate files. The names should be listed in the same order as the experiments in experiment_dirs. When this field is not specified, the report will show the original experiment_id for each experiment.

file_format (Optional)

The format of the intermediate files generated by rsmsummarize. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

Note

In the rsmsummarize context, the file_format parameter refers to the format of the intermediate files generated by rsmsummarize, not the intermediate files generated by the original experiment(s) being summarized. The format of these files does not have to match the format of the files generated by the original experiment(s).

general_sections (Optional)

RSMTool provides pre-defined sections for rsmsummarize (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

preprocessed_features : compares marginal and partial correlations between all features and the human score, and optionally response length if this was computed for any of the models.

model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.

evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.

true_score_evaluation: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

Number of single and double-scored responses.

Variance of human rater errors and estimated variance of true scores

Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.

intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the summary.

sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and

All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

use_wandb (Optional)

If set to true, the generated report will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. The report will be added to a reports section in a new run under the given project. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmsummarize produces a set of folders in the output directory. If logging to Weights & Biases is enabled, the reports and all intermediate files are also logged to the specified Weights & Biases project.

report

This folder contains the final rsmsummarize report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmsummarize will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

figure

This folder contains all of the figures that may be generated as part of the various analyses performed, saved as .svg files. Note that no figures are generated by the existing rsmsummarize notebooks.

Intermediate files

Although the primary output of RSMSummarize is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the summary_id provided by the user in the experiment configuration file.

Marginal and partial correlations with score

filenames: margcor_score_all_data, pcor_score_all_data, `pcor_score_no_length_all_data

The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column was specified in the configuration file.

Model information

model_summary

This file contains the main information about the models included into the report including:

Total number of features

Total number of features with non-negative coefficients

The learner

The label used to train the model

betas: standardized coefficients (for built-in models only).
model_fit: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.

Note

If the report includes a combination of rsmtool and rsmeval experiments, the summary tables with model information will only include rsmtool experiments since no model information is available for rsmeval experiments.

Evaluation metrics

eval_short - descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).
- h_mean
- h_sd
- corr
- sys_mean [raw/scale trim]
- sys_sd [raw/scale trim]
- SMD [raw/scale trim]
- adj_agr [raw/scale trim_round]
- exact_agr [raw/scale trim_round]
- kappa [raw/scale trim_round]
- wtkappa [raw/scale trim_round]
- sys_mean [raw/scale trim_round]
- sys_sd [raw/scale trim_round]
- SMD [raw/scale trim_round]
- R2 [raw/scale trim]
- RMSE [raw/scale trim]

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

Evaluations based on test theory

true_score_eval: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.

`rsmxval` - Run cross-validation experiments

RSMTool provides the rsmxval command-line utility to run cross-validation experiments with scoring models. Why would one want to use cross-validation rather than just using the simple train-and-evaluate loop provided by the rsmtool utility? Using cross-validation can provide more accurate estimates of scoring model performance since those estimates are averaged over multiple train-test splits that are randomly selected based on the data. Using a single train-test split may lead to biased estimates of performance since those estimates will depend on the specific characteristics of that split. Using cross-validation is more likely to provide estimates of how well the scoring model will generalize to unseen test data, and more easily flag problems with overfitting and selection bias, if any.

Cross-validation experiments in RSMTool consist of the following steps:

The given training data file is first shuffled randomly (with a fixed seed for reproducibility) and then split into the requested number of folds. It is also possible for the user to provide a CSV file containing a pre-determined set of folds, e.g., from another part of the data pipeline.
For each fold (or train-test split), rsmtool is run to train a scoring model on the training split and evaluate on the test split. All of the outputs for each of the rsmtool runs are saved on disk and represent the per-fold performance.
The predictions generated by rsmtool for each of the folds are all combined into a single file, which is then used as input for rsmeval. The output of this evaluation run is saved to disk and provides a more accurate estimate of the predictive performance of a scoring model trained on the given data.
A summary report comparing all of the folds is generated by running rsmsummarize on all of the fold directories created in the Step 1 and its output is also saved to disk. This summary output can be useful to see if the performance for any of the folds stands out for any reason, which could point to a potential problem.
Finally, a scoring model is trained on the complete training data file using rsmtool, which also generates a report that contains only the feature and model descriptives. The model is what will most likely be deployed for inference assuming the analyses produced in this step and Steps 1–4 meet the stakeholders’ requirements.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmxval is designed to run cross-validation experiments using a single file containing human scores and features. Just like rsmtool, rsmxval does not provide any functionality for feature extraction and assumes that users will extract features on their own. The workflow steps are as follows:

Create a data file in one of the supported formats containing the extracted features for each response in the data along with human score(s) assigned to it.
Create an experiment configuration file describing the cross-validation experiment you would like to run.
Run that configuration file with rsmxval and generate its outputs.
Examine the various HTML reports to check various aspects of model performance.

Note that unlike rsmtool and rsmeval, rsmxval currently does not support customization of the HTML reports generated in each step. This functionality may be added in future versions.

ASAP Example

We are going to use the same example from the 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Extract features

We are using the same features for this data as described in the rsmtool tutorial.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

{
    "experiment_id": "ASAP2_xval",
    "description": "Cross-validation with two human scores using a LinearRegression model.",
    "train_file": "train.csv",
    "folds": 3,
    "train_label_column": "score",
    "id_column": "ID",
    "model": "LinearRegression",
    "trim_min": 1,
    "trim_max": 6,
    "second_human_score_column": "score2",
    "use_scaled_predictions": true
}

Let’s take a look at the options in our configuration file.

Line 2: We define an experiment ID used to identify the files produced as part of this experiment.
Line 3: We provide a description which will be included in the various reports.
Line 4: We list the path to our training file with the feature values and human scores. For this tutorial, we used .csv format, but several other input file formats are also supported.
Line 5: This field indicates the number of cross-validation folds we want to use. If this field is not specified, rsmxval uses 5-fold cross-validation by default.
Line 6: This field indicates that the human (reference) scores in our .csv file are located in a column named score.
Line 7: This field indicates that the unique IDs for the responses in the .csv file are located in a column named ID.
Line 8: We choose to use a linear regression model to combine the feature values into a score.
Lines 9-10: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
Line 11: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the score2 column in the training .csv file.
Line 12: Next, we indicate that we would like to use the scaled scores for all our evaluation analyses at each step.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmxval configuration files rather than creating them manually.

Run the experiment

Now that we have our input file and our configuration file, we can use the rsmxval command-line script to run our evaluation experiment.

$ cd examples/rsmxval
$ rsmxval config_rsmxval.json output

This should produce output like:

Output directory: output
Saving configuration file.
Generating 3 folds after shuffling
Running RSMTool on each fold in parallel
Progress: 100%|███████████████████████████████████████████████| 3/3 [00:08<00:00,  2.76s/it]
Creating fold summary
Evaluating combined fold predictions
Training model on full data

Once the run finishes, you will see an output sub-directory in the current directory. Under this directory you will see multiple sub-directories, each corresponding to a different cross-validation step, as described here.

Examine the reports

The cross-validation experiment produces multiple HTML reports – an rsmtool report for each of the 3 folds (output/folds/{01,02,03}/report/ASAP2_xval_fold{01,02,03}.html), the evaluation report for the cross-validated predictions (output/evaluation/report/ASAP2_xval_evaluation_report.html), a report summarizing the salient characteristics of the 3 folds (output/fold-summary/report/ASAP2_xval_fold_summary_report.html), and a report showing the feature and model descriptives (output/final-model/report/ASAP2_xval_model_report.html). Examining these reports will provide a relatively complete picture of how well the predictive performance of the scoring model will generalize to unseen data.

Input

rsmxval requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmxval will use the current directory as the output directory.

Here are all the arguments to the rsmxval command-line script.

config_file: The JSON configuration file for this cross-validation experiment.

output_dir (optional): The output directory where all the sub-directories and files for this cross-validation experiment will be stored. If a non-empty directory with the same name already exists, an error will be raised.

-h, --help: Show help message and exit.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmxval experiment. Here’s an example configuration file for rsmxval.

Note

To make it easy to get started with rsmxval, we provide a way to automatically generate configuration files both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Configuration files for rsmxval are almost identical to rsmtool configuration files with only a few differences. Next, we describe the three required rsmxval configuration fields in detail.

experiment_id

An identifier for the experiment that will be used as part of the names of the reports and intermediate files produced in each of the steps. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters. Suffixes are added to this experiment ID by each of the steps for the reports and files they produce, i.e., _fold<N> in the per-fold rsmtool step where <N> is a two digit number, _evaluation by the rsmeval evaluation step, _fold_summary by the rsmsummarize step, and _model by the final full-data rsmtool step.

model

The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.

train_file

The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to config file’s location.

Important

Unlike rsmtool, rsmxval does not accept an evaluation set and will raise an error if the test_file field is specified.

Next, we will describe the optional fields that are unique to rsmxval.

folds (Optional)

The number of folds to use for cross-validation. This should be an integer and defaults to 5.

folds_file (Optional)

The path to a file containing custom, pre-specified folds to be used for cross-validation. This should be a .csv file (no other formats are accepted) and should contain only two columns: id and fold. The id column should contain the same IDs of the responses that are contained in train_file above. The fold column should contain an integer representing which fold the response with the id belongs to. IDs not specified in this file will be skipped and not included in the cross-validation at all. Just like train_file, this path can be absolute or relative to the config file’s location. Here’s an example of a folds file containing 2 folds.

Note

If both folds_file and folds are specified, then the former will take precedence unless it contains a non-existent path.

use_wandb (Optional)

If set to true, the generated reports and all intermediate tables will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. A new run will be created in the specified project and it will include different sections for the different steps of the experiment:

rsmxval section will include the combined predictions file generated by the folds.
rsmeval section will include the tables and plots from the evaluation run on the predictions file.
rsmtool section will include the output of the final rsmtool run that creates the final model.
All the reports generated in the process will be logged to the reports section.

In addition, some evaluation metrics will be logged to the run’s history with a name representing the context in which each score is calculated (see more details here).

Note that the output of the individual rsmtool runs for each fold are not logged to W&B.

Defaults to false.

In addition to the fields described so far, an rsmxval configuration file also accepts the following optional fields used by rsmtool:

candidate_column
description
exclude_zero_scores
feature_subset
feature_subset_file
features
file_format
flag_column
flag_column_test
id_column
length_column
min_items_per_candidate
min_n_per_group
predict_expected_scores
rater_error_variance
second_human_score_column
select_transformations
sign
skll_fixed_parameters
skll_objective
standardize_features
subgroups
train_label_column
trim_max
trim_min
trim_tolerance
truncate_outliers
use_scaled_predictions
use_thumbnails
use_truncation_thresholds
skll_grid_search_jobs
use_wandb
wandb_entity
wandb_project

Please refer to these fields’ descriptions on the page describing the rsmtool configuration file.

Output

rsmxval produces a set of folders in the output directory. If logging to Weights & Biases is enabled, the reports generated in this run are also logged to the specified Weights & Biases project.

folds

This folder contains the output of each of the per-fold rsmtool experiments. It contains as many sub-folders as the number of specified folds, named 01, 02, 03, etc. Each of these numbered sub-folders contains the output of one rsmtool experiment conducted using the training split of that fold as the training data and the test split as the evaluation data. Each of the sub-folders contains the output directories produced by rsmtool. The report for each fold lives in the report sub-directory, e.g., the report for the first fold is found at folds/01/report/<experiment_id>_fold01_report.html, and so on. The messages that are usually printed out by rsmtool to the screen are instead logged to a file and saved to disk as, e.g., folds/01/rsmtool.log.

evaluation

This folder contains the output of the rsmeval evaluation experiment that uses the cross-validated predictions from each fold. This folder contains the output directories produced by rsmeval. The evaluation report can be found at evaluation/report/<experiment_id>_evaluation_report.html. The messages that are usually printed out by rsmeval to the screen are instead logged to a file and saved to disk as evaluation/rsmeval.log.

fold-summary

This folder contains the output of the rsmsummarize experiment that provides a quick summary of all of the folds in a single, easily-scanned report. The folder contains the output directories produced by rsmsummarize. The summary report can be found at fold-summary/report/<experiment_id>_fold_summary_report.html. The messages that are usually printed out by rsmsummarize to the screen are instead logged to a file and saved to disk as fold-summary/rsmsummarize.log.

final-model

This folder contains the output of the rsmtool experiment that trains a model on the full training data and provides a report showing the feature and model descriptives. It contains the output directories produced by rsmtool. The primary artifacts of this experiment are the report (final-model/report/<experiment_id>_model_report.html) and the final trained model (final-model/output/<experiment_id>_model.model). The messages that are usually printed out by rsmtool to the screen are instead logged to a file and saved to disk as final-model/rsmtool.log.

Note

Every rsmtool experiment requires both a training and an evaluation set. However, in this step, we are using the full training data to train the model and rsmxval does not use a separate test set. Therefore, we simply randomly sample 10% of the full training data as a dummy test set to make sure that rsmtool runs successfully. The report in this step only contains the model and feature descriptives and, therefore, does not use this dummy test set at all. Users should ignore any intermediate files under the final-model/output and final-model/figure sub-directories that are derived from this dummy test set. If needed, the data used as the dummy test set can be found at final-model/dummy_test.csv (or in the chosen format).

In addition to these folders, rsmxval will also save a copy of the configuration file in the output directory at the same-level as the above folders. Fields not specified in the original configuration file will be pre-populated with default values.

`rsmexplain` - Explain non-linear models

RSMTool provides the rsmexplain command-line utility to generate a report explaining the predictions made by a model trained using the rsmtool utility. These explanations contain useful information about the contribution of each feature to the final score, even if the model is non-linear or black-box in nature. The rsmexplain command-line utility uses the SHAP library to compute the explanations.

Note

rsmexplain uses the sampling explainer which is model agnostic and should, in principle, work for any type of model. However, rsmexplain currently only supports regressors since they are the most popular model type used for automated scoring.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmexplain is designed to explain the predictions from a model trained as part of an existing rsmtool experiment. The steps to do this are as follows:

Successfully run an rsmtool experiment so that the model we would like to explain is trained and available.
Create an experiment configuration file describing the explanation experiment you would like to run.
Run the created configuration file with rsmexplain to generate the explanation HTML report.
Examine the HTML report to see the explanations for the rsmtool model on the selected responses.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Run `rsmtool` experiments with chosen model

rsmexplain requires an existing rsmtool experiment with a trained model. For this tutorial, we will explain the model trained as part of the rsmtool tutorial.

Note

If you have not already completed the rsmtool tutorial, please do so now. You may need to complete it again if you deleted the output files.

Create a configuration file

The next step is to create an experiment configuration file for rsmexplain in .json format.

{
    "description": "Explaining a linear regression model trained on all features.",
    "experiment_dir": "../rsmtool",
    "experiment_id": "ASAP2",
    "background_data": "../rsmtool/train.csv",
    "explain_data": "../rsmtool/test.csv",
    "id_column": "ID",
    "sample_size": 1,
    "num_features_to_display": 15
}

Let’s take a look at the options in our configuration file.

Line 2: We give a short description of this experiment. This will be shown in the report.
Line 3: We give the path to the directory containing the output of the original rsmtool` experiment. Note that this is the top-level directory that contains the output directory produced by rsmtool.
Line 4: We provide the experiment_id of the rsmtool experiment used to train the model. This can usually be read off the output/<experiment_id>.model file in the rsmtool experiment output directory.
Line 5: We provide the path to the data file that will be used as the background distribution.
Line 6: We provide the path to the data file that will be used to generate the explanations.
Line 7: This field indicates that the unique IDs for the responses in both data files are located in a column named ID.
Line 8: This field indicates that we wish to explain one randomly chosen example from the second data file. If we wish to explain a specific example from that file, we would use the sample_ids option instead.
Line 9: This field indicates the number of top features that should be displayed in the plots in the rsmexplain report.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmexplain configuration files rather than creating them manually.

Run explanation experiment

Now that we have the rsmtool experiment, the data files, and our configuration file, we can use the rsmexplain command-line script to run our explanation experiment.

$ cd examples/rsmexplain
$ rsmexplain config_rsmexplain.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmexplain
Saving configuration file.
WARNING: The following extraneous features will be ignored: {'LENGTH', 'score2', 'score'}
Pre-processing input features
WARNING: The following extraneous features will be ignored: {'LENGTH', 'score2', 'score'}
Pre-processing input features
Generating SHAP explanations for 1 examples from ../rsmtool/test.csv
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.99it/s]
Merging sections
Exporting HTML
Success

Once the run finishes, you will see the output, figure, and report sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_explain_report.html file, which is the final evaluation report.

Examine the report

Our experiment report contains all the information we would need to explain the trained model. It includes:

The various absolute value variants of the SHAP values.
Several SHAP plots indicating how different features contribute to the predicted score. Since we chose to explain a single example in this tutorial, the following plots will be displayed in the report: global bar plot, beeswarm plot, decision plot, and waterfall plot.

Note

We encourage you to re-run the tutorial by modifying the configuration file to explain multiple examples instead of a single one. You can do so either by setting sample_size to a value larger than 1, by explicitly specifying multiple example indices via sample_ids, or by setting sample_range to an appropriate range of example indices. For a multiple-example explanation run, the following plots will be displayed in the report: global bar plot, beeswarm plot, and heatmap plots.

Input

rsmexplain requires only one argument to generate the explanation report: the path to a configuration file.

Here are all the arguments to the rsmexplain command-line script.

config_file: The JSON configuration file for this experiment.

output_dir: The output directory where all the files for this experiment will be stored.

-f, --force: If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmexplain experiment. Here’s an example configuration file for rsmexplain.

Note

To make it easy to get started with rsmexplain, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmexplain configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

background_data

The path to the background data feature file in one of the supported formats. Each row should correspond to a single response and contain the numeric feature values extracted for this response. In addition, there should be a column containing a unique identifier (ID) for each response. This path can be absolute or relative to the location of the config file. It must contain at least 300 responses to ensure meaningful explanations.

explain_data

The path to the file containing the data that we want to explain. The file should be in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column containing a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file.

experiment_id

An identifier for the rsmexplain experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

experiment_dir

The directory containing the rsmtool models we want to explain. This directory should contain an output sub-directory and that sub-directory should contain two files: the <experiment_id>.model and <experiment_id>_feature.csv. Note that <experiment_id> refers to the one defined in this same configuration file. As an example of this directory structure, take a look at the existing_experiment directory here.

background_kmeans_size (Optional)

The size of the k-means sample for background sampling. Defaults to 500. We summarize the dataset specified in background_data with this many k-means clusters (each cluster is weighted by the number of data points it represents) and then use the summarized data set for sampling instead of the original. The k-means clustering allows us to speed up the explanation process but may sacrifice some accuracy. The default value of 500 has been shown to provide a good balance between speed and accuracy in our experiments. You may use a higher value if you have a very large or very diverse background dataset and you want to ensure that it’s accurately summarized.

Warning

background_kmeans_size must be smaller than the size of the original background data. If not, you may see errors like this: ValueError: n_samples=500 should be >= n_clusters=750.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

description (Optional)

A brief description of the rsmexplain experiment that will be shown at the top of the report. The description can contain spaces and punctuation.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmexplain (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

data_description: Shows the number/identity of responses that are being explained.

shap_values: Shows different types of SHAP values for the features.

shap_plots: Shows various SHAP explanation plots for the features.

sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmexplain will look for a column called spkitemid in both background_data and explain_data files. Note that the name of the id_column must be the same in these two files.

num_features_to_display (Optional)

Number of top features that should be displayed in rsmexplain plots. Defaults to 15.

sample_ids (Optional)

If we want to explain a specific set of responses from the explain_data, we can specify their IDs here as a comma-separated string. Note that the IDs must be values from the id_column. For example, if explain_data has IDs of the form "EXAMPLE_1", "EXAMPLE_2", etc., and we want to explain the fifth, tenth, and twelfth example, the value of this field must be set to "EXAMPLE_5, EXAMPLE_10, EXAMPLE_12". Defaults to None.

sample_range (Optional)

If we want to explain a specific range of responses from the explain_data, we can specify that range here. Note that the range is specified in terms of the location of the responses in the explain_data file and that the locations are zero-indexed. So, for example, to explain only the first 50 responses in the file, we should set a value of "0-49" for this option. Defaults to None.

sample_size (Optional)

If we want to explain a random sample of the responses in explain_data, we can specify the size of that random sample here. For example, to explain a random sample of 10 responses, we would set this to 10. Defaults to None.

Note

Only one of sample_ids, sample_range or sample_size must be specified. If none of them are specified, explanations will be generated for the entire set of responses in explain_data which could be very slow, depending on its size.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and

All custom section names specified using custom_sections, i.e., file prefixes only, without the path and without the .ipynb extension

show_auto_cohorts (Optional)

If this option is set to true, auto cohort bar plots will be displayed. These plots can be useful for detecting interaction effects between cohorts and features. If a cohort shows a high feature value, then there may be an interaction between that cohort and the feature. Defaults to false. These plots are not shown by default because these plots may be unstable or provide misleading information if explain_data is not large enough. For smaller datasets, SHAP may not be able to detect strong feature interactions and compute clear cohorts. If that happens, the plots will be too specific to be useful. If you have a large enough dataset, you can set this option to true and see if the plots are useful.

Important

By default, the auto cohort bar plots are treated as a custom section and added at the end of the report, after the system information section. The section order option can be used to move this section to a different place in the report. Use "auto_cohorts" as the name for this section when specifying an order.

standardize_features (Optional)

If this option is set to false, the feature values for the responses in background_data and explain_data will not be standardized using the mean and standard deviation parameters for the rsmtool experiment. These parameters are expected to be part of the feature information contained in <experiment_dir>/output/<experiment_id>_feature.csv. Defaults to true.

Important

If experiment_dir contains the rsmtool configuration file, that file’s value for standardize_features will override the value specified by the user. The reason is that if rsmtool trained the model with (or without) standardized features, then rsmexplain must do the same for the explanations to be meaningful.

truncate_outliers (Optional)

If this option is set to false, outliers (values more than 4 standard deviations away from the mean) in feature columns will _not_ be truncated. Defaults to true.

use_wandb (Optional)

If set to true, the generated report will be logged to Weights & Biases. The Weights & Biases entity and project name should be specified in the appropriate configuration fields. The report will be added to a reports section in a new run under the given project. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmexplain produces a set of folders in the output directory. If logging to Weights & Biases is enabled, the reports are also logged to the specified Weights & Biases project.

report

This folder contains the final explanation report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains various SHAP values and their absolute value variants. rsmexplain also saves a copy of the configuration file in this folder. Fields not specified in the original configuration file will be pre-populated with default values. The SHAP explanation object is saved as <experiment_id>_explanation.pkl and a mapping between the position of each explained response in the data file and its unique ID is saved in <experiment_id>_ids.pkl.

figure

This folder contains all of the figures containing the various SHAP plots, saved as .svg files.

Advanced Uses of RSMTool

rsmeval - Evaluate external predictions

Tutorial

Workflow

ASAP Example

Generate scores

Create a configuration file

Run the experiment

Examine the report

Input

Experiment configuration file

experiment_id

predictions_file

system_score_column

trim_min

trim_max

candidate_column (Optional)

custom_sections (Optional)

description (Optional)

exclude_zero_scores (Optional)

file_format (Optional)

flag_column (Optional)

general_sections (Optional)

human_score_column (Optional)

id_column (Optional)

min_items_per_candidate (Optional)

min_n_per_group (Optional)

rater_error_variance (Optional)

scale_with (Optional)

second_human_score_column (Optional)

section_order (Optional)

subgroups (Optional)

trim_tolerance (Optional)

use_thumbnails (Optional)

use_wandb (Optional)

wandb_project (Optional)

wandb_entity (Optional)

Output

report

output

figure

Intermediate files

Predictions

Flagged responses

Excluded responses

Response metadata

Unused columns

Human scores

Data composition

Excluded data composition

Subgroup composition

Evaluation metrics

Human-human Consistency

Evaluations based on test theory

Additional fairness analyses

rsmpredict - Generate new predictions

Tutorial

Workflow

ASAP Example

Extract features

Create a configuration file

Run the experiment

Input

Experiment configuration file

experiment_dir

experiment_id

input_feature_file

candidate_column (Optional)

file_format (Optional)

flag_column (Optional)

human_score_column (Optional)

id_column (Optional)

predict_expected_scores (Optional)

second_human_score_column (Optional)

standardize_features (Optional)

subgroups (Optional)

truncate_outliers (Optional)

use_wandb (Optional)

wandb_project (Optional)

wandb_entity (Optional)

`rsmeval` - Evaluate external predictions

`rsmpredict` - Generate new predictions

`rsmcompare` - Create a detailed comparison of two scoring models

Run `rsmtool` (or `rsmeval`) experiments

`rsmsummarize` - Compare multiple scoring models

Run `rsmtool` and `rsmeval` experiments

`rsmxval` - Run cross-validation experiments