Advanced Uses of RSMTool
In addition to providing the rsmtool
utility training and evaluating regression-based scoring models, the RSMTool package also provides six other command-line utilities for more advanced users.
rsmeval
- Evaluate external predictions
RSMTool provides the rsmeval
command-line utility to evaluate existing predictions and generate a report with all the built-in analyses. This can be useful in scenarios where the user wants to use more sophisticated machine learning algorithms not available in RSMTool to build the scoring model but still wants to be able to evaluate that model’s predictions using the standard analyses.
For example, say a researcher has an existing automated scoring engine for grading short responses that extracts the features and computes the predicted score. This engine uses a large number of binary, sparse features. She cannot use rsmtool
to train her model since it requires numeric features. So, she uses scikit-learn to train her model.
Once the model is trained, the researcher wants to evaluate her engine’s performance using the analyses recommended by the educational measurement community as well as conduct additional investigations for specific subgroups of test-takers. However, these kinds of analyses are not available in scikit-learn
. She can use rsmeval
to set up a customized report using a combination of existing and custom sections and quickly produce the evaluation that is useful to her.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
rsmeval
is designed for evaluating existing machine scores. Once you have the scores computed for all the responses in your data, the next steps are fairly straightforward:
Create a data file in one of the supported formats containing the computed system scores and the human scores you want to compare against.
Create an experiment configuration file describing the evaluation experiment you would like to run.
Run that configuration file with rsmeval and generate the experiment HTML report as well as the intermediate CSV files.
Examine the HTML report to check various aspects of model performance.
Note that the above workflow does not use any customization features , e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Generate scores
rsmeval
is designed for researchers who have developed their own scoring engine for generating scores and would like to produce an evaluation report for those scores. For this tutorial, we will use the scores we generated for the ASAP2 evaluation set using rsmtool tutorial.
Create a configuration file
The next step is to create an experiment configuration file in .json
format.
1{
2 "experiment_id": "ASAP2_evaluation",
3 "description": "Evaluation of the scores generated using rsmtool.",
4 "predictions_file": "ASAP2_scores.csv",
5 "system_score_column": "system",
6 "human_score_column": "human",
7 "id_column": "ID",
8 "trim_min": 1,
9 "trim_max": 6,
10 "second_human_score_column": "human2",
11 "scale_with": "asis"
12}
Let’s take a look at the options in our configuration file.
Line 2: We define an experiment ID.
Line 3: We also provide a description which will be included in the experiment report.
Line 4: We list the path to the file with the predicted and human scores. For this tutorial we used
.csv
format, but RSMTool also supports several other input file formats.Line 5: This field indicates that the system scores in our
.csv
file are located in a column namedsystem
.Line 6: This field indicates that the human (reference) scores in our
.csv
file are located in a column namedhuman
.Line 7: This field indicates that the unique IDs for the responses in the
.csv
file are located in columns namedID
.Lines 8-9: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
Line 10: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the
human2
column in the.csv
file.Line 11: This field indicates that the provided machine scores are already re-scaled to match the distribution of human scores.
rsmeval
itself will not perform any scaling and the report will refer to these asscaled
scores.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmeval
configuration files rather than creating them manually.
Run the experiment
Now that we have our scores in the right format and our configuration file in .json
format, we can use the rsmeval command-line script to run our evaluation experiment.
$ cd examples/rsmeval
$ rsmeval config_rsmeval.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmeval
Assuming given system predictions are already scaled and will be used as such.
predictions: /Users/nmadnani/work/rsmtool/examples/rsmeval/ASAP2_scores.csv
Processing predictions
Saving pre-processed predictions and the metadata to disk
Running analyses on predictions
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see the output
, figure
, and report
sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_evaluation_report.html
file, which is the final evaluation report.
Examine the report
Our experiment report contains all the information we would need to evaluate the provided system scores against the human scores. It includes:
The distributions for the human versus the system scores.
Several different metrics indicating how well the machine’s scores agree with the humans’.
Information about human-human agreement and the difference between human-human and human-system agreement.
… and much more.
Input
rsmeval
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmeval
will use the current directory as the output directory.
Here are all the arguments to the rsmeval
command-line script.
- config_file
The JSON configuration file for this experiment.
- output_dir (optional)
The output directory where all the files for this experiment will be stored.
- -f, --force
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.
- -h, --help
Show help message and exist.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmeval
experiment. Here’s an example configuration file for rsmeval
.
Note
To make it easy to get started with rsmeval
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmeval
configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
experiment_id
An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
predictions_file
The path to the file with predictions to evaluate. The file should be in one of the supported formats. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.
system_score_column
The name for the column containing the scores predicted by the system. These scores will be used for evaluation.
trim_min
The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min
- trim_tolerance
.
trim_max
The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max
+ trim_tolerance
.
Note
Although the trim_min
and trim_max
fields are optional for rsmtool
, they are required for rsmeval
.
candidate_column (Optional)
The name for an optional column in prediction file containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.
custom_sections (Optional)
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)
A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.
exclude_zero_scores (Optional)
By default, responses with human scores of 0 will be excluded from evaluations. Set this field to false
if you want to keep responses with scores of 0. Defaults to true
.
file_format (Optional)
The format of the intermediate files. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
flag_column (Optional)
This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0
in a column called ADVISORY
). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be evaluated. For example, a value of {"ADVISORY": 0}
will mean that rsmeval
will only use responses for which the ADVISORY
column has the value 0. Defaults to None
.
Note
If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}
) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY
column has a value of 0 and the ERROR
column has a value of 0).
Note
When reading the values in the supplied dictionary, rsmeval
treats numeric strings, floats and integers as the same value. Thus 1
, 1.0
, "1"
and "1.0"
are all treated as the 1.0
.
general_sections (Optional)
RSMTool provides pre-defined sections for rsmeval
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
data_description
: Shows the total number of responses, along with any responses have been excluded due to non-numeric/zero scores or flag columns.
data_description_by_group
: Shows the total number of responses for each of the subgroups specified in the configuration file. This section only covers the responses used to evaluate the model.
consistency
: shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.
evaluation
: Shows the standard set of evaluations recommended for scoring models on the evaluation data:
a table showing human-system association metrics;
the confusion matrix; and
a barplot showing the distributions for both human and machine scores.
evaluation by group
: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.
fairness_analyses
: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:
percentage of variance in squared error explained by subgroup membership
percentage of variance in raw (signed) error explained by subgroup membership
percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score
plots showing estimates for each subgroup for each model
true_score_evaluation
: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
Number of single and double-scored responses.
Variance of human rater errors and estimated variance of true scores
Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
intermediate_file_paths
: Shows links to all of the intermediate files that were generated while running the evaluation.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
human_score_column (Optional)
The name for the column containing the human scores for each response. The values in this column will be used as observed scores. Defaults to sc1
.
Note
All responses with non-numeric values or zeros in either human_score_column
or system_score_column
will be automatically excluded from evaluation. You can use exclude_zero_scores (Optional) to keep responses with zero scores.
id_column (Optional)
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmeval
will look for a column called spkitemid
in the prediction file.
min_items_per_candidate (Optional)
An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None
.
min_n_per_group (Optional)
A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.
Note
If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.
rater_error_variance (Optional)
True score evaluations require an estimate of rater error variance. By default, rsmeval
will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmeval
. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.
scale_with (Optional)
In many scoring applications, system scores are re-scaled so that their mean and standard deviation match those of the human scores for the training data.
If you want rsmeval
to re-scale the supplied predictions, you need to provide – as the value for this field – the path to a second file in one of the supported formats containing the human scores and predictions of the same system on its training data. This file must have two columns: the human scores under the sc1
column and the predicted score under the prediction
.
This field can also be set to "asis"
if the scores are already scaled. In this case, no additional scaling will be performed by rsmeval
but the report will refer to the scores as “scaled”.
Defaults to "raw"
which means that no-rescaling is performed and the report refers to the scores as “raw”.
second_human_score_column (Optional)
The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.
Note
You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.
section_order (Optional)
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
subgroups (Optional)
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]
. These subgroup columns need to be present in the input predictions file. If subgroups are specified, rsmeval
will generate:
tables and barplots showing human-system agreement for each subgroup on the evaluation set.
trim_tolerance (Optional)
The single numeric value that will be used to pad the trimming range specified in trim_min
and trim_max
. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max
+ trim_tolerance
for ceiling value and trim_min
-trim_tolerance
for floor value.
Defaults to 0.4998.
Note
For more fine-grained control over the trimming range, you can set trim_tolerance
to 0 and use trim_min
and trim_max
to specify the exact floor and ceiling values.
use_thumbnails (Optional)
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
use_wandb (Optional)
If set to true
, the generated reports and all intermediate tables will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
The tables and plots will be logged in a section named rsmeval in a new run under the given project, and the report will be
added to a reports section in that run. In addition, some evaluation metrics will be logged to the run’s history, see more details
here. Defaults to false
.
wandb_project (Optional)
The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.
Important
Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.
wandb_entity (Optional)
The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.
Output
rsmeval
produces a set of folders in the output directory. If logging to Weights & Biases is enabled,
the reports and all intermediate files are also logged to the specified Weights & Biases project.
report
This folder contains the final RSMEval report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file).
output
This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv
files. rsmeval
will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.
figure
This folder contains all of the figures generated as part of the various analyses performed, saved as .svg
files.
Intermediate files
Although the primary output of rsmeval
is an HTML report, we also want the user to be able to conduct additional analyses outside of rsmeval
. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format
parameter in the output
directory. The following sections describe all of the intermediate files that are produced.
Note
The names of all files begin with the experiment_id
provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:
spkitemid
for the column containing response IDs.sc1
for the column containing the human scores used as observed scoressc2
for the column containing the second human scores, if this column was specified in the configuration file.candidate
for the column containing candidate IDs, if this column was specified in the configuration file.
Predictions
filename: pred_processed
This file contains the post-processed predicted scores: the predictions from the model are truncated, rounded, and re-scaled (if requested).
Flagged responses
filename: test_responses_with_excluded_flags
This file contains all of the rows in input predictions file that were filtered out based on conditions specified in flag_column.
Note
If the predictions file contained columns with internal names such as sc1
that were not actually used by rsmeval
, they will still be included in these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Excluded responses
filename: test_excluded_responses
This file contains all of the rows in the predictions file that were filtered out because of non-numeric or zero scores.
Response metadata
filename: test_metadata
This file contains the metadata columns (id_column
, subgroups
if provided) for all rows in the predictions file that used in the evaluation.
Unused columns
filename: test_other_columns
This file contains all of the the columns from the input predictions file that are not present in the *_pred_processed
and *_metadata
files. They only include the rows that were not filtered out.
Note
If the predictions file contained columns with internal names such as sc1
but these columns were not actually used by rsmeval
, these columns will also be included into these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Human scores
filename: test_human_scores
This file contains the human scores, if available in the input predictions file, under a column called sc1
with the response IDs under the spkitemid
column.
If second_human_score_column
was specfied, then it also contains the values in the predictions file from that column under a column called sc2
. Only the rows that were not filtered out are included.
Note
If exclude_zero_scores
was set to true
(the default value), all zero scores in the second_human_score_column
will be replaced by nan
.
Data composition
filename: data_composition
This file contains the total number of responses in the input predictions file. If applicable, the table will also include the number of different subgroups.
Excluded data composition
filenames: test_excluded_composition
This file contains the composition of the set of excluded responses, e.g., why were they excluded and how many for each such exclusion.
Subgroup composition
filename: data_composition_by_<SUBGROUP>
There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup.
Evaluation metrics
eval
: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.eval_by_<SUBGROUP>
: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.eval_short
: a shortened version ofeval
that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (theraw
orscale
version is chosen depending on the value of theuse_scaled_predictions
in the configuration file).h_mean
h_sd
corr
sys_mean [raw/scale trim]
sys_sd [raw/scale trim]
SMD [raw/scale trim]
adj_agr [raw/scale trim_round]
exact_agr [raw/scale trim_round]
kappa [raw/scale trim_round]
wtkappa [raw/scale trim]
sys_mean [raw/scale trim_round]
sys_sd [raw/scale trim_round]
SMD [raw/scale trim_round]
R2 [raw/scale trim]
RMSE [raw/scale trim]
score_dist
: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.confMatrix
: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.
Note
Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.
Human-human Consistency
These files are created only if a second human score has been made available via the second_human_score_column
option in the configuration file.
consistency
: contains descriptives for both human raters as well as the agreement metrics between their ratings.consistency_by_<SUBGROUP>
: contains the same metrics as inconsistency
file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.degradation
: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.
Evaluations based on test theory
disattenuated_correlations
: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.disattenuated_correlations_by_<SUBGROUP>
: contains the same metrics as indisattenuated_correlations
file computed separately for each group.true_score_eval
: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.
Additional fairness analyses
These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.
<METRICS>_by_<SUBGROUP>.ols
: a serialized object of typepandas.stats.ols.OLS
containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics areosa
(overall score accuracy),osd
(overall score difference), andcsd
(conditional score difference).<METRICS>_by_<SUBGROUP>_ols_summary.txt
: a text file containing a summary of the above modelestimates_<METRICS>_by_<SUBGROUP>`
: coefficients, confidence intervals and p-values estimated by the model for each subgroup.fairness_metrics_by_<SUBGROUP>
: the \(R^2\) (percentage of variance) and p-values for all models.
rsmpredict
- Generate new predictions
RSMTool provides the rsmpredict
command-line utility to generate predictions for new data using a model already trained using the rsmtool
utility. This can be useful when processing a new set of responses to the same task without needing to retrain the model.
rsmpredict
pre-processes the feature values according to user specifications before using them to generate the predicted scores. The generated scores are post-processed in the same manner as they are in rsmtool
output.
Note
No score is generated for responses with non-numeric values for any of the features included into the model.
If the original model specified transformations for some of the features and these transformations led to NaN
or Inf
values when applied to the new data, rsmpredict
will raise a warning. No score will be generated for such responses.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
Important
Although this tutorial provides feature values for the purpose of illustration, rsmpredict
does not include any functionality for feature extraction; the tool is designed for researchers who use their own NLP/Speech processing pipeline to extract features for their data.
rsmpredict
allows you to generate the scores for new data using an existing model trained using RSMTool. Therefore, before starting this tutorial, you first need to complete rsmtool tutorial which will produce a train RSMTool model. You will also need to process the new data to extract the same features as the ones used in the model.
Once you have the features for the new data and the RSMTool model, using rsmpredict
is fairly straightforward:
Create a file containing the features for the new data. The file should be in one of the supported formats.
Create an experiment configuration file describing the experiment you would like to run.
Run that configuration file with rsmpredict to generate the predicted scores.
Note
You do not need human scores to run
rsmpredict
since it does not produce any evaluation analyses. If you do have human scores for the new data and you would like to evaluate the system on this new data, you can first runrsmpredict
to generate the predictions and then runrsmeval
on the output ofrsmpredict
to generate an evaluation report.
ASAP Example
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial. Specifically, We are going to use the linear regression model we trained in that tutorial to generate scores for new data.
Note
If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.
Extract features
We will first need to generate features for the new set of responses for which we want to predict scores. For this experiment, we will simply re-use the test set from the rsmtool
tutorial.
Note
The features used with rsmpredict
should be generated using the same NLP/Speech processing pipeline that generated the features used in the rsmtool
modeling experiment.
Create a configuration file
The next step is to create an rsmpredict experiment configuration file in .json
format.
1{
2 "experiment_dir": "../rsmtool",
3 "experiment_id": "ASAP2",
4 "input_features_file": "../rsmtool/test.csv",
5 "id_column": "ID",
6 "human_score_column": "score",
7 "second_human_score_column": "score2"
8}
Let’s take a look at the options in our configuration file.
Line 2: We give the path to the directory containing the output of the
rsmtool
experiment.Line 3: We provide the
experiment_id
of thersmtool
experiment used to train the model. This can usually be read off theoutput/<experiment_id>.model
file in thersmtool
experiment output directory.Line 4: We list the path to the data file with the feature values for the new data. For this tutorial we used
.csv
format, but RSMTool also supports several other input file formats.Line 5: This field indicates that the unique IDs for the responses in the
.csv
file are located in a column namedID
.Lines 6-7: These fields indicates that there are two sets of human scores in our
.csv
file located in the columns namedscore
andscore2
. The values from these columns will be added to the output file containing the predictions which can be useful if we want to evaluate the predictions usingrsmeval
.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmpredict
configuration files rather than creating them manually.
Run the experiment
Now that we have the model, the features in the right format, and our configuration file in .json
format, we can use the rsmpredict command-line script to generate the predictions and to save them in predictions.csv
.
$ cd examples/rsmpredict
$ rsmpredict config_rsmpredict.json predictions.csv
This should produce output like:
WARNING: The following extraneous features will be ignored: {'spkitemid', 'sc1', 'sc2', 'LENGTH'}
Pre-processing input features
Generating predictions
Rescaling predictions
Trimming and rounding predictions
Saving predictions to /Users/nmadnani/work/rsmtool/examples/rsmpredict/predictions.csv
You should now see a file named predictions.csv
in the current directory which contains the predicted scores for the new data in the predictions
column.
Input
rsmpredict
requires two arguments to generate predictions: the path to a configuration file and the path to the output file where the generated predictions are saved in .csv
format.
If you also want to save the pre-processed feature values,``rsmpredict`` can take a third optional argument --features
to specify the path to a .csv
file to save these values.
Here are all the arguments to the rsmpredict
command-line script.
- config_file
The JSON configuration file for this experiment.
- output_file
The output
.csv
file where predictions will be saved.
- --features <preproc_feats_file>
If specified, the pre-processed values for the input features will also be saved in this
.csv
file.
- -h, --help
Show help message and exist.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmpredict
experiment. Here’s an example configuration file for rsmpredict
.
Note
To make it easy to get started with rsmpredict
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmpredict
configuration fields in detail. There are three required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
experiment_dir
The path to the directory containing rsmtool
model to use for generating predictions. This directory must contain a sub-directory called output
with the model files, feature pre-processing parameters, and score post-processing parameters. The path can be absolute or relative to the location of configuration file.
experiment_id
The experiment_id
used to create the rsmtool
model files being used for generating predictions. If you do not know the experiment_id
, you can find it by looking at the prefix of the .model
file under the output
directory.
input_feature_file
The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the supported formats Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names must be the same as used in the original rsmtool
experiment.
Note
rsmpredict
will only generate predictions for responses in this file that have numeric values for the features included in the rsmtool
model.
See also
rsmpredict
does not require human scores for the new data since it does not evaluate the generated predictions. If you do have the human scores and want to evaluate the new predictions, you can use the rsmeval command-line utility.
candidate_column (Optional)
The name for the column containing unique candidate IDs. This column will be named candidate
in the output file with predictions.
file_format (Optional)
The format of the intermediate files. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
flag_column (Optional)
See description in the rsmtool configuration file for further information. No filtering will be done by rsmpredict
, but the contents of all specified columns will be added to the predictions file using the original column names.
human_score_column (Optional)
The name for the column containing human scores. This column will be renamed to sc1
.
id_column (Optional)
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmpredict
will look for a column called spkitemid
in the prediction file.
There are several other options in the configuration file that, while not directly used by rsmpredict
, can simply be passed through from the input features file to the output predictions file. This can be particularly useful if you want to subsequently run rsmeval to evaluate the generated predictions.
predict_expected_scores (Optional)
If the original model was a probabilistic SKLL classifier, then expected scores — probability-weighted averages over a contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default. Set this field to true
to compute expected scores as predictions. Defaults to false
.
Note
If the model in the original
rsmtool
experiment is an SVC, that original experiment must have been run withpredict_expected_scores
set totrue
. This is because SVC classifiers are fit differently if probabilistic output is desired, in contrast to other probabilistic SKLL classifiers.You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probability values for certain score points which can affect floating point computations.
second_human_score_column (Optional)
The name for the column containing the second human score. This column will be renamed to sc2
.
standardize_features (Optional)
If this option is set to false
features will not be standardized by dividing by the mean and multiplying by the standard deviation. Defaults to true
.
subgroups (Optional)
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]
. All these columns will be included into the predictions file with the original names.
truncate_outliers (Optional)
If this option is set to false
, outliers (values more than 4 standard deviations away from the mean) in feature columns will _not_ be truncated. Defaults to true
.
use_wandb (Optional)
If set to true
, the generated report and the predictions table will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
The predictions table will be logged in a section named rsmpredict in a new run under the given project, and the report will be
added to a reports section in that run.
Defaults to false
.
wandb_project (Optional)
The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.
Important
Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.
wandb_entity (Optional)
The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.
Output
rsmpredict
produces a .csv
file with predictions for all responses in new data set, and, optionally, a .csv
file with pre-processed feature values. If any of the responses had non-numeric feature values in
the original data or after applying transformations, these are saved in a file name PREDICTIONS_NAME_excluded_responses.csv
where PREDICTIONS_NAME
is the name of the predictions file supplied by the user without the extension.
The predictions .csv
file contains the following columns:
spkitemid
: the unique resonse IDs from the original feature file.sc1
andsc2
: the human scores for each response from the original feature file (human_score_column
andsecond_human_score_column
, respectively.raw
: raw predictions generated by the model.raw_trim
,raw_trim_round
,scale
,scale_trim
,scale_trim_round
: raw scores post-processed in different ways.
If logging to Weights & Biases is enabled, these csv files are also logged to the specified Weights & Biases project.
rsmcompare
- Create a detailed comparison of two scoring models
RSMTool provides the rsmcompare
command-line utility to compare two models and to generate a detailed comparison report including differences between the two models. This can be useful in many scenarios, e.g., say the user wants to compare the changes in model performance after adding a new feature into the model. To use rsmcompare
, the user must first run two experiments using either rsmtool or rsmeval. rsmcompare
can then be used to compare the outputs of these two experiments to each other.
Note
Currently rsmcompare
takes the outputs of the analyses generated during the original experiments and creates comparison tables. These comparison tables were designed with a specific comparison scenario in mind: comparing a baseline model with a model which includes new feature(s). The tool can certianly be used for other comparison scenarios if the researcher feels that the generated comparison output is appropriate.
rsmcompare
can be used to compare:
Two
rsmtool
experiments, orTwo
rsmeval
experiments, orAn
rsmtool
experiment with anrsmeval
experiment (in this case, only the evaluation analyses will be compared).
Note
It is strongly recommend that the original experiments as well as the comparison experiment are all done using the same version of RSMTool.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
rsmcompare
is designed to compare two existing rsmtool
or rsmeval
experiments. To use rsmcompare
you need:
Create an experiment configuration file describing the comparison experiment you would like to run.
Run that configuration file with rsmcompare and generate the comparison experiment HTML report.
Examine HTML report to compare the two models.
Note that the above workflow does not use the customization features of rsmcompare
, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Run rsmtool
(or rsmeval
) experiments
rsmcompare
compares the results of the two existing rsmtool
(or rsmeval
) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to itself.
Note
If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.
Create a configuration file
The next step is to create an experiment configuration file in .json
format.
1{
2 "comparison_id": "ASAP2_vs_ASAP2",
3 "experiment_id_old": "ASAP2",
4 "experiment_dir_old": "../rsmtool/",
5 "description_old": "RSMTool experiment.",
6 "use_scaled_predictions_old": true,
7 "experiment_id_new": "ASAP2",
8 "experiment_dir_new": "../rsmtool",
9 "description_new": "RSMTool experiment (copy).",
10 "use_scaled_predictions_new": true
11}
Let’s take a look at the options in our configuration file.
Line 2: We provide an ID for the comparison experiment.
Line 3: We provide the
experiment_id
for the experiment we want to use as a baseline.Line 4: We also give the path to the directory containing the output of the original baseline experiment.
Line 5: We give a short description of this baseline experiment. This will be shown in the report.
Line 6: This field indicates that the baseline experiment used scaled scores for some evaluation analyses.
Line 7: We provide the
experiment_id
for the new experiment. We use the same experiment ID for both experiments since we are comparing the experiment to itself.Line 8: We also give the path to the directory containing the output of the new experiment. As above, we use the same path because we are comparing the experiment to itself.
Line 9: We give a short description of the new experiment. This will also be shown in the report.
Line 10: This field indicates that the new experiment also used scaled scores for some evaluation analyses.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmcompare
configuration files rather than creating them manually.
Run the experiment
Now that we have the two experiments we want to compare and our configuration file in .json
format, we can use the rsmcompare command-line script to run our comparison experiment.
$ cd examples/rsmcompare
$ rsmcompare config_rsmcompare.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmcompare
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see an HTML file named ASAP2_vs_ASAP2_report.html
. This is the final rsmcompare
comparison report.
Examine the report
Our experiment report contains all the information we would need to compare the new model to the baseline model. It includes:
Comparison of feature distributions between the two experiments.
Comparison of model coefficients between the two experiments.
Comparison of model performance between the two experiments.
Note
Since we are comparing the experiment to itself, the comparison is not very interesting, e.g., the differences between various values will always be 0.
Input
rsmcompare
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmcompare
will use the current directory as the output directory.
Here are all the arguments to the rsmcompare
command-line script.
- config_file
The JSON configuration file for this experiment.
- output_dir (optional)
The output directory where the report files for this comparison will be stored.
- -h, --help
Show help message and exist.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmcompare
experiment. Here’s an example configuration file for rsmcompare
.
Note
To make it easy to get started with rsmcompare
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmcompare
configuration fields in detail. There are seven required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
comparison_id
An identifier for the comparison experiment that will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
experiment_id_old
An identifier for the “baseline” experiment. This ID should be identical to the experiment_id
used when the baseline experiment was run, whether rsmtool
or rsmeval
. The results for this experiment will be listed first in the comparison report.
experiment_id_new
An identifier for the experiment with the “new” model (e.g., the model with new feature(s)). This ID should be identical to the experiment_id
used when the experiment was run, whether rsmtool
or rsmeval
. The results for this experiment will be listed first in the comparison report.
experiment_dir_old
The directory with the results for the “baseline” experiment. This directory is the output directory that was used for the experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
experiment_dir_new
The directory with the results for the experiment with the new model. This directory is the output directory that was used for the experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
description_old
A brief description of the “baseline” experiment. The description can contain spaces and punctuation.
description_new
A brief description of the experiment with the new model. The description can contain spaces and punctuation.
custom_sections (Optional)
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
general_sections (Optional)
RSMTool provides pre-defined sections for rsmcompare
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
feature_descriptives
: Compares the descriptive statistics for all raw feature values included in the model:
a table showing mean, standard deviation, skewness and kurtosis;
a table showing the number of truncated outliers for each feature; and
a table with percentiles and outliers;
a table with correlations between raw feature values and human score in each model and the correlation between the values of the same feature in these two models. Note that this table only includes features and responses which occur in both training sets.
features_by_group
: Shows boxplots for both experiments with distributions of raw feature values by each of the subgroups specified in the configuration file.
preprocessed_features
: Compares analyses of preprocessed features:
histograms showing the distributions of preprocessed features values;
the correlation matrix between all features and the human score;
a table showing marginal correlations between all features and the human score; and
a table showing partial correlations between all features and the human score.
preprocessed_features_by_group
: Compares analyses of preprocessed features by subgroups: marginal and partial correlations between each feature and human score for each subgroup.
consistency
: Compares metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated correlations for the whole dataset and by each of the subgroups specified in the configuration file.
score_distributions
:
tables showing the distributions for both human and machine scores; and
confusion matrices for human and machine scores.
model
: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.
evaluation
: Compares the standard set of evaluations recommended for scoring models on the evaluation data.
true_score_evaluation
: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
Number of single and double-scored responses.
Variance of human rater errors and estimated variance of true scores
Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
pca
: Shows the results of principal components analysis on the processed feature values for the new model only:
the principal components themselves;
the variances; and
a Scree plot.
notes
: Notes explaining the terminology used in comparison reports.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
section_order (Optional)
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
subgroups (Optional)
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups.For example, ["prompt, gender, native_language, test_country"]
.
Note
In order to include subgroups analyses in the comparison report, both experiments must have been run with the same set of subgroups.
use_scaled_predictions_old (Optional)
Set to true
if the “baseline” experiment used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false
.
use_scaled_predictions_new (Optional)
Set to true
if the experiment with the new model used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false
.
Warning
For rsmtool
and rsmeval
, primary evaluation analyses are computed on both raw and scaled scores, but some analyses (e.g., the confusion matrix) are only computed for either raw or re-scaled scores based on the value of use_scaled_predictions
. rsmcompare
uses the existing outputs and does not perform any additional evaluations. Therefore if this field was set to true
in the original experiment but is set to false
for rsmcompare
, the report will be internally inconsistent: some evaluations use raw scores whereas others will use scaled scores.
use_thumbnails (Optional)
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
use_wandb (Optional)
If set to true
, the generated reports and all intermediate tables will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
The report will be added to a reports section in a new run under the given project.
Defaults to false
.
wandb_project (Optional)
The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.
Important
Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.
wandb_entity (Optional)
The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.
Output
rsmcompare
produces the comparison report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file) in the output directory.
If logging to Weights & Biases is enabled,
the report will be also logged to the specified Weights & Biases project.
rsmsummarize
- Compare multiple scoring models
RSMTool provides the rsmsummarize
command-line utility to compare multiple models and to generate a comparison report. Unlike rsmcompare
which creates a detailed comparison report between the two models, rsmsummarize
can be used to create a more general overview of multiple models.
rsmsummarize
can be used to compare:
Multiple
rsmtool
experiments, orMultiple
rsmeval
experiments, orA mix of
rsmtool
andrsmeval
experiments (in this case, only the evaluation analyses will be compared).
Note
It is strongly recommend that the original experiments as well as the summary experiment are all done using the same version of RSMTool.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
rsmsummarize
is designed to compare several existing rsmtool
or rsmeval
experiments. To use rsmsummarize
you need:
Two or more experiments that were run using rsmtool or rsmeval.
Create an experiment configuration file describing the comparison experiment you would like to run.
Run that configuration file with rsmsummarize and generate the comparison experiment HTML report.
Examine HTML report to compare the models.
Note that the above workflow does not use the customization features of rsmsummarize
, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Run rsmtool
and rsmeval
experiments
rsmsummarize
compares the results of the two or more existing rsmtool
(or rsmeval
) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to the evaluations we obtained in the rsmeval tutorial.
Note
If you have not already completed these tutorials, please do so now. You may need to complete them again if you deleted the output files.
Create a configuration file
The next step is to create an experiment configuration file in .json
format.
1{
2 "summary_id": "model_comparison",
3 "description": "a comparison of the results of the rsmtool sample experiment, rsmeval sample experiment and once again the rsmtool sample experiment",
4 "experiment_dirs": ["../rsmtool", "../rsmeval", "../rsmtool"],
5 "experiment_names":["RSMTool experiment 1", "RSMEval experiment", "RSMTool experiment 2"]
6}
Let’s take a look at the options in our configuration file.
Line 2: We provide the
summary_id
for the comparison. This will be used to generate the name of the final report.Line 3: We give a short description of this comparison experiment. This will be shown in the report.
Line 4: We also give the list of paths to the directories containing the outputs of the experiments we want to compare.
Line 5: Since we want to compare experiments that all used the same experiment id (
ASAP2
), we instead list the names that we want to use for each experiment in the summary report.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmsummarize
configuration files rather than creating them manually.
Run the experiment
Now that we have the list of the experiments we want to compare and our configuration file in .json
format, we can use the rsmsummarize command-line script to run our comparison experiment.
$ cd examples/rsmsummarize
$ rsmsummarize config_rsmsummarize.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmsummarize
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see a new folder report
containing an HTML file named model_comparison_report.html
. This is the final rsmsummarize
summary report.
Examine the report
Our experiment report contains the overview of main aspects of model performance. It includes:
Brief description of all experiments.
Information about model parameters and model fit for all
rsmtool
experiments.Model performance for all experiments.
Note
Some of the information such as model fit and model parameters are only available for rsmtool
experiments.
Input
rsmsummarize
requires a single argument to run an experiment: the path to a configuration file. You can specify which models you want to compare and the name of the report by supplying the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmsummarize
will use the current directory as the output directory.
Here are all the arguments to the rsmsummarize
command-line script.
- config_file
The JSON configuration file for this experiment.
- output_dir (optional)
The output directory where the report and intermediate
.csv
files for this comparison will be stored.
- -f, --force
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmsummarize experiment.
- -h, --help
Show help message and exist.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmsummarize
experiment. Here’s an example configuration file for rsmsummarize
.
Note
To make it easy to get started with rsmsummarize
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmsummarize
configuration fields in detail. There are two required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
summary_id
An identifier for the rsmsummarize
experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
experiment_dirs
The list of the directories with the results of the experiment. These directories should be the output directories used for each experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
custom_sections (Optional)
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)
A brief description of the summary. The description can contain spaces and punctuation.
experiment_names (Optional)
The list of experiment names to use in the summary report and intermediate files. The names should be listed in the same order as the experiments in experiment_dirs. When this field is not specified, the report will show the original experiment_id
for each experiment.
file_format (Optional)
The format of the intermediate files generated by rsmsummarize
. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
Note
In the rsmsummarize
context, the file_format
parameter refers to the format of the intermediate files generated by rsmsummarize
, not the intermediate files generated by the original experiment(s) being summarized. The format of these files does not have to match the format of the files generated by the original experiment(s).
general_sections (Optional)
RSMTool provides pre-defined sections for rsmsummarize
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
preprocessed_features
: compares marginal and partial correlations between all features and the human score, and optionally response length if this was computed for any of the models.
model
: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.
evaluation
: Compares the standard set of evaluations recommended for scoring models on the evaluation data.
true_score_evaluation
: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
Number of single and double-scored responses.
Variance of human rater errors and estimated variance of true scores
Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
intermediate_file_paths
: Shows links to all of the intermediate files that were generated while running the summary.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
section_order (Optional)
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
use_thumbnails (Optional)
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
use_wandb (Optional)
If set to true
, the generated report will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
The report will be added to a reports section in a new run under the given project.
Defaults to false
.
wandb_project (Optional)
The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.
Important
Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.
wandb_entity (Optional)
The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.
Output
rsmsummarize
produces a set of folders in the output directory. If logging to Weights & Biases is enabled,
the reports and all intermediate files are also logged to the specified Weights & Biases project.
report
This folder contains the final rsmsummarize
report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file).
output
This folder contains all of the intermediate files produced as part of the
various analyses performed, saved as .csv
files. rsmsummarize
will also save in this folder a copy of the
configuration file. Fields not specified in the original configuration file will
be pre-populated with default values.
figure
This folder contains all of the figures that may be generated as part of the various analyses performed, saved as .svg
files. Note that no figures are generated by the existing rsmsummarize
notebooks.
Intermediate files
Although the primary output of RSMSummarize is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format
parameter in the output
directory. The following sections describe all of the intermediate files that are produced.
Note
The names of all files begin with the summary_id
provided by the user in the experiment configuration file.
Marginal and partial correlations with score
filenames: margcor_score_all_data
, pcor_score_all_data
, `pcor_score_no_length_all_data
The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column
was specified in the configuration file.
Model information
model_summary
This file contains the main information about the models included into the report including:
Total number of features
Total number of features with non-negative coefficients
The learner
The label used to train the model
betas
: standardized coefficients (for built-in models only).model_fit
: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.
Note
If the report includes a combination of rsmtool
and rsmeval
experiments, the summary tables with model information will only include rsmtool
experiments since no model information is available for rsmeval
experiments.
Evaluation metrics
eval_short
- descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (theraw
orscale
version is chosen depending on the value of theuse_scaled_predictions
in the configuration file).h_mean
h_sd
corr
sys_mean [raw/scale trim]
sys_sd [raw/scale trim]
SMD [raw/scale trim]
adj_agr [raw/scale trim_round]
exact_agr [raw/scale trim_round]
kappa [raw/scale trim_round]
wtkappa [raw/scale trim_round]
sys_mean [raw/scale trim_round]
sys_sd [raw/scale trim_round]
SMD [raw/scale trim_round]
R2 [raw/scale trim]
RMSE [raw/scale trim]
Note
Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.
Evaluations based on test theory
true_score_eval
: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.
rsmxval
- Run cross-validation experiments
RSMTool provides the rsmxval
command-line utility to run cross-validation experiments with scoring models. Why would one want to use cross-validation rather than just using the simple train-and-evaluate loop provided by the rsmtool
utility? Using cross-validation can provide more accurate estimates of scoring model performance since those estimates are averaged over multiple train-test splits that are randomly selected based on the data. Using a single train-test split may lead to biased estimates of performance since those estimates will depend on the specific characteristics of that split. Using cross-validation is more likely to provide estimates of how well the scoring model will generalize to unseen test data, and more easily flag problems with overfitting and selection bias, if any.
Cross-validation experiments in RSMTool consist of the following steps:
The given training data file is first shuffled randomly (with a fixed seed for reproducibility) and then split into the requested number of folds. It is also possible for the user to provide a CSV file containing a pre-determined set of folds, e.g., from another part of the data pipeline.
For each fold (or train-test split),
rsmtool
is run to train a scoring model on the training split and evaluate on the test split. All of the outputs for each of thersmtool
runs are saved on disk and represent the per-fold performance.The predictions generated by
rsmtool
for each of the folds are all combined into a single file, which is then used as input forrsmeval
. The output of this evaluation run is saved to disk and provides a more accurate estimate of the predictive performance of a scoring model trained on the given data.A summary report comparing all of the folds is generated by running
rsmsummarize
on all of the fold directories created in the Step 1 and its output is also saved to disk. This summary output can be useful to see if the performance for any of the folds stands out for any reason, which could point to a potential problem.Finally, a scoring model is trained on the complete training data file using
rsmtool
, which also generates a report that contains only the feature and model descriptives. The model is what will most likely be deployed for inference assuming the analyses produced in this step and Steps 1–4 meet the stakeholders’ requirements.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
rsmxval
is designed to run cross-validation experiments using a single file containing human scores and features. Just like rsmtool
, rsmxval
does not provide any functionality for feature extraction and assumes that users will extract features on their own. The workflow steps are as follows:
Create a data file in one of the supported formats containing the extracted features for each response in the data along with human score(s) assigned to it.
Create an experiment configuration file describing the cross-validation experiment you would like to run.
Run that configuration file with rsmxval and generate its outputs.
Examine the various HTML reports to check various aspects of model performance.
Note that unlike rsmtool
and rsmeval
, rsmxval
currently does not support customization of the HTML reports generated in each step. This functionality may be added in future versions.
ASAP Example
We are going to use the same example from the 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Extract features
We are using the same features for this data as described in the rsmtool tutorial.
Create a configuration file
The next step is to create an experiment configuration file in .json
format.
1{
2 "experiment_id": "ASAP2_xval",
3 "description": "Cross-validation with two human scores using a LinearRegression model.",
4 "train_file": "train.csv",
5 "folds": 3,
6 "train_label_column": "score",
7 "id_column": "ID",
8 "model": "LinearRegression",
9 "trim_min": 1,
10 "trim_max": 6,
11 "second_human_score_column": "score2",
12 "use_scaled_predictions": true
13}
Let’s take a look at the options in our configuration file.
Line 2: We define an experiment ID used to identify the files produced as part of this experiment.
Line 3: We provide a description which will be included in the various reports.
Line 4: We list the path to our training file with the feature values and human scores. For this tutorial, we used
.csv
format, but several other input file formats are also supported.Line 5: This field indicates the number of cross-validation folds we want to use. If this field is not specified,
rsmxval
uses 5-fold cross-validation by default.Line 6: This field indicates that the human (reference) scores in our
.csv
file are located in a column namedscore
.Line 7: This field indicates that the unique IDs for the responses in the
.csv
file are located in a column namedID
.Line 8: We choose to use a linear regression model to combine the feature values into a score.
Lines 9-10: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
Line 11: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the
score2
column in the training.csv
file.Line 12: Next, we indicate that we would like to use the scaled scores for all our evaluation analyses at each step.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmxval
configuration files rather than creating them manually.
Run the experiment
Now that we have our input file and our configuration file, we can use the rsmxval command-line script to run our evaluation experiment.
$ cd examples/rsmxval
$ rsmxval config_rsmxval.json output
This should produce output like:
Output directory: output
Saving configuration file.
Generating 3 folds after shuffling
Running RSMTool on each fold in parallel
Progress: 100%|███████████████████████████████████████████████| 3/3 [00:08<00:00, 2.76s/it]
Creating fold summary
Evaluating combined fold predictions
Training model on full data
Once the run finishes, you will see an output
sub-directory in the current directory. Under this directory you will see multiple sub-directories, each corresponding to a different cross-validation step, as described here.
Examine the reports
The cross-validation experiment produces multiple HTML reports – an rsmtool
report for each of the 3 folds (output/folds/{01,02,03}/report/ASAP2_xval_fold{01,02,03}.html
), the evaluation report for the cross-validated predictions (output/evaluation/report/ASAP2_xval_evaluation_report.html
), a report summarizing the salient characteristics of the 3 folds (output/fold-summary/report/ASAP2_xval_fold_summary_report.html
), and a report showing the feature and model descriptives (output/final-model/report/ASAP2_xval_model_report.html
). Examining these reports will provide a relatively complete picture of how well the predictive performance of the scoring model will generalize to unseen data.
Input
rsmxval
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmxval
will use the current directory as the output directory.
Here are all the arguments to the rsmxval
command-line script.
- config_file
The JSON configuration file for this cross-validation experiment.
- output_dir (optional)
The output directory where all the sub-directories and files for this cross-validation experiment will be stored. If a non-empty directory with the same name already exists, an error will be raised.
- -h, --help
Show help message and exit.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmxval
experiment. Here’s an example configuration file for rsmxval
.
Note
To make it easy to get started with rsmxval
, we provide a way to automatically generate configuration files both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Configuration files for rsmxval
are almost identical to rsmtool
configuration files with only a few differences. Next, we describe the three required rsmxval
configuration fields in detail.
experiment_id
An identifier for the experiment that will be used as part of the names of the reports and intermediate files produced in each of the steps. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters. Suffixes are added to this experiment ID by each of the steps for the reports and files they produce, i.e., _fold<N>
in the per-fold rsmtool
step where <N>
is a two digit number, _evaluation
by the rsmeval
evaluation step, _fold_summary
by the rsmsummarize
step, and _model
by the final full-data rsmtool
step.
model
The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.
train_file
The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to config file’s location.
Important
Unlike rsmtool
, rsmxval
does not accept an evaluation set and will raise an error if the test_file
field is specified.
Next, we will describe the optional fields that are unique to rsmxval
.
folds (Optional)
The number of folds to use for cross-validation. This should be an integer and defaults to 5.
folds_file (Optional)
The path to a file containing custom, pre-specified folds to be used for cross-validation. This should be a .csv
file (no other formats are accepted) and should contain only two columns: id
and fold
. The id
column should contain the same IDs of the responses that are contained in train_file
above. The fold
column should contain an integer representing which fold the response with the id
belongs to. IDs not specified in this file will be skipped and not included in the cross-validation at all. Just like train_file
, this path can be absolute or relative to the config file’s location. Here’s an example of a folds file containing 2 folds.
Note
If both folds_file
and folds
are specified, then the former will take precedence unless it contains a non-existent path.
use_wandb (Optional)
If set to true
, the generated reports and all intermediate tables will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
A new run will be created in the specified project and it will include different sections for the different steps of the
experiment:
rsmxval section will include the combined predictions file generated by the folds.
rsmeval section will include the tables and plots from the evaluation run on the predictions file.
rsmtool section will include the output of the final
rsmtool
run that creates the final model.All the reports generated in the process will be logged to the reports section.
In addition, some evaluation metrics will be logged to the run’s history with a name representing the context in which each score is calculated (see more details here).
Note that the output of the individual rsmtool runs for each fold are not logged to W&B.
Defaults to false
.
In addition to the fields described so far, an rsmxval
configuration file also accepts the following optional fields used by rsmtool
:
candidate_column
description
exclude_zero_scores
feature_subset
feature_subset_file
features
file_format
flag_column
flag_column_test
id_column
length_column
min_items_per_candidate
min_n_per_group
predict_expected_scores
rater_error_variance
second_human_score_column
select_transformations
sign
skll_fixed_parameters
skll_objective
standardize_features
subgroups
train_label_column
trim_max
trim_min
trim_tolerance
truncate_outliers
use_scaled_predictions
use_thumbnails
use_truncation_thresholds
skll_grid_search_jobs
use_wandb
wandb_entity
wandb_project
Please refer to these fields’ descriptions on the page describing the rsmtool configuration file.
Output
rsmxval
produces a set of folders in the output directory. If logging to Weights & Biases is enabled,
the reports generated in this run are also logged to the specified Weights & Biases project.
folds
This folder contains the output of each of the per-fold rsmtool
experiments. It contains as many sub-folders as the number of specified folds, named 01
, 02
, 03
, etc. Each of these numbered sub-folders contains the output of one rsmtool
experiment conducted using the training split of that fold as the training data and the test split as the evaluation data. Each of the sub-folders contains the output directories produced by rsmtool. The report for each fold lives in the report
sub-directory, e.g., the report for the first fold is found at folds/01/report/<experiment_id>_fold01_report.html
, and so on. The messages that are usually printed out by rsmtool
to the screen are instead logged to a file and saved to disk as, e.g., folds/01/rsmtool.log
.
evaluation
This folder contains the output of the rsmeval
evaluation experiment that uses the cross-validated predictions from each fold. This folder contains the output directories produced by rsmeval. The evaluation report can be found at evaluation/report/<experiment_id>_evaluation_report.html
. The messages that are usually printed out by rsmeval
to the screen are instead logged to a file and saved to disk as evaluation/rsmeval.log
.
fold-summary
This folder contains the output of the rsmsummarize
experiment that provides a quick summary of all of the folds in a single, easily-scanned report. The folder contains the output directories produced by rsmsummarize. The summary report can be found at fold-summary/report/<experiment_id>_fold_summary_report.html
. The messages that are usually printed out by rsmsummarize
to the screen are instead logged to a file and saved to disk as fold-summary/rsmsummarize.log
.
final-model
This folder contains the output of the rsmtool
experiment that trains a model on the full training data and provides a report showing the feature and model descriptives. It contains the output directories produced by rsmtool. The primary artifacts of this experiment are the report (final-model/report/<experiment_id>_model_report.html
) and the final trained model (final-model/output/<experiment_id>_model.model
). The messages that are usually printed out by rsmtool
to the screen are instead logged to a file and saved to disk as final-model/rsmtool.log
.
Note
Every rsmtool
experiment requires both a training and an evaluation set. However, in this step, we are using the full training data to train the model and rsmxval
does not use a separate test set. Therefore, we simply randomly sample 10% of the full training data as a dummy test set to make sure that rsmtool
runs successfully. The report in this step only contains the model and feature descriptives and, therefore, does not use this dummy test set at all. Users should ignore any intermediate files under the final-model/output
and final-model/figure
sub-directories that are derived from this dummy test set. If needed, the data used as the dummy test set can be found at final-model/dummy_test.csv
(or in the chosen format).
In addition to these folders, rsmxval
will also save a copy of the configuration file in the output directory at the same-level as the above folders. Fields not specified in the original configuration file will be pre-populated with default values.
rsmexplain
- Explain non-linear models
RSMTool provides the rsmexplain
command-line utility to generate a report explaining the predictions made by a model trained using the rsmtool
utility. These explanations contain useful information about the contribution of each feature to the final score, even if the model is non-linear or black-box in nature. The rsmexplain command-line utility uses the SHAP library to compute the explanations.
Note
rsmexplain
uses the sampling explainer which is model agnostic and should, in principle, work for any type of model. However, rsmexplain
currently only supports regressors since they are the most popular model type used for automated scoring.
Tutorial
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow
rsmexplain
is designed to explain the predictions from a model trained as part of an existing rsmtool
experiment. The steps to do this are as follows:
Successfully run an rsmtool experiment so that the model we would like to explain is trained and available.
Create an experiment configuration file describing the explanation experiment you would like to run.
Run the created configuration file with rsmexplain to generate the explanation HTML report.
Examine the HTML report to see the explanations for the
rsmtool
model on the selected responses.
ASAP Example
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Run rsmtool
experiments with chosen model
rsmexplain
requires an existing rsmtool
experiment with a trained model. For this tutorial, we will explain the model trained as part of the rsmtool tutorial.
Note
If you have not already completed the rsmtool
tutorial, please do so now. You may need to complete it again if you deleted the output files.
Create a configuration file
The next step is to create an experiment configuration file for rsmexplain
in .json
format.
1{
2 "description": "Explaining a linear regression model trained on all features.",
3 "experiment_dir": "../rsmtool",
4 "experiment_id": "ASAP2",
5 "background_data": "../rsmtool/train.csv",
6 "explain_data": "../rsmtool/test.csv",
7 "id_column": "ID",
8 "sample_size": 1,
9 "num_features_to_display": 15
10}
Let’s take a look at the options in our configuration file.
Line 2: We give a short description of this experiment. This will be shown in the report.
Line 3: We give the path to the directory containing the output of the original rsmtool` experiment. Note that this is the top-level directory that contains the
output
directory produced byrsmtool
.Line 4: We provide the
experiment_id
of the rsmtool experiment used to train the model. This can usually be read off theoutput/<experiment_id>.model
file in the rsmtool experiment output directory.Line 5: We provide the path to the data file that will be used as the background distribution.
Line 6: We provide the path to the data file that will be used to generate the explanations.
Line 7: This field indicates that the unique IDs for the responses in both data files are located in a column named
ID
.Line 8: This field indicates that we wish to explain one randomly chosen example from the second data file. If we wish to explain a specific example from that file, we would use the sample_ids option instead.
Line 9: This field indicates the number of top features that should be displayed in the plots in the
rsmexplain
report.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmexplain
configuration files rather than creating them manually.
Run explanation experiment
Now that we have the rsmtool
experiment, the data files, and our configuration file, we can use the rsmexplain command-line script to run our explanation experiment.
$ cd examples/rsmexplain
$ rsmexplain config_rsmexplain.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmexplain
Saving configuration file.
WARNING: The following extraneous features will be ignored: {'LENGTH', 'score2', 'score'}
Pre-processing input features
WARNING: The following extraneous features will be ignored: {'LENGTH', 'score2', 'score'}
Pre-processing input features
Generating SHAP explanations for 1 examples from ../rsmtool/test.csv
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.99it/s]
Merging sections
Exporting HTML
Success
Once the run finishes, you will see the output
, figure
, and report
sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_explain_report.html
file, which is the final evaluation report.
Examine the report
Our experiment report contains all the information we would need to explain the trained model. It includes:
The various absolute value variants of the SHAP values.
Several SHAP plots indicating how different features contribute to the predicted score. Since we chose to explain a single example in this tutorial, the following plots will be displayed in the report: global bar plot, beeswarm plot, decision plot, and waterfall plot.
Note
We encourage you to re-run the tutorial by modifying the configuration file to explain multiple examples instead of a single one. You can do so either by setting sample_size to a value larger than 1, by explicitly specifying multiple example indices via sample_ids, or by setting sample_range to an appropriate range of example indices. For a multiple-example explanation run, the following plots will be displayed in the report: global bar plot, beeswarm plot, and heatmap plots.
Input
rsmexplain
requires only one argument to generate the explanation report: the path to a configuration file.
Here are all the arguments to the rsmexplain
command-line script.
- config_file
The JSON configuration file for this experiment.
- output_dir
The output directory where all the files for this experiment will be stored.
- -f, --force
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.
- -h, --help
Show help message and exist.
- -V, --version
Show version number and exit.
Experiment configuration file
This is a file in .json
format that provides overall configuration options for an rsmexplain
experiment. Here’s an example configuration file for rsmexplain
.
Note
To make it easy to get started with rsmexplain
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmexplain
configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
background_data
The path to the background data feature file in one of the supported formats. Each row should correspond to a single response and contain the numeric feature values extracted for this response. In addition, there should be a column containing a unique identifier (ID) for each response. This path can be absolute or relative to the location of the config file. It must contain at least 300 responses to ensure meaningful explanations.
explain_data
The path to the file containing the data that we want to explain. The file should be in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column containing a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file.
experiment_id
An identifier for the rsmexplain
experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
experiment_dir
The directory containing the rsmtool models we want to explain. This directory should contain an output
sub-directory and that sub-directory should contain two files: the <experiment_id>.model
and <experiment_id>_feature.csv
. Note that <experiment_id>
refers to the one defined in this same configuration file. As an example of this directory structure, take a look at the existing_experiment
directory here.
background_kmeans_size (Optional)
The size of the k-means sample for background sampling. Defaults to 500. We summarize the dataset specified in background_data with this many k-means clusters (each cluster is weighted by the number of data points it represents) and then use the summarized data set for sampling instead of the original. The k-means clustering allows us to speed up the explanation process but may sacrifice some accuracy. The default value of 500 has been shown to provide a good balance between speed and accuracy in our experiments. You may use a higher value if you have a very large or very diverse background dataset and you want to ensure that it’s accurately summarized.
Warning
background_kmeans_size
must be smaller than the size of the original background data. If not, you may see errors like this: ValueError: n_samples=500 should be >= n_clusters=750.
custom_sections (Optional)
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)
A brief description of the rsmexplain
experiment that will be shown at the top of the report. The description can contain spaces and punctuation.
general_sections (Optional)
RSMTool provides pre-defined sections for rsmexplain
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
data_description
: Shows the number/identity of responses that are being explained.
shap_values
: Shows different types of SHAP values for the features.
shap_plots
: Shows various SHAP explanation plots for the features.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
id_column (Optional)
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmexplain
will look for a column called spkitemid
in both background_data
and explain_data
files. Note that the name of the id_column
must be the same in these two files.
num_features_to_display (Optional)
Number of top features that should be displayed in rsmexplain
plots. Defaults to 15.
sample_ids (Optional)
If we want to explain a specific set of responses from the explain_data
, we can specify their IDs here as a comma-separated string. Note that the IDs must be values from the id_column. For example, if explain_data
has IDs of the form "EXAMPLE_1"
, "EXAMPLE_2"
, etc., and we want to explain the fifth, tenth, and twelfth example, the value of this field must be set to "EXAMPLE_5, EXAMPLE_10, EXAMPLE_12"
. Defaults to None
.
sample_range (Optional)
If we want to explain a specific range of responses from the explain_data
, we can specify that range here. Note that the range is specified in terms of the location of the responses in the explain_data
file and that the locations are zero-indexed. So, for example, to explain only the first 50 responses in the file, we should set a value of "0-49"
for this option. Defaults to None
.
sample_size (Optional)
If we want to explain a random sample of the responses in explain_data, we can specify the size of that random sample here. For example, to explain a random sample of 10 responses, we would set this to 10. Defaults to None
.
Note
Only one of sample_ids
, sample_range
or sample_size
must be specified. If none of them are specified, explanations will be generated for the entire set of responses in explain_data
which could be very slow, depending on its size.
section_order (Optional)
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
All custom section names specified using custom_sections, i.e., file prefixes only, without the path and without the .ipynb extension
show_auto_cohorts (Optional)
If this option is set to true
, auto cohort bar plots will be displayed. These plots can be useful for detecting interaction effects between cohorts and features. If a cohort shows a high feature value, then there may be an interaction between that cohort and the feature. Defaults to false
. These plots are not shown by default because these plots may be unstable or provide misleading information if explain_data
is not large enough. For smaller datasets, SHAP may not be able to detect strong feature interactions and compute clear cohorts. If that happens, the plots will be too specific to be useful. If you have a large enough dataset, you can set this option to true
and see if the plots are useful.
Important
By default, the auto cohort bar plots are treated as a custom section and added at the end of the report, after the system information section. The section order option can be used to move this section to a different place in the report. Use "auto_cohorts"
as the name for this section when specifying an order.
standardize_features (Optional)
If this option is set to false
, the feature values for the responses in background_data
and explain_data
will not be standardized using the mean and standard deviation parameters for the rsmtool experiment. These parameters are expected to be part of the feature information contained in <experiment_dir>/output/<experiment_id>_feature.csv
. Defaults to true
.
Important
If experiment_dir
contains the rsmtool configuration file, that file’s value for standardize_features
will override the value specified by the user. The reason is that if rsmtool
trained the model with (or without) standardized features, then rsmexplain
must do the same for the explanations to be meaningful.
truncate_outliers (Optional)
If this option is set to false
, outliers (values more than 4 standard deviations away from the mean) in feature columns will _not_ be truncated. Defaults to true
.
use_wandb (Optional)
If set to true
, the generated report will be logged to Weights & Biases.
The Weights & Biases entity and project name should be specified in the appropriate configuration fields.
The report will be added to a reports section in a new run under the given project.
Defaults to false
.
wandb_project (Optional)
The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.
Important
Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.
wandb_entity (Optional)
The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.
Output
rsmexplain
produces a set of folders in the output directory. If logging to Weights & Biases is enabled,
the reports are also logged to the specified Weights & Biases project.
report
This folder contains the final explanation report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file).
output
This folder contains various SHAP values and their absolute value variants. rsmexplain
also saves a copy of the configuration file in this folder. Fields not specified in the original configuration file will be pre-populated with default values. The SHAP explanation object is saved as <experiment_id>_explanation.pkl
and a mapping between the position of each explained response in the data file and its unique ID is saved in <experiment_id>_ids.pkl
.
figure
This folder contains all of the figures containing the various SHAP plots, saved as .svg
files.