Advanced Uses of RSMTool¶
In addition to providing the rsmtool
utility training and evaluating regression-based scoring models, the RSMTool package also provides three other command-line utilities for more advanced users.
rsmeval
- Evaluate external predictions¶
RSMTool provides the rsmeval
command-line utility to evaluate existing predictions and generate a report with all the built-in analyses. This can be useful in scenarios where the user wants to use more sophisticated machine learning algorithms not available in RSMTool to build the scoring model but still wants to be able to evaluate that model’s predictions using the standard analyses.
For example, say a researcher has an existing automated scoring engine for grading short responses that extracts the features and computes the predicted score. This engine uses a large number of binary, sparse features. She cannot use rsmtool
to train her model since it requires numeric features. So, she uses scikit-learn to train her model.
Once the model is trained, the researcher wants to evaluate her engine’s performance using the analyses recommended by the educational measurement community as well as conduct additional investigations for specific subgroups of test-takers. However, these kinds of analyses are not available in scikit-learn
. She can use rsmeval
to set up a customized report using a combination of existing and custom sections and quickly produce the evaluation that is useful to her.
Tutorial¶
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow¶
rsmeval
is designed for evaluating existing machine scores. Once you have the scores computed for all the responses in your data, the next steps are fairly straightforward:
- Create a data file in one of the supported formats containing the computed system scores and the human scores you want to compare against.
- Create an experiment configuration file describing the evaluation experiment you would like to run.
- Run that configuration file with rsmeval and generate the experiment HTML report as well as the intermediate CSV files.
- Examine the HTML report to check various aspects of model performance.
Note that the above workflow does not use any customization features , e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example¶
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Generate scores¶
rsmeval
is designed for researchers who have developed their own scoring engine for generating scores and would like to produce an evaluation report for those scores. For this tutorial, we will use the scores we generated for the ASAP2 evaluation set using rsmtool tutorial.
Create a configuration file¶
The next step is to create an experiment configuration file in .json
format.
1 2 3 4 5 6 7 8 9 10 11 12 | { "experiment_id": "ASAP2_evaluation", "description": "Evaluation of the scores generated using rsmtool.", "predictions_file": "ASAP2_scores.csv", "system_score_column": "system", "human_score_column": "human", "id_column": "ID", "trim_min": 1, "trim_max": 6, "second_human_score_column": "human2", "scale_with": "asis" } |
Let’s take a look at the options in our configuration file.
- Line 2: We define an experiment ID.
- Line 3: We also provide a description which will be included in the experiment report.
- Line 4: We list the path to the file with the predicted and human scores. For this tutorial we used
.csv
format, but RSMTool also supports several other input file formats. - Line 5: This field indicates that the system scores in our
.csv
file are located in a column namedsystem
. - Line 6: This field indicates that the human (reference) scores in our
.csv
file are located in a column namedhuman
. - Line 7: This field indicates that the unique IDs for the responses in the
.csv
file are located in columns namedID
. - Lines 8-9: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
- Line 10: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the
human2
column in the.csv
file. - Line 11: This field indicates that the provided machine scores are already re-scaled to match the distribution of human scores.
rsmeval
itself will not perform any scaling and the report will refer to these asscaled
scores.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmeval
configuration files rather than creating them manually.
Run the experiment¶
Now that we have our scores in the right format and our configuration file in .json
format, we can use the rsmeval command-line script to run our evaluation experiment.
$ cd examples/rsmeval
$ rsmeval config_rsmeval.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmeval
Assuming given system predictions are already scaled and will be used as such.
predictions: /Users/nmadnani/work/rsmtool/examples/rsmeval/ASAP2_scores.csv
Processing predictions
Saving pre-processed predictions and the metadata to disk
Running analyses on predictions
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see the output
, figure
, and report
sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_evaluation_report.html
file, which is the final evaluation report.
Examine the report¶
Our experiment report contains all the information we would need to evaluate the provided system scores against the human scores. It includes:
- The distributions for the human versus the system scores.
- Several different metrics indicating how well the machine’s scores agree with the humans’.
- Information about human-human agreement and the difference between human-human and human-system agreement.
… and much more.
Input¶
rsmeval
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmeval
will use the current directory as the output directory.
Here are all the arguments to the rsmeval
command-line script.
-
config_file
¶
The JSON configuration file for this experiment.
-
output_dir
(optional)
¶ The output directory where all the files for this experiment will be stored.
-
-f
,
--force
¶
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.
-
-h
,
--help
¶
Show help message and exist.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmeval
experiment. Here’s an example configuration file for rsmeval
.
Note
To make it easy to get started with rsmeval
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmeval
configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
experiment_id¶
An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
predictions_file¶
The path to the file with predictions to evaluate. The file should be in one of the supported formats. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.
system_score_column¶
The name for the column containing the scores predicted by the system. These scores will be used for evaluation.
trim_min¶
The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min
- trim_tolerance
.
trim_max¶
The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max
+ trim_tolerance
.
Note
Although the trim_min
and trim_max
fields are optional for rsmtool
, they are required for rsmeval
.
candidate_column (Optional)¶
The name for an optional column in prediction file containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.
custom_sections (Optional)¶
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)¶
A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.
exclude_zero_scores (Optional)¶
By default, responses with human scores of 0 will be excluded from evaluations. Set this field to false
if you want to keep responses with scores of 0. Defaults to true
.
file_format (Optional)¶
The format of the intermediate files. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
flag_column (Optional)¶
This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0
in a column called ADVISORY
). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be evaluated. For example, a value of {"ADVISORY": 0}
will mean that rsmeval
will only use responses for which the ADVISORY
column has the value 0. Defaults to None
.
Note
If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}
) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY
column has a value of 0 and the ERROR
column has a value of 0).
Note
When reading the values in the supplied dictionary, rsmeval
treats numeric strings, floats and integers as the same value. Thus 1
, 1.0
, "1"
and "1.0"
are all treated as the 1.0
.
general_sections (Optional)¶
RSMTool provides pre-defined sections for rsmeval
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
data_description
: Shows the total number of responses, along with any responses have been excluded due to non-numeric/zero scores or flag columns.
data_description_by_group
: Shows the total number of responses for each of the subgroups specified in the configuration file. This section only covers the responses used to evaluate the model.
consistency
: shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.
evaluation
: Shows the standard set of evaluations recommended for scoring models on the evaluation data:
- a table showing human-system association metrics;
- the confusion matrix; and
- a barplot showing the distributions for both human and machine scores.
evaluation by group
: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.
fairness_analyses
: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:
- percentage of variance in squared error explained by subgroup membership
- percentage of variance in raw (signed) error explained by subgroup membership
- percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score
- plots showing estimates for each subgroup for each model
true_score_evaluation
: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
- Number of single and double-scored responses.
- Variance of human rater errors and estimated variance of true scores
- Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
intermediate_file_paths
: Shows links to all of the intermediate files that were generated while running the evaluation.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
human_score_column (Optional)¶
The name for the column containing the human scores for each response. The values in this column will be used as observed scores. Defaults to sc1
.
Note
All responses with non-numeric values or zeros in either human_score_column
or system_score_column
will be automatically excluded from evaluation. You can use exclude_zero_scores (Optional) to keep responses with zero scores.
id_column (Optional)¶
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmeval
will look for a column called spkitemid
in the prediction file.
min_items_per_candidate (Optional)¶
An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None
.
min_n_per_group (Optional)¶
A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.
Note
If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.
rater_error_variance (Optional)¶
True score evaluations require an estimate of rater error variance. By default, rsmeval
will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmeval
. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.
scale_with (Optional)¶
In many scoring applications, system scores are re-scaled so that their mean and standard deviation match those of the human scores for the training data.
If you want rsmeval
to re-scale the supplied predictions, you need to provide – as the value for this field – the path to a second file in one of the supported formats containing the human scores and predictions of the same system on its training data. This file must have two columns: the human scores under the sc1
column and the predicted score under the prediction
.
This field can also be set to "asis"
if the scores are already scaled. In this case, no additional scaling will be performed by rsmeval
but the report will refer to the scores as “scaled”.
Defaults to "raw"
which means that no-rescaling is performed and the report refers to the scores as “raw”.
second_human_score_column (Optional)¶
The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.
Note
You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.
section_order (Optional)¶
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
- Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
- All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
- All special sections specified using special_sections.
special_sections (Optional)¶
A list specifying special ETS-only sections to be included into the final report. These sections are available only to ETS employees via the rsmextra
package.
subgroups (Optional)¶
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]
. These subgroup columns need to be present in the input predictions file. If subgroups are specified, rsmeval
will generate:
- tables and barplots showing human-system agreement for each subgroup on the evaluation set.
trim_tolerance (Optional)¶
The single numeric value that will be used to pad the trimming range specified in trim_min
and trim_max
. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max
+ trim_tolerance
for ceiling value and trim_min
-trim_tolerance
for floor value.
Defaults to 0.4998.
Note
For more fine-grained control over the trimming range, you can set trim_tolerance
to 0 and use trim_min
and trim_max
to specify the exact floor and ceiling values.
use_thumbnails (Optional)¶
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
Output¶
rsmeval
produces a set of folders in the output directory.
report¶
This folder contains the final RSMEval report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file).
output¶
This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv
files. rsmeval
will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.
figure¶
This folder contains all of the figures generated as part of the various analyses performed, saved as .svg
files.
Intermediate files¶
Although the primary output of rsmeval
is an HTML report, we also want the user to be able to conduct additional analyses outside of rsmeval
. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format
parameter in the output
directory. The following sections describe all of the intermediate files that are produced.
Note
The names of all files begin with the experiment_id
provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:
spkitemid
for the column containing response IDs.sc1
for the column containing the human scores used as observed scoressc2
for the column containing the second human scores, if this column was specified in the configuration file.candidate
for the column containing candidate IDs, if this column was specified in the configuration file.
Predictions¶
filename: pred_processed
This file contains the post-processed predicted scores: the predictions from the model are truncated, rounded, and re-scaled (if requested).
Flagged responses¶
filename: test_responses_with_excluded_flags
This file contains all of the rows in input predictions file that were filtered out based on conditions specified in flag_column.
Note
If the predictions file contained columns with internal names such as sc1
that were not actually used by rsmeval
, they will still be included in these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Excluded responses¶
filename: test_excluded_responses
This file contains all of the rows in the predictions file that were filtered out because of non-numeric or zero scores.
Response metadata¶
filename: test_metadata
This file contains the metadata columns (id_column
, subgroups
if provided) for all rows in the predictions file that used in the evaluation.
Unused columns¶
filename: test_other_columns
This file contains all of the the columns from the input predictions file that are not present in the *_pred_processed
and *_metadata
files. They only include the rows that were not filtered out.
Note
If the predictions file contained columns with internal names such as sc1
but these columns were not actually used by rsmeval
, these columns will also be included into these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Human scores¶
filename: test_human_scores
This file contains the human scores, if available in the input predictions file, under a column called sc1
with the response IDs under the spkitemid
column.
If second_human_score_column
was specfied, then it also contains the values in the predictions file from that column under a column called sc2
. Only the rows that were not filtered out are included.
Note
If exclude_zero_scores
was set to true
(the default value), all zero scores in the second_human_score_column
will be replaced by nan
.
Data composition¶
filename: data_composition
This file contains the total number of responses in the input predictions file. If applicable, the table will also include the number of different subgroups.
Excluded data composition¶
filenames: test_excluded_composition
This file contains the composition of the set of excluded responses, e.g., why were they excluded and how many for each such exclusion.
Subgroup composition¶
filename: data_composition_by_<SUBGROUP>
There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup.
Evaluation metrics¶
eval
: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.eval_by_<SUBGROUP>
: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.eval_short
- a shortened version ofeval
that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (theraw
orscale
version is chosen depending on the value of theuse_scaled_predictions
in the configuration file).- h_mean
- h_sd
- corr
- sys_mean [raw/scale trim]
- sys_sd [raw/scale trim]
- SMD [raw/scale trim]
- adj_agr [raw/scale trim_round]
- exact_agr [raw/scale trim_round]
- kappa [raw/scale trim_round]
- wtkappa [raw/scale trim]
- sys_mean [raw/scale trim_round]
- sys_sd [raw/scale trim_round]
- SMD [raw/scale trim_round]
- R2 [raw/scale trim]
- RMSE [raw/scale trim]
score_dist
: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.confMatrix
: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.
Note
Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.
Human-human Consistency¶
These files are created only if a second human score has been made available via the second_human_score_column
option in the configuration file.
consistency
: contains descriptives for both human raters as well as the agreement metrics between their ratings.consistency_by_<SUBGROUP>
: contains the same metrics as inconsistency
file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.degradation
: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.
Evaluations based on test theory¶
disattenuated_correlations
: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.disattenuated_correlations_by_<SUBGROUP>
: contains the same metrics as indisattenuated_correlations
file computed separately for each group.true_score_eval
: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.
Additional fairness analyses¶
These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.
<METRICS>_by_<SUBGROUP>.ols
: a serialized object of typepandas.stats.ols.OLS
containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics areosa
(overall score accuracy),osd
(overall score difference), andcsd
(conditional score difference).<METRICS>_by_<SUBGROUP>_ols_summary.txt
: a text file containing a summary of the above modelestimates_<METRICS>_by_<SUBGROUP>`
: coefficients, confidence intervals and p-values estimated by the model for each subgroup.fairness_metrics_by_<SUBGROUP>
: the \(R^2\) (percentage of variance) and p-values for all models.
rsmpredict
- Generate new predictions¶
RSMTool provides the rsmpredict
command-line utility to generate predictions for new data using a model already trained using the rsmtool
utility. This can be useful when processing a new set of responses to the same task without needing to retrain the model.
rsmpredict
pre-processes the feature values according to user specifications before using them to generate the predicted scores. The generated scores are post-processed in the same manner as they are in rsmtool
output.
Note
No score is generated for responses with non-numeric values for any of the features included into the model.
If the original model specified transformations for some of the features and these transformations led to NaN
or Inf
values when applied to the new data, rsmpredict
will raise a warning. No score will be generated for such responses.
Tutorial¶
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow¶
Important
Although this tutorial provides feature values for the purpose of illustration, rsmpredict
does not include any functionality for feature extraction; the tool is designed for researchers who use their own NLP/Speech processing pipeline to extract features for their data.
rsmpredict
allows you to generate the scores for new data using an existing model trained using RSMTool. Therefore, before starting this tutorial, you first need to complete rsmtool tutorial which will produce a train RSMTool model. You will also need to process the new data to extract the same features as the ones used in the model.
Once you have the features for the new data and the RSMTool model, using rsmpredict
is fairly straightforward:
Create a file containing the features for the new data. The file should be in one of the supported formats.
Create an experiment configuration file describing the experiment you would like to run.
Run that configuration file with rsmpredict to generate the predicted scores.
Note
You do not need human scores to run
rsmpredict
since it does not produce any evaluation analyses. If you do have human scores for the new data and you would like to evaluate the system on this new data, you can first runrsmpredict
to generate the predictions and then runrsmeval
on the output ofrsmpredict
to generate an evaluation report.
ASAP Example¶
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial. Specifically, We are going to use the linear regression model we trained in that tutorial to generate scores for new data.
Note
If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.
Extract features¶
We will first need to generate features for the new set of responses for which we want to predict scores. For this experiment, we will simply re-use the test set from the rsmtool
tutorial.
Note
The features used with rsmpredict
should be generated using the same NLP/Speech processing pipeline that generated the features used in the rsmtool
modeling experiment.
Create a configuration file¶
The next step is to create an rsmpredict experiment configuration file in .json
format.
1 2 3 4 5 6 7 8 | { "experiment_dir": "../rsmtool", "experiment_id": "ASAP2", "input_features_file": "../rsmtool/test.csv", "id_column": "ID", "human_score_column": "score", "second_human_score_column": "score2" } |
Let’s take a look at the options in our configuration file.
- Line 2: We give the path to the directory containing the output of the
rsmtool
experiment. - Line 3: We provide the
experiment_id
of thersmtool
experiment used to train the model. This can usually be read off theoutput/<experiment_id>.model
file in thersmtool
experiment output directory. - Line 4: We list the path to the data file with the feature values for the new data. For this tutorial we used
.csv
format, but RSMTool also supports several other input file formats. - Line 5: This field indicates that the unique IDs for the responses in the
.csv
file are located in a column namedID
. - Lines 6-7: These fields indicates that there are two sets of human scores in our
.csv
file located in the columns namedscore
andscore2
. The values from these columns will be added to the output file containing the predictions which can be useful if we want to evaluate the predictions usingrsmeval
.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmpredict
configuration files rather than creating them manually.
Run the experiment¶
Now that we have the model, the features in the right format, and our configuration file in .json
format, we can use the rsmpredict command-line script to generate the predictions and to save them in predictions.csv
.
$ cd examples/rsmpredict
$ rsmpredict config_rsmpredict.json predictions.csv
This should produce output like:
WARNING: The following extraneous features will be ignored: {'spkitemid', 'sc1', 'sc2', 'LENGTH'}
Pre-processing input features
Generating predictions
Rescaling predictions
Trimming and rounding predictions
Saving predictions to /Users/nmadnani/work/rsmtool/examples/rsmpredict/predictions.csv
You should now see a file named predictions.csv
in the current directory which contains the predicted scores for the new data in the predictions
column.
Input¶
rsmpredict
requires two arguments to generate predictions: the path to a configuration file and the path to the output file where the generated predictions are saved in .csv
format.
If you also want to save the pre-processed feature values,``rsmpredict`` can take a third optional argument --features
to specify the path to a .csv
file to save these values.
Here are all the arguments to the rsmpredict
command-line script.
-
config_file
¶
The JSON configuration file for this experiment.
-
output_file
¶
The output
.csv
file where predictions will be saved.
-
--features
<preproc_feats_file>
¶ If specified, the pre-processed values for the input features will also be saved in this
.csv
file.
-
-h
,
--help
¶
Show help message and exist.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmpredict
experiment. Here’s an example configuration file for rsmpredict
.
Note
To make it easy to get started with rsmpredict
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmpredict
configuration fields in detail. There are three required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
experiment_dir¶
The path to the directory containing rsmtool
model to use for generating predictions. This directory must contain a sub-directory called output
with the model files, feature pre-processing parameters, and score post-processing parameters. The path can be absolute or relative to the location of configuration file.
experiment_id¶
The experiment_id
used to create the rsmtool
model files being used for generating predictions. If you do not know the experiment_id
, you can find it by looking at the prefix of the .model
file under the output
directory.
input_feature_file¶
The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the supported formats Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names must be the same as used in the original rsmtool
experiment.
Note
rsmpredict
will only generate predictions for responses in this file that have numeric values for the features included in the rsmtool
model.
See also
rsmpredict
does not require human scores for the new data since it does not evaluate the generated predictions. If you do have the human scores and want to evaluate the new predictions, you can use the rsmeval command-line utility.
candidate_column (Optional)¶
The name for the column containing unique candidate IDs. This column will be named candidate
in the output file with predictions.
file_format (Optional)¶
The format of the intermediate files. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
flag_column (Optional)¶
See description in the rsmtool configuration file for further information. No filtering will be done by rsmpredict
, but the contents of all specified columns will be added to the predictions file using the original column names.
human_score_column (Optional)¶
The name for the column containing human scores. This column will be renamed to sc1
.
id_column (Optional)¶
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmpredict
will look for a column called spkitemid
in the prediction file.
There are several other options in the configuration file that, while not directly used by rsmpredict
, can simply be passed through from the input features file to the output predictions file. This can be particularly useful if you want to subsequently run rsmeval to evaluate the generated predictions.
predict_expected_scores (Optional)¶
If the original model was a probabilistic SKLL classifier, then expected scores — probability-weighted averages over a contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default. Set this field to true
to compute expected scores as predictions. Defaults to false
.
Note
- If the model in the original
rsmtool
experiment is an SVC, that original experiment must have been run withpredict_expected_scores
set totrue
. This is because SVC classifiers are fit differently if probabilistic output is desired, in contrast to other probabilistic SKLL classifiers. - You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probability values for certain score points which can affect floating point computations.
second_human_score_column (Optional)¶
The name for the column containing the second human score. This column will be renamed to sc2
.
standardize_features (Optional)¶
If this option is set to false
features will not be standardized by dividing by the mean and multiplying by the standard deviation. Defaults to true
.
subgroups (Optional)¶
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]
. All these columns will be included into the predictions file with the original names.
Output¶
rsmpredict
produces a .csv
file with predictions for all responses in new data set, and, optionally, a .csv
file with pre-processed feature values. If any of the responses had non-numeric feature values in
the original data or after applying transformations, these are saved in a file name PREDICTIONS_NAME_excluded_responses.csv
where PREDICTIONS_NAME
is the name of the predictions file supplied by the user without the extension.
The predictions .csv
file contains the following columns:
spkitemid
: the unique resonse IDs from the original feature file.sc1
andsc2
: the human scores for each response from the original feature file (human_score_column
andsecond_human_score_column
, respectively.raw
: raw predictions generated by the model.raw_trim
,raw_trim_round
,scale
,scale_trim
,scale_trim_round
: raw scores post-processed in different ways.
rsmcompare
- Create a detailed comparison of two scoring models¶
RSMTool provides the rsmcompare
command-line utility to compare two models and to generate a detailed comparison report including differences between the two models. This can be useful in many scenarios, e.g., say the user wants to compare the changes in model performance after adding a new feature into the model. To use rsmcompare
, the user must first run two experiments using either rsmtool or rsmeval. rsmcompare
can then be used to compare the outputs of these two experiments to each other.
Note
Currently rsmcompare
takes the outputs of the analyses generated during the original experiments and creates comparison tables. These comparison tables were designed with a specific comparison scenario in mind: comparing a baseline model with a model which includes new feature(s). The tool can certianly be used for other comparison scenarios if the researcher feels that the generated comparison output is appropriate.
rsmcompare
can be used to compare:
- Two
rsmtool
experiments, or - Two
rsmeval
experiments, or - An
rsmtool
experiment with anrsmeval
experiment (in this case, only the evaluation analyses will be compared).
Note
It is strongly recommend that the original experiments as well as the comparison experiment are all done using the same version of RSMTool.
Tutorial¶
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow¶
rsmcompare
is designed to compare two existing rsmtool
or rsmeval
experiments. To use rsmcompare
you need:
- Two experiments that were run using rsmtool or rsmeval.
- Create an experiment configuration file describing the comparison experiment you would like to run.
- Run that configuration file with rsmcompare and generate the comparison experiment HTML report.
- Examine HTML report to compare the two models.
Note that the above workflow does not use the customization features of rsmcompare
, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example¶
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Run rsmtool
(or rsmeval
) experiments¶
rsmcompare
compares the results of the two existing rsmtool
(or rsmeval
) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to itself.
Note
If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.
Create a configuration file¶
The next step is to create an experiment configuration file in .json
format.
1 2 3 4 5 6 7 8 9 10 11 | { "comparison_id": "ASAP2_vs_ASAP2", "experiment_id_old": "ASAP2", "experiment_dir_old": "../rsmtool/", "description_old": "RSMTool experiment.", "use_scaled_predictions_old": true, "experiment_id_new": "ASAP2", "experiment_dir_new": "../rsmtool", "description_new": "RSMTool experiment (copy).", "use_scaled_predictions_new": true } |
Let’s take a look at the options in our configuration file.
- Line 2: We provide an ID for the comparison experiment.
- Line 3: We provide the
experiment_id
for the experiment we want to use as a baseline. - Line 4: We also give the path to the directory containing the output of the original baseline experiment.
- Line 5: We give a short description of this baseline experiment. This will be shown in the report.
- Line 6: This field indicates that the baseline experiment used scaled scores for some evaluation analyses.
- Line 7: We provide the
experiment_id
for the new experiment. We use the same experiment ID for both experiments since we are comparing the experiment to itself. - Line 8: We also give the path to the directory containing the output of the new experiment. As above, we use the same path because we are comparing the experiment to itself.
- Line 9: We give a short description of the new experiment. This will also be shown in the report.
- Line 10: This field indicates that the new experiment also used scaled scores for some evaluation analyses.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmcompare
configuration files rather than creating them manually.
Run the experiment¶
Now that we have the two experiments we want to compare and our configuration file in .json
format, we can use the rsmcompare command-line script to run our comparison experiment.
$ cd examples/rsmcompare
$ rsmcompare config_rsmcompare.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmcompare
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see an HTML file named ASAP2_vs_ASAP2_report.html
. This is the final rsmcompare
comparison report.
Examine the report¶
Our experiment report contains all the information we would need to compare the new model to the baseline model. It includes:
- Comparison of feature distributions between the two experiments.
- Comparison of model coefficients between the two experiments.
- Comparison of model performance between the two experiments.
Note
Since we are comparing the experiment to itself, the comparison is not very interesting, e.g., the differences between various values will always be 0.
Input¶
rsmcompare
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmcompare
will use the current directory as the output directory.
Here are all the arguments to the rsmcompare
command-line script.
-
config_file
¶
The JSON configuration file for this experiment.
-
output_dir
(optional)
¶ The output directory where the report files for this comparison will be stored.
-
-h
,
--help
¶
Show help message and exist.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmcompare
experiment. Here’s an example configuration file for rsmcompare
.
Note
To make it easy to get started with rsmcompare
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmcompare
configuration fields in detail. There are seven required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
comparison_id¶
An identifier for the comparison experiment that will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
experiment_id_old¶
An identifier for the “baseline” experiment. This ID should be identical to the experiment_id
used when the baseline experiment was run, whether rsmtool
or rsmeval
. The results for this experiment will be listed first in the comparison report.
experiment_id_new¶
An identifier for the experiment with the “new” model (e.g., the model with new feature(s)). This ID should be identical to the experiment_id
used when the experiment was run, whether rsmtool
or rsmeval
. The results for this experiment will be listed first in the comparison report.
experiment_dir_old¶
The directory with the results for the “baseline” experiment. This directory is the output directory that was used for the experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
experiment_dir_new¶
The directory with the results for the experiment with the new model. This directory is the output directory that was used for the experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
description_old¶
A brief description of the “baseline” experiment. The description can contain spaces and punctuation.
description_new¶
A brief description of the experiment with the new model. The description can contain spaces and punctuation.
custom_sections (Optional)¶
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
general_sections (Optional)¶
RSMTool provides pre-defined sections for rsmcompare
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
feature_descriptives
: Compares the descriptive statistics for all raw feature values included in the model:
- a table showing mean, standard deviation, skewness and kurtosis;
- a table showing the number of truncated outliers for each feature; and
- a table with percentiles and outliers;
- a table with correlations between raw feature values and human score in each model and the correlation between the values of the same feature in these two models. Note that this table only includes features and responses which occur in both training sets.
features_by_group
: Shows boxplots for both experiments with distributions of raw feature values by each of the subgroups specified in the configuration file.
preprocessed_features
: Compares analyses of preprocessed features:
- histograms showing the distributions of preprocessed features values;
- the correlation matrix between all features and the human score;
- a table showing marginal correlations between all features and the human score; and
- a table showing partial correlations between all features and the human score.
preprocessed_features_by_group
: Compares analyses of preprocessed features by subgroups: marginal and partial correlations between each feature and human score for each subgroup.
consistency
: Compares metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated correlations for the whole dataset and by each of the subgroups specified in the configuration file.
score_distributions
:
- tables showing the distributions for both human and machine scores; and
- confusion matrices for human and machine scores.
model
: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.
evaluation
: Compares the standard set of evaluations recommended for scoring models on the evaluation data.
true_score_evaluation
: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
- Number of single and double-scored responses.
- Variance of human rater errors and estimated variance of true scores
- Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
pca
: Shows the results of principal components analysis on the processed feature values for the new model only:
- the principal components themselves;
- the variances; and
- a Scree plot.
notes
: Notes explaining the terminology used in comparison reports.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
section_order (Optional)¶
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
- Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
- All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
- All special sections specified using special_sections.
special_sections (Optional)¶
A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.
subgroups (Optional)¶
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups.For example, ["prompt, gender, native_language, test_country"]
.
Note
In order to include subgroups analyses in the comparison report, both experiments must have been run with the same set of subgroups.
use_scaled_predictions_old (Optional)¶
Set to true
if the “baseline” experiment used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false
.
use_scaled_predictions_new (Optional)¶
Set to true
if the experiment with the new model used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false
.
Warning
For rsmtool
and rsmeval
, primary evaluation analyses are computed on both raw and scaled scores, but some analyses (e.g., the confusion matrix) are only computed for either raw or re-scaled scores based on the value of use_scaled_predictions
. rsmcompare
uses the existing outputs and does not perform any additional evaluations. Therefore if this field was set to true
in the original experiment but is set to false
for rsmcompare
, the report will be internally inconsistent: some evaluations use raw scores whereas others will use scaled scores.
use_thumbnails (Optional)¶
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
Output¶
rsmcompare
produces the comparison report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file) in the output directory.
rsmsummarize
- Compare multiple scoring models¶
RSMTool provides the rsmsummarize
command-line utility to compare multiple models and to generate a comparison report. Unlike rsmcompare
which creates a detailed comparison report between the two models, rsmsummarize
can be used to create a more general overview of multiple models.
rsmsummarize
can be used to compare:
- Multiple
rsmtool
experiments, or - Multiple
rsmeval
experiments, or - A mix of
rsmtool
andrsmeval
experiments (in this case, only the evaluation analyses will be compared).
Note
It is strongly recommend that the original experiments as well as the summary experiment are all done using the same version of RSMTool.
Tutorial¶
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow¶
rsmsummarize
is designed to compare several existing rsmtool
or rsmeval
experiments. To use rsmsummarize
you need:
- Two or more experiments that were run using rsmtool or rsmeval.
- Create an experiment configuration file describing the comparison experiment you would like to run.
- Run that configuration file with rsmsummarize and generate the comparison experiment HTML report.
- Examine HTML report to compare the models.
Note that the above workflow does not use the customization features of rsmsummarize
, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.
ASAP Example¶
We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Run rsmtool
and rsmeval
experiments¶
rsmsummarize
compares the results of the two or more existing rsmtool
(or rsmeval
) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to the evaluations we obtained in the rsmeval tutorial.
Note
If you have not already completed these tutorials, please do so now. You may need to complete them again if you deleted the output files.
Create a configuration file¶
The next step is to create an experiment configuration file in .json
format.
1 2 3 4 5 6 | { "summary_id": "model_comparison", "description": "a comparison of the results of the rsmtool sample experiment, rsmeval sample experiment and once again the rsmtool sample experiment", "experiment_dirs": ["../rsmtool", "../rsmeval", "../rsmtool"], "experiment_names":["RSMTool experiment 1", "RSMEval experiment", "RSMTool experiment 2"] } |
Let’s take a look at the options in our configuration file.
- Line 2: We provide the
summary_id
for the comparison. This will be used to generate the name of the final report. - Line 3: We give a short description of this comparison experiment. This will be shown in the report.
- Line 4: We also give the list of paths to the directories containing the outputs of the experiments we want to compare.
- Line 5: Since we want to compare experiments that all used the same experiment id (
ASAP2
), we instead list the names that we want to use for each experiment in the summary report.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmsummarize
configuration files rather than creating them manually.
Run the experiment¶
Now that we have the list of the experiments we want to compare and our configuration file in .json
format, we can use the rsmsummarize command-line script to run our comparison experiment.
$ cd examples/rsmsummarize
$ rsmsummarize config_rsmsummarize.json
This should produce output like:
Output directory: /Users/nmadnani/work/rsmtool/examples/rsmsummarize
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3
Once the run finishes, you will see a new folder report
containing an HTML file named model_comparison_report.html
. This is the final rsmsummarize
summary report.
Examine the report¶
Our experiment report contains the overview of main aspects of model performance. It includes:
- Brief description of all experiments.
- Information about model parameters and model fit for all
rsmtool
experiments. - Model performance for all experiments.
Note
Some of the information such as model fit and model parameters are only available for rsmtool
experiments.
Input¶
rsmsummarize
requires a single argument to run an experiment: the path to a configuration file. You can specify which models you want to compare and the name of the report by supplying the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmsummarize
will use the current directory as the output directory.
Here are all the arguments to the rsmsummarize
command-line script.
-
config_file
¶
The JSON configuration file for this experiment.
-
output_dir
(optional)
¶ The output directory where the report and intermediate
.csv
files for this comparison will be stored.
-
-f
,
--force
¶
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmsummarize experiment.
-
-h
,
--help
¶
Show help message and exist.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmsummarize
experiment. Here’s an example configuration file for rsmsummarize
.
Note
To make it easy to get started with rsmsummarize
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmsummarize
configuration fields in detail. There are two required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
summary_id¶
An identifier for the rsmsummarize
experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
experiment_dirs¶
The list of the directories with the results of the experiment. These directories should be the output directories used for each experiment and should contain subdirectories output
and figure
generated by rsmtool
or rsmeval
.
custom_sections (Optional)¶
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)¶
A brief description of the summary. The description can contain spaces and punctuation.
experiment_names (Optional)¶
The list of experiment names to use in the summary report and intermediate files. The names should be listed in the same order as the experiments in experiment_dirs. When this field is not specified, the report will show the original experiment_id
for each experiment.
file_format (Optional)¶
The format of the intermediate files generated by rsmsummarize
. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
Note
In the rsmsummarize
context, the file_format
parameter refers to the format of the intermediate files generated by rsmsummarize
, not the intermediate files generated by the original experiment(s) being summarized. The format of these files does not have to match the format of the files generated by the original experiment(s).
general_sections (Optional)¶
RSMTool provides pre-defined sections for rsmsummarize
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
preprocessed_features
: compares marginal and partial correlations between all features and the human score, and optionally response length if this was computed for any of the models.
model
: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.
evaluation
: Compares the standard set of evaluations recommended for scoring models on the evaluation data.
true_score_evaluation
: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
- Number of single and double-scored responses.
- Variance of human rater errors and estimated variance of true scores
- Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
intermediate_file_paths
: Shows links to all of the intermediate files that were generated while running the summary.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
section_order (Optional)¶
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
- Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
- All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
- All special sections specified using special_sections.
special_sections (Optional)¶
A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.
use_thumbnails (Optional)¶
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
Output¶
rsmsummarize
produces a set of folders in the output directory.
report¶
This folder contains the final rsmsummarize
report in HTML format as well as in the form of a Jupyter notebook (a .ipynb
file).
output¶
This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv
files. rsmsummarize
will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.
figure¶
This folder contains all of the figures that may be generated as part of the various analyses performed, saved as .svg
files. Note that no figures are generated by the existing rsmsummarize
notebooks.
Intermediate files¶
Although the primary output of RSMSummarize is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format
parameter in the output
directory. The following sections describe all of the intermediate files that are produced.
Note
The names of all files begin with the summary_id
provided by the user in the experiment configuration file.
Marginal and partial correlations with score¶
filenames: margcor_score_all_data
, pcor_score_all_data
, `pcor_score_no_length_all_data
The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column
was specified in the configuration file.
Model information¶
model_summary
This file contains the main information about the models included into the report including:
- Total number of features
- Total number of features with non-negative coefficients
- The learner
- The label used to train the model
betas
: standardized coefficients (for built-in models only).model_fit
: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.
Note
If the report includes a combination of rsmtool
and rsmeval
experiments, the summary tables with model information will only include rsmtool
experiments since no model information is available for rsmeval
experiments.
Evaluation metrics¶
eval_short
- descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (theraw
orscale
version is chosen depending on the value of theuse_scaled_predictions
in the configuration file).- h_mean
- h_sd
- corr
- sys_mean [raw/scale trim]
- sys_sd [raw/scale trim]
- SMD [raw/scale trim]
- adj_agr [raw/scale trim_round]
- exact_agr [raw/scale trim_round]
- kappa [raw/scale trim_round]
- wtkappa [raw/scale trim_round]
- sys_mean [raw/scale trim_round]
- sys_sd [raw/scale trim_round]
- SMD [raw/scale trim_round]
- R2 [raw/scale trim]
- RMSE [raw/scale trim]
Note
Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.
Evaluations based on test theory¶
true_score_eval
: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.
rsmxval
- Run cross-validation experiments¶
RSMTool provides the rsmxval
command-line utility to run cross-validation experiments with scoring models. Why would one want to use cross-validation rather than just using the simple train-and-evaluate loop provided by the rsmtool
utility? Using cross-validation can provide more accurate estimates of scoring model performance since those estimates are averaged over multiple train-test splits that are randomly selected based on the data. Using a single train-test split may lead to biased estimates of performance since those estimates will depend on the specific characteristics of that split. Using cross-validation is more likely to provide estimates of how well the scoring model will generalize to unseen test data, and more easily flag problems with overfitting and selection bias, if any.
Cross-validation experiments in RSMTool consist of the following steps:
- The given training data file is first shuffled randomly (with a fixed seed for reproducibility) and then split into the requested number of folds. It is also possible for the user to provide a CSV file containing a pre-determined set of folds, e.g., from another part of the data pipeline.
- For each fold (or train-test split),
rsmtool
is run to train a scoring model on the training split and evaluate on the test split. All of the outputs for each of thersmtool
runs are saved on disk and represent the per-fold performance. - The predictions generated by
rsmtool
for each of the folds are all combined into a single file, which is then used as input forrsmeval
. The output of this evaluation run is saved to disk and provides a more accurate estimate of the predictive performance of a scoring model trained on the given data. - A summary report comparing all of the folds is generated by running
rsmsummarize
on all of the fold directories created in the Step 1 and its output is also saved to disk. This summary output can be useful to see if the performance for any of the folds stands out for any reason, which could point to a potential problem. - Finally, a scoring model is trained on the complete training data file using
rsmtool
, which also generates a report that contains only the feature and model descriptives. The model is what will most likely be deployed for inference assuming the analyses produced in this step and Steps 1–4 meet the stakeholders’ requirements.
Tutorial¶
For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.
Workflow¶
rsmxval
is designed to run cross-validation experiments using a single file containing human scores and features. Just like rsmtool
, rsmxval
does not provide any functionality for feature extraction and assumes that users will extract features on their own. The workflow steps are as follows:
- Create a data file in one of the supported formats containing the extracted features for each response in the data along with human score(s) assigned to it.
- Create an experiment configuration file describing the cross-validation experiment you would like to run.
- Run that configuration file with rsmxval and generate its outputs.
- Examine the various HTML reports to check various aspects of model performance.
Note that unlike rsmtool
and rsmeval
, rsmxval
currently does not support customization of the HTML reports generated in each step. This functionality may be added in future versions.
ASAP Example¶
We are going to use the same example from the 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.
Extract features¶
We are using the same features for this data as described in the rsmtool tutorial.
Create a configuration file¶
The next step is to create an experiment configuration file in .json
format.
1 2 3 4 5 6 7 8 9 10 11 12 13 | { "experiment_id": "ASAP2_xval", "description": "Cross-validation with two human scores using a LinearRegression model.", "train_file": "train.csv", "folds": 3, "train_label_column": "score", "id_column": "ID", "model": "LinearRegression", "trim_min": 1, "trim_max": 6, "second_human_score_column": "score2", "use_scaled_predictions": true } |
Let’s take a look at the options in our configuration file.
- Line 2: We define an experiment ID used to identify the files produced as part of this experiment.
- Line 3: We provide a description which will be included in the various reports.
- Line 4: We list the path to our training file with the feature values and human scores. For this tutorial, we used
.csv
format, but several other input file formats are also supported. - Line 5: This field indicates the number of cross-validation folds we want to use. If this field is not specified,
rsmxval
uses 5-fold cross-validation by default. - Line 6: This field indicates that the human (reference) scores in our
.csv
file are located in a column namedscore
. - Line 7: This field indicates that the unique IDs for the responses in the
.csv
file are located in a column namedID
. - Line 8: We choose to use a linear regression model to combine the feature values into a score.
- Lines 9-10: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
- Line 11: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the
score2
column in the training.csv
file. - Line 12: Next, we indicate that we would like to use the scaled scores for all our evaluation analyses at each step.
Documentation for all of the available configuration options is available here.
Note
You can also use our nifty capability to automatically generate rsmxval
configuration files rather than creating them manually.
Run the experiment¶
Now that we have our input file and our configuration file, we can use the rsmxval command-line script to run our evaluation experiment.
$ cd examples/rsmxval
$ rsmxval config_rsmxval.json output
This should produce output like:
Output directory: output
Saving configuration file.
Generating 3 folds after shuffling
Running RSMTool on each fold in parallel
Progress: 100%|███████████████████████████████████████████████| 3/3 [00:08<00:00, 2.76s/it]
Creating fold summary
Evaluating combined fold predictions
Training model on full data
Once the run finishes, you will see an output
sub-directory in the current directory. Under this directory you will see multiple sub-directories, each corresponding to a different cross-validation step, as described here.
Examine the reports¶
The cross-validation experiment produces multiple HTML reports – an rsmtool
report for each of the 3 folds (output/folds/{01,02,03}/report/ASAP2_xval_fold{01,02,03}.html
), the evaluation report for the cross-validated predictions (output/evaluation/report/ASAP2_xval_evaluation_report.html
), a report summarizing the salient characteristics of the 3 folds (output/fold-summary/report/ASAP2_xval_fold_summary_report.html
), and a report showing the feature and model descriptives (output/final-model/report/ASAP2_xval_model_report.html
). Examining these reports will provide a relatively complete picture of how well the predictive performance of the scoring model will generalize to unseen data.
Input¶
rsmxval
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmxval
will use the current directory as the output directory.
Here are all the arguments to the rsmxval
command-line script.
-
config_file
¶
The JSON configuration file for this cross-validation experiment.
-
output_dir
(optional)
¶ The output directory where all the sub-directories and files for this cross-validation experiment will be stored. If a non-empty directory with the same name already exists, an error will be raised.
-
-h
,
--help
¶
Show help message and exit.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmxval
experiment. Here’s an example configuration file for rsmxval
.
Note
To make it easy to get started with rsmxval
, we provide a way to automatically generate configuration files both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Configuration files for rsmxval
are almost identical to rsmtool
configuration files with only a few differences. Next, we describe the three required rsmxval
configuration fields in detail.
experiment_id¶
An identifier for the experiment that will be used as part of the names of the reports and intermediate files produced in each of the steps. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters. Suffixes are added to this experiment ID by each of the steps for the reports and files they produce, i.e., _fold<N>
in the per-fold rsmtool
step where <N>
is a two digit number, _evaluation
by the rsmeval
evaluation step, _fold_summary
by the rsmsummarize
step, and _model
by the final full-data rsmtool
step.
model¶
The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.
train_file¶
The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to config file’s location.
Important
Unlike rsmtool
, rsmxval
does not accept an evaluation set and will raise an error if the test_file
field is specified.
Next, we will describe the two optional fields that are unique to rsmxval
.
folds (Optional)¶
The number of folds to use for cross-validation. This should be an integer and defaults to 5.
folds_file (Optional)¶
The path to a file containing custom, pre-specified folds to be used for cross-validation. This should be a .csv
file (no other formats are accepted) and should contain only two columns: id
and fold
. The id
column should contain the same IDs of the responses that are contained in train_file
above. The fold
column should contain an integer representing which fold the response with the id
belongs to. IDs not specified in this file will be skipped and not included in the cross-validation at all. Just like train_file
, this path can be absolute or relative to the config file’s location. Here’s an example of a folds file containing 2 folds.
Note
If both folds_file
and folds
are specified, then the former will take precedence unless it contains a non-existent path.
In addition to the fields described so far, an rsmxval
configuration file also accepts the following optional fields used by rsmtool
:
candidate_column
description
exclude_zero_scores
feature_subset
feature_subset_file
features
file_format
flag_column
flag_column_test
id_column
length_column
min_items_per_candidate
min_n_per_group
predict_expected_scores
rater_error_variance
second_human_score_column
select_transformations
sign
skll_fixed_parameters
skll_objective
standardize_features
subgroups
train_label_column
trim_max
trim_min
trim_tolerance
use_scaled_predictions
use_thumbnails
use_truncation_thresholds
Please refer to these fields’ descriptions on the page describing the rsmtool configuration file.
Output¶
rsmxval
produces a set of folders in the output directory.
folds¶
This folder contains the output of each of the per-fold rsmtool
experiments. It contains as many sub-folders as the number of specified folds, named 01
, 02
, 03
, etc. Each of these numbered sub-folders contains the output of one rsmtool
experiment conducted using the training split of that fold as the training data and the test split as the evaluation data. Each of the sub-folders contains the output directories produced by rsmtool. The report for each fold lives in the report
sub-directory, e.g., the report for the first fold is found at folds/01/report/<experiment_id>_fold01_report.html
, and so on. The messages that are usually printed out by rsmtool
to the screen are instead logged to a file and saved to disk as, e.g., folds/01/rsmtool.log
.
evaluation¶
This folder contains the output of the rsmeval
evaluation experiment that uses the cross-validated predictions from each fold. This folder contains the output directories produced by rsmeval. The evaluation report can be found at evaluation/report/<experiment_id>_evaluation_report.html
. The messages that are usually printed out by rsmeval
to the screen are instead logged to a file and saved to disk as evaluation/rsmeval.log
.
fold-summary¶
This folder contains the output of the rsmsummarize
experiment that provides a quick summary of all of the folds in a single, easily-scanned report. The folder contains the output directories produced by rsmsummarize. The summary report can be found at fold-summary/report/<experiment_id>_fold_summary_report.html
. The messages that are usually printed out by rsmsummarize
to the screen are instead logged to a file and saved to disk as fold-summary/rsmsummarize.log
.
final-model¶
This folder contains the output of the rsmtool
experiment that trains a model on the full training data and provides a report showing the feature and model descriptives. It contains the output directories produced by rsmtool. The primary artifacts of this experiment are the report (final-model/report/<experiment_id>_model_report.html
) and the final trained model (final-model/output/<experiment_id>_model.model
). The messages that are usually printed out by rsmtool
to the screen are instead logged to a file and saved to disk as final-model/rsmtool.log
.
Note
Every rsmtool
experiment requires both a training and an evaluation set. However, in this step, we are using the full training data to train the model and rsmxval
does not use a separate test set. Therefore, we simply randomly sample 10% of the full training data as a dummy test set to make sure that rsmtool
runs successfully. The report in this step only contains the model and feature descriptives and, therefore, does not use this dummy test set at all. Users should ignore any intermediate files under the final-model/output
and final-model/figure
sub-directories that are derived from this dummy test set. If needed, the data used as the dummy test set can be found at final-model/dummy_test.csv
(or in the chosen format).
In addition to these folders, rsmxval
will also save a copy of the configuration file in the output directory at the same-level as the above folders. Fields not specified in the original configuration file will be pre-populated with default values.