In addition to providing the rsmtool utility training and evaluating regression-based scoring models, the RSMTool package also provides three other command-line utilities for more advanced users.

## rsmeval - Evaluate external predictions¶

RSMTool provides the rsmeval command-line utility to evaluate existing predictions and generate a report with all the built-in analyses. This can be useful in scenarios where the user wants to use more sophisticated machine learning algorithms not available in RSMTool to build the scoring model but still wants to be able to evaluate that model’s predictions using the standard analyses.

For example, say a researcher has an existing automated scoring engine for grading short responses that extracts the features and computes the predicted score. This engine uses a large number of binary, sparse features. She cannot use rsmtool to train her model since it requires numeric features. So, she uses scikit-learn to train her model.

Once the model is trained, the researcher wants to evaluate her engine’s performance using the analyses recommended by the educational measurement community as well as conduct additional investigations for specific subgroups of test-takers. However, these kinds of analyses are not available in scikit-learn. She can use rsmeval to set up a customized report using a combination of existing and custom sections and quickly produce the evaluation that is useful to her.

### Tutorial¶

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

#### Workflow¶

rsmeval is designed for evaluating existing machine scores. Once you have the scores computed for all the responses in your data, the next steps are fairly straightforward:

1. Create a data file in one of the supported formats containing the computed system scores and the human scores you want to compare against.
2. Create an experiment configuration file describing the evaluation experiment you would like to run.
3. Run that configuration file with rsmeval and generate the experiment HTML report as well as the intermediate CSV files.
4. Examine the HTML report to check various aspects of model performance.

Note that the above workflow does not use any customization features , e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

#### ASAP Example¶

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

#### Generate scores¶

rsmeval is designed for researchers who have developed their own scoring engine for generating scores and would like to produce an evaluation report for those scores. For this tutorial, we will use the scores we generated for the ASAP2 evaluation set using rsmtool tutorial.

#### Create a configuration file¶

The next step is to create an experiment configuration file in .json format.

  1 2 3 4 5 6 7 8 9 10 11 12 { "experiment_id": "ASAP2_evaluation", "description": "Evaluation of the scores generated using rsmtool.", "predictions_file": "ASAP2_scores.csv", "system_score_column": "system", "human_score_column": "human", "id_column": "ID", "trim_min": 1, "trim_max": 6, "second_human_score_column": "human2", "scale_with": "asis" } 

Let’s take a look at the options in our configuration file.

• Line 2: We define an experiment ID.
• Line 3: We also provide a description which will be included in the experiment report.
• Line 4: We list the path to the file with the predicted and human scores. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
• Line 5: This field indicates that the system scores in our .csv file are located in a column named system.
• Line 6: This field indicates that the human (reference) scores in our .csv file are located in a column named human.
• Line 7: This field indicates that the unique IDs for the responses in the .csv file are located in columns named ID.
• Lines 8-9: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
• Line 10: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the human2 column in the .csv file.
• Line 11: This field indicates that the provided machine scores are already re-scaled to match the distribution of human scores. rsmeval itself will not perform any scaling and the report will refer to these as scaled scores.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmeval configuration files rather than creating them manually.

#### Run the experiment¶

Now that we have our scores in the right format and our configuration file in .json format, we can use the rsmeval command-line script to run our evaluation experiment.

$cd examples/rsmeval$ rsmeval config_rsmeval.json


This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmeval
Assuming given system predictions are already scaled and will be used as such.
Processing predictions
Saving pre-processed predictions and the metadata to disk
Running analyses on predictions
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3


Once the run finishes, you will see the output, figure, and report sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_evaluation_report.html file, which is the final evaluation report.

#### Examine the report¶

Our experiment report contains all the information we would need to evaluate the provided system scores against the human scores. It includes:

1. The distributions for the human versus the system scores.
2. Several different metrics indicating how well the machine’s scores agree with the humans’.
3. Information about human-human agreement and the difference between human-human and human-system agreement.

… and much more.

### Input¶

rsmeval requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmeval will use the current directory as the output directory.

Here are all the arguments to the rsmeval command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where all the files for this experiment will be stored.

-f, --force

If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

### Experiment configuration file¶

This is a file in .json format that provides overall configuration options for an rsmeval experiment. Here’s an example configuration file for rsmeval.

Note

To make it easy to get started with rsmeval, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmeval configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

#### experiment_id¶

An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

#### predictions_file¶

The path to the file with predictions to evaluate. The file should be in one of the supported formats. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.

#### system_score_column¶

The name for the column containing the scores predicted by the system. These scores will be used for evaluation.

#### trim_min¶

The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min - trim_tolerance.

#### trim_max¶

The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max + trim_tolerance.

Note

Although the trim_min and trim_max fields are optional for rsmtool, they are required for rsmeval.

#### candidate_column (Optional)¶

The name for an optional column in prediction file containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.

#### custom_sections (Optional)¶

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

#### description (Optional)¶

A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.

#### exclude_zero_scores (Optional)¶

By default, responses with human scores of 0 will be excluded from evaluations. Set this field to false if you want to keep responses with scores of 0. Defaults to true.

#### file_format (Optional)¶

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

#### flag_column (Optional)¶

This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0 in a column called ADVISORY). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be evaluated. For example, a value of {"ADVISORY": 0} will mean that rsmeval will only use responses for which the ADVISORY column has the value 0. Defaults to None.

Note

If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY column has a value of 0 and the ERROR column has a value of 0).

Note

When reading the values in the supplied dictionary, rsmeval treats numeric strings, floats and integers as the same value. Thus 1, 1.0, "1" and "1.0" are all treated as the 1.0.

#### general_sections (Optional)¶

RSMTool provides pre-defined sections for rsmeval (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

• data_description: Shows the total number of responses, along with any responses have been excluded due to non-numeric/zero scores or flag columns.

• data_description_by_group: Shows the total number of responses for each of the subgroups specified in the configuration file. This section only covers the responses used to evaluate the model.

• consistency: shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.

• evaluation: Shows the standard set of evaluations recommended for scoring models on the evaluation data:

• a table showing human-system association metrics;
• the confusion matrix; and
• a barplot showing the distributions for both human and machine scores.
• evaluation by group: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.

• fairness_analyses: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:

• percentage of variance in squared error explained by subgroup membership
• percentage of variance in raw (signed) error explained by subgroup membership
• percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score
• plots showing estimates for each subgroup for each model
• true_score_evaluation: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

• Number of single and double-scored responses.
• Variance of human rater errors and estimated variance of true scores
• Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
• intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the evaluation.

• sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

#### human_score_column (Optional)¶

The name for the column containing the human scores for each response. The values in this column will be used as observed scores. Defaults to sc1.

Note

All responses with non-numeric values or zeros in either human_score_column or system_score_column will be automatically excluded from evaluation. You can use exclude_zero_scores (Optional) to keep responses with zero scores.

#### id_column (Optional)¶

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmeval will look for a column called spkitemid in the prediction file.

#### min_items_per_candidate (Optional)¶

An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None.

#### min_n_per_group (Optional)¶

A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.

Note

If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.

#### rater_error_variance (Optional)¶

True score evaluations require an estimate of rater error variance. By default, rsmeval will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmeval. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.

#### scale_with (Optional)¶

In many scoring applications, system scores are re-scaled so that their mean and standard deviation match those of the human scores for the training data.

If you want rsmeval to re-scale the supplied predictions, you need to provide – as the value for this field – the path to a second file in one of the supported formats containing the human scores and predictions of the same system on its training data. This file must have two columns: the human scores under the sc1 column and the predicted score under the prediction.

This field can also be set to "asis" if the scores are already scaled. In this case, no additional scaling will be performed by rsmeval but the report will refer to the scores as “scaled”.

Defaults to "raw" which means that no-rescaling is performed and the report refers to the scores as “raw”.

#### second_human_score_column (Optional)¶

The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.

Note

You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.

#### section_order (Optional)¶

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
3. All special sections specified using special_sections.

#### special_sections (Optional)¶

A list specifying special ETS-only sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

#### subgroups (Optional)¶

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. These subgroup columns need to be present in the input predictions file. If subgroups are specified, rsmeval will generate:

• tables and barplots showing human-system agreement for each subgroup on the evaluation set.

#### trim_tolerance (Optional)¶

The single numeric value that will be used to pad the trimming range specified in trim_min and trim_max. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max + trim_tolerance for ceiling value and trim_min-trim_tolerance for floor value. Defaults to 0.4998.

Note

For more fine-grained control over the trimming range, you can set trim_tolerance to 0 and use trim_min and trim_max to specify the exact floor and ceiling values.

#### use_thumbnails (Optional)¶

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

### Output¶

rsmeval produces a set of folders in the output directory.

#### report¶

This folder contains the final RSMEval report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

#### output¶

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmeval will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

#### figure¶

This folder contains all of the figures generated as part of the various analyses performed, saved as .svg files.

### Intermediate files¶

Although the primary output of rsmeval is an HTML report, we also want the user to be able to conduct additional analyses outside of rsmeval. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the experiment_id provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:

• spkitemid for the column containing response IDs.
• sc1 for the column containing the human scores used as observed scores
• sc2 for the column containing the second human scores, if this column was specified in the configuration file.
• candidate for the column containing candidate IDs, if this column was specified in the configuration file.

#### Predictions¶

filename: pred_processed

This file contains the post-processed predicted scores: the predictions from the model are truncated, rounded, and re-scaled (if requested).

#### Flagged responses¶

filename: test_responses_with_excluded_flags

This file contains all of the rows in input predictions file that were filtered out based on conditions specified in flag_column.

Note

If the predictions file contained columns with internal names such as sc1 that were not actually used by rsmeval, they will still be included in these files but their names will be changed to ##name## (e.g. ##sc1##).

#### Excluded responses¶

filename: test_excluded_responses

This file contains all of the rows in the predictions file that were filtered out because of non-numeric or zero scores.

filename: test_metadata

This file contains the metadata columns (id_column, subgroups if provided) for all rows in the predictions file that used in the evaluation.

#### Unused columns¶

filename: test_other_columns

This file contains all of the the columns from the input predictions file that are not present in the *_pred_processed and *_metadata files. They only include the rows that were not filtered out.

Note

If the predictions file contained columns with internal names such as sc1 but these columns were not actually used by rsmeval, these columns will also be included into these files but their names will be changed to ##name## (e.g. ##sc1##).

#### Human scores¶

filename: test_human_scores

This file contains the human scores, if available in the input predictions file, under a column called sc1 with the response IDs under the spkitemid column.

If second_human_score_column was specfied, then it also contains the values in the predictions file from that column under a column called sc2. Only the rows that were not filtered out are included.

Note

If exclude_zero_scores was set to true (the default value), all zero scores in the second_human_score_column will be replaced by nan.

#### Data composition¶

filename: data_composition

This file contains the total number of responses in the input predictions file. If applicable, the table will also include the number of different subgroups.

#### Excluded data composition¶

filenames: test_excluded_composition

This file contains the composition of the set of excluded responses, e.g., why were they excluded and how many for each such exclusion.

#### Subgroup composition¶

filename: data_composition_by_<SUBGROUP>

There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup.

#### Evaluation metrics¶

• eval: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.

• eval_by_<SUBGROUP>: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.

• eval_short - a shortened version of eval that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).

• h_mean
• h_sd
• corr
• sys_mean [raw/scale trim]
• sys_sd [raw/scale trim]
• SMD [raw/scale trim]
• exact_agr [raw/scale trim_round]
• kappa [raw/scale trim_round]
• wtkappa [raw/scale trim]
• sys_mean [raw/scale trim_round]
• sys_sd [raw/scale trim_round]
• SMD [raw/scale trim_round]
• R2 [raw/scale trim]
• RMSE [raw/scale trim]
• score_dist: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

• confMatrix: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

#### Human-human Consistency¶

These files are created only if a second human score has been made available via the second_human_score_column option in the configuration file.

• consistency: contains descriptives for both human raters as well as the agreement metrics between their ratings.
• consistency_by_<SUBGROUP>: contains the same metrics as in consistency file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.
• degradation: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.

#### Evaluations based on test theory¶

• disattenuated_correlations: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.
• disattenuated_correlations_by_<SUBGROUP>: contains the same metrics as in disattenuated_correlations file computed separately for each group.
• true_score_eval: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.

These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.

• <METRICS>_by_<SUBGROUP>.ols: a serialized object of type pandas.stats.ols.OLS containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics are osa (overall score accuracy), osd (overall score difference), and csd (conditional score difference).
• <METRICS>_by_<SUBGROUP>_ols_summary.txt: a text file containing a summary of the above model
• estimates_<METRICS>_by_<SUBGROUP>: coefficients, confidence intervals and p-values estimated by the model for each subgroup.
• fairness_metrics_by_<SUBGROUP>: the $$R^2$$ (percentage of variance) and p-values for all models.

## rsmpredict - Generate new predictions¶

RSMTool provides the rsmpredict command-line utility to generate predictions for new data using a model already trained using the rsmtool utility. This can be useful when processing a new set of responses to the same task without needing to retrain the model.

rsmpredict pre-processes the feature values according to user specifications before using them to generate the predicted scores. The generated scores are post-processed in the same manner as they are in rsmtool output.

Note

No score is generated for responses with non-numeric values for any of the features included into the model.

If the original model specified transformations for some of the features and these transformations led to NaN or Inf values when applied to the new data, rsmpredict will raise a warning. No score will be generated for such responses.

### Tutorial¶

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

#### Workflow¶

Important

Although this tutorial provides feature values for the purpose of illustration, rsmpredict does not include any functionality for feature extraction; the tool is designed for researchers who use their own NLP/Speech processing pipeline to extract features for their data.

rsmpredict allows you to generate the scores for new data using an existing model trained using RSMTool. Therefore, before starting this tutorial, you first need to complete rsmtool tutorial which will produce a train RSMTool model. You will also need to process the new data to extract the same features as the ones used in the model.

Once you have the features for the new data and the RSMTool model, using rsmpredict is fairly straightforward:

1. Create a file containing the features for the new data. The file should be in one of the supported formats.

2. Create an experiment configuration file describing the experiment you would like to run.

3. Run that configuration file with rsmpredict to generate the predicted scores.

Note

You do not need human scores to run rsmpredict since it does not produce any evaluation analyses. If you do have human scores for the new data and you would like to evaluate the system on this new data, you can first run rsmpredict to generate the predictions and then run rsmeval on the output of rsmpredict to generate an evaluation report.

#### ASAP Example¶

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial. Specifically, We are going to use the linear regression model we trained in that tutorial to generate scores for new data.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

#### Extract features¶

We will first need to generate features for the new set of responses for which we want to predict scores. For this experiment, we will simply re-use the test set from the rsmtool tutorial.

Note

The features used with rsmpredict should be generated using the same NLP/Speech processing pipeline that generated the features used in the rsmtool modeling experiment.

#### Create a configuration file¶

The next step is to create an rsmpredict experiment configuration file in .json format.

 1 2 3 4 5 6 7 8 { "experiment_dir": "../rsmtool", "experiment_id": "ASAP2", "input_features_file": "../rsmtool/test.csv", "id_column": "ID", "human_score_column": "score", "second_human_score_column": "score2" } 

Let’s take a look at the options in our configuration file.

• Line 2: We give the path to the directory containing the output of the rsmtool experiment.
• Line 3: We provide the experiment_id of the rsmtool experiment used to train the model. This can usually be read off the output/<experiment_id>.model file in the rsmtool experiment output directory.
• Line 4: We list the path to the data file with the feature values for the new data. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
• Line 5: This field indicates that the unique IDs for the responses in the .csv file are located in a column named ID.
• Lines 6-7: These fields indicates that there are two sets of human scores in our .csv file located in the columns named score and score2. The values from these columns will be added to the output file containing the predictions which can be useful if we want to evaluate the predictions using rsmeval.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmpredict configuration files rather than creating them manually.

#### Run the experiment¶

Now that we have the model, the features in the right format, and our configuration file in .json format, we can use the rsmpredict command-line script to generate the predictions and to save them in predictions.csv.

$cd examples/rsmpredict$ rsmpredict config_rsmpredict.json predictions.csv


This should produce output like:

WARNING: The following extraneous features will be ignored: {'spkitemid', 'sc1', 'sc2', 'LENGTH'}
Pre-processing input features
Generating predictions
Rescaling predictions
Trimming and rounding predictions


You should now see a file named predictions.csv in the current directory which contains the predicted scores for the new data in the predictions column.

### Input¶

rsmpredict requires two arguments to generate predictions: the path to a configuration file and the path to the output file where the generated predictions are saved in .csv format.

If you also want to save the pre-processed feature values,rsmpredict can take a third optional argument --features to specify the path to a .csv file to save these values.

Here are all the arguments to the rsmpredict command-line script.

config_file

The JSON configuration file for this experiment.

output_file

The output .csv file where predictions will be saved.

--features <preproc_feats_file>

If specified, the pre-processed values for the input features will also be saved in this .csv file.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

### Experiment configuration file¶

This is a file in .json format that provides overall configuration options for an rsmpredict experiment. Here’s an example configuration file for rsmpredict.

Note

To make it easy to get started with rsmpredict, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmpredict configuration fields in detail. There are three required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

#### experiment_dir¶

The path to the directory containing rsmtool model to use for generating predictions. This directory must contain a sub-directory called output with the model files, feature pre-processing parameters, and score post-processing parameters. The path can be absolute or relative to the location of configuration file.

#### experiment_id¶

The experiment_id used to create the rsmtool model files being used for generating predictions. If you do not know the experiment_id, you can find it by looking at the prefix of the .model file under the output directory.

#### input_feature_file¶

The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the supported formats Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names must be the same as used in the original rsmtool experiment.

Note

rsmpredict will only generate predictions for responses in this file that have numeric values for the features included in the rsmtool model.

rsmpredict does not require human scores for the new data since it does not evaluate the generated predictions. If you do have the human scores and want to evaluate the new predictions, you can use the rsmeval command-line utility.

#### candidate_column (Optional)¶

The name for the column containing unique candidate IDs. This column will be named candidate in the output file with predictions.

#### file_format (Optional)¶

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

#### flag_column (Optional)¶

See description in the rsmtool configuration file for further information. No filtering will be done by rsmpredict, but the contents of all specified columns will be added to the predictions file using the original column names.

#### human_score_column (Optional)¶

The name for the column containing human scores. This column will be renamed to sc1.

#### id_column (Optional)¶

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmpredict will look for a column called spkitemid in the prediction file.

There are several other options in the configuration file that, while not directly used by rsmpredict, can simply be passed through from the input features file to the output predictions file. This can be particularly useful if you want to subsequently run rsmeval to evaluate the generated predictions.

#### predict_expected_scores (Optional)¶

If the original model was a probabilistic SKLL classifier, then expected scores — probability-weighted averages over a contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default. Set this field to true to compute expected scores as predictions. Defaults to false.

Note

1. If the model in the original rsmtool experiment is an SVC, that original experiment must have been run with predict_expected_scores set to true. This is because SVC classifiers are fit differently if probabilistic output is desired, in contrast to other probabilistic SKLL classifiers.
2. You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probability values for certain score points which can affect floating point computations.

#### second_human_score_column (Optional)¶

The name for the column containing the second human score. This column will be renamed to sc2.

#### standardize_features (Optional)¶

If this option is set to false features will not be standardized by dividing by the mean and multiplying by the standard deviation. Defaults to true.

#### subgroups (Optional)¶

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. All these columns will be included into the predictions file with the original names.

### Output¶

rsmpredict produces a .csv file with predictions for all responses in new data set, and, optionally, a .csv file with pre-processed feature values. If any of the responses had non-numeric feature values in the original data or after applying transformations, these are saved in a file name PREDICTIONS_NAME_excluded_responses.csv where PREDICTIONS_NAME is the name of the predictions file supplied by the user without the extension.

The predictions .csv file contains the following columns:

• spkitemid : the unique resonse IDs from the original feature file.
• sc1 and sc2 : the human scores for each response from the original feature file (human_score_column and second_human_score_column, respectively.
• raw : raw predictions generated by the model.
• raw_trim, raw_trim_round, scale, scale_trim, scale_trim_round : raw scores post-processed in different ways.

## rsmcompare - Create a detailed comparison of two scoring models¶

RSMTool provides the rsmcompare command-line utility to compare two models and to generate a detailed comparison report including differences between the two models. This can be useful in many scenarios, e.g., say the user wants to compare the changes in model performance after adding a new feature into the model. To use rsmcompare, the user must first run two experiments using either rsmtool or rsmeval. rsmcompare can then be used to compare the outputs of these two experiments to each other.

Note

Currently rsmcompare takes the outputs of the analyses generated during the original experiments and creates comparison tables. These comparison tables were designed with a specific comparison scenario in mind: comparing a baseline model with a model which includes new feature(s). The tool can certianly be used for other comparison scenarios if the researcher feels that the generated comparison output is appropriate.

rsmcompare can be used to compare:

1. Two rsmtool experiments, or
2. Two rsmeval experiments, or
3. An rsmtool experiment with an rsmeval experiment (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the comparison experiment are all done using the same version of RSMTool.

### Tutorial¶

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

#### Workflow¶

rsmcompare is designed to compare two existing rsmtool or rsmeval experiments. To use rsmcompare you need:

1. Two experiments that were run using rsmtool or rsmeval.
2. Create an experiment configuration file describing the comparison experiment you would like to run.
3. Run that configuration file with rsmcompare and generate the comparison experiment HTML report.
4. Examine HTML report to compare the two models.

Note that the above workflow does not use the customization features of rsmcompare, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

#### ASAP Example¶

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

#### Run rsmtool (or rsmeval) experiments¶

rsmcompare compares the results of the two existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to itself.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

#### Create a configuration file¶

The next step is to create an experiment configuration file in .json format.

  1 2 3 4 5 6 7 8 9 10 11 { "comparison_id": "ASAP2_vs_ASAP2", "experiment_id_old": "ASAP2", "experiment_dir_old": "../rsmtool/", "description_old": "RSMTool experiment.", "use_scaled_predictions_old": true, "experiment_id_new": "ASAP2", "experiment_dir_new": "../rsmtool", "description_new": "RSMTool experiment (copy).", "use_scaled_predictions_new": true } 

Let’s take a look at the options in our configuration file.

• Line 2: We provide an ID for the comparison experiment.
• Line 3: We provide the experiment_id for the experiment we want to use as a baseline.
• Line 4: We also give the path to the directory containing the output of the original baseline experiment.
• Line 5: We give a short description of this baseline experiment. This will be shown in the report.
• Line 6: This field indicates that the baseline experiment used scaled scores for some evaluation analyses.
• Line 7: We provide the experiment_id for the new experiment. We use the same experiment ID for both experiments since we are comparing the experiment to itself.
• Line 8: We also give the path to the directory containing the output of the new experiment. As above, we use the same path because we are comparing the experiment to itself.
• Line 9: We give a short description of the new experiment. This will also be shown in the report.
• Line 10: This field indicates that the new experiment also used scaled scores for some evaluation analyses.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmcompare configuration files rather than creating them manually.

#### Run the experiment¶

Now that we have the two experiments we want to compare and our configuration file in .json format, we can use the rsmcompare command-line script to run our comparison experiment.

$cd examples/rsmcompare$ rsmcompare config_rsmcompare.json


This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmcompare
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3


Once the run finishes, you will see an HTML file named ASAP2_vs_ASAP2_report.html. This is the final rsmcompare comparison report.

#### Examine the report¶

Our experiment report contains all the information we would need to compare the new model to the baseline model. It includes:

1. Comparison of feature distributions between the two experiments.
2. Comparison of model coefficients between the two experiments.
3. Comparison of model performance between the two experiments.

Note

Since we are comparing the experiment to itself, the comparison is not very interesting, e.g., the differences between various values will always be 0.

### Input¶

rsmcompare requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmcompare will use the current directory as the output directory.

Here are all the arguments to the rsmcompare command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where the report files for this comparison will be stored.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

### Experiment configuration file¶

This is a file in .json format that provides overall configuration options for an rsmcompare experiment. Here’s an example configuration file for rsmcompare.

Note

To make it easy to get started with rsmcompare, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmcompare configuration fields in detail. There are seven required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

#### comparison_id¶

An identifier for the comparison experiment that will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

#### experiment_id_old¶

An identifier for the “baseline” experiment. This ID should be identical to the experiment_id used when the baseline experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

#### experiment_id_new¶

An identifier for the experiment with the “new” model (e.g., the model with new feature(s)). This ID should be identical to the experiment_id used when the experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

#### experiment_dir_old¶

The directory with the results for the “baseline” experiment. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

#### experiment_dir_new¶

The directory with the results for the experiment with the new model. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

#### description_old¶

A brief description of the “baseline” experiment. The description can contain spaces and punctuation.

#### description_new¶

A brief description of the experiment with the new model. The description can contain spaces and punctuation.

#### custom_sections (Optional)¶

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

#### general_sections (Optional)¶

RSMTool provides pre-defined sections for rsmcompare (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

• feature_descriptives: Compares the descriptive statistics for all raw feature values included in the model:

• a table showing mean, standard deviation, skewness and kurtosis;
• a table showing the number of truncated outliers for each feature; and
• a table with percentiles and outliers;
• a table with correlations between raw feature values and human score in each model and the correlation between the values of the same feature in these two models. Note that this table only includes features and responses which occur in both training sets.
• features_by_group: Shows boxplots for both experiments with distributions of raw feature values by each of the subgroups specified in the configuration file.

• preprocessed_features: Compares analyses of preprocessed features:

• histograms showing the distributions of preprocessed features values;
• the correlation matrix between all features and the human score;
• a table showing marginal correlations between all features and the human score; and
• a table showing partial correlations between all features and the human score.
• preprocessed_features_by_group: Compares analyses of preprocessed features by subgroups: marginal and partial correlations between each feature and human score for each subgroup.

• consistency: Compares metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated correlations for the whole dataset and by each of the subgroups specified in the configuration file.

• score_distributions:

• tables showing the distributions for both human and machine scores; and
• confusion matrices for human and machine scores.
• model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.

• evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.

• true_score_evaluation: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

• Number of single and double-scored responses.
• Variance of human rater errors and estimated variance of true scores
• Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
• pca: Shows the results of principal components analysis on the processed feature values for the new model only:

• the principal components themselves;
• the variances; and
• a Scree plot.
• notes: Notes explaining the terminology used in comparison reports.

• sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

#### section_order (Optional)¶

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
3. All special sections specified using special_sections.

#### special_sections (Optional)¶

A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

#### subgroups (Optional)¶

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups.For example, ["prompt, gender, native_language, test_country"].

Note

In order to include subgroups analyses in the comparison report, both experiments must have been run with the same set of subgroups.

#### use_scaled_predictions_old (Optional)¶

Set to true if the “baseline” experiment used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

#### use_scaled_predictions_new (Optional)¶

Set to true if the experiment with the new model used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

Warning

For rsmtool and rsmeval, primary evaluation analyses are computed on both raw and scaled scores, but some analyses (e.g., the confusion matrix) are only computed for either raw or re-scaled scores based on the value of use_scaled_predictions. rsmcompare uses the existing outputs and does not perform any additional evaluations. Therefore if this field was set to true in the original experiment but is set to false for rsmcompare, the report will be internally inconsistent: some evaluations use raw scores whereas others will use scaled scores.

#### use_thumbnails (Optional)¶

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

### Output¶

rsmcompare produces the comparison report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file) in the output directory.

## rsmsummarize - Compare multiple scoring models¶

RSMTool provides the rsmsummarize command-line utility to compare multiple models and to generate a comparison report. Unlike rsmcompare which creates a detailed comparison report between the two models, rsmsummarize can be used to create a more general overview of multiple models.

rsmsummarize can be used to compare:

1. Multiple rsmtool experiments, or
2. Multiple rsmeval experiments, or
3. A mix of rsmtool and rsmeval experiments (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the summary experiment are all done using the same version of RSMTool.

### Tutorial¶

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

#### Workflow¶

rsmsummarize is designed to compare several existing rsmtool or rsmeval experiments. To use rsmsummarize you need:

1. Two or more experiments that were run using rsmtool or rsmeval.
2. Create an experiment configuration file describing the comparison experiment you would like to run.
3. Run that configuration file with rsmsummarize and generate the comparison experiment HTML report.
4. Examine HTML report to compare the models.

Note that the above workflow does not use the customization features of rsmsummarize, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

#### ASAP Example¶

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

#### Run rsmtool and rsmeval experiments¶

rsmsummarize compares the results of the two or more existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to the evaluations we obtained in the rsmeval tutorial.

Note

If you have not already completed these tutorials, please do so now. You may need to complete them again if you deleted the output files.

#### Create a configuration file¶

The next step is to create an experiment configuration file in .json format.

 1 2 3 4 5 6 { "summary_id": "model_comparison", "description": "a comparison of the results of the rsmtool sample experiment, rsmeval sample experiment and once again the rsmtool sample experiment", "experiment_dirs": ["../rsmtool", "../rsmeval", "../rsmtool"], "experiment_names":["RSMTool experiment 1", "RSMEval experiment", "RSMTool experiment 2"] } 

Let’s take a look at the options in our configuration file.

• Line 2: We provide the summary_id for the comparison. This will be used to generate the name of the final report.
• Line 3: We give a short description of this comparison experiment. This will be shown in the report.
• Line 4: We also give the list of paths to the directories containing the outputs of the experiments we want to compare.
• Line 5: Since we want to compare experiments that all used the same experiment id (ASAP2), we instead list the names that we want to use for each experiment in the summary report.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmsummarize configuration files rather than creating them manually.

#### Run the experiment¶

Now that we have the list of the experiments we want to compare and our configuration file in .json format, we can use the rsmsummarize command-line script to run our comparison experiment.

$cd examples/rsmsummarize$ rsmsummarize config_rsmsummarize.json


This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmsummarize
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3


Once the run finishes, you will see a new folder report containing an HTML file named model_comparison_report.html. This is the final rsmsummarize summary report.

#### Examine the report¶

Our experiment report contains the overview of main aspects of model performance. It includes:

1. Brief description of all experiments.
2. Information about model parameters and model fit for all rsmtool experiments.
3. Model performance for all experiments.

Note

Some of the information such as model fit and model parameters are only available for rsmtool experiments.

### Input¶

rsmsummarize requires a single argument to run an experiment: the path to a configuration file. You can specify which models you want to compare and the name of the report by supplying the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmsummarize will use the current directory as the output directory.

Here are all the arguments to the rsmsummarize command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where the report and intermediate .csv files for this comparison will be stored.

-f, --force

If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmsummarize experiment.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

### Experiment configuration file¶

This is a file in .json format that provides overall configuration options for an rsmsummarize experiment. Here’s an example configuration file for rsmsummarize.

Note

To make it easy to get started with rsmsummarize, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmsummarize configuration fields in detail. There are two required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

#### summary_id¶

An identifier for the rsmsummarize experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

#### experiment_dirs¶

The list of the directories with the results of the experiment. These directories should be the output directories used for each experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

#### custom_sections (Optional)¶

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

#### description (Optional)¶

A brief description of the summary. The description can contain spaces and punctuation.

#### experiment_names (Optional)¶

The list of experiment names to use in the summary report and intermediate files. The names should be listed in the same order as the experiments in experiment_dirs. When this field is not specified, the report will show the original experiment_id for each experiment.

#### file_format (Optional)¶

The format of the intermediate files generated by rsmsummarize. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

Note

In the rsmsummarize context, the file_format parameter refers to the format of the intermediate files generated by rsmsummarize, not the intermediate files generated by the original experiment(s) being summarized. The format of these files does not have to match the format of the files generated by the original experiment(s).

#### general_sections (Optional)¶

RSMTool provides pre-defined sections for rsmsummarize (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

• preprocessed_features : compares marginal and partial correlations between all features and the human score, and optionally response length if this was computed for any of the models.

• model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.

• evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.

• true_score_evaluation: compares the evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

• Number of single and double-scored responses.
• Variance of human rater errors and estimated variance of true scores
• Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
• intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the summary.

• sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

#### section_order (Optional)¶

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
3. All special sections specified using special_sections.

#### special_sections (Optional)¶

A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

#### use_thumbnails (Optional)¶

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

### Output¶

rsmsummarize produces a set of folders in the output directory.

#### report¶

This folder contains the final rsmsummarize report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

#### output¶

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmsummarize will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

#### figure¶

This folder contains all of the figures that may be generated as part of the various analyses performed, saved as .svg files. Note that no figures are generated by the existing rsmsummarize notebooks.

### Intermediate files¶

Although the primary output of RSMSummarize is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the summary_id provided by the user in the experiment configuration file.

#### Marginal and partial correlations with score¶

filenames: margcor_score_all_data, pcor_score_all_data, pcor_score_no_length_all_data

The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column was specified in the configuration file.

#### Model information¶

• model_summary

This file contains the main information about the models included into the report including:

• Total number of features
• Total number of features with non-negative coefficients
• The learner
• The label used to train the model
• betas: standardized coefficients (for built-in models only).
• model_fit: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.

Note

If the report includes a combination of rsmtool and rsmeval experiments, the summary tables with model information will only include rsmtool experiments since no model information is available for rsmeval experiments.

#### Evaluation metrics¶

• eval_short - descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).

• h_mean
• h_sd
• corr
• sys_mean [raw/scale trim]
• sys_sd [raw/scale trim]
• SMD [raw/scale trim]
• exact_agr [raw/scale trim_round]
• kappa [raw/scale trim_round]
• wtkappa [raw/scale trim_round]
• sys_mean [raw/scale trim_round]
• sys_sd [raw/scale trim_round]
• SMD [raw/scale trim_round]
• R2 [raw/scale trim]
• RMSE [raw/scale trim]

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

#### Evaluations based on test theory¶

• true_score_eval: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.

## rsmxval - Run cross-validation experiments¶

RSMTool provides the rsmxval command-line utility to run cross-validation experiments with scoring models. Why would one want to use cross-validation rather than just using the simple train-and-evaluate loop provided by the rsmtool utility? Using cross-validation can provide more accurate estimates of scoring model performance since those estimates are averaged over multiple train-test splits that are randomly selected based on the data. Using a single train-test split may lead to biased estimates of performance since those estimates will depend on the specific characteristics of that split. Using cross-validation is more likely to provide estimates of how well the scoring model will generalize to unseen test data, and more easily flag problems with overfitting and selection bias, if any.

Cross-validation experiments in RSMTool consist of the following steps:

1. The given training data file is first shuffled randomly (with a fixed seed for reproducibility) and then split into the requested number of folds. It is also possible for the user to provide a CSV file containing a pre-determined set of folds, e.g., from another part of the data pipeline.
2. For each fold (or train-test split), rsmtool is run to train a scoring model on the training split and evaluate on the test split. All of the outputs for each of the rsmtool runs are saved on disk and represent the per-fold performance.
3. The predictions generated by rsmtool for each of the folds are all combined into a single file, which is then used as input for rsmeval. The output of this evaluation run is saved to disk and provides a more accurate estimate of the predictive performance of a scoring model trained on the given data.
4. A summary report comparing all of the folds is generated by running rsmsummarize on all of the fold directories created in the Step 1 and its output is also saved to disk. This summary output can be useful to see if the performance for any of the folds stands out for any reason, which could point to a potential problem.
5. Finally, a scoring model is trained on the complete training data file using rsmtool, which also generates a report that contains only the feature and model descriptives. The model is what will most likely be deployed for inference assuming the analyses produced in this step and Steps 1–4 meet the stakeholders’ requirements.

### Tutorial¶

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

#### Workflow¶

rsmxval is designed to run cross-validation experiments using a single file containing human scores and features. Just like rsmtool, rsmxval does not provide any functionality for feature extraction and assumes that users will extract features on their own. The workflow steps are as follows:

1. Create a data file in one of the supported formats containing the extracted features for each response in the data along with human score(s) assigned to it.
2. Create an experiment configuration file describing the cross-validation experiment you would like to run.
3. Run that configuration file with rsmxval and generate its outputs.
4. Examine the various HTML reports to check various aspects of model performance.

Note that unlike rsmtool and rsmeval, rsmxval currently does not support customization of the HTML reports generated in each step. This functionality may be added in future versions.

#### ASAP Example¶

We are going to use the same example from the 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

#### Extract features¶

We are using the same features for this data as described in the rsmtool tutorial.

#### Create a configuration file¶

The next step is to create an experiment configuration file in .json format.

  1 2 3 4 5 6 7 8 9 10 11 12 13 { "experiment_id": "ASAP2_xval", "description": "Cross-validation with two human scores using a LinearRegression model.", "train_file": "train.csv", "folds": 3, "train_label_column": "score", "id_column": "ID", "model": "LinearRegression", "trim_min": 1, "trim_max": 6, "second_human_score_column": "score2", "use_scaled_predictions": true } 

Let’s take a look at the options in our configuration file.

• Line 2: We define an experiment ID used to identify the files produced as part of this experiment.
• Line 3: We provide a description which will be included in the various reports.
• Line 4: We list the path to our training file with the feature values and human scores. For this tutorial, we used .csv format, but several other input file formats are also supported.
• Line 5: This field indicates the number of cross-validation folds we want to use. If this field is not specified, rsmxval uses 5-fold cross-validation by default.
• Line 6: This field indicates that the human (reference) scores in our .csv file are located in a column named score.
• Line 7: This field indicates that the unique IDs for the responses in the .csv file are located in a column named ID.
• Line 8: We choose to use a linear regression model to combine the feature values into a score.
• Lines 9-10: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
• Line 11: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the score2 column in the training .csv file.
• Line 12: Next, we indicate that we would like to use the scaled scores for all our evaluation analyses at each step.

Documentation for all of the available configuration options is available here.

Note

You can also use our nifty capability to automatically generate rsmxval configuration files rather than creating them manually.

#### Run the experiment¶

Now that we have our input file and our configuration file, we can use the rsmxval command-line script to run our evaluation experiment.

$cd examples/rsmxval$ rsmxval config_rsmxval.json output


This should produce output like:

Output directory: output
Saving configuration file.
Generating 3 folds after shuffling
Running RSMTool on each fold in parallel
Progress: 100%|███████████████████████████████████████████████| 3/3 [00:08<00:00,  2.76s/it]
Creating fold summary
Evaluating combined fold predictions
Training model on full data


Once the run finishes, you will see an output sub-directory in the current directory. Under this directory you will see multiple sub-directories, each corresponding to a different cross-validation step, as described here.

#### Examine the reports¶

The cross-validation experiment produces multiple HTML reports – an rsmtool report for each of the 3 folds (output/folds/{01,02,03}/report/ASAP2_xval_fold{01,02,03}.html), the evaluation report for the cross-validated predictions (output/evaluation/report/ASAP2_xval_evaluation_report.html), a report summarizing the salient characteristics of the 3 folds (output/fold-summary/report/ASAP2_xval_fold_summary_report.html), and a report showing the feature and model descriptives (output/final-model/report/ASAP2_xval_model_report.html). Examining these reports will provide a relatively complete picture of how well the predictive performance of the scoring model will generalize to unseen data.

### Input¶

rsmxval requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmxval will use the current directory as the output directory.

Here are all the arguments to the rsmxval command-line script.

config_file

The JSON configuration file for this cross-validation experiment.

output_dir (optional)

The output directory where all the sub-directories and files for this cross-validation experiment will be stored. If a non-empty directory with the same name already exists, an error will be raised.

-h, --help

Show help message and exit.

-V, --version

Show version number and exit.

### Experiment configuration file¶

This is a file in .json format that provides overall configuration options for an rsmxval experiment. Here’s an example configuration file for rsmxval.

Note

To make it easy to get started with rsmxval, we provide a way to automatically generate configuration files both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Configuration files for rsmxval are almost identical to rsmtool configuration files with only a few differences. Next, we describe the three required rsmxval configuration fields in detail.

#### experiment_id¶

An identifier for the experiment that will be used as part of the names of the reports and intermediate files produced in each of the steps. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters. Suffixes are added to this experiment ID by each of the steps for the reports and files they produce, i.e., _fold<N> in the per-fold rsmtool step where <N> is a two digit number, _evaluation by the rsmeval evaluation step, _fold_summary by the rsmsummarize step, and _model by the final full-data rsmtool step.

#### model¶

The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.

#### train_file¶

The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to config file’s location.

Important

Unlike rsmtool, rsmxval does not accept an evaluation set and will raise an error if the test_file field is specified.

Next, we will describe the two optional fields that are unique to rsmxval.

#### folds (Optional)¶

The number of folds to use for cross-validation. This should be an integer and defaults to 5.

#### folds_file (Optional)¶

The path to a file containing custom, pre-specified folds to be used for cross-validation. This should be a .csv file (no other formats are accepted) and should contain only two columns: id and fold. The id column should contain the same IDs of the responses that are contained in train_file above. The fold column should contain an integer representing which fold the response with the id belongs to. IDs not specified in this file will be skipped and not included in the cross-validation at all. Just like train_file, this path can be absolute or relative to the config file’s location. Here’s an example of a folds file containing 2 folds.

Note

If both folds_file and folds are specified, then the former will take precedence unless it contains a non-existent path.

In addition to the fields described so far, an rsmxval configuration file also accepts the following optional fields used by rsmtool:

• candidate_column
• description
• exclude_zero_scores
• feature_subset
• feature_subset_file
• features
• file_format
• flag_column
• flag_column_test
• id_column
• length_column
• min_items_per_candidate
• min_n_per_group
• predict_expected_scores
• rater_error_variance
• second_human_score_column
• select_transformations
• sign
• skll_fixed_parameters
• skll_objective
• standardize_features
• subgroups
• train_label_column
• trim_max
• trim_min
• trim_tolerance
• use_scaled_predictions
• use_thumbnails
• use_truncation_thresholds

Please refer to these fields’ descriptions on the page describing the rsmtool configuration file.

### Output¶

rsmxval produces a set of folders in the output directory.

#### folds¶

This folder contains the output of each of the per-fold rsmtool experiments. It contains as many sub-folders as the number of specified folds, named 01, 02, 03, etc. Each of these numbered sub-folders contains the output of one rsmtool experiment conducted using the training split of that fold as the training data and the test split as the evaluation data. Each of the sub-folders contains the output directories produced by rsmtool. The report for each fold lives in the report sub-directory, e.g., the report for the first fold is found at folds/01/report/<experiment_id>_fold01_report.html, and so on. The messages that are usually printed out by rsmtool to the screen are instead logged to a file and saved to disk as, e.g., folds/01/rsmtool.log.

#### evaluation¶

This folder contains the output of the rsmeval evaluation experiment that uses the cross-validated predictions from each fold. This folder contains the output directories produced by rsmeval. The evaluation report can be found at evaluation/report/<experiment_id>_evaluation_report.html. The messages that are usually printed out by rsmeval to the screen are instead logged to a file and saved to disk as evaluation/rsmeval.log.

#### fold-summary¶

This folder contains the output of the rsmsummarize experiment that provides a quick summary of all of the folds in a single, easily-scanned report. The folder contains the output directories produced by rsmsummarize. The summary report can be found at fold-summary/report/<experiment_id>_fold_summary_report.html. The messages that are usually printed out by rsmsummarize to the screen are instead logged to a file and saved to disk as fold-summary/rsmsummarize.log.

#### final-model¶

This folder contains the output of the rsmtool experiment that trains a model on the full training data and provides a report showing the feature and model descriptives. It contains the output directories produced by rsmtool. The primary artifacts of this experiment are the report (final-model/report/<experiment_id>_model_report.html) and the final trained model (final-model/output/<experiment_id>_model.model). The messages that are usually printed out by rsmtool to the screen are instead logged to a file and saved to disk as final-model/rsmtool.log.

Note

Every rsmtool experiment requires both a training and an evaluation set. However, in this step, we are using the full training data to train the model and rsmxval does not use a separate test set. Therefore, we simply randomly sample 10% of the full training data as a dummy test set to make sure that rsmtool runs successfully. The report in this step only contains the model and feature descriptives and, therefore, does not use this dummy test set at all. Users should ignore any intermediate files under the final-model/output and final-model/figure sub-directories that are derived from this dummy test set. If needed, the data used as the dummy test set can be found at final-model/dummy_test.csv (or in the chosen format).

In addition to these folders, rsmxval will also save a copy of the configuration file in the output directory at the same-level as the above folders. Fields not specified in the original configuration file will be pre-populated with default values.