Using RSMTool

For most users, the primary means of using RSMTool will be via the command-line utility rsmtool. We refer to each run of rsmtool as an “experiment”.

Input

rsmtool requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmtool will use the current directory as the output directory.

Here are all the arguments to the rsmtool command-line script.

config_file: The JSON configuration file for this experiment.

output_dir (optional): The output directory where all the files for this experiment will be stored.

-f, --force: If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmtool experiment.

-h, --help: Show help message and exist.

-V, --version: Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmtool experiment. Here’s an example configuration file for rsmtool.

Note

To make it easy to get started with rsmtool, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.

Next, we describe all of the rsmtool configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).

experiment_id

An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

model

The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.

train_file

The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.

test_file

The path to the evaluation data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.

Note

For both the training and evaluation files, the default behavior of rsmtool is to look for a column named spkitemid in order to get the unique IDs for each response and a column named sc1 to get train/test labels. The optional fields id_column and train_label_column can be used to specify different names for these columns.
rsmtool also assumes that all other columns present in these files (other than those containing IDs and labels) contain feature values. If this is not the case, one can use other configuration file fields to identify columns containing non-feature information useful for various analyses, e.g., second_human_score_column, flag_column, subgroups et cetera below.
Any columns not explicitly identified in (1) and (2) will be considered feature columns and used by rsmtool in the model. To use only a subset of these remaining columns as features, one can employ the four optional fields features, feature_subset_file, feature_subset, and sign. See selecting feature columns for more details on how to achieve this.

candidate_column (Optional)

The name for an optional column in the training and test data containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

description (Optional)

A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.

exclude_zero_scores (Optional)

By default, responses with human scores of 0 will be excluded from both training and evaluation set. Set this field to false if you want to keep responses with scores of 0. Defaults to true.

feature_subset (Optional)

Name of the pre-defined feature subset to be used if using subset-based column selection.

feature_subset_file (Optional)

Path to the feature subset file if using subset-based column selection.

features (Optional)

Path to the file with list of features if using fine-grained column selection. Alternatively, you can pass a list of feature names to include in the experiment.

file_format (Optional)

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

flag_column (Optional)

This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0 in a column called ADVISORY). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be used to train the model. For example, a value of {"ADVISORY": 0} will mean that rsmtool will only use responses for which the ADVISORY column has the value 0. If this field is used without flag_column_test, the conditions will be applied to both training and evaluation set and the specified columns must be present in both sets. When this field is used in conjunction with flag_column_test, the conditions will be applied to training set only and the specified columns must be present in the training set. Defaults to None.

Note

If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY column has a value of 0 and the ERROR column has a value of 0).

Note

When reading the values in the supplied dictionary, rsmtool treats numeric strings, floats and integers as the same value. Thus 1, 1.0, "1" and "1.0" are all treated as the 1.0.

flag_column_test (Optional)

This field makes it possible to only use a separate Python flag dictionary for the evaluation set. If this field is not passed, and flag_column is passed, then the same advisories will be used for both training and evaluation sets.

When this field is used, the specified columns must be present in the evaluation set. Defaults to None or flag_column`, if flag_column is present. Use flag_column_test only if you want filtering of the test set.

Note

When used, flag_column_test field determines all filtering conditions for the evaluation set. If it is used in conjunction with flag_column field, the filtering conditions defined in flag_column will only be applied to the training set. If you want to apply a subset of conditions to both partitions with additional conditions applied to the evaluation set only, you will need to specify the overlapping conditions separately for each partition.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmtool (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

data_description: Shows the total number of responses in training and evaluation set, along with any responses have been excluded due to non-numeric features/scores or flag columns.

data_description_by_group: Shows the total number of responses in training and evaluation set for each of the subgroups specified in the configuration file. This section only covers the responses used to train/evaluate the model.

feature_descriptives: Shows the descriptive statistics for all raw feature values included in the model:

a table showing mean, standard deviation, min, max, correlation with human score etc.;

a table with percentiles and outliers; and

a barplot showing the number of truncated outliers for each feature.

features_by_group: Shows boxplots with distributions of raw feature values by each of the subgroups specified in the configuration file.

preprocessed_features: Shows analyses of preprocessed features:

histograms showing the distributions of preprocessed features values;

the correlation matrix between all features and the human score;

a barplot showing marginal and partial correlations between all features and the human score, and, optionally, response length if length_column is specified in the config file.

dff_by_group: Differential feature functioning by group. The plots in this section show average feature values for each of the subgroups conditioned on human score.

consistency: Shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.

model: Shows the parameters of the learned regression model. For linear models, it also includes the standardized and relative coefficients as well as model diagnostic plots.

evaluation: Shows the standard set of evaluations recommended for scoring models on the evaluation data:

a table showing human-system association metrics;

the confusion matrix; and

a barplot showing the distributions for both human and machine scores.

evaluation_by_group: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.

fairness_analyses: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:

percentage of variance in squared error explained by subgroup membership

percentage of variance in raw (signed) error error explained by subgroup membership

percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score

plots showing estimates for each subgroup for each model

true_score_evaluation: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:

Number of single and double-scored responses.

Variance of human rater errors and estimated variance of true scores

Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.

pca: Shows the results of principal components analysis on the processed feature values:

the principal components themselves;

the variances; and

a Scree plot.

The analysis keeps all components. The total number of components usually equals the total number of features. In cases where the total number of responses is smaller than the number of features, the number of components is the same as the number of responses.

intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the experiment.

sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmtool will look for a column called spkitemid in the training and evaluation files.

length_column (Optional)

The name for the optional column in the training and evaluation data containing response length. If specified, length is included in the inter-feature and partial correlation analyses. Note that this field should not be specified if you want to use the length column as an actual feature in the model. In the latter scenario, the length column will automatically be included in the analyses, like any other feature. If you specify length_column and include the same column name as a feature in the feature file, rsmtool will ignore the length_column setting. In addition, if length_column has missing values or if its standard deviation is 0 (both somewhat unlikely scenarios), rsmtool will not include any length-based analyses in the report.

min_items_per_candidate (Optional)

An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None.

min_n_per_group (Optional)

A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.

Note

If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.

Note

Any provided thresholds will be applied when displaying the feature descriptive analyses conducted on the training set and the results of the performance analyses computed on the evaluation set.

predict_expected_scores (Optional)

If a probabilistic SKLL classifier is chosen to build the scoring model, then expected scores — probability-weighted averages over contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default for a classifier. Set this field to true to compute expected scores as predictions. Defaults to false.

Note

You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probablity values for certain score points which can affect floating point computations.

rater_error_variance (Optional)

True score evaluations require an estimate of rater error variance. By default, rsmtool will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmtool. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.

second_human_score_column (Optional)

The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.

Note

You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and

All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and

select_transformations (Optional)

If this option is set to true the system will try apply feature transformations to each of the features and then choose the transformation for each feature that yields the highest correlation with human score. The possible transformations are:

raw: no transformation, use original feature value

org: same as raw

inv: 1/x

sqrt: square root

addOneInv: 1/(x+1)

addOneLn: ln(x+1)

We only consider transformations that produce numeric results for all values for a given feature column. For example, if a feature column contains a single negative value, the sqrt transformation will be ignored even if it would have resulted in the highest correlation with human score for the remaining values. In addition, the inv and addOneInv transformations are never used for feature columns that contain both positive and negative values.

Defaults to false.

sign (Optional)

Name of the column containing expected correlation sign between each feature and human score if using subset-based column selection.

skll_fixed_parameters (Optional)

Any fixed hyperparameters to be used if a SKLL model is chosen to build the scoring model. This should be a dictionary with the names of the hyperparameters as the keys. To determine what hyperparameters are available for the SKLL learner you chose, consult the scikit-learn documentation for the learner with the same name as well as the SKLL documentation. Any values you specify here will override both the scikit-learn and SKLL defaults. The values for a key can be string, integer, float, or boolean depending on what the hyperparameter expects. Note that if this option is specified with the built-in linear regression models, it will simply be ignored.

skll_grid_search_jobs (Optional)

Number of folds to run in parallel when using SKLL grid search. Defaults to 1 (no parallelization). For very large data sets, setting this option to a higher number (e.g., number of CPU cores) may significantly improve running time.

skll_objective (Optional)

The tuning objective to use if a SKLL model is chosen to build the scoring model. Possible values are the objectives available via SKLL. Defaults to neg_mean_squared_error for SKLL regressors and f1_score_micro for SKLL classifiers. Note that if this option is specified with the built-in linear regression models, it will simply be ignored.

standardize_features (Optional)

If this option is set to false features will not be standardized by subtracting the mean and dividing by the standard deviation. Defaults to true.

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt", "gender", "native_language", "test_country"]. These subgroup columns need to be present in both training and evaluation data. If subgroups are specified, rsmtool will generate:

description of the data by each subgroup;

boxplots showing the feature distribution for each subgroup on the training set; and

tables and barplots showing human-system agreement for each subgroup on the evaluation set.

test_label_column (Optional)

The name for the column containing the human scores in the test data. If set to to fake, fake scores will be generated using randomly sampled integers. This option may be useful if you only need descriptive statistics for the data and do not care about the other analyses. Defaults to sc1.

train_label_column (Optional)

The name for the column containing the human scores in the training data. If set to to fake, fake scores will be generated using randomly sampled integers. This option may be useful if you only need descriptive statistics for the data and do not care about the other analyses. Defaults to sc1.

Note

All responses with non-numeric values in either train_label_column or test_label_column and/or those with non-numeric values for relevant features will be automatically excluded from model training and evaluation. By default, zero scores in either train_label_column or test_label_column will also be excluded. See exclude_zero_scores (Optional) if you want to keep responses with zero scores.

trim_max (Optional)

The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max + trim_tolerance. Defaults to the highest observed human score in the training data or 10 if there are no numeric human scores available.

trim_min (Optional)

The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min - trim_tolerance. Defaults to the lowest observed human score in the training data or 1 if there are no numeric human scores available.

trim_tolerance (Optional)

The single numeric value that will be used to pad the trimming range specified in trim_min and trim_max. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max + trim_tolerance for ceiling value and trim_min-trim_tolerance for floor value. Defaults to 0.4998.

Note

For more fine-grained control over the trimming range, you can set trim_tolerance to 0 and use trim_min and trim_max to specify the exact floor and ceiling values.

truncate_outliers (Optional)

If this option is set to false, outliers (values more than 4 standard deviations away from the mean) in feature columns will _not_ be truncated. Defaults to true.

use_scaled_predictions (Optional)

If set to true, certain evaluations (confusion matrices, score distributions, subgroup analyses) will use the scaled machine scores. If set to false, these evaluations will use the raw machine scores. Defaults to false.

Note

All evaluation metrics (e.g., kappa and pearson correlation) are automatically computed for both scaled and raw scores.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

use_truncation_thresholds (Optional)

If set to true, use the min and max columns specified in the features file to clamp outlier feature values. This is useful if users would like to clamp feature values based on some pre-defined boundaries, rather than having these boundaries calculated based on the training set. Defaults to false.

Note

If use_truncation_thresholds is set, a features file _must_ be specified, and this file _must_ include min and max columns. If no feature file is specified or these columns are missing, an error will be raised.

use_wandb (Optional)

If set to true, the generated reports and all intermediate tables will be logged to Weights & Biases. The Weights & Biases entity and project name can be specified in the appropriate configuration fields.

The tables and plots will be logged in a section named rsmtool in a new run under the given project, and the report will be added to a reports section in that run. In addition, some evaluation metrics will be logged to the run’s history for easy comparison between different runs in the project. With each metric name representing the context and the table from which the metric is taken, for example: rsmeval/consistency.SMD is the SMD metric from the consistency table calculated by rsmtool. Defaults to false.

wandb_project (Optional)

The Weights & Biases project name if logging to Weights & Biases is enabled. If a project by this name does not already exist, it will be created.

Important

Before using Weights & Biases for the first time, users should log in and provide their API key as described in W&B Quickstart guidelines.
Note that when using W&B logging, the rsmtool run may take significantly longer due to the network traffic being sent to W&B.

wandb_entity (Optional)

The Weights & Biases entity name if logging to Weights & Biases is enabled. Entity can be a user name or the name of a team or organization.

Output

rsmtool produces a set of folders in the experiment output directory. This is either the current directory in which rsmtool is run or the directory specified as the second optional command-line argument. If logging to Weights & Biases is enabled, the reports and all intermediate files are also logged to the specified Weights & Biases project.

report

This folder contains the final RSMTool report in HTML format as well in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmtool will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

figure

This folder contains all of the figures generated as part of the various analyses performed, saved as .svg files.

feature

This folder contains a .csv file that lists all features, signs and transformation as used in the final model, taking into account any manual or automatic feature selection. See feature column selection for more details.

Selecting Feature Columns

By default, rsmtool will use all columns included in the training and evaluation data files as features. The only exception are any columns explicitly identified in the configuration file as containing non-feature information (e.g., id_column, train_label_column, test_label_column, etc.)

However, there are certain scenarios in which it is useful to choose specific columns in the data to be used as features. For example, let’s say that you have a large number of very different features and you want to use a different subset of features to score different types of questions on a test. In this case, the ability to easily choose the desired features for any rsmtool experiment becomes quite important. The alternative of manually pre-processing the data to remove the features you don’t need is quite cumbersome.

There are two ways to select specific columns in the data as features:

Fine-grained column selection: In this method, you manually create a list of the columns that you wish to use as features for an rsmtool experiment. See fine-grained selection for more details.

Subset-based column selection: In this method, you can pre-define subsets of features and then select entire subsets at a time for any rsmtool experiment. See subset-based selection for more details.

While fine-grained column selection is better for a single experiment, subset-based selection is more convenient when you need to run several experiments with somewhat different subsets of features.

Warning

rsmtool will filter the training and evaluation data to remove any responses with non-numeric values in any of the feature columns before training the model. If your data includes a column containing string values and you do not use any of these methods of feature selection nor specify this column as the id_column or the candidate_column or a subgroup column, rsmtool will filter out all the responses in the data.

Fine-grained column selection

To manually select columns to be used as features, you can provide a data file in one of the supported formats. The file must contain a column named feature which specifies the names of the feature columns that should be used for scoring model building. For additional flexibility, the same file also allows you to describe transformations to be applied to the values in these feature columns before being used in the model. The path to this file should be set as an argument to features in the experiment configuration file. (Note: If you do not wish to perform any feature transformations, but would simply like to select certain feature columns to include, you can also pass a list of feature names as an arguement to features.)

Here’s an example of what such a file might look like.

feature,transform,sign
feature1,raw,1
feature2,inv,-1

There is one required column and two optional columns.

feature

The exact name of the column in the training and evaluation data files, including capitalization. Column names cannot contain hyphens. The following strings are reserved and cannot not be used as feature column names: spkitemid, spkitemlab, itemType, r1, r2, score, sc, sc1, and adj. In addition, any column names provided as values for id_column, train_label_column, test_label_column, length_column, candidate_column, and subgroups may also not be used as feature column names.

transform (optional)

A transformation that should be applied to the column values before using it in the model. Possible values are:

raw: no transformation, use original value

org: same as raw

inv: 1/x

sqrt: square root

addOneInv: 1/(x+1)

addOneLn: ln(x+1)

Note that rsmtool will raise an exception if the values in the data do not allow the supplied transformation (for example, if inv is applied to a column which has 0 values). If you really want to use the tranformation, you must pre-process your training and evaluation data files to remove the problematic cases.

If the feature file contains no transform column, rsmtool will use the original values for all features (raw trasform).

sign (optional)

After transformation, the column values will be multiplied by this number, which can be either 1 or -1 depending on the expected sign of the correlation between transformed feature and human score. This mechanism is provided to ensure that all features in the final models have a positive correlation with the score, if that is so desired by the user.

If the feature file contains no sign column, rsmtool will multiply all values by 1.

When determining the sign, you should take into account the correlation between the original feature and the score as well as any applied transformations. For example, if you use feature which has a negative correlation with the human score and apply sqrt transformation, sign should be set to -1. However, if you use the same feature but apply the inv transformation, sign should now be set to 1.

To ensure that this is working as expected, you can check the sign of correlations for both raw and processed features in the final report.

Note

You can use the fine-grained method of column selection in combination with a model with automatic feature selection. In this case, the features that end up being used in the final model can be found in the .csv file in the feature folder in the experiment output directory.

Subset-based column selection

For more advanced users, rsmtool offers the ability to assign columns to named subsets in a data file in one of the supported formats and then select a set of columns by simply specifying the name of that pre-defined subset.

If you want to run multiple rsmtool experiments, each choosing from a large number of features, generating a separate feature file for each experiment listing columns to use can quickly become tedious.

Instead you can define feature subsets by providing a subset definition file in one of the supported formats which lists all feature names under a column named feature. Each subset is an additional column with a value of either 0 (denoting that the feature does not belong to the subset named by that column) or 1 (denoting that the feature does belong to the subset named by that column).

Here’s an example of a subset definition file, say subset.csv.

feature,A,B
feature1,0,1
feature2,1,1
feature3,1,0

In this example, feature2 and feature3 belong to a subset called “A” and feature1 and feature1 and feature2 belong to a subset called “B”.

This feature subset file can be provided to rsmtool using the feature_subset_file field in the configuration file. Then, to select a particular pre-defined subset of features, you simply set the feature_subset field in the configuration file to the name of the subset that you wish to use.

Then, in order to use feature subset “A” (feature2 and feature3) in an experiment, we need to set the following two fields in our experiment configuration file:

{
    ...
    "feature_subset_file": "subset.csv",
    "feature_subset": "A",
    ...
}

Transformations

Unlike in fine-grained selection, the feature subset file does not list any transformations to be applied to the feature columns. However, you can automatically select transformation for each feature in the selected subset by applying all possible transforms and identifying the one which gives the highest correlation with the human score. To use this functionality set the select_transformations field in the configuration file to true.

Signs

Some guidelines for building scoring models require all coefficients in the model to be positive and all features to have a positive correlation with human score. rsmtool can automatically flip the sign for any pre-defined feature subset. To use this functionality, the feature subset file should provide the expected correlation sign between each feature and human score under a column called sign_<SUBSET> where <SUBSET> is the name of the feature subset. Then, to tell rsmtool to flip the sign for this subset, you need to set the sign field in the configuration file to <SUBSET>.

To understand this, let’s re-examine our earlier example of a subset definition file subset.csv, but with an additional column.

feature,A,B,sign_A
feature1,0,1,+
feature2,1,1,-
feature3,1,0,+

Then, in order to use feature subset “A” (feature2 and feature3) in an experiment with the sign of feature3 flipped appropriately (multiplied by -1) to ensure positive correlations with score, we need to set the following three fields in our experiment configuration file:

{
    ...
    "feature_subset_file": "subset.csv",
    "feature_subset": "A",
    "sign": "A"
    ...
}

Note

If select_transformations is set to true, rsmtool is intelligent enough to take it into account when flipping the signs. For example, if the expected correlation sign for a given feature is negative, rsmtool will multiply the feature values by -1 if the sqrt transform has the highest correlation with score. However, if the best transformation turns out to be inv – which already changes the polarity of the feature – no such multiplication will take place.

Intermediate files

Although the primary output of RSMTool is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool.To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the experiment_id provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMTool standardizes these column names internally for convenience. These values are:

spkitemid for the column containing response IDs.
sc1 for the column containing the human scores used as training labels.
sc2 for the column containing the second human scores, if this column was specified in the configuration file.
length for the column containing response length, if this column was specified in the configuration file.
candidate for the column containing candidate IDs, if this column was specified in the configuration file.

Feature values

filenames: train_features, test_features, train_preprocessed_features, test_preprocessed_features

These files contain the raw and pre-processed feature values for the training and evaluation sets. They include only includes the rows that were used for training/evaluating the models after filtering. For models with feature selection, these files only include the features that ended up being included in the model.

Note

By default RSMTool filters out non-numeric feature values and non-numeric/zero human scores from both the training and evaluation sets. Zero scores can be kept by setting the exclude_zero_scores to false.

Flagged responses

filenames: train_responses_with_excluded_flags, test_responses_with_excluded_flags

These files contain all of the rows in the training and evaluation sets that were filtered out based on conditions specified in flag_column.

Note

If the training/evaluation files contained columns with internal names such as sc1 or length but these columns were not actually used by rsmtool, these columns will also be included into these files but their names will be changed to ##name## (e.g. ##sc1##).

Excluded responses

filenames: train_excluded_responses, test_excluded_responses

These files contain all of the rows in the training and evaluation sets that were filtered out because of feature values or scores. For models with feature selection, these files only include the features that ended up being included in the model.

Response metadata

filenames: train_metadata, test_metadata

These files contain the metadata columns (id_column, subgroups if provided) for the rows in the training and evaluation sets that were not excluded for some reason.

Unused columns

filenames: train_other_columns, test_other_columns

These files contain all of the columns from the original features files that are not present in the *_feature and *_metadata files. They only include the rows from the training and evaluation sets that were not filtered out.

Note

If the training/evaluation files contained columns with internal names such as sc1 or length but these columns were not actually used by rsmtool, these columns will also be included into these files but their names will be changed to ##name## (e.g. ##sc1##).

Response length

filename: train_response_lengths

If length_column is specified, then this file contains the values from that column for the training data under a column called length with the response IDs under the spkitemid column.

Human scores

filename: test_human_scores

This file contains the human scores for the evaluation data under a column called sc1 with the response IDs under the spkitemid column. If second_human_score_column was specfied, then it also contains the values from that column under a column called sc2. Only the rows that were not filtered out are included.

Note

If exclude_zero_scores was set to true (the default value), all zero scores in the second_human_score_column will be replaced by nan.

Data composition

filename: data_composition

This file contains the total number of responses in training and evaluation set and the number of overlapping responses. If applicable, the table will also include the number of different subgroups for each set.

Excluded data composition

filenames: train_excluded_composition, test_excluded_composition

These files contain the composition of the set of excluded responses for the training and evaluation sets, e.g., why were they excluded and how many for each such exclusion.

Missing features

filename: train_missing_feature_values

This file contains the total number of non-numeric values for each feature. The counts in this table are based only on those responses that have a numeric human score in the training data.

Subgroup composition

filename: data_composition_by_<SUBGROUP>

There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup in both the training and evaluation sets.

Feature descriptives

filenames: feature_descriptives, feature_descriptivesExtra

The first file contains the main descriptive statistics (mean,std. dev., correlation with human score etc.) for all features included in the final model. The second file contains percentiles, mild, and extreme outliers for the same set of features. The values in both files are computed on raw feature values before pre-processing.

Feature outliers

filename: feature_outliers

This file contains the number and percentage of outlier values truncated to [MEAN-4*SD, MEAN+4*SD] during feature pre-processing for each feature included in the final model.

Inter-feature and score correlations

filenames: cors_orig, cors_processed

The first file contains the pearson correlations between each pair of (raw) features and between each (raw) feature and the human score. The second file is the same but with the pre-processed feature values instead of the raw values.

Marginal and partial correlations with score

filenames: margcor_score_all_data, pcor_score_all_data, `pcor_score_no_length_all_data

The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column was specified in the configuration file.

Marginal and partial correlations with length

filenames: margcor_length_all_data, pcor_length_all_data

The first file contains the marginal correlations between each pre-processed feature and response length, if length_column was specified. The second file contains the partial correlations between each pre-processed feature and response length after controlling for all other features, if length_column was specified in the configuration file.

Principal components analyses

filenames: pca, pcavar

The first file contains the results of a Principal Components Analysis (PCA) using pre-processed feature values from the training set and its singular value decomposition. The second file contains the eigenvalues and variance explained by each component.

Various correlations by subgroups

Each of following files may be produced for every subgroup, assuming all other information was also available.

margcor_score_by_<SUBGROUP>: the marginal correlations between each pre-processed feature and human score, computed separately for the subgroup.
pcor_score_by_<SUBGROUP>: the partial correlations between pre-processed features and human score after controlling for all other features, computed separately for the subgroup.
pcor_score_no_length_by_<SUBGROUP>: the partial correlations between each pre-processed feature and human score after controlling for response length (if available), computed separately for the subgroup.
margcor_length_by_<SUBGROUP>: the marginal correlations between each feature and response length (if available), computed separately for each subgroup.
pcor_length_by_<SUBGROUP>: partial correlations between each feature and response length (if available) after controlling for all other features, computed separately for each subgroup.

Note

All of the feature descriptive statistics, correlations (including those for subgroups), and PCA are computed only on the training set.

Model information

feature: pre-processing parameters for all features used in the model.
coefficients: model coefficients and intercept (for built-in models only).
coefficients_scaled: scaled model coefficients and intercept (linear models only). Although RSMTool generates scaled scores by scaling the predictions of the model, it is also possible to achieve the same result by scaling the coefficients instead. This file shows those scaled coefficients.

betas: standardized and relative coefficients (for built-in models only).
model_fit: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.
.model: the serialized RSMTool Modeler object containing the fitted SKLL Learner object (before scaling the coeffcients).
.ols: a serialized object of type pandas.stats.ols.OLS containing the fitted model (for built-in models excluding LassoFixedLambda and PositiveLassoCV).
ols_summary.txt: a text file containing a summary of the above model (for built-in models excluding LassoFixedLabmda and PositiveLassoCV)

postprocessing_params: the parameters for trimming and scaling predicted scores. Useful for generating predictions on new data.

Predictions

filenames: pred_processed, pred_train

The first file contains the predicted scores for the evaluation set and the second file contains the predicted scores for the responses in the training set. Both of them contain the raw scores as well as different types of post-processed scores.

Evaluation metrics

eval: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.
eval_by_<SUBGROUP>: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.
eval_short: a shortened version of eval that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).
- h_mean
- h_sd
- corr
- sys_mean [raw/scale_trim]
- sys_sd [raw/scale_trim]
- SMD [raw/scale_trim]
- adj_agr [raw/scale_trim_round]
- exact_agr [raw/scale_trim_round]
- kappa [raw/scale_trim_round]
- wtkappa [raw/scale_trim]
- sys_mean [raw/scale_trim_round]
- sys_sd [raw/scale_trim_round]
- SMD [raw/scale_trim_round]
- R2 [raw/scale_trim]
- RMSE [raw/scale_trim]
score_dist: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.
confMatrix: the confusion matrix between the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

true_score_eval: evaluation of how well system scores can predict true scores.

Human-human Consistency

These files are created only if a second human score has been made available via the second_human_score_column option in the configuration file.

consistency: contains descriptives for both human raters as well as the agreement metrics between their ratings.
consistency_by_<SUBGROUP>: contains the same metrics as in consistency file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.
degradation: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.
confMatrix_h1h2: the confusion matrix between the human scores for double-scored responses.

Evaluations based on test theory

disattenuated_correlations: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.
disattenuated_correlations_by_<SUBGROUP>: contains the same metrics as in disattenuated_correlations file computed separately for each group.
true_score_eval: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.

Additional fairness analyses

These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.

<METRICS>_by_<SUBGROUP>.ols: a serialized object of type pandas.stats.ols.OLS containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics are osa (overall score accuracy), osd (overall score difference), and csd (conditional score difference).
<METRICS>_by_<SUBGROUP>_ols_summary.txt: a text file containing a summary of the above model
estimates_<METRICS>_by_<SUBGROUP>`: coefficients, confidence intervals and p-values estimated by the model for each subgroup.
fairness_metrics_by_<SUBGROUP>: the \(R^2\) (percentage of variance) and p-values for all models.

Built-in RSMTool Linear Regression Models

Models which use the full feature set

LinearRegression: A model that learns empirical regression weights using ordinary least squares regression (OLS).
EqualWeightsLR: A model with all feature weights set to 1.0; a naive model.
ScoreWeightedLR: a model that learns empirical regression weights using weighted least sqaures. The weights are determined based on the number of responses with different score levels. Score levels with lower number of responses are assigned higher weight.
RebalancedLR - empirical regression weights are rebalanced by using a small portion of positive weights to replace negative beta values. This model has no negative coefficients.

Models with automatic feature selection

LassoFixedLambdaThenLR: A model that learns empirical OLS regression weights with feature selection using Lasso regression with all coefficients set to positive. The hyperparameter lambda is set to sqrt(n-lg(p)) where n is the number of responses and p is the number of features. This approach was chosen to balance the penalties for error vs. penalty for two many coefficients to force Lasso perform more aggressive feature selection, so it may not necessarily achieve the best possible performance. The feature set selected by LASSO is then used to fit an OLS linear regression. Note that while the original Lasso model is constrained to positive coefficients only, small negative coefficients may appear when the coefficients are re-estimated using OLS regression.
PositiveLassoCVThenLR: A model that learns empirical OLS regression weights with feature selection using Lasso regression with all coefficients set to positive. The hyperparameter lambda is optimized using crossvalidation for loglikehood. The feature set selected by LASSO is then used to fit an OLS linear regression. Note that this approach will likely produce a model with a large N features and any advantages of running Lasso would be effectively negated by latter adding those features to OLS regression.
NNLR: A model that learns empirical OLS regression weights with feature selection using non-negative least squares regression. Note that only the coefficients are constrained to be positive: the intercept can be either positive or negative.
NNLRIterative: A model that learns empirical OLS regression weights with feature selection using an iterative implementation of non-negative least squares regression. Under this implementation, an initial OLS model is fit. Then, any variables whose coefficients are negative are dropped and the model is re-fit. Any coefficients that are still negative after re-fitting are set to zero.
LassoFixedLambdaThenNNLR: A model that learns empirical OLS regression weights with feature selection using Lasso regression as above followed by non-negative least squares regression. The latter ensures that no feature has negative coefficients even when the coefficients are estimated using least squares without penalization.
LassoFixedLambda: same as LassoFixedLambdaThenLR but the model uses the original Lasso weights. Note that the coefficients in Lasso model are estimated using an optimization routine which may produce slightly different results on different systems.
PositiveLassoCV: same as PositiveLassoCVThenLR but using the original Lasso weights. Please note: the coefficients in Lasso model are estimated using an optimization routine which may produce slightly different results on different systems.

Note

NNLR, NNLRIterative, LassoFixedLambdaThenNNLR, LassoFixedLambda and PositiveLassoCV all have no negative coefficients.
For all feature selection models, the final set of features will be saved in the feature folder in the experiment output directory.