Using RSMTool¶
For most users, the primary means of using RSMTool will be via the command-line utility rsmtool
. We refer to each run of rsmtool
as an “experiment”.
Input¶
rsmtool
requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmtool
will use the current directory as the output directory.
Here are all the arguments to the rsmtool
command-line script.
-
config_file
¶
The JSON configuration file for this experiment.
-
output_dir
(optional)
¶ The output directory where all the files for this experiment will be stored.
-
-f
,
--force
¶
If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmtool experiment.
-
-h
,
--help
¶
Show help message and exist.
-
-V
,
--version
¶
Show version number and exit.
Experiment configuration file¶
This is a file in .json
format that provides overall configuration options for an rsmtool
experiment. Here’s an example configuration file for rsmtool
.
Note
To make it easy to get started with rsmtool
, we provide a way to automatically generate configurations file both interactively as well as non-interactively. Novice users will find interactive generation more helpful while more advanced users will prefer non-interactive generation. See this page for more details.
Next, we describe all of the rsmtool
configuration fields in detail. There are four required fields and the rest are all optional. We first describe the required fields and then the optional ones (sorted alphabetically).
experiment_id¶
An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.
model¶
The machine learner you want to use to build the scoring model. Possible values include built-in linear regression models as well as all of the learners available via SKLL. With SKLL learners, you can customize the tuning objective and also compute expected scores as predictions.
train_file¶
The path to the training data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
test_file¶
The path to the evaluation data feature file in one of the supported formats. Each row should correspond to a single response and contain numeric feature values extracted for this response. In addition, there should be a column with a unique identifier (ID) for each response and a column with the human score for each response. The path can be absolute or relative to the location of config file.
Note
- For both the training and evaluation files, the default behavior of
rsmtool
is to look for a column namedspkitemid
in order to get the unique IDs for each response and a column namedsc1
to get train/test labels. The optional fields id_column and train_label_column can be used to specify different names for these columns. rsmtool
also assumes that all other columns present in these files (other than those containing IDs and labels) contain feature values. If this is not the case, one can use other configuration file fields to identify columns containing non-feature information useful for various analyses, e.g., second_human_score_column, flag_column, subgroups et cetera below.- Any columns not explicitly identified in (1) and (2) will be considered feature columns and used by
rsmtool
in the model. To use only a subset of these remaining columns as features, one can employ the four optional fields features, feature_subset_file, feature_subset, and sign. See selecting feature columns for more details on how to achieve this.
candidate_column (Optional)¶
The name for an optional column in the training and test data containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.
custom_sections (Optional)¶
A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb
files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.
description (Optional)¶
A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.
exclude_zero_scores (Optional)¶
By default, responses with human scores of 0 will be excluded from both training and evaluation set. Set this field to false
if you want to keep responses with scores of 0. Defaults to true
.
feature_subset (Optional)¶
Name of the pre-defined feature subset to be used if using subset-based column selection.
feature_subset_file (Optional)¶
Path to the feature subset file if using subset-based column selection.
features (Optional)¶
Path to the file with list of features if using fine-grained column selection. Alternatively, you can pass a list
of feature names to include in the experiment.
file_format (Optional)¶
The format of the intermediate files. Options are csv
, tsv
, or xlsx
. Defaults to csv
if this is not specified.
flag_column (Optional)¶
This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0
in a column called ADVISORY
). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be used to train the model. For example, a value of {"ADVISORY": 0}
will mean that rsmtool
will only use responses for which the ADVISORY
column has the value 0.
If this field is used without flag_column_test
, the conditions will be applied to both training and evaluation set and the specified columns must be present in both sets.
When this field is used in conjunction with flag_column_test
, the conditions will be applied to training set only and the specified columns must be present in the training set.
Defaults to None
.
Note
If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}
) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY
column has a value of 0 and the ERROR
column has a value of 0).
Note
When reading the values in the supplied dictionary, rsmtool
treats numeric strings, floats and integers as the same value. Thus 1
, 1.0
, "1"
and "1.0"
are all treated as the 1.0
.
flag_column_test (Optional)¶
This field makes it possible to only use a separate Python flag dictionary for the evaluation set. If this field is not passed, and flag_column
is passed, then the same advisories will be used for both training and evaluation sets.
When this field is used, the specified columns must be present in the evaluation set.
Defaults to None
or flag_column`, if flag_column
is present. Use flag_column_test
only if you want filtering of the test set.
Note
When used, flag_column_test
field determines all filtering conditions for the evaluation set. If it is used in conjunction with flag_column
field, the filtering conditions defined in flag_column
will only be applied to the training set. If you want to apply a subset of conditions to both partitions with additional conditions applied to the evaluation set only, you will need to specify the overlapping conditions separately for each partition.
general_sections (Optional)¶
RSMTool provides pre-defined sections for rsmtool
(listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.
data_description
: Shows the total number of responses in training and evaluation set, along with any responses have been excluded due to non-numeric features/scores or flag columns.
data_description_by_group
: Shows the total number of responses in training and evaluation set for each of the subgroups specified in the configuration file. This section only covers the responses used to train/evaluate the model.
feature_descriptives
: Shows the descriptive statistics for all raw feature values included in the model:
- a table showing mean, standard deviation, min, max, correlation with human score etc.;
- a table with percentiles and outliers; and
- a barplot showing the number of truncated outliers for each feature.
features_by_group
: Shows boxplots with distributions of raw feature values by each of the subgroups specified in the configuration file.
preprocessed_features
: Shows analyses of preprocessed features:
- histograms showing the distributions of preprocessed features values;
- the correlation matrix between all features and the human score;
- a barplot showing marginal and partial correlations between all features and the human score, and, optionally, response length if length_column is specified in the config file.
dff_by_group
: Differential feature functioning by group. The plots in this section show average feature values for each of the subgroups conditioned on human score.
consistency
: Shows metrics for human-human agreement, the difference (“degradation”) between the human-human and human-system agreement, and the disattenuated human-machine correlations. This notebook is only generated if the config file specifies second_human_score_column.
model
: Shows the parameters of the learned regression model. For linear models, it also includes the standardized and relative coefficients as well as model diagnostic plots.
evaluation
: Shows the standard set of evaluations recommended for scoring models on the evaluation data:
- a table showing human-system association metrics;
- the confusion matrix; and
- a barplot showing the distributions for both human and machine scores.
evaluation_by_group
: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.
fairness_analyses
: Additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The notebook shows:
- percentage of variance in squared error explained by subgroup membership
- percentage of variance in raw (signed) error error explained by subgroup membership
- percentage of variance in raw (signed) error explained by subgroup membership when controlling for human score
- plots showing estimates for each subgroup for each model
true_score_evaluation
: evaluation of system scores against the true scores estimated according to test theory. The notebook shows:
- Number of single and double-scored responses.
- Variance of human rater errors and estimated variance of true scores
- Mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score with system score.
pca
: Shows the results of principal components analysis on the processed feature values:
- the principal components themselves;
- the variances; and
- a Scree plot.
The analysis keeps all components. The total number of components usually equals the total number of features. In cases where the total number of responses is smaller than the number of features, the number of components is the same as the number of responses.
intermediate_file_paths
: Shows links to all of the intermediate files that were generated while running the experiment.
sysinfo
: Shows all Python packages along with versions installed in the current environment while generating the report.
id_column (Optional)¶
The name of the column containing the response IDs. Defaults to spkitemid
, i.e., if this is not specified, rsmtool
will look for a column called spkitemid
in the training and evaluation files.
length_column (Optional)¶
The name for the optional column in the training and evaluation data containing response length. If specified, length is included in the inter-feature and partial correlation analyses. Note that this field should not be specified if you want to use the length column as an actual feature in the model. In the latter scenario, the length column will automatically be included in the analyses, like any other feature. If you specify length_column
and include the same column name as a feature in the feature file, rsmtool
will ignore the length_column
setting. In addition, if length_column
has missing values or if its standard deviation is 0 (both somewhat unlikely scenarios), rsmtool
will not include any length-based analyses in the report.
min_items_per_candidate (Optional)¶
An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None
.
min_n_per_group (Optional)¶
A single numeric value or a dictionary with keys as the group names listed in the subgroups field and values as the thresholds for the groups. When specified, only groups with at least this number of instances will be displayed in the tables and plots contained in the report. Note that this parameter only affects the HTML report and the figures. For all analyses – including the computation of the population parameters – data from all groups will be used. In addition, the intermediate files will still show the results for all groups.
Note
If you supply a dictionary, it must contain a key for every subgroup listed in subgroups field. If no threshold is to be applied for some of the groups, set the threshold value for this group to 0 in the dictionary.
Note
Any provided thresholds will be applied when displaying the feature descriptive analyses conducted on the training set and the results of the performance analyses computed on the evaluation set.
predict_expected_scores (Optional)¶
If a probabilistic SKLL classifier is chosen to build the scoring model, then expected scores — probability-weighted averages over contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default for a classifier. Set this field to true
to compute expected scores as predictions. Defaults to false
.
Note
You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probablity values for certain score points which can affect floating point computations.
rater_error_variance (Optional)¶
True score evaluations require an estimate of rater error variance. By default, rsmtool
will compute this variance from double-scored responses in the data. However, in some cases, one may wish to compute the variance on a different sample of responses. In such cases, this field can be used to set the rater error variance to a precomputed value which is then used as-is by rsmtool
. You can use the rsmtool.utils.variance_of_errors function to compute rater error variance outside the main evaluation pipeline.
second_human_score_column (Optional)¶
The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.
Note
You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.
section_order (Optional)¶
A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:
- Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
- All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
- All special sections specified using special_sections.
select_transformations (Optional)¶
If this option is set to true
the system will try apply feature transformations to each of the features and then choose the transformation for each feature that yields the highest correlation with human score. The possible transformations are:
raw
: no transformation, use original feature valueorg
: same as rawinv
: 1/xsqrt
: square rootaddOneInv
: 1/(x+1)addOneLn
: ln(x+1)
We only consider transformations that produce numeric results for all values for a given feature column. For example, if a feature column contains a single negative value, the sqrt
transformation will be ignored even if it would have resulted in the highest correlation with human score for the remaining values. In addition, the inv
and addOneInv
transformations are never used for feature columns that contain both positive and negative values.
Defaults to false
.
See also
It is also possible to manually apply transformations to any feature as part of the feature column selection process.
sign (Optional)¶
Name of the column containing expected correlation sign between each feature and human score if using subset-based column selection.
skll_fixed_parameters (Optional)¶
Any fixed hyperparameters to be used if a SKLL model is chosen to build the scoring model. This should be a dictionary with the names of the hyperparameters as the keys. To determine what hyperparameters are available for the SKLL learner you chose, consult the scikit-learn documentation for the learner with the same name as well as the SKLL documentation. Any values you specify here will override both the scikit-learn and SKLL defaults. The values for a key can be string, integer, float, or boolean depending on what the hyperparameter expects. Note that if this option is specified with the built-in linear regression models, it will simply be ignored.
skll_objective (Optional)¶
The tuning objective to use if a SKLL model is chosen to build the scoring model. Possible values are the objectives available via SKLL. Defaults to neg_mean_squared_error
for SKLL regressors and f1_score_micro
for SKLL classifiers. Note that if this option is specified with the built-in linear regression models, it will simply be ignored.
special_sections (Optional)¶
A list specifying special ETS-only sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.
standardize_features (Optional)¶
If this option is set to false
features will not be standardized by subtracting the mean and dividing by the standard deviation. Defaults to true
.
subgroups (Optional)¶
A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]
. These subgroup columns need to be present in both training and evaluation data. If subgroups are specified, rsmtool
will generate:
- description of the data by each subgroup;
- boxplots showing the feature distribution for each subgroup on the training set; and
- tables and barplots showing human-system agreement for each subgroup on the evaluation set.
test_label_column (Optional)¶
The name for the column containing the human scores in the test data. If set to to fake
, fake scores will be generated using randomly sampled integers. This option may be useful if you only need descriptive statistics for the data and do not care about the other analyses. Defaults to sc1
.
train_label_column (Optional)¶
The name for the column containing the human scores in the training data. If set to to fake
, fake scores will be generated using randomly sampled integers. This option may be useful if you only need descriptive statistics for the data and do not care about the other analyses. Defaults to sc1
.
Note
All responses with non-numeric values in either train_label_column
or test_label_column
and/or those with non-numeric values for relevant features will be automatically excluded from model training and evaluation. By default, zero scores in either train_label_column
or test_label_column
will also be excluded. See exclude_zero_scores (Optional) if you want to keep responses with zero scores.
trim_max (Optional)¶
The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max
+ trim_tolerance
. Defaults to the highest observed human score in the training data or 10 if there are no numeric human scores available.
trim_min (Optional)¶
The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min
- trim_tolerance
. Defaults to the lowest observed human score in the training data or 1 if there are no numeric human scores available.
trim_tolerance (Optional)¶
The single numeric value that will be used to pad the trimming range specified in trim_min
and trim_max
. This value will be used to compute the ceiling and floor values for trimmed (bound) machine scores as trim_max
+ trim_tolerance
for ceiling value and trim_min
-trim_tolerance
for floor value.
Defaults to 0.4998.
Note
For more fine-grained control over the trimming range, you can set trim_tolerance
to 0 and use trim_min
and trim_max
to specify the exact floor and ceiling values.
use_scaled_predictions (Optional)¶
If set to true
, certain evaluations (confusion matrices, score distributions, subgroup analyses) will use the scaled machine scores. If set to false
, these evaluations will use the raw machine scores. Defaults to false
.
Note
All evaluation metrics (e.g., kappa and pearson correlation) are automatically computed for both scaled and raw scores.
use_thumbnails (Optional)¶
If set to true
, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false
, full-sized images will be displayed as usual. Defaults to false
.
use_truncation_thresholds (Optional)¶
If set to true
, use the min
and max
columns specified in the features
file to clamp outlier feature values. This is useful if users would like to clamp feature values based on some pre-defined boundaries, rather than having these boundaries calculated based on the training set. Defaults to false
.
Note
If use_truncation_thresholds
is set, a features
file _must_ be specified, and this file _must_ include min
and max
columns. If no feature
file is specified or these columns are missing, an error will be raised.
Output¶
rsmtool
produces a set of folders in the experiment output directory. This is either the current directory in which rsmtool
is run or the directory specified as the second optional command-line argument.
report¶
This folder contains the final RSMTool report in HTML format as well in the form of a Jupyter notebook (a .ipynb
file).
output¶
This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv
files. rsmtool
will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.
figure¶
This folder contains all of the figures generated as part of the various analyses performed, saved as .svg
files.
feature¶
This folder contains a .csv
file that lists all features, signs and transformation as used in the final model, taking into account any manual or automatic feature selection. See feature column selection for more details.
Selecting Feature Columns¶
By default, rsmtool
will use all columns included in the training and evaluation data files as features. The only exception are any columns explicitly identified in the configuration file as containing non-feature information (e.g., id_column, train_label_column, test_label_column, etc.)
However, there are certain scenarios in which it is useful to choose specific columns in the data to be used as features. For example, let’s say that you have a large number of very different features and you want to use a different subset of features to score different types of questions on a test. In this case, the ability to easily choose the desired features for any rsmtool
experiment becomes quite important. The alternative of manually pre-processing the data to remove the features you don’t need is quite cumbersome.
There are two ways to select specific columns in the data as features:
- Fine-grained column selection: In this method, you manually create a list of the columns that you wish to use as features for an
rsmtool
experiment. See fine-grained selection for more details.- Subset-based column selection: In this method, you can pre-define subsets of features and then select entire subsets at a time for any
rsmtool
experiment. See subset-based selection for more details.
While fine-grained column selection is better for a single experiment, subset-based selection is more convenient when you need to run several experiments with somewhat different subsets of features.
Warning
rsmtool
will filter the training and evaluation data to remove any responses with non-numeric values in any of the feature columns before training the model. If your data includes a column containing string values and you do not use any of these methods of feature selection nor specify this column as the id_column
or the candidate_column
or a subgroup
column, rsmtool
will filter out all the responses in the data.
Fine-grained column selection¶
To manually select columns to be used as features, you can provide a data file in one of the supported formats. The file must contain a column named feature
which specifies the names of the feature columns that should be used for scoring model building. For additional flexibility, the same file also allows you to describe transformations to be applied to the values in these feature columns before being used in the model. The path to this file should be set as an argument to features
in the experiment configuration file. (Note: If you do not wish to perform any feature transformations, but would simply like to select certain feature columns to include, you can also pass a list
of feature names as an arguement to features
.)
Here’s an example of what such a file might look like.
feature,transform,sign
feature1,raw,1
feature2,inv,-1
There is one required column and two optional columns.
feature¶
The exact name of the column in the training and evaluation data files, including capitalization. Column names cannot contain hyphens. The following strings are reserved and cannot not be used as feature column names: spkitemid
, spkitemlab
, itemType
, r1
, r2
, score
, sc
, sc1
, and adj
. In addition, any column names provided as values for id_column
, train_label_column
, test_label_column
, length_column
, candidate_column
, and subgroups
may also not be used as feature column names.
transform (optional)¶
A transformation that should be applied to the column values before using it in the model. Possible values are:
raw
: no transformation, use original valueorg
: same as rawinv
: 1/xsqrt
: square rootaddOneInv
: 1/(x+1)addOneLn
: ln(x+1)
Note that rsmtool
will raise an exception if the values in the data do not allow the supplied transformation (for example, if inv
is applied to a column which has 0 values). If you really want to use the tranformation, you must pre-process your training and evaluation data files to remove the problematic cases.
If the feature file contains no transform
column, rsmtool
will use the original values for all features (raw
trasform).
sign (optional)¶
After transformation, the column values will be multiplied by this number, which can be either 1
or -1
depending on the expected sign of the correlation between transformed feature and human score. This mechanism is provided to ensure that all features in the final models have a positive correlation with the score, if that is so desired by the user.
If the feature file contains no sign
column, rsmtool
will multiply all values by 1
.
When determining the sign, you should take into account the correlation between the original feature and the score as well as any applied transformations. For example, if you use feature which has a negative correlation with the human score and apply sqrt
transformation, sign
should be set to -1
. However, if you use the same feature but apply the inv
transformation, sign
should now be set to 1
.
To ensure that this is working as expected, you can check the sign of correlations for both raw and processed features in the final report.
Note
You can use the fine-grained method of column selection in combination with a model with automatic feature selection. In this case, the features that end up being used in the final model can be found in the .csv
file in the feature
folder in the experiment output directory.
Subset-based column selection¶
For more advanced users, rsmtool
offers the ability to assign columns to named subsets in a data file in one of the supported formats and then select a set of columns by simply specifying the name of that pre-defined subset.
If you want to run multiple rsmtool
experiments, each choosing from a large number of features, generating a separate feature file for each experiment listing columns to use can quickly become tedious.
Instead you can define feature subsets by providing a subset definition file in one of the supported formats which lists all feature names under a column named feature
. Each subset is an additional column with a value of either 0
(denoting that the feature does not belong to the subset named by that column) or 1
(denoting that the feature does belong to the subset named by that column).
Here’s an example of a subset definition file, say subset.csv
.
feature,A,B
feature1,0,1
feature2,1,1
feature3,1,0
In this example, feature2
and feature3
belong to a subset called “A” and feature1
and feature1
and feature2
belong to a subset called “B”.
This feature subset file can be provided to rsmtool
using the feature_subset_file field in the configuration file. Then, to select a particular pre-defined subset of features, you simply set the feature_subset field in the configuration file to the name of the subset that you wish to use.
Then, in order to use feature subset “A” (feature2
and feature3
) in an experiment, we need to set the following two fields in our experiment configuration file:
{
...
"feature_subset_file": "subset.csv",
"feature_subset": "A",
...
}
Transformations¶
Unlike in fine-grained selection, the feature subset file does not list any transformations to be applied to the feature columns. However, you can automatically select transformation for each feature in the selected subset by applying all possible transforms and identifying the one which gives the highest correlation with the human score. To use this functionality set the select_transformations field in the configuration file to true
.
Signs¶
Some guidelines for building scoring models require all coefficients in the model to be positive and all features to have a positive correlation with human score. rsmtool
can automatically flip the sign for any pre-defined feature subset. To use this functionality, the feature subset file should provide the expected correlation sign between each feature and human score under a column called sign_<SUBSET>
where <SUBSET>
is the name of the feature subset. Then, to tell rsmtool
to flip the sign for this subset, you need to set the sign field in the configuration file to <SUBSET>
.
To understand this, let’s re-examine our earlier example of a subset definition file subset.csv
, but with an additional column.
feature,A,B,sign_A
feature1,0,1,+
feature2,1,1,-
feature3,1,0,+
Then, in order to use feature subset “A” (feature2
and feature3
) in an experiment with the sign of feature3
flipped appropriately (multiplied by -1) to ensure positive correlations with score, we need to set the following three fields in our experiment configuration file:
{
...
"feature_subset_file": "subset.csv",
"feature_subset": "A",
"sign": "A"
...
}
Note
If select_transformations is set to true
, rsmtool
is intelligent enough to take it into account when flipping the signs. For example, if the expected correlation sign for a given feature is negative, rsmtool
will multiply the feature values by -1
if the sqrt
transform has the highest correlation with score. However, if the best transformation turns out to be inv
– which already changes the polarity of the feature – no such multiplication will take place.
Intermediate files¶
Although the primary output of RSMTool is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool.To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format
parameter in the output
directory. The following sections describe all of the intermediate files that are produced.
Note
The names of all files begin with the experiment_id
provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMTool standardizes these column names internally for convenience. These values are:
spkitemid
for the column containing response IDs.sc1
for the column containing the human scores used as training labels.sc2
for the column containing the second human scores, if this column was specified in the configuration file.length
for the column containing response length, if this column was specified in the configuration file.candidate
for the column containing candidate IDs, if this column was specified in the configuration file.
Feature values¶
filenames: train_features
, test_features
, train_preprocessed_features
, test_preprocessed_features
These files contain the raw and pre-processed feature values for the training and evaluation sets. They include only includes the rows that were used for training/evaluating the models after filtering. For models with feature selection, these files only include the features that ended up being included in the model.
Note
By default RSMTool filters out non-numeric feature values and non-numeric/zero human scores from both the training and evaluation sets. Zero scores can be kept by setting the exclude_zero_scores to false.
Flagged responses¶
filenames: train_responses_with_excluded_flags
, test_responses_with_excluded_flags
These files contain all of the rows in the training and evaluation sets that were filtered out based on conditions specified in flag_column.
Note
If the training/evaluation files contained columns with internal names such as sc1
or length
but these columns were not actually used by rsmtool
, these columns will also be included into these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Excluded responses¶
filenames: train_excluded_responses
, test_excluded_responses
These files contain all of the rows in the training and evaluation sets that were filtered out because of feature values or scores. For models with feature selection, these files only include the features that ended up being included in the model.
Response metadata¶
filenames: train_metadata
, test_metadata
These files contain the metadata columns (id_column
, subgroups
if provided) for the rows in the training and evaluation sets that were not excluded for some reason.
Unused columns¶
filenames: train_other_columns
, test_other_columns
These files contain all of the the columns from the original features files that are not present in the *_feature
and *_metadata
files. They only include the rows from the training and evaluation sets that were not filtered out.
Note
If the training/evaluation files contained columns with internal names such as sc1
or length
but these columns were not actually used by rsmtool
, these columns will also be included into these files but their names will be changed to ##name##
(e.g. ##sc1##
).
Response length¶
filename: train_response_lengths
If length_column is specified, then this file contains the values from that column for the training data under a column called length
with the response IDs under the spkitemid
column.
Human scores¶
filename: test_human_scores
This file contains the human scores for the evaluation data under a column called sc1
with the response IDs under the spkitemid
column. If second_human_score_column
was specfied, then it also contains the values from that column under a column called sc2
. Only the rows that were not filtered out are included.
Note
If exclude_zero_scores
was set to true
(the default value), all zero scores in the second_human_score_column
will be replaced by nan
.
Data composition¶
filename: data_composition
This file contains the total number of responses in training and evaluation set and the number of overlapping responses. If applicable, the table will also include the number of different subgroups for each set.
Excluded data composition¶
filenames: train_excluded_composition
, test_excluded_composition
These files contain the composition of the set of excluded responses for the training and evaluation sets, e.g., why were they excluded and how many for each such exclusion.
Missing features¶
filename: train_missing_feature_values
This file contains the total number of non-numeric values for each feature. The counts in this table are based only on those responses that have a numeric human score in the training data.
Subgroup composition¶
filename: data_composition_by_<SUBGROUP>
There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup in both the training and evaluation sets.
Feature descriptives¶
filenames: feature_descriptives
, feature_descriptivesExtra
The first file contains the main descriptive statistics (mean,std. dev., correlation with human score etc.) for all features included in the final model. The second file contains percentiles, mild, and extreme outliers for the same set of features. The values in both files are computed on raw feature values before pre-processing.
Feature outliers¶
filename: feature_outliers
This file contains the number and percentage of outlier values truncated to [MEAN-4*SD, MEAN+4*SD] during feature pre-processing for each feature included in the final model.
Inter-feature and score correlations¶
filenames: cors_orig
, cors_processed
The first file contains the pearson correlations between each pair of (raw) features and between each (raw) feature and the human score. The second file is the same but with the pre-processed feature values instead of the raw values.
Marginal and partial correlations with score¶
filenames: margcor_score_all_data
, pcor_score_all_data
, `pcor_score_no_length_all_data
The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column
was specified in the configuration file.
Marginal and partial correlations with length¶
filenames: margcor_length_all_data
, pcor_length_all_data
The first file contains the marginal correlations between each pre-processed feature and response length, if length_column
was specified. The second file contains the partial correlations between each pre-processed feature and response length after controlling for all other features, if length_column
was specified in the configuration file.
Principal components analyses¶
filenames: pca
, pcavar
The first file contains the the results of a Principal Components Analysis (PCA) using pre-processed feature values from the training set and its singular value decomposition. The second file contains the eigenvalues and variance explained by each component.
Various correlations by subgroups¶
Each of following files may be produced for every subgroup, assuming all other information was also available.
margcor_score_by_<SUBGROUP>
: the marginal correlations between each pre-processed feature and human score, computed separately for the subgroup.pcor_score_by_<SUBGROUP>
: the partial correlations between pre-processed features and human score after controlling for all other features, computed separately for the subgroup.pcor_score_no_length_by_<SUBGROUP>
: the partial correlations between each pre-processed feature and human score after controlling for response length (if available), computed separately for the subgroup.margcor_length_by_<SUBGROUP>
: the marginal correlations between each feature and response length (if available), computed separately for each subgroup.pcor_length_by_<SUBGROUP>
: partial correlations between each feature and response length (if available) after controlling for all other features, computed separately for each subgroup.
Note
All of the feature descriptive statistics, correlations (including those for subgroups), and PCA are computed only on the training set.
Model information¶
feature
: pre-processing parameters for all features used in the model.coefficients
: model coefficients and intercept (for built-in models only).coefficients_scaled
: scaled model coefficients and intercept (linear models only). Although RSMTool generates scaled scores by scaling the predictions of the model, it is also possible to achieve the same result by scaling the coefficients instead. This file shows those scaled coefficients.
betas
: standardized and relative coefficients (for built-in models only).model_fit
: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding..model
: the serialized SKLLLearner
object containing the fitted model (before scaling the coeffcients)..ols
: a serialized object of typepandas.stats.ols.OLS
containing the fitted model (for built-in models excludingLassoFixedLambda
andPositiveLassoCV
).ols_summary.txt
: a text file containing a summary of the above model (for built-in models excludingLassoFixedLabmda
andPositiveLassoCV
)
postprocessing_params
: the parameters for trimming and scaling predicted scores. Useful for generating predictions on new data.
Predictions¶
filenames: pred_processed
, pred_train
The first file contains the predicted scores for the evaluation set and the second file contains the predicted scores for the responses in the training set. Both of them contain the raw scores as well as different types of post-processed scores.
Evaluation metrics¶
eval
: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.eval_by_<SUBGROUP>
: the same information as in *_eval.csv computed separately for each subgroup. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.eval_short
- a shortened version ofeval
that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (theraw
orscale
version is chosen depending on the value of theuse_scaled_predictions
in the configuration file).- h_mean
- h_sd
- corr
- sys_mean [raw/scale_trim]
- sys_sd [raw/scale_trim]
- SMD [raw/scale_trim]
- adj_agr [raw/scale_trim_round]
- exact_agr [raw/scale_trim_round]
- kappa [raw/scale_trim_round]
- wtkappa [raw/scale_trim]
- sys_mean [raw/scale_trim_round]
- sys_sd [raw/scale_trim_round]
- SMD [raw/scale_trim_round]
- R2 [raw/scale_trim]
- RMSE [raw/scale_trim]
score_dist
: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.confMatrix
: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value ofuse_scaled_predictions
.
Note
Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.
true_score_eval
- evaluation of how well system scores can predict true scores.
Human-human Consistency¶
These files are created only if a second human score has been made available via the second_human_score_column
option in the configuration file.
consistency
: contains descriptives for both human raters as well as the agreement metrics between their ratings.consistency_by_<SUBGROUP>
: contains the same metrics as inconsistency
file computed separately for each group. However, rather than SMD, a difference of standardized means (DSM) will be calculated using z-scores.degradation
: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.
Evaluations based on test theory¶
disattenuated_correlations
: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.disattenuated_correlations_by_<SUBGROUP>
: contains the same metrics as indisattenuated_correlations
file computed separately for each group.true_score_eval
: evaluations of system scores against estimated true score. Contains total counts of single and double-scored response, variance of human rater error, estimated true score variance, and mean squared error (MSE) and proportional reduction in mean squared error (PRMSE) when predicting true score using system score.
Additional fairness analyses¶
These files contain the results of additional fairness analyses suggested in suggested in Loukina, Madnani, & Zechner, 2019.
<METRICS>_by_<SUBGROUP>.ols
: a serialized object of typepandas.stats.ols.OLS
containing the fitted model for estimating the variance attributed to a given subgroup membership for a given metric. The subgroups are defined by the configuration file. The metrics areosa
(overall score accuracy),osd
(overall score difference), andcsd
(conditional score difference).<METRICS>_by_<SUBGROUP>_ols_summary.txt
: a text file containing a summary of the above modelestimates_<METRICS>_by_<SUBGROUP>`
: coefficients, confidence intervals and p-values estimated by the model for each subgroup.fairness_metrics_by_<SUBGROUP>
: the \(R^2\) (percentage of variance) and p-values for all models.
Built-in RSMTool Linear Regression Models¶
Models which use the full feature set¶
LinearRegression
: A model that learns empirical regression weights using ordinary least squares regression (OLS).EqualWeightsLR
: A model with all feature weights set to 1.0; a naive model.ScoreWeightedLR
: a model that learns empirical regression weights using weighted least sqaures. The weights are determined based on the number of responses with different score levels. Score levels with lower number of responses are assigned higher weight.RebalancedLR
- empirical regression weights are rebalanced by using a small portion of positive weights to replace negative beta values. This model has no negative coefficients.
Models with automatic feature selection¶
LassoFixedLambdaThenLR
: A model that learns empirical OLS regression weights with feature selection using Lasso regression with all coefficients set to positive. The hyperparameterlambda
is set tosqrt(n-lg(p))
wheren
is the number of responses andp
is the number of features. This approach was chosen to balance the penalties for error vs. penalty for two many coefficients to force Lasso perform more aggressive feature selection, so it may not necessarily achieve the best possible performance. The feature set selected by LASSO is then used to fit an OLS linear regression. Note that while the original Lasso model is constrained to positive coefficients only, small negative coefficients may appear when the coefficients are re-estimated using OLS regression.PositiveLassoCVThenLR
: A model that learns empirical OLS regression weights with feature selection using Lasso regression with all coefficients set to positive. The hyperparameterlambda
is optimized using crossvalidation for loglikehood. The feature set selected by LASSO is then used to fit an OLS linear regression. Note that this approach will likely produce a model with a large N features and any advantages of running Lasso would be effectively negated by latter adding those features to OLS regression.NNLR
: A model that learns empirical OLS regression weights with feature selection using non-negative least squares regression. Note that only the coefficients are constrained to be positive: the intercept can be either positive or negative.NNLRIterative
: A model that learns empirical OLS regression weights with feature selection using an iterative implementation of non-negative least squares regression. Under this implementation, an initial OLS model is fit. Then, any variables whose coefficients are negative are dropped and the model is re-fit. Any coefficients that are still negative after re-fitting are set to zero.LassoFixedLambdaThenNNLR
: A model that learns empirical OLS regression weights with feature selection using Lasso regression as above followed by non-negative least squares regression. The latter ensures that no feature has negative coefficients even when the coefficients are estimated using least squares without penalization.LassoFixedLambda
: same asLassoFixedLambdaThenLR
but the model uses the original Lasso weights. Note that the coefficients in Lasso model are estimated using an optimization routine which may produce slightly different results on different systems.PositiveLassoCV
: same as PositiveLassoCVThenLR but using the original Lasso weights. Please note: the coefficients in Lasso model are estimated using an optimization routine which may produce slightly different results on different systems.
Note
NNLR
,NNLRIterative
,LassoFixedLambdaThenNNLR
,LassoFixedLambda
andPositiveLassoCV
all have no negative coefficients.- For all feature selection models, the final set of features will be saved in the
feature
folder in the experiment output directory.