Evaluation Metrics

This section documents the exact mathematical definitions of the primary metrics used in RSMTool for evaluating the performance of automated scoring engines. RSMTool reports also include many secondary evaluations as described in intermediary files and the report sections.

The following conventions are used in the formulas in this section:

\(N\) \-\- total number of responses in the evaluation set with numeric human scores and numeric system scores. Zero human scores are, by default, excluded from evaluations unless exclude_zero_scores was set to false.

\(M\) \-\- system score. The primary evaluation metrics in the RSMTool report are computed for all six types of scores. For some secondary evaluations, the user can choose between raw and scaled scores using the use_scaled_predictions configuration field for RSMTool or the scale_with field for RSMEval.

\(H\) \-\- human score. The score values in test_label_column for RSMTool or human_score_column for RSMEval.

\(H2\) \-\- second human score (if available). The score values in second_human_score_column.

\(N_2\) \-\- total number of responses in the evaluation set where both \(H\) and \(H2\) are available and are numeric and non-zero (unless exclude_zero_scores was set to false).

\(\bar{M}\) \-\- Mean of \(M\) ; \(\bar{M}\) = \(\displaystyle\sum_{n=1}^{N}{\frac{M_i}{N}}\)

\(\bar{H}\) \-\- Mean of \(H\) ; \(\bar{H}\) = \(\displaystyle\sum_{n=1}^{N}{\frac{H_i}{N}}\)

\(\sigma_M\) \-\- Standard deviation of \(M\) ; \(\sigma_M\) = \(\displaystyle\sqrt{\frac{\sum_{i=1}^{N}{(M_i-\bar{M})^2}}{N-1}}\)

\(\sigma_H\) \-\- Standard deviation of \(H\) ; \(\sigma_H\) = \(\displaystyle\sqrt{\frac{\sum_{i=1}^{N}{(H_i-\bar{H})^2}}{N-1}}\)

\(\sigma_{H2}\) \-\- Standard deviation of \(H2\) ; \(\sigma_{H2}\) = \(\displaystyle\sqrt{\frac{\sum_{i=1}^{N_2}{(H2_i-\bar{H2})^2}}{N_2-1}}\)

Accuracy Metrics (Observed score)

These metrics show how well system scores \(M\) predict observed human scores \(H\). The computed metrics are available in the intermediate file eval, with a subset of the metrics also available in the intermediate file eval_short.

Percent exact agreement (rounded scores only)

Percentage responses where human and system scores match exactly.

\(A = \displaystyle\sum_{i=1}^{N}\frac{w_i}{N} \times 100\)

where \(w_i=1\) if \(M_i = H_i\) and \(w_i=0\) if \(M_i \neq H_i\)

The percent exact agreement is computed using rsmtool.utils.agreement with tolerance set to 0.

Percent exact + adjacent agreement

Percentage responses where the absolute difference between human and system scores is 1 or less.

\(A_{adj} = \displaystyle\sum_{i=1}^{N}\frac{w_i}{N} \times 100\)

where \(w_i=1\) if \(|M_i-H_i| \leq 1\) and \(w_i=0\) if \(|M_i-H_i| \gt 1\).

The percent exact + adjacent agreement is computed using rsmtool.utils.agreement with tolerance set to 1.

Cohen’s kappa (rounded scores only)

\(\kappa=1-\displaystyle\frac{\sum_{k=0}^{K-1}{}\sum_{j=1}^{K}{w_{jk}X_{jk}}}{\sum_{k=0}^{K-1}{}\sum_{j=1}^{K}{w_{jk}m_{jk}}}\)

when \(k=j\) then \(w_{jk}\) = 0 and when \(k \neq j\) then \(w_{jk}\) = 1

where:

\(K\) is the number of scale score categories (maximum observed rating - minimum observed rating + 1). Note that for \(\kappa\) computation the values of H and M are shifted to H-minimum_rating and M-minimum_rating so that the lowest value is 0. This is done to support negative labels.
\(X_{jk}\) is the number times where \(H=j\) and \(M=k\).
\(m_{jk}\) is the percent chance agreement:
\(m_{jk} = \displaystyle\sum_{k=1}^{K}{\frac{n_{k+}}{N}\frac{n_{+k}}{N}}\), where
\(n_{k+}\) - total number of responses where \(H_i=k\)

\(n_{+k}\) - total number of responses where \(M_i=k\)

Kappa is computed using skll.metrics.kappa with weights set to None and allow_off_by_one set to False (default).

Note

See this discussion for the explanation of how the SKLL implementation differs from the scikit-learn implementation. The two implementations might produce different results if the matrix contains missing labels. For example, consider the hypothetical scenario where our predictions only contain the labels 1, 2, and 4. In the SKLL implementation, the missing 3 will be automatically added to the list of labels whereas in the scikit-learn implementation, the 3 would only be added if a complete list of labels was passed to the function via the optional labels keyword argument.

Quadratic weighted kappa (QWK)

Unlike Cohen’s kappa which is only computed for rounded scores, quadratic weighted kappa is computed for continuous scores using the following formula:

\(QWK=\displaystyle\frac{2*Cov(M,H)}{Var(H)+Var(M)+(\bar{M}-\bar{H})^2}\)

Note that in this case the variances and covariance are computed by dividing by N and not by N-1, as in other cases.

QWK is computed using rsmtool.utils.quadratic_weighted_kappa with ddof set to 0.

See Haberman (2019) for the full derivation of this formula. The discrete case is simply treated as a special case of the continuous one.

Note

In RSMTool v6.x and earlier QWK was computed using skll.metrics.kappa with weights set to "quadratic". Continuous scores were rounded for computation. Both formulas produce the same scores for discrete (rounded scores) but QWK values for continuous scores computed by RSMTool starting with v7.0 will be different from those computed by earlier versions.

Pearson Correlation coefficient (r)

\(r=\displaystyle\frac{\sum_{i=1}^{N}{(H_i-\bar{H})(M_i-\bar{M})}}{\sqrt{\sum_{i=1}^{N}{(H_i-\bar{H})^2} \sum_{i=1}^{N}{(M-\bar{M})^2}}}\)

Pearson correlation coefficients is computed using scipy.stats.pearsonr.

If the variance of human or system scores is 0 (all scores are the same) or only one response is available, RSMTool returns None.

Note

In scipy v1.4.1 and later, the implementation uses the following formula:

\(r=\displaystyle\frac{H-\bar{H}}{\left\|H-\bar{H}\right\|_2}\cdot\frac{M-\bar{M}}{\left\|M-\bar{M}\right\|_2}\)

This implementation is more robust to very large values but is more likely to return a value slightly smaller than 1 (for example, 0.9999999999999998) for perfect correlation when n is small. See this comment for further detail.

Standardized mean difference (SMD)

This metrics ensures that the distribution of system scores is centered on a point close to what is observed with human scoring.

\(SMD = \displaystyle\frac{\bar{M}-\bar{H}}{\sigma_H}\)

SMD between system and human scores is computed using rsmtool.utils.standardized_mean_difference with the method argument set to "unpooled".

Note

In RSMTool v6.x and earlier SMD was computed with the method argument set to "williamson" as described in Williamson et al. (2012). The values computed by RSMTool starting with v7.0 will be different from those computed by earlier versions.

Mean squared error (MSE)

The mean squared error of a machine score 𝑀 as a predictor of observed human score H:

\(MSE(H|M) = \displaystyle\frac{1}{N}\sum_{i=1}^{N}{(H_{i}-M_{i})^2}\)

MSE is computed using sklearn.metrics.mean_squared_error.

Proportional reduction in mean squared error for observed score (R2)

\(R2=1-\displaystyle\frac{MSE(H|M)}{\sigma_H^2}\)

R2 is computed using sklearn.metrics.r2_score. If only one response is available, RSMTool returns None.

Accuracy Metrics (True score)

According to test theory, an observed score is a combination of the true score \(T\) and a measurement error. The true score cannot be observed, but its distribution parameters can be estimated from observed scores. Such an estimation requires that two human scores be available for at least a subset of responses in the evaluation set since these are necessary to estimate the measurement error component.

Evaluating system against true score produces performance estimates that are robust to errors in human scores and remain stable even when human-human agreeement varies (see Loukina et al. (2020).

The true score evaluations computed by RSMTool are available in the intermediate file true_score_eval.

Proportional reduction in mean squared error for true scores (PRMSE)

PRMSE shows how well system scores can predict true scores. This metric generally varies between 0 (random prediction) and 1 (perfect prediction), although in some cases in can take negative values (suggesting a very bad fit) or exceed 1 (suggesting that the sample size is too small to reliably estimate rater error variance).

PRMSE for true scores is defined similarly to PRMSE for observed scores, but with the true score \(T\) used instead of the observed score \(H\), that is, as the percentage of variance in the true scores explained by the system scores.

\(PRMSE=1-\displaystyle\frac{MSE(T|M)}{\sigma_T^2}\)

In the simple case where all responses have two human scores, \(MSE(T|M)\) (mean squared error when predicting true score with system score) and \(\sigma_T^2\) (variance of true score) are estimated from their observed score counterparts \(MSE(H|M)\) and \(\sigma_H^2\) as follows:

\(\hat{H}\) is used instead of \(H\) to compute \(MSE(\hat{H}|M)\) and \(\sigma_{\hat{H}}^2\). \(\hat{H}\) is the average of two human scores for each response (\(\hat{H_i} = \frac{{H_i}+{H2_i}}{2}\)). These evaluations use \(\hat{H}\) rather than \(H\) because the measurement errors for each rater are assumed to be random and, thus, can partially cancel out making the average \(\hat{H}\) closer to true score \(T\) than \(H\) or \(H2\).
To compute estimates for true scores, the values for observed scores are adjusted for variance of measurement errors (\(\sigma_{e}^2\)) in human scores defined as:

\(\displaystyle\sigma_{e}^2 = \frac{1}{2 \times N_2}\sum_{i=1}^{N_2}{(H_{i} - H2_{i})^2}\)

In the simple case, where all responses are double-scored, \(MSE(T|M)\) is estimated as:

\(MSE(T|M) = MSE(\hat{H}|M)-\displaystyle\frac{1}{2}\sigma_{e}^2\)

and \(\sigma_T^2\) is estimated as:

\(\sigma_T^2 = \sigma_{\hat{H}}^2 - \displaystyle\frac{1}{2}\sigma_{e}^2\)

The PRMSE formula implemented in RSMTool is more general and can also handle the case where the number of available ratings varies across the responses (e.g. only a subset of responses is double-scored). While rsmtool and rsmeval only support evaluations with two raters, the implementation of the PRMSE formula available via the API supports cases where some of the responses have more than two ratings available. The formula was derived by Matt S. Johnson and is explained in more detail in Loukina et al. (2020).

In this case, the variance of rater errors is computed as a pooled variance estimator.

We first calculate the within-subject variance of human ratings for each response, \(V_i\), using denominator \(c_i - 1\):

\(V_{i} = \displaystyle\frac{\sum_{j=1}^c (H_{i,j} - \bar{H}_i)^2}{c_i-1}\)

where

\(H_{i,j}\) is the human score assigned by rater \(j\) to response \(i\)
\(c_i\) is the total number of human scores available for response \(i\). For double-scored responses this equals 2.
\(\bar{H}_i\) is the average human rating for response \(i\).

We then take a weighted average of those within-responses variances:

\(\sigma_{e}^2 = \frac{\sum_{i=1}^N V_{i} * (c_i-1)}{\sum_{i=1}^N (c_i-1)}\)

The true score variance \(\sigma_T^2\) is then estimated as

\(\sigma_T^2 = \displaystyle\frac{\sum_{i=1}^N c_i (\bar{H}_i - \bar{H})^2 - (N-1) \sigma_{e}^2}{c_\cdot - \frac{\sum_{i=1}^N c_i^2}{c_\cdot}}\)

where

\(c_\cdot = \sum_{i=1}^N c_i\) is the total number of observed human scores.
\(\bar{H}_i\) is the average human rating for response \(i\). For responses with only one rating this will be the single human score H.

Mean squared error \(MSE(T|M)\) is estimated as:

\(MSE(T|M) = \displaystyle\frac{1}{c_\cdot} \left (\sum_{i=1}^N c_i (\bar{H}_i - M_i)^2 - N\sigma_{e}^2 \right )\)

The formulas are derived to ensure consistent results regardless of the number of raters and of the number of ratings availvable for each response.

PRMSE is computed using the rsmtool.utils.prmse_true function.

In some cases, it may be appropriate to compute variance of human errors using a different sample than the one used for main evaluations. This can be accomplished using rsmtool.utils.variance_of_errors and using an optional configuration field rater_error_variance in rsmtool or rsmtool

Note

The PRMSE formula assigns higher weight to discrepancies between system scores and human scores when human score is the average of two or more human scores than when the human score is based on a single score.

Fairness

Fairness of automated scores is an important component of RSMTool evaluations (see Madnani et al, 2017).

When defining an experiment, the RSMTool user has the option of specifying which subgroups should be considered for such evaluations using subgroups field. These subgroups are then used in all fairness evaluations.

All fairness evaluations are conducted on the evaluation set. The metrics are only computed for either raw_trim or scale_trim scores (see score postprocessing for further details) depending on the value of use_scaled_predictions in RSMTool or the value of scale_with in RSMEval.

Differences between standardized means for subgroups (DSM)

This is a standard evaluation used for evaluating subgroup differences. The metrics are available in the intermediate files eval_by_<SUBGROUP>.

DSM is computed as follows:

For each group, get the z-score for each response :math:i, using the \(\bar{H}\), \(\bar{M}\), \(\sigma_H\), and \(\sigma_S\) for system and human scores for the whole evaluation set:

\(z_{H_{i}} = \displaystyle\frac{H_i - \bar{H}}{\sigma_H}\)

\(z_{M_{i}} = \displaystyle\frac{M_i - \bar{M}}{\sigma_M}\)
For each response \(i\), calculate the difference between machine and human scores: \(z_{M_{i}} - z_{H_{i}}\)
Calculate the mean of the difference \(z_{M_{i}} - z_{H_{i}}\) by subgroup of interest.

DSM is computed using rsmtool.utils.difference_of_standardized_means with:

population_y_true_observe_mn = \(\bar{H}\) for the whole evaluation set

population_y_pred_mn = \(\bar{M}\) for the whole evaluation set

population_y_true_observed_sd = \(\sigma_H\) for the whole evaluation set

population_y_pred_sd = \(\sigma_M\) for the whole evaluation set

Note

In RSMTool v6.x and earlier, subgroup differences were computed using standardized mean difference with the method argument set to "williamson". Since the differences computed in this manner were very sensitive to score distributions, RSMTool no longer uses this function to compute subgroup differences starting with v7.0.

Additional fairness evaluations

Starting with v7.0, RSMTool includes additional fairness analyses suggested in Loukina, Madnani, & Zechner, 2019. The computed metrics from these analyses are available in intermediate files fairness_metrics_by_<SUBGROUP>.

These include:

Overall score accuracy: percentage of variance in squared error \((M-H)^2\) explained by subgroup membership
Overall score difference: percentage of variance in absolute error \((M-H)\) explained by subgroup membership
Conditional score difference: percentage of variance in absolute error \((M-H)\) explained by subgroup membership when controlling for human score

Please refer to the paper for full descriptions of these metrics.

The fairness metrics are computed using rsmtool.fairness_utils.get_fairness_analyses.

Human-human agreement

If scores from a second human (H2) are available, RSMTool computes the following additional metrics for human-human agreement using only the \(N_2\) responses, including only responses that contain numeric values for both the \(H\) and \(H2\) columns.

The computed metrics are available in the intermediate file consistency.

Percent exact agreement

Same as percent exact agreement for observed scores but substituting \(H2\) for \(M\).

Percent exact + ajdacent agreement

Same as percent adjacent agreement for observed scores but substituting \(H2\) for \(M\) and \(N_2\) for \(N\).

Cohen’s kappa

Same as Cohen’s kappa for observed scores but substituting \(H2\) for \(M\) and \(N_2\) for \(N\).

Quadratic weighted kappa (QWK)

Same as QWK for observed scores but substituting \(H2\) for \(M\) and \(N_2\) for \(N\).

Pearson Correlation coefficient (r)

Same as r for observed scores but substituting \(H2\) for \(M\) and \(N_2\) for \(N\).

Standardized mean difference (SMD)

\(SMD = \displaystyle\frac{\bar{H2}-\bar{H1}}{ \sqrt{\frac{\sigma_{H}^2 + \sigma_{H2}^2}{2}}}\)

Unlike SMD for human-system scores, the denominator in this case is the “pooled” standard deviation of \(H1\) and \(H2\).

Therefore, SMD between two human scores is computed using rsmtool.utils.standardized_mean_difference with the method argument set to "pooled".

Note

In RSMTool v6.x and earlier, SMD was computed with the method argument set to "williamson" as described in Williamson et al. (2012). Starting with v7.0, the values computed by RSMTool will be different from those computed by earlier versions.