Auto-generating configuration files

Configuration files for rsmtool, rsmeval, rsmexplain, rsmcompare, rsmpredict, rsmsummarize and rsmxval can be difficult to create manually due to the large number of configuration options supported by these tools. To make this easier for users, all of these tools support automatic creation of configuration files, both interactively and non-interactively.

Interactive generation

For novice users, it is easiest to use a guided, interactive mode to generate a configuration file. For example, to generate an rsmtool configuration file interactively, run the following command:

rsmtool generate --interactive --output example_rsmtool.json

The following screencast shows an example interactive session after the above command is run (click to play):

_images/demo.gif

The configuration file example_rsmtool.json generated via the session is shown below:

{
    // Reference: https://rsmtool.readthedocs.io/en/stable/usage_rsmtool.html#experiment-configuration-file
    "experiment_id": "test_rsmtool",
    "model": "LinearRegression",
    "train_file": "/Users/nmadnani/train.csv",
    "test_file": "/Users/nmadnani/test.csv",
    // OPTIONAL: replace default values below based on your data.
    "candidate_column": null,
    "custom_sections": null,
    "description": "Test experiment.",
    "exclude_zero_scores": true,
    "feature_subset": null,
    "feature_subset_file": null,
    "features": null,
    "file_format": "xlsx",
    "flag_column": null,
    "flag_column_test": null,
    "general_sections": [
        "data_description",
        "feature_descriptives",
        "preprocessed_features",
        "consistency",
        "model",
        "evaluation",
        "true_score_evaluation",
        "pca",
        "intermediate_file_paths",
        "sysinfo"
    ],
    "id_column": "ID",
    "length_column": "length",
    "min_items_per_candidate": null,
    "min_n_per_group": null,
    "predict_expected_scores": false,
    "second_human_score_column": null,
    "section_order": null,
    "select_transformations": false,
    "sign": null,
    "skll_objective": null,
    "standardize_features": true,
    "subgroups": [],
    "test_label_column": "score",
    "train_label_column": "score",
    "trim_max": 1,
    "trim_min": 6,
    "trim_tolerance": 0.4998,
    "use_scaled_predictions": true,
    "use_thumbnails": false,
    "use_truncation_thresholds": false
}

Note

Although we use rsmtool in the example above, the same instructions apply to all 6 tools; simply replace rsmtool with rsmeval, rsmcompare, etc.

There are some configuration options that can accept multiple inputs. For example, the experiment_dirs option for rsmsummarize takes a list of rsmtool experiment directories for a summary report. These options are handled differently in interactive mode. To illustrate this, let’s generate a configuration file for rsmsummarize by using the following command:

rsmsummarize generate --interactive --output example_rsmsummarize.json

The following screencast shows the interactive session (click to play):

_images/summary.gif

And here is the generated configuration file for rsmsummarize:

{
    // Reference: https://rsmtool.readthedocs.io/en/stable/advanced_usage.html#config-file-rsmsummarize
    "summary_id": "test_summary",
    "experiment_dirs": [
        "/Users/nmadnani/work",
        "/Users/nmadnani/work/rsmtool",
        "/Users/nmadnani/work/rsmtool/tests"
    ],
    // OPTIONAL: replace default values below based on your data.
    "custom_sections": null,
    "description": "This is a test.",
    "experiment_names": null,
    "file_format": "tsv",
    "general_sections": [
        "preprocessed_features",
        "model",
        "evaluation",
        "true_score_evaluation",
        "intermediate_file_paths",
        "sysinfo"
    ],
    "section_order": null,
    "subgroups": [],
    "use_thumbnails": true
}

Important

If you want to include subgroup information in the reports for rsmtool, rsmeval, rsmcompare, and rsmxval, you should add --subgroups to the command. For example, when you run rsmeval generate --interactive --subgroups you would be prompted to enter the subgroup column names and the general_sections list (if shown [1]) will also include subgroup-based sections. Since the subgroups option can accept multiple inputs, it is handled in the same way as the experiment_dirs option for rsmsummarize above.

We end with a list of important things to note about interactive generation:

  • Carefully read the instructions and notes displayed at the top when you first enter interactive mode.

  • If you do not specify an output file using --output, the generated configuration file will simply be printed out.

  • You may see messages like “invalid option” and “invalid file” on the bottom left while you are entering the value for a field. This is an artifact of real-time validation. For example, when choosing a training file for rsmtool, the message “invalid file” may be displayed while you navigate to the actual file. Once you get to a valid file, this message should disappear.

  • Required fields will not accept a blank input (just pressing enter) and will show an error message in the bottom left until a valid input is provided.

  • Optional fields will accept blank inputs since they have default values that will be used if no user input is provided. In some cases, default values are shown underlined in parentheses.

  • You can also use -i as an alias for --interactive and -g as an alias for --subgroups. So, for example, if you want to interactively generate a configuration file with subgroups for rsmtool, just run rsmtool generate -ig instead of rsmtool generate --interactive --subgroups.

  • The configuration files generated interactively contain comments (as indicated by // ...). While RSMTool handles JSON files with comments just fine, you may need to remove the comments manually if you wish to use these files outside of RSMTool.

Non-interactive Generation

For more advanced or experienced users who want to quickly get started with a dummy configuration file that they feel comfortable editing manually, RSMTool also provides the capability to generate configuration files non-interactively. To do so, simply omit the --interactive switch in the commands above. For example, to generate a dummy configuration file for rsmtool, the command to run would be:

rsmtool generate --output dummy_rsmtool.json

When running this command, the following warning would be printed out to stderr:

WARNING: Automatically generated configuration files MUST be edited to add values
for required fields and even for optional ones depending on your data

This warning explains that the generated file cannot be used directly as input to rsmtool since the required fields are filled with dummy values. This can be confirmed by looking at the configuration file the command generates:

{
    // Reference: https://rsmtool.readthedocs.io/en/stable/usage_rsmtool.html#experiment-configuration-file
    // REQUIRED: replace "ENTER_VALUE_HERE" with the appropriate value!
    "experiment_id": "ENTER_VALUE_HERE",
    "model": "ENTER_VALUE_HERE",
    "train_file": "ENTER_VALUE_HERE",
    "test_file": "ENTER_VALUE_HERE",
    // OPTIONAL: replace default values below based on your data.
    "candidate_column": null,
    "custom_sections": null,
    "description": "",
    "exclude_zero_scores": true,
    "feature_subset": null,
    "feature_subset_file": null,
    "features": null,
    "file_format": "csv",
    "flag_column": null,
    "flag_column_test": null,
    "general_sections": [
        "data_description",
        "feature_descriptives",
        "preprocessed_features",
        "consistency",
        "model",
        "evaluation",
        "true_score_evaluation",
        "pca",
        "intermediate_file_paths",
        "sysinfo"
    ],
    "id_column": "spkitemid",
    "length_column": null,
    "min_items_per_candidate": null,
    "min_n_per_group": null,
    "predict_expected_scores": false,
    "second_human_score_column": null,
    "section_order": null,
    "select_transformations": false,
    "sign": null,
    "skll_objective": null,
    "standardize_features": true,
    "subgroups": [],
    "test_label_column": "sc1",
    "train_label_column": "sc1",
    "trim_max": null,
    "trim_min": null,
    "trim_tolerance": 0.4998,
    "use_scaled_predictions": false,
    "use_thumbnails": false,
    "use_truncation_thresholds": false
}

Note the two comments demarcating the locations of the required and optional fields. Note also that the required fields are filled with the dummy value “ENTER_VALUE_HERE” that must be manually edited by the user. The optional fields are filled with default values that may also need to be further edited depending on the data being used.

Just like interactive generation, non-interactive generation is supported by all 6 tools: rsmtool, rsmeval, rsmexplain, rsmcompare, rsmpredict, rsmsummarize, and rsmxval.

Similarly, to include subgroup information in the reports for rsmtool, rsmeval, and rsmcompare, just add --subgroups (or -g) to the command. Note that unlike in interactive mode, this would only add subgroup-based sections to the general_sections list in the output file. You will need to manually edit the subgroups option in the configuration file to enter the subgroup column names.

Generation API

Interactive generation is only meant for end users and can only be used via the 6 command-line tools rsmtool, rsmeval, rsmexplain, rsmcompare, rsmpredict, rsmsummarize, and rsmxval. It cannot be used via the RSMTool API.

However, the non-interactive generation can be used via the API which can be useful for more advanced RSMTool users. To illustrate, here’s some example Python code to generate a configuration for rsmtool in the form of a dictionary:

# import the ConfigurationGenerator class
from rsmtool.utils.commandline import ConfigurationGenerator

# instantiate it with the options as needed
#   we want a dictionary, not a string
#   we do not want to see any warnings
#   we want to include subgroup-based sections in the report
generator = ConfigurationGenerator('rsmtool',
                                   as_string=False,
                                   suppress_warnings=True,
                                   use_subgroups=True)

# generate the configuration dictionary
configdict = generator.generate()

# remember we still need to replace the dummy values
# for the required fields
configdict["experiment_id"] = "test_experiment"
configdict["model"] = "LinearRegression"
configdict["train_file"] = "train.csv"
configdict["test_file"] = "test.csv"

# and don't forget about adding the subgroups
configdict["subgroups"] = ["GROUP1", "GROUP2"]

# make other changes to optional fields based on your data
...

# now we can use this dictionary to run an rsmtool experiment via the API
from rsmtool import run_experiment
run_experiment(configdict, "/tmp/output")

For more details, refer to the API documentation for the ConfigurationGenerator class.