Quickstart

This page provides simple examples to get started accessing data and training models with TableShift. If you would like to set up customized experiments or run experiments at scale, check the examples in our github repository here.

1. Environment Setup

We recommend using our prebuilt docker environment as a starting point, due to the complex dependencies required to support the many tabular data models available in TableShift. You can fetch a prebuilt Docker image via

# fetch the docker image
docker pull ghcr.io/jpgard/tableshift:latest

# run it to test your setup; this automatically launches examples/run_expt.py
docker run ghcr.io/jpgard/tableshift:latest --model xgb

# optionally, use the container interactively
docker run -it --entrypoint=/bin/bash ghcr.io/jpgard/tableshift:latest

You can change the entrypoint in the above code to /usr/local/bin/python to directly enter a Python interpreter and run Python code. For example, the block below loads a dataset for model training:

# To run Python code using the Docker environment from above, run:
# docker run -it --entrypoint=/usr/local/bin/python ghcr.io/jpgard/tableshift:latest

from tableshift import get_dataset

dataset_name = "diabetes_readmission"
dset = get_dataset(dataset_name)

# Fetch a pandas DataFrame of the training set
X_tr, y_tr, _, _ = dset.get_pandas("train")

# Fetch and use a pytorch DataLoader
train_loader = dset.get_dataloader("train", batch_size=1024)

for X, y, _, _ in train_loader:
    ...

2. Model Training

TableShift includes implementations of 19 different models (described in the paper and in detail here). To train a model on a publicly-available dataset, you can you the example script provided in the examples directory:

python examples/run_expt.py --experiment diabetes_readmission --model xgb

That's it!

3. Optional and Advanced Usage

This section outlines some advanced use cases: benchmarking your own models, running distributed hyperparameter tuning of the models available in TableShift, or (an experimental/not-officially-supported feature) using your own datasets in TableShift.

Benchmarking Your Own Models

Training with your own models should be managed in your own scripts (but examples/run_expt.py is a good starting point). All you need is a few lines of Python code to access the TableShift data in Pandas, Torch, or Ray format (see the example above) to train your model.

Distributed hyperparameter tuning with Ray

We recommend tuning the hyperparameters of any method in order to fairly evaluate its performance on the tasks in TableShift.

If you would like to run distributed hyperparameter tuning with Ray on one of the available models, you can run

ulimit -u 127590 && python scripts/ray_train.py \ 
    --experiment adult \ 
    --num_samples 2 \ 
    --num_workers 1 \ 
    --cpu_per_worker 4 \ 
    --use_cached \ 
    --models xgb

Please note that Ray can require careful tuning based on your system resources for optimal performance. For more information, check the Ray Tune documentation.

Using Your Own Datasets

The TableShift package is designed to support the benchmark datasets only. However, some users have expressed interest in "bringing their own data". To do this, you'll need to make the following changes to the TableShift Python source: * Add a DataSource for your dataset in tableshift/core/data_source.py. The DataSource specifies how to fetch a dataset and how to check if it is cached locally. * Add an ExperimentConfig in tableshift/configs/non_benchmark_configs.py. An ExperimentConfig defines how the data is split (i.e. the ID and OOD partitions) and optional grouping based on sensitive attributes. * Add a TaskConfig for the dataset. A TaskConfig defines the features, data types, and any preprocessing for a dataset.