Pick a benchmark task from the searchable sidebar (filter by name, domain, or path). The task header describes the setup—univariate, multivariate, or covariate—with catalog context, frequency, and horizon when available, plus optional source links. Choose up to 5 models from Model Overlay chips ranked by leaderboard win rate. On live warehouse tasks (not the offline demo), pick the evaluation window and target variate when the series has multiple columns. Toggle Show uncertainty to view point forecasts or p10–p90 prediction bands when present in display data. Figure 1 plots forecast trajectories against the holdout target from GCS; Figure 2 compares the same models on this task with a per-metric bar chart.
Figure 1. Forecast trajectories for selected models on this task. The orange line is the actual target from GCS display data (context + holdout). Shaded regions use p10–p90 intervals when present in the warehouse.
Figure 2. Per-metric comparison of selected models on the current task.
Wall-clock runtimes come from warehouse telemetry and worker timestamps on the latest evaluation slice, scoped by the task category control (All tasks, Univariate, Multivariate, or Covariate). Figure 3 plots each model's mean win rate (pooled leaderboard metrics) against total wall time in seconds (summed over tasks in that filter). When you pick a single family instead of All tasks, Table 2 is a heatmap of per-task wall time by model with column totals. The final table lists Google Cloud Batch VM shape (machine type, vCPUs, RAM) and which models in this slice map to each tier.
Figure 3. One point per model: mean win rate (%, leaderboard metrics) versus total wall time in seconds (summed over tasks in the category filter), log-scaled x-axis. Dashed lines mark cohort medians; the shaded quadrant is faster-than-median wall time and higher-than-median win rate. Hover for model name and values.
Table 2. Google Cloud Batch worker VM shape (machine type, vCPUs, RAM) and models from the current leaderboard slice assigned to that tier.