What must be covered
DatasetsName, size (train/dev/test split), language/domain, source, license, preprocessing steps applied.
BaselinesList every method you compare against. For each: name, reference, key settings used. Include the strongest known baseline.
Evaluation metricsDefine each metric mathematically. Explain why it is appropriate for your task.
ImplementationFramework (PyTorch, TF…), hardware (GPU model + count), training time, batch size, learning rate, optimizer, seeds for reproducibility.
Statistical rigorReport mean ± std over N runs. State significance tests used (t-test, Wilcoxon, bootstrap). p < 0.05 threshold.
Research questions to answer
- RQ1: Does our method outperform all baselines on the primary metric?
- RQ2: How does each component contribute? (ablation study)
- RQ3: How does the method scale / generalize across datasets?
- RQ4: What is the computational cost vs accuracy trade-off?
Do
- Use the same test set for ALL methods — no cherry-picking
- Report results of baselines yourself if possible (re-run under same conditions)
- State if you used public checkpoints and which version
Don't
- Compare against outdated or weak baselines only
- Report only best run — report average of multiple seeds
- Use different test sets for different methods