# PyBaMM AI Data Lab

This lab accompanies the Battery Modeling for AI article series. It generates a small, auditable synthetic battery-aging and impedance dataset with explicit labels, split groups, solver status, and figures.

The publication bundle keeps the sample data small. For large AI datasets, run the same scripts locally with a larger `--samples` value and store generated data outside the website repository.

## Install

```bash
cd pybamm-ai-data-lab
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

PyBaMM is used through the core package. The archived `pybamm-eis` package is not required; EIS support is available through `pybamm.EISSimulation` in current PyBaMM.

## Quick Run

```bash
python src/run_all.py --quick --backend auto --output .
```

For fast schema validation, `--quick --backend auto` resolves to the deterministic surrogate backend and records `backend_used=surrogate`. Use `--backend pybamm` when you want to require PyBaMM core solves and fail on solver errors.

## Larger Dataset

```bash
python src/run_all.py --samples 200 --workers 4 --seed 7 --backend pybamm --output /tmp/pybamm-ai-dataset
```

The `--workers` flag is accepted by the orchestration script so the public interface is stable. The current reference implementation runs sequentially to keep failures easy to inspect; parallel execution can be added around the manifest rows without changing output schemas.

## Train SOH/RUL Model

```bash
python src/train_soh_rul_model.py --labels results/sample_labels.csv --output results --figures figures
```

The training script uses scikit-learn to fit two tabular regressors: one for `soh` and one for `rul_to_80_cycles`. It uses operating metadata and EIS features, then splits by `cell_design_id` by default so that frequency points or snapshots from the same simulated design are not mixed across train and test.

## Outputs

- `results/sample_manifest.csv`: sample metadata, design groups, protocol, temperature, SOC, cycle, and split group.
- `results/sample_eis_spectra.csv`: frequency, complex impedance, and backend status for every spectrum point.
- `results/sample_labels.csv`: SOH, RUL proxy, degradation-mode labels, and EIS features.
- `results/quality_report.csv`: duplicate, missing-value, split-leakage, and backend-status checks.
- `results/sample_train_metrics.csv`: group-split MAE, RMSE, and R2 for the SOH/RUL regressors.
- `results/sample_train_predictions.csv`: held-out predictions with true values and absolute errors.
- `results/sample_feature_importance.csv`: feature importance values for each target model.
- `figures/pybamm-pipeline.svg`: model-to-dataset pipeline.
- `figures/eis-label-schema.svg`: feature and label schema.
- `figures/aging-label-curves.png`: sample SOH and resistance labels.
- `figures/soh-rul-training-results.png`: held-out prediction scatter plots and feature importance.

## Boundary

These outputs are physics-based synthetic labels. They are useful for pretraining, pipeline tests, sensitivity studies, and active-learning design. They are not a replacement for calibrated experimental cells, impedance fixtures, temperature control, or out-of-domain validation.
