THE BIG PICTURE
You have two scripts that work together: algae_predictor_by_water_type.py (v1 training + runtime predictor) and train_algae_models_v2.py (v2 training with substrate features). Both output pickle files that the report generator loads at runtime.
STEP 1: DATA LOADING
- Loads from
beaches_full_export.json (v2) or beaches_cleaned_20260107.json (v1 — temperature-validated)
- 355 beaches total, split by
waterBodyType: 225 tidal, 82 lake, 48 river
- Each beach must have
algaeData with avgChlorophyll and/or avgCI
- Beaches without algae data are skipped
STEP 2: FEATURE EXTRACTION (33 features in v2)
- Geographic (3): latitude, longitude, lat_abs
- Satellite temperature (4): avg, max, min, range — from Landsat 8/9 thermal band. v1 validates: rejects if min < -5°C or max < 5°C (ice/snow contamination)
- Satellite clarity (3): turbidity, secchi depth, clarity score — from Sentinel-2
- Other indices (5): water_index, shoreline_risk, garbage, population_pressure, sediment_index
- NEW in v2 — Substrate (10): sand, pebbles, rocks, boulders, stone, mud + 4 engineered:
nutrient_retention = weighted score (mud×5 + stone×3 + rocks×2 + pebbles×1 + sand×0.5) / total
substrate_diversity = count of non-zero substrate types (0-5)
fine_coarse_ratio = (mud + sand) / (rocks + boulders + pebbles)
dark_substrate = mud + stone // absorbs heat, warms water
- NEW in v2 — Dimensions (3): beach_width, beach_length, beach_area
- NEW in v2 — Biological (2): seaweed_total, kelp
- Engineered interactions (3): temp_clarity, warm_turbid, temp_nutrient
STEP 3: WATER-TYPE-SPECIFIC ENGINEERING
Each water type gets 3 additional custom features that capture its unique physics:
LAKE:
thermal_stratification_proxy = temp_range × secchi
lake_bloom_risk = temp_avg × turbidity × nutrient_retention
warm_shallow = temp_avg / secchi
TIDAL:
tidal_exchange_proxy = turbidity / temp_range
coastal_upwelling = secchi × lat_abs / 100
mixing_index = temp_range × turbidity
RIVER:
flow_proxy = turbidity × temp_range
upstream_pollution = garbage × turbidity
nutrient_loading = population_pressure × turbidity × nutrient_retention
STEP 4: PREPROCESSING
- Imputation:
SimpleImputer(strategy='median') — fills missing values with column median
- Scaling:
StandardScaler() — zero mean, unit variance. Required for gradient boosting convergence.
- Both fitted on training data, applied to new predictions at runtime
STEP 5: MODEL TRAINING & SELECTION
- v2 compares 4 algorithms per water type:
1. GradientBoosting (default): n_est=200, depth=5, lr=0.05
2. GradientBoosting (tuned): n_est=300, depth=4, lr=0.03, subsample=0.8
3. RandomForest: n_est=200, depth=10
4. XGBoost: n_est=200, depth=5, lr=0.05, subsample=0.8, colsample=0.8
- Best model selected by cross-validated R² (5-fold CV when n ≥ 30)
- XGBoost wins in most cases — hence "v2 = XGBoost + substrate features"
- Trains 2 targets per water type: chlorophyll-a concentration + cyanobacteria index
- v1 also trains a risk level model (RandomForest, max_depth=10) mapping to low/medium/high
STEP 6: SERIALIZATION & DEPLOYMENT
- Models saved as
.pkl files via pickle (~3MB each)
algae_models_by_water_type.pkl (v1) and algae_models_by_water_type_v2.pkl (v2)
- Report generator loads v2 first, falls back to v1 if missing
- Each pickle contains: model objects + scaler + imputer + feature list per water type
- At prediction time: extract features → impute → scale → predict → classify risk
KEY INSIGHT FOR HASSAN
The water-type-specific feature engineering is the clever part. A lake's bloom risk depends on thermal stratification (temp_range × secchi), while a river's depends on upstream nutrient loading (population × turbidity × substrate retention). Generic models miss these physics. That's also where his geomatics expertise would improve things — better atmospheric correction = better input features = better predictions.