Machine Learning Models for Cyanobacteria and Chlorophyll-a
The Algae Problem: Why It Matters
Harmful algal blooms (HABs) are one of the most serious water quality threats worldwide, causing beach closures, fish kills, and human health risks. Nimpact uses water-body-specific machine learning models trained on thousands of beaches to predict algae concentrations from satellite-derived environmental parameters.
Two Key Algae Metrics
Cyanobacteria Index (CI)
Measures concentration of blue-green algae (cyanobacteria)āthe most harmful type. Some species produce toxins (microcystins, anatoxins) that can cause:
Liver damage in humans and animals
Neurological effects
Skin irritation and respiratory problems
Pet deaths from drinking contaminated water
Units: Cells/mL (typical range 0-100,000)
Chlorophyll-a Concentration
Measures all algae types (green, brown, blue-green) via their primary photosynthetic pigment. High chlorophyll indicates:
High biological productivity (eutrophication)
Nutrient enrichment (phosphorus, nitrogen)
Potential for oxygen depletion and fish kills
Reduced water clarity
Units: μg/L (typical range 1-100 μg/L)
Why Prediction Is Challenging
Unlike temperature and clarity (which satellites measure directly), algae concentrations cannot be reliably measured from space for inland waters. Here's why:
Patchy Distribution: Blooms concentrate in mats and streaks, not evenly distributed
Depth Variation: Some algae float (surface scums), others suspend at various depths
Spectral Similarity: Algae reflectance overlaps with suspended sediments and CDOM
Atmospheric Interference: Haze and aerosols corrupt the weak algae signal
Nimpact's Solution: Rather than trying to measure algae directly, we predict concentrations using environmental drivers (temperature, clarity, nutrient proxies) that CAN be measured accurately. This "indirect measurement" approach achieves better accuracy than direct spectral analysis.
Water-Body-Specific Models
Nimpact uses three separate machine learning models because algae dynamics differ fundamentally across water body types:
# Algae Prediction Models (Random Forest)
RIVER Model (n=1,247 training sites):
- Cyanobacteria: R² = 0.69 (geography-based)
- Chlorophyll-a: R² = 0.49 (temperature, clarity)
- Rivers show elevated levels due to upstream nutrient loading
LAKE Model (n=2,183 training sites):
- Cyanobacteria: Not predictable (nutrient-limited)
- Chlorophyll-a: R² = 0.55 (temperature, clarity)
- Lakes show summer bloom patterns when warm + nutrients
TIDAL Model (n=891 training sites):
- Cyanobacteria: Not predictable (salinity inhibits)
- Chlorophyll-a: R² = 0.61 (temperature, clarity, tidal mixing)
- Coastal blooms driven by upwelling and tidal cycles
This water-type stratification is critical. A river with high cyanobacteria is normal (upstream sources), while a lake with the same concentration indicates a serious local problem.
Page 1 of 2
Summer Bloom Analysis and Risk Assessment
The Summer Bloom Window
For lakes, Nimpact performs specialized summer bloom risk analysis because most severe blooms occur during the June-September warm season. This analysis is particularly important for northern prairie and temperate lakes.
The Northern Prairie Paradox
Northern lakes like Eagle Lake (SK) experience severe annual blooms at surprisingly low temperatures:
Peak Temperature: Only 15.5°C (would be considered "cool" in southern lakes)
# Bloom Risk Tiers
LOW: Cool water (< 12°C peak) OR oceanic (high salinity)
MEDIUM: Moderate temps (12-20°C) with some risk factors
HIGH: Warm temps (> 20°C) + seasonal pattern + shallow
EXTREME: Very warm (> 25°C) + all risk factors present
Measured vs. Predicted Data
Nimpact uses both measured satellite data (when available) and machine learning predictions:
Data Source Hierarchy:
Direct Satellite Measurement: Chlorophyll-a from Sentinel-2 spectral analysis (when conditions allow)
ML Prediction: When direct measurement impossible or unreliable (most cases for inland waters)
Unavailable: Cyanobacteria for tidal/lake (requires nutrient data not available from satellites)
For rivers, cyanobacteria predictions achieve good accuracy (R² = 0.69) because they're primarily driven by watershed geography and upstream sourcesāfactors we can model. For lakes and tidal areas, cyanobacteria blooms are sporadic and nutrient-dependent, making prediction from satellite data alone impossible.
Nimpact Role: Pre-screening tool to identify beaches that need water quality sampling and laboratory analysis. High predicted risk triggers professional testing, not closureāsatellites screen, labs confirm.