Platform Overview

MyBeachBook is a citizen science platform for water quality monitoring, combining field assessments with satellite analysis.

355+
Beaches in Database
357+
Field Assessments
40+
Data Categories per Assessment
6
Satellite Data Sources
What is MyBeachBook?

MyBeachBook is a cross-platform mobile application (Android, iOS, Web) built with Flutter. It enables citizen scientists to collect GPS-tagged, timestamped, photo-documented water quality assessments. On the backend, we run satellite analysis pipelines for algal bloom prediction, temporal trend analysis, and environmental monitoring.

The platform is backed by Firebase (Firestore, Storage, Auth) with Google Maps integration, Google ML Kit for on-device image labeling, and iNaturalist API for species identification.

What data does the app collect?

Each field assessment captures 40+ data categories:

  • Flora: Kelp, seaweed, eelgrass, and other aquatic vegetation
  • Fauna: 25+ species including barnacles, clams, mussels, anemones
  • Substrate composition: Sand, pebbles, rocks, boulders, stone, mud (proportional)
  • Physical dimensions: Beach width, length, bluff height
  • Driftwood: Kindling, firewood, logs, trees
  • Environmental indicators: Garbage levels, wind conditions, human activity
  • Photos: ML-assisted identification via Google ML Kit (50% confidence threshold)

All submissions are geohash-validated (precision 9 = ~100m) to confirm the contributor is physically at the location, and admin-moderated before inclusion.

What makes this ground truth data unique?

Our data addresses a critical gap in remote sensing research: concurrent ground truth paired with satellite observations. Most remote sensing studies lack field-validated data collected at the same locations and time periods as satellite passes.

  • All assessments are GPS-tagged with sub-10m accuracy
  • Timestamped to correlate with satellite overpasses
  • Photo-documented for verification
  • Multiple contributions aggregated per site (averaged numerics, deduplicated species)
  • Quality-controlled through admin moderation
  • Covers tidal (225), lake (82), and river (48) environments

Satellite Data Pipeline

Our backend processes multi-source satellite imagery using STAC APIs, independent of Google Earth Engine.

What satellite data sources do you use?
  • Sentinel-2 L2A (ESA) — Primary optical source. Accessed via Element84 STAC API with Planetary Computer and Copernicus fallbacks
  • Landsat 8/9 Collection 2 L2 (USGS) — Thermal band (lwir11) for water surface temperature
  • NASA SRTM — Elevation and terrain data (±6m vertical accuracy)
  • VIIRS DNB (NASA Black Marble) — Night lights for development pressure analysis
  • WorldPop — Population density at 100m resolution
  • JRC Global Surface Water — Lake extent detection and water permanence
What Sentinel-2 bands do you use?
  • B02 Blue (490nm) — Water color, turbidity assessment
  • B03 Green (560nm) — NDWI calculation, chlorophyll estimation, turbidity
  • B04 Red (665nm) — NDVI, chlorophyll proxy, sediment detection
  • B08 NIR (842nm) — NDWI, NDVI, floating debris detection
  • B11 SWIR1 (1610nm) — Urban development index (NDBI)
  • SCL — Scene Classification Layer for cloud masking

Data is downloaded at 120m resolution (from native 10-20m) for computational efficiency, with cloud masking via SCL using progressive thresholds (20% → 40% → 60% → 80% → 95%).

What spectral indices do you calculate?
  • NDWI = (Green - NIR) / (Green + NIR) — Water detection, shoreline dynamics, clarity proxy
  • NDVI = (NIR - Red) / (NIR + Red) — Vegetation health, 5-year trend analysis
  • NDBI = (SWIR1 - NIR) / (SWIR1 + NIR) — Urban development pressure (2015-2018 vs 2021-2024 comparison)
  • Chlorophyll proxy = (Green/Red - 0.9) × 27.3, clamped [0, 30] mg/m³
  • Chlorophyll Index = (Green - Red) / (Green + Red)
  • Turbidity = (Red × 0.3 + Green × 0.7) / 1000
  • Sediment ratio = Red / Green
  • Water Color Index = Blue/Green/Red band ratios (G/B and R/G)
  • Floating Debris Index = (NIR - Red) / (NIR + Red) × (Red / Green)
  • Shoreline Risk Proxy = std(NDWI over time) — temporal standard deviation of NDWI across 3 years of Sentinel-2 imagery (up to 20 scenes, 100m buffer, ≤20% cloud cover). Higher variance indicates a more dynamic or unstable shoreline. Paired with Water Index = mean(NDWI) for average water presence.
Why STAC APIs instead of Google Earth Engine?

We migrated from GEE to STAC APIs for platform independence and reproducibility. Our pipeline uses xarray + stackstac for lazy loading and chunked processing, with multi-level caching (in-memory → disk via NetCDF → STAC download). This eliminates dependency on Google's infrastructure and gives us full control over the processing chain.

Primary STAC endpoint: Element84, with automatic fallback to Planetary Computer and Copernicus.

AI Algae Prediction Models

Water-type-specific machine learning models trained on satellite-derived features and citizen science ground truth.

How does the algae prediction model work?

We train separate models for each water body type (lake, river, tidal) because the physics driving algal growth differ between environments. The pipeline:

  1. Data loading — 355 beaches from our database, split by water body type
  2. Feature extraction — 33 features per beach (v2 model)
  3. Water-type-specific feature engineering — 3 custom features per type capturing unique physics
  4. Preprocessing — Median imputation + standard scaling
  5. Model training — 4-algorithm comparison, best selected by cross-validated R²
  6. Serialization — Pickle files loaded at report generation time
What features does the model use?

The v2 model uses 33 input features organized into categories:

  • Geographic (3): latitude, longitude, absolute latitude
  • Satellite temperature (4): average, max, min, range — from Landsat 8/9 thermal band. Temperature-validated to reject ice/snow contamination (min < -5°C or max < 5°C)
  • Satellite clarity (3): turbidity, Secchi depth, clarity score — from Sentinel-2
  • Environmental indices (5): water index, shoreline risk, garbage levels, population pressure, sediment index
  • Substrate (10): sand, pebbles, rocks, boulders, stone, mud + 4 engineered features
  • Physical dimensions (3): beach width, length, area
  • Biological (2): seaweed total, kelp presence
  • Interaction terms (3): temp×clarity, temp×turbidity, temp×nutrient retention

Engineered substrate features:

nutrient_retention = weighted score (mud×5 + stone×3 + rocks×2 + pebbles×1 + sand×0.5) / total
substrate_diversity = count of non-zero substrate types (0-5)
fine_coarse_ratio = (mud + sand) / (rocks + boulders + pebbles)
dark_substrate = mud + stone   // heat absorption proxy
What are the water-type-specific features?

Each water body type gets 3 additional custom features that capture its unique physical dynamics:

LAKE:
  thermal_stratification_proxy = temp_range × secchi
  lake_bloom_risk = temp_avg × turbidity × nutrient_retention
  warm_shallow = temp_avg / secchi
RIVER:
  flow_proxy = turbidity × temp_range
  upstream_pollution = garbage × turbidity
  nutrient_loading = population_pressure × turbidity × nutrient_retention
TIDAL:
  tidal_exchange_proxy = turbidity / temp_range
  coastal_upwelling = secchi × lat_abs / 100
  mixing_index = temp_range × turbidity

For example, a lake's bloom risk depends on thermal stratification and nutrient retention in fine sediments, while a river's depends on upstream population pressure and flow-mediated nutrient loading. Generic models miss these physics.

What algorithms are compared?

The v2 training script compares four algorithms per water type and selects the best by 5-fold cross-validated R²:

1. GradientBoosting (default): n_estimators=200, max_depth=5, learning_rate=0.05
2. GradientBoosting (tuned): n_estimators=300, max_depth=4, learning_rate=0.03, subsample=0.8
3. RandomForest: n_estimators=200, max_depth=10
4. XGBoost: n_estimators=200, max_depth=5, learning_rate=0.05, subsample=0.8, colsample=0.8

XGBoost wins in most cases. Two targets are trained per water type: chlorophyll-a concentration and cyanobacteria index.

What accuracy do the models achieve?

V2 model results (XGBoost + substrate features, 33 input features):

  • Lakes — Chlorophyll-a: R² = 0.62 (+12.9% over v1)
  • Rivers — Chlorophyll-a: R² = 0.59 (+237% over v1)
  • Tidal — Chlorophyll-a: R² = 0.59
  • Rivers — Cyanobacteria Index: R² = 0.69

Satellite measurement accuracy:

  • Water surface temperature: ±0.5-2°C
  • Clarity indices: ±20-40%
  • Satellite vs in-situ chlorophyll: ±30-40%
  • GPS positional accuracy: <10m

Improving retrieval algorithm accuracy through better atmospheric correction and validation methodology is a key area where research collaboration would have the most impact.

Algae Heatmap Pipeline

Multi-year temporal analysis of algal bloom patterns from Sentinel-2 imagery.

How are algae heatmaps generated?
  • Temporal scope: 5-year analysis (current year minus 4), summer/growing season only (May–September for northern hemisphere)
  • Per year: Query Sentinel-2, load Green/Red/NIR bands, apply water mask (NDWI > 0.1)
  • Chlorophyll proxy: (Green/Red - 0.9) × 27.3, clamped [0, 30] mg/m³
  • Lake detection: JRC Global Surface Water dataset
  • Sampling grid: 100m spacing across entire lake surface (e.g. 845 points for a medium lake)
  • Output formats: Per-year PNG heatmaps (blue→green→yellow→red), animated GIF (220 frames, 800ms per year), temporal MP4 video (week-by-week, 30 FPS)
  • Trend detection: Linear comparison first vs last year (>20% = Increasing, <-20% = Decreasing)

Report System

Comprehensive environmental reports combining satellite analysis with citizen science data.

What does a premium report include?

Each report is a 24+ page document (Word + PDF) covering 10 major sections:

  • Satellite-derived water quality metrics
  • Multi-temporal satellite imagery analysis (1984–present via Landsat + Sentinel-2)
  • Environmental maps (terrain, land cover, night lights, population density)
  • Lake-specific analysis with algae heatmaps
  • Seasonal water temperature (4 seasons from Landsat thermal)
  • 5-year algae heatmaps + animated GIF + temporal video
  • Nearest-beach comparison (5 closest sites in database)
  • Regional benchmarking by climate zone (355-beach database)
  • GIS export package with FGDC-compliant shapefiles
  • Environmental recommendations and government contacts

Composite scoring:

  • Overall Beach Quality (0-100) = Natural Features (30%) + Environmental Stability (25%) + Water Quality (20%) + Vegetation Health (15%) + Low Development (10%)
  • Water Quality (0-100) = Turbidity (30%) + Sediment (30%) + NDWI (20%) + Color (20%)

Technical Architecture

Key technical decisions and system design.

What is the system architecture?
  • Mobile app: Flutter (Dart) — cross-platform (Android, iOS, Web)
  • Backend: Firebase (Firestore, Storage, Auth, App Check)
  • Report server: Flask + WebSocket for real-time progress updates to the mobile app during report generation
  • Satellite processing: Python (xarray, stackstac, rasterio, pystac-client)
  • ML pipeline: scikit-learn, XGBoost, pickle serialization
  • Data access: STAC APIs (Element84 primary, Planetary Computer + Copernicus fallbacks)
  • Caching: Multi-level — in-memory → disk (NetCDF .nc files) → remote STAC download
  • Crash recovery: JSON checkpoint/resume system after each processing phase
How is water sampling validated?

Before running analysis on any location, we verify the sampling point is actually in permanent water using a 5-year NDWI timeseries. If the pin falls on land, an 8-directional search automatically adjusts toward the nearest deeper water pixel.

Climate zones for regional benchmarking are latitude-based: Arctic (>60°), Temperate Cold (45–60°), Temperate Warm (30–45°), Subtropical (15–30°), Tropical (<15°).

Research Collaboration

We actively seek academic partnerships to improve our satellite retrieval algorithms and validate models against ground truth.

What kind of research partnerships are you looking for?

We're seeking academic collaborators in:

  • Remote sensing / geomatics — Atmospheric correction, retrieval algorithm optimization, geometric validation
  • Limnology / aquatic ecology — Eutrophication modeling, harmful algal bloom prediction
  • Environmental science — Citizen science methodology, water quality monitoring frameworks
  • Data science / ML — Feature engineering, model architectures, transfer learning across water body types

We bring: a deployed platform, 355+ ground truth sites, satellite processing infrastructure, municipal relationships, and potential co-funding through Alberta Innovates and other programs.

Is the platform suitable for graduate student projects?

Yes. The platform provides a natural fit for graduate research including:

  • Field data collection campaigns with built-in GPS/timestamp validation
  • Algorithm development with existing satellite processing infrastructure
  • Validation studies comparing satellite-derived metrics against in-situ measurements
  • Citizen science methodology research

We're happy to support HQP (Highly Qualified Personnel) components for NSERC or other grant applications.

Where are the biggest opportunities for improvement?
  • Atmospheric correction — Current pipeline uses L2A (pre-corrected) data. Custom atmospheric correction could improve retrieval accuracy.
  • Band ratio optimization — Chlorophyll proxy formula is empirical. Regionally-calibrated coefficients would improve accuracy.
  • Temporal resolution — Fusing Sentinel-2 (5-day revisit) with Landsat (16-day) and potentially Sentinel-3/MODIS for daily coverage.
  • Ground truth expansion — More in-situ water samples (chlorophyll, nutrients, turbidity) paired with satellite overpasses.
  • Model generalization — Transfer learning across climate zones and water body types.

Interested in Collaborating?

We're always looking for research partners to push the boundaries of satellite-derived water quality monitoring

Get in Touch