Research FAQ | Nimpact Environmental

Platform Overview

MyBeachBook is a citizen science platform for water quality monitoring, combining field assessments with satellite analysis.

355+

Beaches in Database

357+

Field Assessments

40+

Data Categories per Assessment

6

Satellite Data Sources

What is MyBeachBook?

MyBeachBook is a cross-platform mobile application (Android, iOS, Web) built with Flutter. It enables citizen scientists to collect GPS-tagged, timestamped, photo-documented water quality assessments. On the backend, we run satellite analysis pipelines for algal bloom prediction, temporal trend analysis, and environmental monitoring.

The platform is backed by Firebase (Firestore, Storage, Auth) with Google Maps integration, Google ML Kit for on-device image labeling, and iNaturalist API for species identification.

What data does the app collect?

Each field assessment captures 40+ data categories:

Flora: Kelp, seaweed, eelgrass, and other aquatic vegetation
Fauna: 25+ species including barnacles, clams, mussels, anemones
Substrate composition: Sand, pebbles, rocks, boulders, stone, mud (proportional)
Physical dimensions: Beach width, length, bluff height
Driftwood: Kindling, firewood, logs, trees
Environmental indicators: Garbage levels, wind conditions, human activity
Photos: ML-assisted identification via Google ML Kit (50% confidence threshold)

All submissions are geohash-validated (precision 9 = ~100m) to confirm the contributor is physically at the location, and admin-moderated before inclusion.

What makes this ground truth data unique?

Our data addresses a critical gap in remote sensing research: concurrent ground truth paired with satellite observations. Most remote sensing studies lack field-validated data collected at the same locations and time periods as satellite passes.

All assessments are GPS-tagged with sub-10m accuracy
Timestamped to correlate with satellite overpasses
Photo-documented for verification
Multiple contributions aggregated per site (averaged numerics, deduplicated species)
Quality-controlled through admin moderation
Covers tidal (225), lake (82), and river (48) environments

Satellite Data Pipeline

Our backend processes multi-source satellite imagery using STAC APIs, independent of Google Earth Engine.

What satellite data sources do you use?

Sentinel-2 L2A (ESA) — Primary optical source. Accessed via Element84 STAC API with Planetary Computer and Copernicus fallbacks
Landsat 8/9 Collection 2 L2 (USGS) — Thermal band (lwir11) for water surface temperature
NASA SRTM — Elevation and terrain data (±6m vertical accuracy)
VIIRS DNB (NASA Black Marble) — Night lights for development pressure analysis
WorldPop — Population density at 100m resolution
JRC Global Surface Water — Lake extent detection and water permanence

What Sentinel-2 bands do you use?

B02 Blue (490nm) — Water color, turbidity assessment
B03 Green (560nm) — NDWI calculation, chlorophyll estimation, turbidity
B04 Red (665nm) — NDVI, chlorophyll proxy, sediment detection
B08 NIR (842nm) — NDWI, NDVI, floating debris detection
B11 SWIR1 (1610nm) — Urban development index (NDBI)
SCL — Scene Classification Layer for cloud masking

Data is downloaded at 120m resolution (from native 10-20m) for computational efficiency, with cloud masking via SCL using progressive thresholds (20% → 40% → 60% → 80% → 95%).

What spectral indices do you calculate?

NDWI = (Green - NIR) / (Green + NIR) — Water detection, shoreline dynamics, clarity proxy
NDVI = (NIR - Red) / (NIR + Red) — Vegetation health, 5-year trend analysis
NDBI = (SWIR1 - NIR) / (SWIR1 + NIR) — Urban development pressure (2015-2018 vs 2021-2024 comparison)
Chlorophyll proxy = (Green/Red - 0.9) × 27.3, clamped [0, 30] mg/m³
Chlorophyll Index = (Green - Red) / (Green + Red)
Turbidity = (Red × 0.3 + Green × 0.7) / 1000
Sediment ratio = Red / Green
Water Color Index = Blue/Green/Red band ratios (G/B and R/G)
Floating Debris Index = (NIR - Red) / (NIR + Red) × (Red / Green)
Shoreline Risk Proxy = std(NDWI over time) — temporal standard deviation of NDWI across 3 years of Sentinel-2 imagery (up to 20 scenes, 100m buffer, ≤20% cloud cover). Higher variance indicates a more dynamic or unstable shoreline. Paired with Water Index = mean(NDWI) for average water presence.

Why STAC APIs instead of Google Earth Engine?

We migrated from GEE to STAC APIs for platform independence and reproducibility. Our pipeline uses xarray + stackstac for lazy loading and chunked processing, with multi-level caching (in-memory → disk via NetCDF → STAC download). This eliminates dependency on Google's infrastructure and gives us full control over the processing chain.

Primary STAC endpoint: Element84, with automatic fallback to Planetary Computer and Copernicus.

AI Algae Prediction Models

Water-type-specific machine learning models trained on satellite-derived features and citizen science ground truth.

How does the algae prediction model work?

We train separate models for each water body type (lake, river, tidal) because the physics driving algal growth differ between environments. The pipeline:

Data loading — 355 beaches from our database, split by water body type
Feature extraction — 33 features per beach (v2 model)
Water-type-specific feature engineering — 3 custom features per type capturing unique physics
Preprocessing — Median imputation + standard scaling
Model training — 4-algorithm comparison, best selected by cross-validated R²
Serialization — Pickle files loaded at report generation time

What features does the model use?

The v2 model uses 33 input features organized into categories:

Geographic (3): latitude, longitude, absolute latitude
Satellite temperature (4): average, max, min, range — from Landsat 8/9 thermal band. Temperature-validated to reject ice/snow contamination (min < -5°C or max < 5°C)
Satellite clarity (3): turbidity, Secchi depth, clarity score — from Sentinel-2
Environmental indices (5): water index, shoreline risk, garbage levels, population pressure, sediment index
Substrate (10): sand, pebbles, rocks, boulders, stone, mud + 4 engineered features
Physical dimensions (3): beach width, length, area
Biological (2): seaweed total, kelp presence
Interaction terms (3): temp×clarity, temp×turbidity, temp×nutrient retention

Engineered substrate features:

nutrient_retention = weighted score (mud×5 + stone×3 + rocks×2 + pebbles×1 + sand×0.5) / total
substrate_diversity = count of non-zero substrate types (0-5)
fine_coarse_ratio = (mud + sand) / (rocks + boulders + pebbles)
dark_substrate = mud + stone // heat absorption proxy

What are the water-type-specific features?

Each water body type gets 3 additional custom features that capture its unique physical dynamics:

LAKE:
thermal_stratification_proxy = temp_range × secchi
lake_bloom_risk = temp_avg × turbidity × nutrient_retention
warm_shallow = temp_avg / secchi

RIVER:
flow_proxy = turbidity × temp_range
upstream_pollution = garbage × turbidity
nutrient_loading = population_pressure × turbidity × nutrient_retention

TIDAL:
tidal_exchange_proxy = turbidity / temp_range
coastal_upwelling = secchi × lat_abs / 100
mixing_index = temp_range × turbidity

For example, a lake's bloom risk depends on thermal stratification and nutrient retention in fine sediments, while a river's depends on upstream population pressure and flow-mediated nutrient loading. Generic models miss these physics.

What algorithms are compared?

The v2 training script compares four algorithms per water type and selects the best by 5-fold cross-validated R²:

1. GradientBoosting (default): n_estimators=200, max_depth=5, learning_rate=0.05
2. GradientBoosting (tuned): n_estimators=300, max_depth=4, learning_rate=0.03, subsample=0.8
3. RandomForest: n_estimators=200, max_depth=10
4. XGBoost: n_estimators=200, max_depth=5, learning_rate=0.05, subsample=0.8, colsample=0.8

XGBoost wins in most cases. Two targets are trained per water type: chlorophyll-a concentration and cyanobacteria index.

What accuracy do the models achieve?

V2 model results (XGBoost + substrate features, 33 input features):

Lakes — Chlorophyll-a: R² = 0.62 (+12.9% over v1)
Rivers — Chlorophyll-a: R² = 0.59 (+237% over v1)
Tidal — Chlorophyll-a: R² = 0.59
Rivers — Cyanobacteria Index: R² = 0.69

Satellite measurement accuracy:

Water surface temperature: ±0.5-2°C
Clarity indices: ±20-40%
Satellite vs in-situ chlorophyll: ±30-40%
GPS positional accuracy: <10m

Improving retrieval algorithm accuracy through better atmospheric correction and validation methodology is a key area where research collaboration would have the most impact.

Algae Heatmap Pipeline

Multi-year temporal analysis of algal bloom patterns from Sentinel-2 imagery.

How are algae heatmaps generated?

Temporal scope: 5-year analysis (current year minus 4), summer/growing season only (May–September for northern hemisphere)
Per year: Query Sentinel-2, load Green/Red/NIR bands, apply water mask (NDWI > 0.1)
Chlorophyll proxy: (Green/Red - 0.9) × 27.3, clamped [0, 30] mg/m³
Lake detection: JRC Global Surface Water dataset
Sampling grid: 100m spacing across entire lake surface (e.g. 845 points for a medium lake)
Output formats: Per-year PNG heatmaps (blue→green→yellow→red), animated GIF (220 frames, 800ms per year), temporal MP4 video (week-by-week, 30 FPS)
Trend detection: Linear comparison first vs last year (>20% = Increasing, <-20% = Decreasing)

Report System

Comprehensive environmental reports combining satellite analysis with citizen science data.

What does a premium report include?

Each report is a 24+ page document (Word + PDF) covering 10 major sections:

Satellite-derived water quality metrics
Multi-temporal satellite imagery analysis (1984–present via Landsat + Sentinel-2)
Environmental maps (terrain, land cover, night lights, population density)
Lake-specific analysis with algae heatmaps
Seasonal water temperature (4 seasons from Landsat thermal)
5-year algae heatmaps + animated GIF + temporal video
Nearest-beach comparison (5 closest sites in database)
Regional benchmarking by climate zone (355-beach database)
GIS export package with FGDC-compliant shapefiles
Environmental recommendations and government contacts

Composite scoring:

Overall Beach Quality (0-100) = Natural Features (30%) + Environmental Stability (25%) + Water Quality (20%) + Vegetation Health (15%) + Low Development (10%)
Water Quality (0-100) = Turbidity (30%) + Sediment (30%) + NDWI (20%) + Color (20%)

Technical Architecture

Key technical decisions and system design.

What is the system architecture?

Mobile app: Flutter (Dart) — cross-platform (Android, iOS, Web)
Backend: Firebase (Firestore, Storage, Auth, App Check)
Report server: Flask + WebSocket for real-time progress updates to the mobile app during report generation
Satellite processing: Python (xarray, stackstac, rasterio, pystac-client)
ML pipeline: scikit-learn, XGBoost, pickle serialization
Data access: STAC APIs (Element84 primary, Planetary Computer + Copernicus fallbacks)
Caching: Multi-level — in-memory → disk (NetCDF .nc files) → remote STAC download
Crash recovery: JSON checkpoint/resume system after each processing phase

How is water sampling validated?

Before running analysis on any location, we verify the sampling point is actually in permanent water using a 5-year NDWI timeseries. If the pin falls on land, an 8-directional search automatically adjusts toward the nearest deeper water pixel.

Climate zones for regional benchmarking are latitude-based: Arctic (>60°), Temperate Cold (45–60°), Temperate Warm (30–45°), Subtropical (15–30°), Tropical (<15°).

Research Collaboration

We actively seek academic partnerships to improve our satellite retrieval algorithms and validate models against ground truth.

What kind of research partnerships are you looking for?

We're seeking academic collaborators in:

Remote sensing / geomatics — Atmospheric correction, retrieval algorithm optimization, geometric validation
Limnology / aquatic ecology — Eutrophication modeling, harmful algal bloom prediction
Environmental science — Citizen science methodology, water quality monitoring frameworks
Data science / ML — Feature engineering, model architectures, transfer learning across water body types

We bring: a deployed platform, 355+ ground truth sites, satellite processing infrastructure, municipal relationships, and potential co-funding through Alberta Innovates and other programs.

Is the platform suitable for graduate student projects?

Yes. The platform provides a natural fit for graduate research including:

Field data collection campaigns with built-in GPS/timestamp validation
Algorithm development with existing satellite processing infrastructure
Validation studies comparing satellite-derived metrics against in-situ measurements
Citizen science methodology research

We're happy to support HQP (Highly Qualified Personnel) components for NSERC or other grant applications.

Where are the biggest opportunities for improvement?

Atmospheric correction — Current pipeline uses L2A (pre-corrected) data. Custom atmospheric correction could improve retrieval accuracy.
Band ratio optimization — Chlorophyll proxy formula is empirical. Regionally-calibrated coefficients would improve accuracy.
Temporal resolution — Fusing Sentinel-2 (5-day revisit) with Landsat (16-day) and potentially Sentinel-3/MODIS for daily coverage.
Ground truth expansion — More in-situ water samples (chlorophyll, nutrients, turbidity) paired with satellite overpasses.
Model generalization — Transfer learning across climate zones and water body types.

Research & Technical FAQ