
This post walks through the pollen forecasting workflow shown in the diagram. It covers how raw data is prepared, how two models are trained, and how results are evaluated in one final pipeline.
Diagram
Overview
Pollen forecasting gets hard fast: data arrives at different frequencies, locations are unevenly sampled, and weather effects lag over time. This workflow is designed to make those constraints explicit instead of hiding them.
The pipeline is split into three phases. First, we build a modeling dataset from pollen, climate, NDVI, and land-cover inputs. Next, we train a classifier and a regressor on a location-based split. Last, we combine predictions in a single evaluation run to measure how the system performs on unseen locations.
Diagram Breakdown
The workflow has three connected phases:
-
Data engineering and feature matrix construction
1_clean_pollen.pyconverts hourly pollen rows into daily data and creates unique location records.2_download_climate_V2.pypulls climate variables, including wind direction.- AppEEARS exports NDVI and LULC data.
5_merge_all_data_V2.pyjoins sources and interpolates missing NDVI points.6_advanced_feature_engineering_V2.pycreates GDD, seasonal, lag, rolling, and wind-vector features, then writesFINAL_MODELING_DATASET.csv.
-
Hybrid model training
- Data is split by location: 10 locations for training, 2 unseen locations for testing.
7_train_model_Atrains anXGBClassifierto detect whether pollen season is active.8_train_model_Btrains anXGBRegressorto estimate pollen amount.
-
Final evaluation pipeline
9_run_final_pipeline_TUNED.pyruns both models and computes metrics.- Combination rule: if Model A predicts "inactive season," final pollen output is
0; otherwise, use Model B output.
Key insights
- Data quality work is not optional. Most forecast errors in this kind of system start with merge gaps, inconsistent timestamps, or weak location mapping.
- Splitting by location is a stronger test than random splits because it checks whether the model generalizes to places it has never seen.
- The hybrid rule is simple but useful: first decide whether pollen activity exists, then estimate magnitude only when it does.
Next steps
- Add feature importance reporting for both models so you can inspect what drives each prediction.
- Track metrics per location, not just global averages, to find weak regions early.
- Add retraining checkpoints when new climate or land-use data is ingested.