Flux Data Processing – Workflow Summary

Utah Geological Survey – Flux Monitoring Network

This page provides a high-level overview of the eddy covariance data processing pipeline. For full technical detail on each step, see the complete workflow reference.

Workflow Flowchart

Pipeline at a Glance

The pipeline converts preprocessed eddy covariance and meteorological data into quality-controlled, AmeriFlux-formatted output through five processing steps and three review checkpoints.

Step	Notebook	Input	Key Operations	Output
1	`1_compile_and_preprocess`	Raw data logger or EasyFlux web files	Compile datalogger files · standardize columns	`*_preprocessed.parquet`
2	`2_create_raw_data`	Preprocessed parquets	Merge data from different data streams · Merge met and eddy data · fix time alignment issues	`*_raw.parquet`
3	`3_qc_data`	`*_raw.parquet`	Calibration corrections · physical limits · SoilVue G calc · manual QC · signal flags	`*_qc.parquet`
4	`4_ameriflux`	`*_qc.parquet`	Signal-strength filter · drop non-AmeriFlux columns · format timestamps	`_HH_.csv`
5	`5_fluxqaqc`	`_HH_.csv`	Gap-fill redundant sensors · EBR correction · ET gap-fill · sensitivity tests	Daily ET + HTML reports

Review notebooks (read-only – findings feed corrections back into Step 3):

Notebook	When to run	Purpose
3a – Variable Review	After Step 3	Summary statistics, distributions, outlier detection
3b – Plot Review	After Step 3	Quick time-series sweep of every variable
4b – AmeriFlux Plot Review	After Step 4	Final visual check before AmeriFlux submission

Step-by-Step Summary

Step 1 – Compile & Preprocess

-> Full details

Create compiled and clean versions of data from each data stream (e.g., CSFlux web/datalogger, AmeriFlux eddy web/datalogger, MetStats, MetAF)

Organize datalogger files into a single directory by table name (e.g., Statistics_Ameriflux, Statistics, Flux_AmerifluxFormat, Flux_CSFormat)
Compile data from a single data stream into a dataframe
Clean data by applying renaming dictionary, setting data types, and fixing timestamp issue
Subset out data to only include 30 or 60 minute data, depending on user input
Present data for review to identify misnamed columns and missing data
Export data into separate parquet files for data from each data stream

Output: {station}_{timestart}_{timeend}_preprocessed.parquet

Step 2 – Create Raw Dataset

-> Full details

Assemble multiple preprocessed files into final datasets and manage any datetime shifts

Load preprocessed parquets for each data stream (CSFlux web/datalogger, AmeriFlux eddy web/datalogger, MetStats, MetAF)
Compare and merge eddy data – CSFlux and AmeriFlux eddy streams are compared for differences; the AmeriFlux stream is primary, with CSFlux filling gaps and providing unique columns (e.g., G_PLATE, diagnostic fields)
Compare and merge met data – MetStats and MetAF streams are compared and combined
Detect and correct temporal shifts – SoilVue sensor data (EC_3_, K_3_, SWC_3_, TS_3_) may be offset by one time step; cross-correlation detects the lag and a frequency shift corrects it. Historical timestamp misalignments are also identified and corrected.
Combine eddy and met – merge the two streams, resolve duplicate columns, validate 30-minute interval integrity
Standardize column naming – apply AmeriFlux positional suffixes (_1_1_1, _1_1_2, etc.)
Trim to station record – drop data before the station install date (retrieved from the database API)

Output: {station}_{timestart}_{timeend}_raw.parquet

Step 3 – Quality Control

-> Full details

The largest and most site-specific step:

Calibration corrections – date-gated fixes for soil heat flux storage thickness, precipitation calibration factors, and G_PLATE sign inversions
SoilVue G calculation – derive ground heat flux from temperature/moisture profiles using the soil_heat library (Johansen thermal model)
Physical limits – Reformatter.finalize() applies range limits, converts SWC units, standardizes SSITC encoding, and produces a limit report
Manual corrections – field-day precipitation, G_PLATE zeros, SoilVue spikes, wind direction offsets, sensor-specific spike removal
Signal-strength flags – H2O/CO2 signal flags (0/1/2) and wind direction obstruction flags
Gap-fill G – linear regression between redundant G sources to impute missing values

Output: {station}_{daterange}_qc.parquet + limit report CSV

Review: 3a & 3b

-> 3a details – -> 3b details

Run after Step 3 to evaluate data quality. Issues found here are resolved by adding correction blocks in Notebook 3 and re-running Steps 3–5.

3a: Summary statistics, data availability, and outlier detection for each variable.

3b: Interactive Plotly time-series for every column – a rapid visual sweep for spikes, gaps, or artifacts.

Step 4 – AmeriFlux Export

-> Full details

Converts the QC dataset into an AmeriFlux-compliant half-hourly CSV:

IRGA-derived variables set to NaN where signal strength < 0.8
Non-AmeriFlux columns dropped; all-NaN columns removed
NaN replaced with -9999; timestamps formatted as YYYYMMDDHHmm

Output: {station}_HH_{start}_{end}.csv

Step 5 – Flux QAQC

-> Full details

Runs fluxdataqaqc for energy balance ratio (EBR) correction and ET gap-filling:

Gap-fill redundant NETRAD and G sensors via linear regression
EBR correction applied to LE; ET gap-filled using ETrF x gridMET ETr
Data subset by year and season for analysis
Sensitivity runs with different Rn/G input combinations

Outputs: EBR-corrected daily ET, HTML diagnostic reports, optional daily CSV

Key Libraries

Library	Role
micromet	Core pipeline: `Reformatter`, `validate`, `merge`, `data_cleaning`, `fix_g_values`, `timestamps`, `columns`, `eddy_plots`
soil_heat	SoilVue-derived ground heat flux (Johansen model)
fluxdataqaqc	EBR correction, ET gap-fill (`Data`, `QaQc`, `Plot`)
pandas / numpy	Data wrangling and array operations
scipy	Cross-correlation and linear regression
plotly / bokeh	Interactive diagnostics and HTML reports

Directory Structure

M:/Shared drives/UGS_Flux/
├── Data_Downloads/compiled/
│   ├── preprocessed_site_data/   ← preprocessed parquets
│   └── {stationid}/              ← raw .dat source files
└── Data_Processing/final_database_tables/
    ├── raw/          *_raw.parquet
    ├── qc/           *_qc.parquet
    └── ameriflux/    *_HH_*.csv

Adapting to Other Sites

Copy the notebooks and update:

station, interval, date_range – station code, measurement interval, and date bounds
Calibration correction dates and factors (Notebook 3)
Sensor failure date ranges and affected variables
Wind direction offsets between instruments
Signal-strength bad-period date ranges
Column selection for eddy merging (Notebooks 1/2) and AmeriFlux export (Notebook 4)
.ini config for fluxdataqaqc (Notebook 5)

See the full workflow document for detailed guidance.