micromet.format.transformers package

Submodules

micromet.format.transformers.cleanup module

Column cleanup and type conversion functions for the reformatter pipeline.

This module handles dropping unwanted columns, setting proper data types, and filtering soil-related columns.

micromet.format.transformers.cleanup.drop_extra_soil_columns(df, config, logger)[source]

Drop redundant or unused soil-related columns from the DataFrame.

This function identifies and removes soil-related columns that are considered extra or redundant based on the provided configuration.

Parameters:

df (DataFrame) – The input DataFrame with soil-related columns.
config (dict) – The configuration dictionary containing lists of columns to drop.
logger (Logger) – The logger for tracking the column dropping process.

Returns:

The DataFrame with extra soil columns removed.

Return type:

DataFrame

micromet.format.transformers.cleanup.drop_extras(df, config)[source]

Drop extra or unwanted columns from the DataFrame based on configuration.

This function removes columns from the DataFrame that are listed in the ‘drop_cols’ section of the configuration dictionary.

Parameters:

df (DataFrame) – The input DataFrame.
config (dict) – The configuration dictionary containing the list of columns to drop.

Returns:

The DataFrame with the specified columns removed.

Return type:

DataFrame

micromet.format.transformers.cleanup.process_and_match_columns(df_full, amflux)[source]

Cleans column names of df_full by removing ‘_1’, ‘_2’, ‘_3’, and ‘_4’ suffixes, compares the cleaned names against an ‘amflux’ variable list, and returns a DataFrame of the results, along with printing the unmatched columns.

Return type:: DataFrame

Args:: df_full: The DataFrame whose columns need to be cleaned and matched. amflux: A DataFrame or Series that contains the ‘Variable’ column

or is the Series of variables to match against.
Returns:: A DataFrame containing the original columns, the cleaned columns, and a boolean indicating if the cleaned column is in the amflux list.

micromet.format.transformers.cleanup.set_number_types(df, logger)[source]

Convert columns in a DataFrame to the appropriate numeric types.

This function iterates through the columns of a DataFrame and converts them to numeric types (integer or float) where appropriate. It handles special cases for certain columns and logs warnings for duplicate columns.

Parameters:

df (DataFrame) – The input DataFrame.
logger (Logger) – The logger for tracking the type conversion process.

Returns:

The DataFrame with columns converted to numeric types.

Return type:

DataFrame

micromet.format.transformers.columns module

Column naming and organization functions for the reformatter pipeline.

This module handles column renaming, prefix normalization, legacy format updates, and column ordering operations.

micromet.format.transformers.columns.col_order(df, logger)[source]

Reorder DataFrame columns to place priority columns at the beginning.

This function moves specified columns (‘TIMESTAMP_END’, ‘TIMESTAMP_START’) to the front of the DataFrame for better readability and consistency.

Parameters:

df (DataFrame) – The input DataFrame.
logger (Logger) – The logger for tracking the reordering process.

Returns:

The DataFrame with columns reordered.

Return type:

DataFrame

micromet.format.transformers.columns.make_unique(cols)[source]

Make a list of column names unique by appending numeric suffixes to duplicates.

This function takes a list of column names and ensures that all names are unique by appending a numeric suffix (e.g., ‘.1’, ‘.2’) to any duplicate names.

Parameters:: cols (list) – A list of column names.
Returns:: A list of unique column names.
Return type:: list

micromet.format.transformers.columns.make_unique_cols(df)[source]

Ensure that all column names in a DataFrame are unique.

This function uses the make_unique helper function to append numeric suffixes to any duplicate column names, ensuring that every column has a unique identifier.

Parameters:: df (DataFrame) – The input DataFrame.
Returns:: A copy of the DataFrame with unique column names.
Return type:: DataFrame

micromet.format.transformers.columns.modernize_soil_legacy(df, logger)[source]

Update legacy soil sensor column names to a standardized format.

This function identifies and renames legacy soil sensor columns to a modern, standardized format based on predefined mapping rules for depth and orientation.

Parameters:

df (DataFrame) – The input DataFrame with legacy soil sensor column names.
logger (Logger) – The logger for tracking the modernization process.

Returns:

The DataFrame with updated soil sensor column names.

Return type:

DataFrame

micromet.format.transformers.columns.normalize_prefixes(df, logger)[source]

Normalize column name prefixes for soil and temperature measurements.

This function standardizes column name prefixes by renaming them based on a set of predefined patterns. For example, it can change ‘BulkEC_’ to ‘EC_’.

Parameters:

df (DataFrame) – The input DataFrame with columns to be normalized.
logger (Logger) – The logger for tracking the normalization process.

Returns:

The DataFrame with normalized column name prefixes.

Return type:

DataFrame

micromet.format.transformers.columns.rename_columns(df, data_type, config, logger)[source]

Rename DataFrame columns based on configuration and standardize their names.

This function renames columns using a predefined mapping from the configuration, normalizes soil and temperature-related prefixes, and converts all column names to uppercase.

Parameters:

df (DataFrame) – The input DataFrame with columns to be renamed.
data_type (str) – The type of data (‘eddy’ or ‘met’), which determines which renaming map to use.
config (dict) – The configuration dictionary containing the renaming maps.
logger (Logger) – The logger for tracking the renaming process.

Returns:

The DataFrame with renamed and standardized column names.

Return type:

DataFrame

micromet.format.transformers.corrections module

Data correction functions for the reformatter pipeline.

This module contains variable-specific corrections and data value fixes, including handling special values, unit conversions, and merging duplicate columns.

micromet.format.transformers.corrections.apply_fixes(df, logger)[source]

Apply a set of minor, variable-specific data corrections.

This function serves as a pipeline for applying several small, targeted fixes to the data, such as correcting ‘TAU’ values, converting soil water content to percent, and scaling SSITC test values.

Parameters:

df (DataFrame) – The input DataFrame to be fixed.
logger (Logger) – The logger for tracking the fixes being applied.

Returns:

The DataFrame with all fixes applied.

Return type:

DataFrame

micromet.format.transformers.corrections.fill_na_drop_dups(df)[source]

Merge any number of duplicate columns with numeric suffixes (.1, .2, …), treating -9999 as missing, and drop redundant duplicates.

This function groups columns by their base name (the part before a trailing .<number> suffix). For each group, it merges values across the base column (if present) and all suffixed duplicates by preferring the first non-missing value at each row. During merging, the sentinel value -9999 is treated as missing (converted to NaN). After merging, remaining missing values are filled back with -9999 and all duplicate suffixed columns are dropped, preserving the base column as the canonical result.

Parameters:: df (DataFrame) – Input DataFrame that may contain duplicate columns named with numeric suffixes (e.g., "A.1", "A.2", …). The unsuffixed base column (e.g., "A") is optional. Sentinel missing values are expected to be encoded as -9999.
Returns:: A new DataFrame where, for each base column, all suffixed duplicates have been merged into the base column and the duplicates removed. Any remaining missing values are filled with -9999.
Return type:: DataFrame

Notes

Columns are grouped by the regex pattern r"^(?P<base>.+?)\.(?P<idx>\d+)$". Columns not matching this pattern are treated as base columns.
Merge precedence follows ascending numeric suffix order, with the base column (if present) considered first.
The input DataFrame is not modified in place; a copy is returned.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     "A":   [1, -9999, 3, -9999],
...     "A.1": [np.nan,  2,   -9999, 4],
...     "A.2": [-9999,   9,   np.nan, -9999],
...     "B.1": [10, -9999, np.nan, 13],   # no base 'B' column present
...     "B.3": [np.nan, 11, 12, -9999]
... })
>>> fill_na_drop_dups(df)
     A     B
0    1  10.0
1    2  11.0
2    3  12.0
3    4  13.0

micromet.format.transformers.corrections.fix_swc_percent(df, logger)[source]

Convert fractional soil water content (SWC) values to percentages.

This function checks soil water content columns (those starting with ‘SWC_’) and, if the values appear to be fractional (<= 1.5), multiplies them by 100 to convert them to percentages.

Parameters:

df (DataFrame) – The input DataFrame with SWC columns.
logger (Logger) – The logger for tracking the conversion process.

Returns:

The DataFrame with SWC values converted to percentages where applicable.

Return type:

DataFrame

micromet.format.transformers.corrections.rating(x)[source]

Categorize a numeric value into a discrete rating level (0, 1, or 2).

This function categorizes a numeric value into one of three levels: - 0 for values between 0 and 3. - 1 for values between 4 and 6. - 2 for all other values.

Parameters:: x (numeric or None) – The input value to be rated.
Returns:: The rating level (0, 1, or 2).
Return type:: int

micromet.format.transformers.corrections.scale_and_convert(column)[source]

Apply a rating transformation and convert the column to float type.

This function applies a ‘rating’ function to each element of the Series and then converts the entire Series to float.

Parameters:: column (Series) – The input Series to be transformed.
Returns:: The transformed and converted Series.
Return type:: Series

micromet.format.transformers.corrections.ssitc_scale(df, logger)[source]

Scale SSITC (Signal Strength and Integrity Test) columns.

This function checks specific SSITC columns and, if their values exceed a certain threshold (3), applies a scaling and rating transformation to them.

Parameters:

df (DataFrame) – The input DataFrame with SSITC columns.
logger (Logger) – The logger for tracking the scaling process.

Returns:

The DataFrame with SSITC columns scaled where applicable.

Return type:

DataFrame

micromet.format.transformers.corrections.tau_fixer(df, threshold=0.5, logger=None)[source]

Replace zero values in the ‘TAU’ column with NaN and flips sign if needed.

Loops through all columns with TAU in the name that don’t also have SSITC or QC in the name.

This function checks for zero values or negative infinity values in the ‘TAU’ column and replaces them with NaN. This is often done to handle cases where zero represents a missing or invalid measurement.

The function also determines whether to reverse the sign of TAU. If more than the specified threshold of TAU values are positive, it flips the sign of all TAU values.

Parameters:: df (DataFrame) – The input DataFrame with a ‘TAU’ column.
Returns:: The DataFrame with zero values in ‘TAU’ replaced by NaN.
Return type:: DataFrame

micromet.format.transformers.interval_updates module

This module contains a dictionary of the datetime when sampling freuency was updated from 30 minutes to 60 minutes for eddy data (first item in list) and met data (second item in list).

It also contains a funtion that subsets out data to only include data from before or after the interval switch, for a dataframe with a multindex of STATIONID and DATETIME_END

micromet.format.transformers.interval_updates.subset_interval(df, date_dict, interval, data_type)[source]

Subsets a MultiIndex DataFrame based on station ID, a date cutoff, and a data_type, using a single vectorized boolean mask.

Return type:: DataFrame

Args:

df (pd.DataFrame): MultiIndex DataFrame with levels ‘STATIONID’: and ‘DATETIME_END’.
date_dict (dict): Dictionary where keys are ‘STATIONID’ and values: are a list of two date strings [date1, date2].
interval (int): Condition for subsetting. 30 for dates <= cutoff,: 60 for dates > cutoff.
data_type (str): Determines which date to use as the cutoff:: ‘eddy’ uses the first date (index 0). ‘met’ uses the second date (index 1).

Returns:

pd.DataFrame: The subsetted DataFrame containing data from all relevant stations.

micromet.format.transformers.timestamp_update module

various scripts for trying to address timestamp issues in the data

micromet.format.transformers.timestamp_update.process_by_interval(in_df, key, interval_dict, datatype)[source]: The goal of this script is to use the interval_updates dictionary to identify when data switched from 30 to 60 minute sampling and then process the data correctly.

micromet.format.transformers.timestamp_update.resample_alternating_frequency_with_other(df, min_records_threshold=24)[source]: Identifies contiguous blocks of data, resamples 30min/60min blocks, and assigns ‘OTHER’ to the timestep for unclassified (non-gap) blocks.

micromet.format.transformers.timestamp_update.resample_single_frequency_switch(df, sample_size=100)[source]

Resamples a DataFrame based on a single detected frequency switch (30min to 60min). It uses the mode of the first 100 records to robustly determine the initial frequency, handling minor clock jitter and occasional gaps.

Args:: df (pd.DataFrame): DataFrame with a DatetimeIndex. sample_size (int): The number of initial records to analyze for the starting routine.
Returns:: pd.DataFrame: Resampled DataFrame with a ‘timestep’ column.

micromet.format.transformers.timestamps module

Timestamp transformation functions for the reformatter pipeline.

This module handles all datetime-related operations including timestamp detection, conversion, resampling, and formatting.

micromet.format.transformers.timestamps.add_ameriflux_timestamps(df, interval_minutes=30)[source]: Creates TIMESTAMP_START and TIMESTAMP_END columns from a DatetimeIndex in the YYYYMMDDHHmm format required by AmeriFlux.

micromet.format.transformers.timestamps.fix_timestamps(df, logger)[source]

Convert the timestamp column to datetime objects and handle missing values.

This function identifies the timestamp column, converts it to datetime objects, and removes any rows where the timestamp could not be parsed.

Parameters:

df (DataFrame) – The input DataFrame with a timestamp column.
logger (Logger) – The logger for tracking progress and warnings.

Returns:

The DataFrame with a ‘DATETIME_END’ column of datetime objects.

Return type:

DataFrame

micromet.format.transformers.timestamps.infer_datetime_col(df, logger)[source]

Infer the name of the timestamp column in a DataFrame.

This function searches for a timestamp column in the DataFrame by checking a list of common names (e.g., ‘TIMESTAMP_END’). If a matching column is found, its name is returned. Otherwise, it logs a warning and returns the name of the first column.

Parameters:

df (DataFrame) – The DataFrame to search for a timestamp column.
logger (Logger) – The logger to use for warning messages.

Returns:

The name of the timestamp column if found, otherwise the name of the first column.

Return type:

str | None

micromet.format.transformers.timestamps.resample_timestamps(df, interval, logger)[source]

Resample a DataFrame to 30- or 60- minute intervals.

This function resamples the DataFrame to a fixed 30-or 60-minute frequency based on the ‘DATETIME_END’ column. It also handles duplicate timestamps by selecting the first available value.

Parameters:

df (DataFrame) – The input DataFrame with a ‘DATETIME_END’ column.
interval (int) – The resampling interval in minutes (30 or 60 minutes)
logger (Logger) – The logger for tracking progress.

Returns:

The resampled DataFrame with a 30- or 60-minute frequency index.

Return type:

DataFrame

micromet.format.transformers.timestamps.timestamp_reset(df, minutes=30)[source]

Reset TIMESTAMP_START and TIMESTAMP_END columns based on the DataFrame index.

This function generates new ‘TIMESTAMP_START’ and ‘TIMESTAMP_END’ columns based on the DataFrame’s datetime index. The ‘TIMESTAMP_START’ is calculated by subtracting a specified number of minutes to the start time.

Parameters:

df (DataFrame) – The input DataFrame with a datetime index.
minutes (int) – The number of minutes to add to the start time to calculate the end time. Defaults to 30.

Returns:

The DataFrame with updated ‘TIMESTAMP_START’ and ‘TIMESTAMP_END’ columns.

Return type:

DataFrame

micromet.format.transformers.validation module

Data validation and quality control functions for the reformatter pipeline.

This module handles applying physical limits to data values and detecting stuck or anomalous sensor readings.

micromet.format.transformers.validation.apply_physical_limits(df, how='mask', inplace=False, prefer_longest_key=True, return_mask=False, round_et=True)[source]

Apply physical Min/Max bounds to columns in a DataFrame.

This function applies physical limits (minimum and maximum) to the columns of a DataFrame. It can either mask out-of-bounds values with NaN or clip them to the limits.

Parameters:

df (DataFrame) – The input DataFrame to which the limits will be applied.
how (str) – The method to use for applying limits: ‘mask’ (default) or ‘clip’.
inplace (bool) – If True, modify the DataFrame in place. Defaults to False.
prefer_longest_key (bool) – If True, prefer longer matching keys from the limits dictionary. Defaults to True.
return_mask (bool) – If True, return a boolean mask of the values that were flagged. Defaults to False.
round_et (bool) – If True, ET values below 0 will be rounded to 1 digit before applying variable limits. Defaults to False

Returns:

A tuple containing: - The DataFrame with physical limits applied. - A boolean mask of flagged values (if return_mask is True). - A report summarizing the number of flagged values for each column.

Return type:

tuple[DataFrame, DataFrame | None, DataFrame]

micromet.format.transformers.validation.mask_stuck_values(df, threshold, columns=None, tolerance=None, mask_value=nan, return_mask=False)[source]

Detect and mask ‘stuck’ values in a datetime-indexed DataFrame.

A run is considered ‘stuck’ when the series does not change (within an optional numeric tolerance) for at least threshold. Threshold can be a count of rows (int) or a time duration (str like ‘30min’ / ‘2H’ or pd.Timedelta).

Parameters:

df (DataFrame) – DataFrame with a DatetimeIndex (required).
threshold (Union[int, str, Timedelta]) – Minimum length of a non-changing run to be masked. - If int: count of consecutive rows (e.g., 5). - If str or Timedelta: minimum duration (e.g., ‘30min’, pd.Timedelta(‘2H’)).
columns (Optional[Iterable[str]]) – Subset of columns to check. Defaults to all columns.
tolerance (Optional[float]) – For numeric columns only: treat changes with absolute difference <= tolerance as ‘no change’. If None, exact equality is used.
mask_value (any, default np.nan) – Value to assign to masked entries.
return_mask (bool) – If True, also return a boolean DataFrame mask where True marks masked cells.

Return type:

Union[Tuple[DataFrame, DataFrame], Tuple[DataFrame, DataFrame, DataFrame]]

Returns:

masked_df (pd.DataFrame) – Copy of df with stuck runs masked.
report (pd.DataFrame) – Tidy report with one row per masked run, columns: [‘column’,’value’,’start’,’end’,’n_rows’,’duration’,’threshold_type’,’threshold_value’]
mask_df (pd.DataFrame (optional)) – Boolean DataFrame (same shape as df[columns]) with True where values were masked.

Notes

NaNs act as boundaries and are never considered part of a ‘stuck’ run.
For irregular time steps and time-based thresholds, the run ‘duration’ is computed as end_time - start_time (inclusive of row timestamps).
Entire runs that meet/exceed the threshold are masked (not just the tail beyond threshold).

Module contents

Data transformation functions for the reformatter pipeline.

This package contains modular transformation functions organized by category: - timestamps: Datetime handling and resampling - columns: Column naming, renaming, and organization - validation: Data quality checks and boundary enforcement - corrections: Variable-specific data fixes - cleanup: Column filtering and type setting

For backward compatibility, all functions are re-exported at the package level.

micromet.format.transformers.apply_fixes(df, logger)[source]

Apply a set of minor, variable-specific data corrections.

This function serves as a pipeline for applying several small, targeted fixes to the data, such as correcting ‘TAU’ values, converting soil water content to percent, and scaling SSITC test values.

Parameters:

df (DataFrame) – The input DataFrame to be fixed.
logger (Logger) – The logger for tracking the fixes being applied.

Returns:

The DataFrame with all fixes applied.

Return type:

DataFrame

micromet.format.transformers.apply_physical_limits(df, how='mask', inplace=False, prefer_longest_key=True, return_mask=False, round_et=True)[source]

Apply physical Min/Max bounds to columns in a DataFrame.

This function applies physical limits (minimum and maximum) to the columns of a DataFrame. It can either mask out-of-bounds values with NaN or clip them to the limits.

Parameters:

df (DataFrame) – The input DataFrame to which the limits will be applied.
how (str) – The method to use for applying limits: ‘mask’ (default) or ‘clip’.
inplace (bool) – If True, modify the DataFrame in place. Defaults to False.
prefer_longest_key (bool) – If True, prefer longer matching keys from the limits dictionary. Defaults to True.
return_mask (bool) – If True, return a boolean mask of the values that were flagged. Defaults to False.
round_et (bool) – If True, ET values below 0 will be rounded to 1 digit before applying variable limits. Defaults to False

Returns:

A tuple containing: - The DataFrame with physical limits applied. - A boolean mask of flagged values (if return_mask is True). - A report summarizing the number of flagged values for each column.

Return type:

tuple[DataFrame, DataFrame | None, DataFrame]

micromet.format.transformers.col_order(df, logger)[source]

Reorder DataFrame columns to place priority columns at the beginning.

This function moves specified columns (‘TIMESTAMP_END’, ‘TIMESTAMP_START’) to the front of the DataFrame for better readability and consistency.

Parameters:

df (DataFrame) – The input DataFrame.
logger (Logger) – The logger for tracking the reordering process.

Returns:

The DataFrame with columns reordered.

Return type:

DataFrame

micromet.format.transformers.drop_extra_soil_columns(df, config, logger)[source]

Drop redundant or unused soil-related columns from the DataFrame.

This function identifies and removes soil-related columns that are considered extra or redundant based on the provided configuration.

Parameters:

df (DataFrame) – The input DataFrame with soil-related columns.
config (dict) – The configuration dictionary containing lists of columns to drop.
logger (Logger) – The logger for tracking the column dropping process.

Returns:

The DataFrame with extra soil columns removed.

Return type:

DataFrame

micromet.format.transformers.drop_extras(df, config)[source]

Drop extra or unwanted columns from the DataFrame based on configuration.

This function removes columns from the DataFrame that are listed in the ‘drop_cols’ section of the configuration dictionary.

Parameters:

df (DataFrame) – The input DataFrame.
config (dict) – The configuration dictionary containing the list of columns to drop.

Returns:

The DataFrame with the specified columns removed.

Return type:

DataFrame

micromet.format.transformers.fill_na_drop_dups(df)[source]

Merge any number of duplicate columns with numeric suffixes (.1, .2, …), treating -9999 as missing, and drop redundant duplicates.

This function groups columns by their base name (the part before a trailing .<number> suffix). For each group, it merges values across the base column (if present) and all suffixed duplicates by preferring the first non-missing value at each row. During merging, the sentinel value -9999 is treated as missing (converted to NaN). After merging, remaining missing values are filled back with -9999 and all duplicate suffixed columns are dropped, preserving the base column as the canonical result.

Parameters:: df (DataFrame) – Input DataFrame that may contain duplicate columns named with numeric suffixes (e.g., "A.1", "A.2", …). The unsuffixed base column (e.g., "A") is optional. Sentinel missing values are expected to be encoded as -9999.
Returns:: A new DataFrame where, for each base column, all suffixed duplicates have been merged into the base column and the duplicates removed. Any remaining missing values are filled with -9999.
Return type:: DataFrame

Notes

Columns are grouped by the regex pattern r"^(?P<base>.+?)\.(?P<idx>\d+)$". Columns not matching this pattern are treated as base columns.
Merge precedence follows ascending numeric suffix order, with the base column (if present) considered first.
The input DataFrame is not modified in place; a copy is returned.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     "A":   [1, -9999, 3, -9999],
...     "A.1": [np.nan,  2,   -9999, 4],
...     "A.2": [-9999,   9,   np.nan, -9999],
...     "B.1": [10, -9999, np.nan, 13],   # no base 'B' column present
...     "B.3": [np.nan, 11, 12, -9999]
... })
>>> fill_na_drop_dups(df)
     A     B
0    1  10.0
1    2  11.0
2    3  12.0
3    4  13.0

micromet.format.transformers.fix_swc_percent(df, logger)[source]

Convert fractional soil water content (SWC) values to percentages.

This function checks soil water content columns (those starting with ‘SWC_’) and, if the values appear to be fractional (<= 1.5), multiplies them by 100 to convert them to percentages.

Parameters:

df (DataFrame) – The input DataFrame with SWC columns.
logger (Logger) – The logger for tracking the conversion process.

Returns:

The DataFrame with SWC values converted to percentages where applicable.

Return type:

DataFrame

micromet.format.transformers.fix_timestamps(df, logger)[source]

Convert the timestamp column to datetime objects and handle missing values.

This function identifies the timestamp column, converts it to datetime objects, and removes any rows where the timestamp could not be parsed.

Parameters:

df (DataFrame) – The input DataFrame with a timestamp column.
logger (Logger) – The logger for tracking progress and warnings.

Returns:

The DataFrame with a ‘DATETIME_END’ column of datetime objects.

Return type:

DataFrame

micromet.format.transformers.infer_datetime_col(df, logger)[source]

Infer the name of the timestamp column in a DataFrame.

This function searches for a timestamp column in the DataFrame by checking a list of common names (e.g., ‘TIMESTAMP_END’). If a matching column is found, its name is returned. Otherwise, it logs a warning and returns the name of the first column.

Parameters:

df (DataFrame) – The DataFrame to search for a timestamp column.
logger (Logger) – The logger to use for warning messages.

Returns:

The name of the timestamp column if found, otherwise the name of the first column.

Return type:

str | None

micromet.format.transformers.make_unique(cols)[source]

Make a list of column names unique by appending numeric suffixes to duplicates.

This function takes a list of column names and ensures that all names are unique by appending a numeric suffix (e.g., ‘.1’, ‘.2’) to any duplicate names.

Parameters:: cols (list) – A list of column names.
Returns:: A list of unique column names.
Return type:: list

micromet.format.transformers.make_unique_cols(df)[source]

Ensure that all column names in a DataFrame are unique.

This function uses the make_unique helper function to append numeric suffixes to any duplicate column names, ensuring that every column has a unique identifier.

Parameters:: df (DataFrame) – The input DataFrame.
Returns:: A copy of the DataFrame with unique column names.
Return type:: DataFrame

micromet.format.transformers.mask_stuck_values(df, threshold, columns=None, tolerance=None, mask_value=nan, return_mask=False)[source]

Detect and mask ‘stuck’ values in a datetime-indexed DataFrame.

A run is considered ‘stuck’ when the series does not change (within an optional numeric tolerance) for at least threshold. Threshold can be a count of rows (int) or a time duration (str like ‘30min’ / ‘2H’ or pd.Timedelta).

Parameters:

df (DataFrame) – DataFrame with a DatetimeIndex (required).
threshold (Union[int, str, Timedelta]) – Minimum length of a non-changing run to be masked. - If int: count of consecutive rows (e.g., 5). - If str or Timedelta: minimum duration (e.g., ‘30min’, pd.Timedelta(‘2H’)).
columns (Optional[Iterable[str]]) – Subset of columns to check. Defaults to all columns.
tolerance (Optional[float]) – For numeric columns only: treat changes with absolute difference <= tolerance as ‘no change’. If None, exact equality is used.
mask_value (any, default np.nan) – Value to assign to masked entries.
return_mask (bool) – If True, also return a boolean DataFrame mask where True marks masked cells.

Return type:

Union[Tuple[DataFrame, DataFrame], Tuple[DataFrame, DataFrame, DataFrame]]

Returns:

masked_df (pd.DataFrame) – Copy of df with stuck runs masked.
report (pd.DataFrame) – Tidy report with one row per masked run, columns: [‘column’,’value’,’start’,’end’,’n_rows’,’duration’,’threshold_type’,’threshold_value’]
mask_df (pd.DataFrame (optional)) – Boolean DataFrame (same shape as df[columns]) with True where values were masked.

Notes

NaNs act as boundaries and are never considered part of a ‘stuck’ run.
For irregular time steps and time-based thresholds, the run ‘duration’ is computed as end_time - start_time (inclusive of row timestamps).
Entire runs that meet/exceed the threshold are masked (not just the tail beyond threshold).

micromet.format.transformers.modernize_soil_legacy(df, logger)[source]

Update legacy soil sensor column names to a standardized format.

This function identifies and renames legacy soil sensor columns to a modern, standardized format based on predefined mapping rules for depth and orientation.

Parameters:

df (DataFrame) – The input DataFrame with legacy soil sensor column names.
logger (Logger) – The logger for tracking the modernization process.

Returns:

The DataFrame with updated soil sensor column names.

Return type:

DataFrame

micromet.format.transformers.normalize_prefixes(df, logger)[source]

Normalize column name prefixes for soil and temperature measurements.

This function standardizes column name prefixes by renaming them based on a set of predefined patterns. For example, it can change ‘BulkEC_’ to ‘EC_’.

Parameters:

df (DataFrame) – The input DataFrame with columns to be normalized.
logger (Logger) – The logger for tracking the normalization process.

Returns:

The DataFrame with normalized column name prefixes.

Return type:

DataFrame

micromet.format.transformers.process_and_match_columns(df_full, amflux)[source]

Cleans column names of df_full by removing ‘_1’, ‘_2’, ‘_3’, and ‘_4’ suffixes, compares the cleaned names against an ‘amflux’ variable list, and returns a DataFrame of the results, along with printing the unmatched columns.

Return type:: DataFrame

Args:: df_full: The DataFrame whose columns need to be cleaned and matched. amflux: A DataFrame or Series that contains the ‘Variable’ column

or is the Series of variables to match against.
Returns:: A DataFrame containing the original columns, the cleaned columns, and a boolean indicating if the cleaned column is in the amflux list.

micromet.format.transformers.rating(x)[source]

Categorize a numeric value into a discrete rating level (0, 1, or 2).

This function categorizes a numeric value into one of three levels: - 0 for values between 0 and 3. - 1 for values between 4 and 6. - 2 for all other values.

Parameters:: x (numeric or None) – The input value to be rated.
Returns:: The rating level (0, 1, or 2).
Return type:: int

micromet.format.transformers.rename_columns(df, data_type, config, logger)[source]

Rename DataFrame columns based on configuration and standardize their names.

This function renames columns using a predefined mapping from the configuration, normalizes soil and temperature-related prefixes, and converts all column names to uppercase.

Parameters:

df (DataFrame) – The input DataFrame with columns to be renamed.
data_type (str) – The type of data (‘eddy’ or ‘met’), which determines which renaming map to use.
config (dict) – The configuration dictionary containing the renaming maps.
logger (Logger) – The logger for tracking the renaming process.

Returns:

The DataFrame with renamed and standardized column names.

Return type:

DataFrame

micromet.format.transformers.resample_timestamps(df, interval, logger)[source]

Resample a DataFrame to 30- or 60- minute intervals.

This function resamples the DataFrame to a fixed 30-or 60-minute frequency based on the ‘DATETIME_END’ column. It also handles duplicate timestamps by selecting the first available value.

Parameters:

df (DataFrame) – The input DataFrame with a ‘DATETIME_END’ column.
interval (int) – The resampling interval in minutes (30 or 60 minutes)
logger (Logger) – The logger for tracking progress.

Returns:

The resampled DataFrame with a 30- or 60-minute frequency index.

Return type:

DataFrame

micromet.format.transformers.scale_and_convert(column)[source]

Apply a rating transformation and convert the column to float type.

This function applies a ‘rating’ function to each element of the Series and then converts the entire Series to float.

Parameters:: column (Series) – The input Series to be transformed.
Returns:: The transformed and converted Series.
Return type:: Series

micromet.format.transformers.set_number_types(df, logger)[source]

Convert columns in a DataFrame to the appropriate numeric types.

This function iterates through the columns of a DataFrame and converts them to numeric types (integer or float) where appropriate. It handles special cases for certain columns and logs warnings for duplicate columns.

Parameters:

df (DataFrame) – The input DataFrame.
logger (Logger) – The logger for tracking the type conversion process.

Returns:

The DataFrame with columns converted to numeric types.

Return type:

DataFrame

micromet.format.transformers.ssitc_scale(df, logger)[source]

Scale SSITC (Signal Strength and Integrity Test) columns.

This function checks specific SSITC columns and, if their values exceed a certain threshold (3), applies a scaling and rating transformation to them.

Parameters:

df (DataFrame) – The input DataFrame with SSITC columns.
logger (Logger) – The logger for tracking the scaling process.

Returns:

The DataFrame with SSITC columns scaled where applicable.

Return type:

DataFrame

micromet.format.transformers.tau_fixer(df, threshold=0.5, logger=None)[source]

Replace zero values in the ‘TAU’ column with NaN and flips sign if needed.

Loops through all columns with TAU in the name that don’t also have SSITC or QC in the name.

This function checks for zero values or negative infinity values in the ‘TAU’ column and replaces them with NaN. This is often done to handle cases where zero represents a missing or invalid measurement.

The function also determines whether to reverse the sign of TAU. If more than the specified threshold of TAU values are positive, it flips the sign of all TAU values.

Parameters:: df (DataFrame) – The input DataFrame with a ‘TAU’ column.
Returns:: The DataFrame with zero values in ‘TAU’ replaced by NaN.
Return type:: DataFrame

micromet.format.transformers.timestamp_reset(df, minutes=30)[source]

Reset TIMESTAMP_START and TIMESTAMP_END columns based on the DataFrame index.

This function generates new ‘TIMESTAMP_START’ and ‘TIMESTAMP_END’ columns based on the DataFrame’s datetime index. The ‘TIMESTAMP_START’ is calculated by subtracting a specified number of minutes to the start time.

Parameters:

df (DataFrame) – The input DataFrame with a datetime index.
minutes (int) – The number of minutes to add to the start time to calculate the end time. Defaults to 30.

Returns:

The DataFrame with updated ‘TIMESTAMP_START’ and ‘TIMESTAMP_END’ columns.

Return type:

DataFrame