micromet.format package

Subpackages

micromet.format.transformers package

Submodules

micromet.format.compare module

Relationship comparison and coordinated outlier plots (SciPy version).

Aligns two time series (treats -9999 as NaN)
Fits y ~ x with scipy.stats.linregress
Flags outliers from residuals using robust MAD (or STD)
Produces three coordinated plots: (1) scatter + regression line + highlighted outliers (2) x time series with the same outliers highlighted (3) y time series with the same outliers highlighted

class micromet.format.compare.FitResult(coef, intercept, r2, y_hat, residuals)[source]

Bases: object

Linear regression summary.

coef: float

intercept: float

r2: float

residuals: ndarray

y_hat: ndarray

micromet.format.compare.align(x, y, x_index=None, y_index=None, x_name='X', y_name='Y', how='inner')[source]

Align two series on their index and drop rows with NaNs.

This function prepares two time series for comparison by coercing them to pandas Series, aligning them based on their time index, and removing any rows that contain missing values (NaNs) in either series.

Parameters:

x (Union[Series, DataFrame, ndarray, Iterable]) – The first time series (independent variable).
y (Union[Series, DataFrame, ndarray, Iterable]) – The second time series (dependent variable).
x_index (Optional[DatetimeIndex]) – The time index for the x series, if not already a Series or DataFrame. Defaults to None.
y_index (Optional[DatetimeIndex]) – The time index for the y series, if not already a Series or DataFrame. Defaults to None.
x_name (str) – The name to assign to the x series. Defaults to “X”.
y_name (str) – The name to assign to the y series. Defaults to “Y”.
how (str) – The method for joining the two series, as in pd.concat. Defaults to “inner”.

Returns:

A DataFrame containing the two aligned and cleaned series as columns, indexed by their common time index.

Return type:

DataFrame

micromet.format.compare.compare_and_plot(x, y, *, x_index=None, y_index=None, x_label='X', y_label='Y', title=None, method='mad', k=3.0, join='inner', point_size=8)[source]

Align, fit, detect outliers, and render coordinated plots.

This function provides a comprehensive analysis of the relationship between two time series. It produces a figure with three subplots: 1. A scatter plot of y vs. x with a regression line and outliers. 2. A time series plot of x with outliers highlighted. 3. A time series plot of y with outliers highlighted.

Parameters:

x (Union[Series, DataFrame, ndarray, Iterable]) – The first time series (independent variable).
y (Union[Series, DataFrame, ndarray, Iterable]) – The second time series (dependent variable).
x_index (Optional[DatetimeIndex]) – The time index for x. Defaults to None.
y_index (Optional[DatetimeIndex]) – The time index for y. Defaults to None.
x_label (str) – Label for the x-axis. Defaults to “X”.
y_label (str) – Label for the y-axis. Defaults to “Y”.
title (Optional[str]) – Title for the scatter plot. Defaults to None.
method (Literal['mad', 'std']) – Method for outlier detection. Defaults to “mad”.
k (float) – Threshold for outlier detection. Defaults to 3.0.
join (str) – Method for aligning the series. Defaults to “inner”.
point_size (int) – Size of the scatter plot points. Defaults to 8.

Returns:

A tuple containing the matplotlib Figure and a dictionary of results, including the aligned data, outlier mask, and fit statistics.

Return type:

Tuple[Figure, dict]

Raises:

ValueError – If there is no overlapping, non-NaN data between the inputs.

micromet.format.compare.compare_report(x, y, **kwargs)[source]

Return a tidy per-record report with predictions, residuals, and outlier flags.

This function wraps compare_and_plot to generate a detailed DataFrame report for each data point, including the predicted value, residual, and an outlier flag.

Parameters:

x (Union[Series, DataFrame, ndarray, Iterable]) – The first time series (independent variable).
y (Union[Series, DataFrame, ndarray, Iterable]) – The second time series (dependent variable).
**kwargs – Additional keyword arguments passed directly to compare_and_plot.

Returns:

A DataFrame with columns for the original data, the predicted y-values (y_hat), residuals, and a boolean outlier flag.

Return type:

DataFrame

micromet.format.compare.fit_linear(x, y)[source]

Fit a linear model y ~ x using scipy.stats.linregress.

This function performs a simple linear regression and returns the key results, including the fitted values and residuals.

Parameters:

x (Union[Series, ndarray]) – The independent variable data (predictor).
y (Union[Series, ndarray]) – The dependent variable data (response).

Returns:

A dataclass object containing the regression coefficient, intercept, R-squared value, predicted y values (y_hat), and the residuals.

Return type:

FitResult

micromet.format.compare.outlier_mask_from_residuals(residuals, method='mad', k=3.0)[source]

Flag outliers from residuals using MAD (robust, default) or STD.

This function identifies outliers in a set of residuals based on a specified statistical method.

Parameters:

residuals (Union[Series, ndarray]) – An array or Series of residuals from a model fit.
method (Literal['mad', 'std']) – The method for outlier detection. “mad” (Median Absolute Deviation) is a robust method, while “std” (Standard Deviation) is the standard approach. Defaults to “mad”.
k (float) – The number of scaled MADs or standard deviations beyond which a point is considered an outlier. Defaults to 3.0.

Returns:

A boolean array of the same size as residuals, where True indicates that the corresponding residual is an outlier.

Return type:

ndarray

Raises:

ValueError – If method is not “mad” or “std”.

micromet.format.file_compile module

Compile files by substring into a single directory.

Key logic: - Group by exact filename (case-sensitive match on the filename itself). - Within each group, deduplicate items that have the same (creation_time, size). - If >1 unique items remain and both creation_time and size differ across them,

copy all, labeled sequentially: name_1.ext, name_2.ext, …

Else (effectively duplicates), copy only one.

class micromet.format.file_compile.FileInfo(path, size, create_ts, mtime_ts)[source]

Bases: object

A container for file metadata.

path

The full path to the file.

Type:: Path

size

The size of the file in bytes.

Type:: int

create_ts

The creation timestamp of the file. This may be platform-dependent.

Type:: float

mtime_ts

The modification timestamp of the file.

Type:: float

create_ts: float

mtime_ts: float

path: Path

size: int

micromet.format.file_compile.compile_files(root, outdir, contains, case_sensitive=False, dry_run=False, use_mtime=False, sequential_zero_pad=1)[source]

Compile files from a source directory to a destination, handling duplicates.

This function scans a directory tree for files containing a specific substring in their names, groups them by filename, and then copies them to an output directory. It includes logic to handle duplicate files based on their creation time and size.

Parameters:

root (Path) – The root directory to search for files.
outdir (Path) – The directory where the compiled files will be saved.
contains (str) – The substring that filenames must contain to be included.
case_sensitive (bool) – If True, the search for contains is case-sensitive. Defaults to False.
dry_run (bool) – If True, the function will only print the actions it would take without actually copying any files. Defaults to False.
use_mtime (bool) – If True, use the file’s modification time instead of its creation time for comparisons. Defaults to False.
sequential_zero_pad (int) – The number of digits to use for zero-padding when creating sequential filenames for duplicates. Defaults to 1.

Return type:

None

micromet.format.headers module

Header detection and repair utilities for delimited text files.

This module provides functions to detect missing headers in data files and repair them by borrowing headers from peer files. It supports both single-file processing and batch operations across directories.

Key Features

Automatic delimiter detection using csv.Sniffer with fallback heuristics
Header presence detection with multiple strategies
Peer file matching based on filename similarity and column count
Directory-based batch processing for duplicate files
Support for UTF-8, UTF-8-sig, and Latin-1 encodings

micromet.format.headers.apply_header(header_file, target_file, *, inplace=False)[source]

Apply a header from a reference file to a data file and return a DataFrame.

This function reads column names from header_file and applies them to target_file, which is assumed to lack a header row. The result is returned as a pandas DataFrame. Optionally, the function can overwrite target_file with the updated version, keeping a backup as *.bak.

Parameters:

header_file (Path) – Path to the file containing the correct column headers.
target_file (Path) – Path to the file that is missing column headers.
inplace (bool) – If True, the modified DataFrame is written back to target_file, and a backup of the original file is saved with a .bak extension. Default is False.

Returns:

DataFrame containing the contents of target_file with headers applied from header_file.

Return type:

DataFrame

Notes

The delimiter is inferred using a sniffing function to ensure consistent parsing between the header and target files.

micromet.format.headers.count_columns(path, delimiter)[source]

Count the number of columns in the first non-empty row of a file.

Parameters:

path (Path) – The path to the file.
delimiter (str) – The delimiter character to use for splitting rows into columns.

Returns:

The number of columns detected in the first non-empty row. Returns 0 if the file is empty or contains only empty rows.

Return type:

int

micromet.format.headers.detect_delimiter_and_header(path, sample_size=64000)[source]

Detect the delimiter and presence of a header in a text file.

Uses csv.Sniffer to determine the delimiter and whether a header row exists. Includes fallbacks for both detection steps if the sniffer fails.

Parameters:

path (Path) – The path to the file to inspect.
sample_size (int) – The number of bytes to read from the beginning of the file to use for detection. Defaults to 64,000.

Returns:

A tuple containing: - The detected delimiter character (e.g., ‘,’). - A boolean that is True if a header is detected, False otherwise.

Return type:

Tuple[str, bool]

micromet.format.headers.find_header_donor(target, delimiter, expected_cols, min_name_sim=0.4)[source]

Find a peer file to serve as a header “donor”.

Searches the same directory as the target file for a suitable file to borrow a header from. A donor is considered suitable if it: - Is a file with a common text extension. - Has a detectable header and the same delimiter. - Has the same number of columns as the target. - Has a filename similarity above min_name_sim.

Among candidates, the one with the closest modification time to the target is chosen. Ties are broken by selecting the one with the highest name similarity.

Parameters:

target (Path) – The path to the file that needs a header.
delimiter (str) – The delimiter used in the target file.
expected_cols (int) – The number of columns in the target file.
min_name_sim (float) – The minimum name similarity ratio (0.0 to 1.0) required for a file to be considered a potential donor. Defaults to 0.4.

Returns:

A tuple containing the path to the donor file and its raw header line, or None if no suitable donor is found.

Return type:

Optional[Tuple[Path, str]]

micromet.format.headers.fix_all_in_parent(parent, searchstr='*_AmeriFluxFormat_*.dat')[source]

Recursively scan a parent directory for files with duplicate names and fix missing headers.

This function searches parent for files matching a given pattern. If duplicate filenames are found such that one version has a header and another does not, the header is copied from the former to the latter. The target files are overwritten in-place, and a .bak backup is created for each.

Parameters:

parent (Path) – Root directory to scan for matching files. All subdirectories are included recursively.
searchstr (str) – Glob-style pattern to match filenames (default is “_AmeriFluxFormat_.dat”).

Returns:

A dictionary mapping filenames to lists of paths where they were found.

Return type:

dict

Notes

Files are grouped by basename and inspected line-by-line to determine whether they contain a header.
If multiple files have headers, only the first one is used as the donor.
Files with no header and no matching header source are skipped.

micromet.format.headers.fix_directory_pairs(dir_with_headers, dir_without_headers)[source]

Apply headers from a directory of correctly formatted files to a directory of files missing headers.

This function loops through all files in dir_without_headers. For each file that lacks a header, it attempts to find a matching file by name in dir_with_headers and uses it to patch the missing header. The original file is overwritten, and a .bak backup is created.

Parameters:

dir_with_headers (Path) – Directory containing files with valid headers.
dir_without_headers (Path) – Directory containing files that may be missing headers.

Return type:

None

Notes

This function assumes that files in both directories are named identically, and that headers can be determined by inspecting the first line of each file.

micromet.format.headers.get_first_line_raw(path)[source]

Return the first line of a file as raw text, without trailing newlines.

Parameters:: path (Path) – The path to the file.
Returns:: The content of the first line.
Return type:: str

micromet.format.headers.header_line_is_valid(header_line, delimiter, expected_cols)[source]

Check if a header line has the expected number of columns.

This function properly handles quoted fields.

Parameters:

header_line (str) – The raw header line text.
delimiter (str) – The delimiter character.
expected_cols (int) – The number of columns the header should have.

Returns:

True if the parsed header has the correct number of columns, False otherwise.

Return type:

bool

micromet.format.headers.looks_like_header(line, alpha_thresh=0.2)[source]

Heuristically determine if a line appears to be a header.

This function checks if a line from a text file is likely to be a header row by checking for the presence of alphabetic characters.

Parameters:

line (str) – A single line of text from a file.
alpha_thresh (float) – The minimum fraction of fields that must contain alphabetic characters to be considered a header. Defaults to 0.2 (20%).

Returns:

True if the line is likely a header, False otherwise.

Return type:

bool

micromet.format.headers.name_similarity(a, b)[source]

Calculate the similarity ratio between two strings.

Uses difflib.SequenceMatcher for the comparison.

Parameters:

a (str) – The first string.
b (str) – The second string.

Returns:

A similarity score between 0.0 and 1.0.

Return type:

float

micromet.format.headers.open_text(path, encodings=None)[source]

Open a text file, trying a list of encodings until one succeeds.

Parameters:

path (Path) – The path to the text file.
encodings (list[str] | None) – A list of character encodings to try, in order. Defaults to [“utf-8-sig”, “utf-8”, “latin-1”].

Returns:

An open file object.

Return type:

TextIOWrapper

Raises:

Exception – If all attempted encodings fail, the last exception is re-raised.

micromet.format.headers.patch_file(donor, target)[source]

Apply a header from a donor file to a target file.

This function reads the header from a donor file and applies it to a target file that is assumed to be missing a header. The modified data is returned as a DataFrame and written back to the target file.

Parameters:

donor (Path) – The path to the file with the correct header.
target (Path) – The path to the file that needs a header.

Returns:

A DataFrame containing the data from the target file with the new header.

Return type:

DataFrame

micromet.format.headers.prepend_header_in_place(path, header_line)[source]

Insert a header line at the top of a file.

This function reads the entire file, then writes it back with the provided header line at the beginning. It attempts to preserve the original newline style.

Parameters:

path (Path) – The path to the file to be modified.
header_line (str) – The header line to prepend to the file.

Return type:

None

micromet.format.headers.process_file(path, min_sim, make_backup)[source]

Detect and repair a headerless delimited text file in place.

The function inspects path to determine its delimiter and whether the file already contains a header row. If a header is missing, it searches for a “donor” file in the same directory with a compatible delimiter and column count, and with column-name similarity above min_sim. When a donor is found, its header is prepended to path (optionally creating a .bak backup first). Progress is reported via print messages.

Parameters:

path (Path) – Path to the target text file to check and possibly fix.
min_sim (float) – Minimum similarity threshold (0–1) for column-name matching when selecting a donor header. Higher values are stricter.
make_backup (bool) – If True, write a bytes-for-bytes backup alongside the file at path.with_suffix(path.suffix + ".bak") before modifying the file.

Returns:

The file at path may be modified in place as a side effect.

Return type:

None

Raises:

OSError – If reading or writing the file fails.
Exception – Any error originating from helper functions may propagate.

micromet.format.headers.read_colnames(path)[source]

Read column names from the first line of a file.

This function infers the delimiter, reads the first line of the file, and returns the column names.

Parameters:: path (Path) – The path to the file.
Returns:: A list of column names.
Return type:: list[str]

micromet.format.headers.scan(root, min_sim=0.5, backup=False)[source]

Recursively scan a directory tree and fix headerless text files.

Walks root with Path.rglob("*") and applies process_file() to every file whose extension is in {".dat"}. Exceptions raised by process_file() are caught and reported, allowing the scan to continue.

Parameters:

root (Path) – Directory to search recursively for candidate text files.
min_sim (float) – Minimum column-name similarity (0–1) when selecting a donor header; passed through to process_file().
backup (bool) – If True, create a .bak file for each modified file; passed through to process_file() as make_backup.

Return type:

None

Returns:

None
Side Effects
————
- May modify files in place by inserting a header line.
- May create .bak files adjacent to modified files when backup=True.
- Prints progress, skip, and error messages to standard output.

micromet.format.headers.sniff_delimiter(path, sample_bytes=2048, default=',')[source]

Infer the most likely delimiter used in a text file.

This function reads a sample from the beginning of a file and uses csv.Sniffer to detect the delimiter.

Parameters:

path (Path) – The path to the file.
sample_bytes (int) – The number of bytes to read for the sample. Defaults to 2048.
default (str) – The delimiter to return if detection fails. Defaults to “,”.

Returns:

The detected or default delimiter.

Return type:

str

micromet.format.merge module

micromet.format.merge.fillna_with_second_df(df1, df2, suffix1='_df1', suffix2='_df2')[source]

Merges two DataFrames by index, prioritizing data from df1 and using df2 to fill any missing (NaN) values introduced by the outer merge for any columns that match between the two dataframes.

Parameters:

df1 (DataFrame) – The primary DataFrame whose index and values are prioritized.
df2 (DataFrame) – The secondary DataFrame used to fill NaN values in df1’s columns.
suffix1 (str) – The suffix to apply to columns from df1 during the merge. The default is ‘_df1’. This suffix is removed from the output. Select a suffix that is not a string in a column name in either dataframe
suffix2 (str) – The suffix to apply to columns from df2 during the merge. The default is ‘_df2’. These columns are dropped from the output. Select a suffix that is not a string in a column name in either dataframe

Returns:

A merged DataFrame containing the union of both indices. Columns are filled: df1’s value if present, otherwise df2’s value. The final column names are stripped of suffix1.

Return type:

DataFrame

Notes

This function assumes that the column names (excluding suffixes) in both DataFrames are the same for matching purposes.

micromet.format.reformatter module

This module provides the Reformatter class for cleaning and standardizing station data for flux/met processing, with integrated timestamp alignment checks.

class micromet.format.reformatter.Reformatter(var_limits_csv=None, drop_soil=True, check_timestamps=False, site_lat=None, site_lon=None, site_utc_offset=-7, logger=None)[source]

Bases: object

A class to clean and standardize station data for flux/met processing.

This class provides a pipeline for preparing raw station data by applying a series of transformations, including fixing timestamps, renaming columns, applying physical limits, and checking timestamp alignment.

Parameters:

var_limits_csv (str | Path | None) – Path to a CSV file containing variable limits. If not provided, default limits are used.
drop_soil (bool) – If True, extra soil-related columns are dropped. Defaults to True.
check_timestamps (bool) – If True, perform timestamp alignment analysis on radiation data. Defaults to False.
site_lat (float | None) – Latitude of the site (required if check_timestamps=True).
site_lon (float | None) – Longitude of the site (required if check_timestamps=True).
site_utc_offset (int) – UTC offset in hours for the site (required if check_timestamps=True).
logger (Logger | None) – A logger for tracking the reformatting process. If not provided, a default logger is used.

logger

The logger used for logging messages.

Type:: logging.Logger

config

A dictionary of configuration parameters for the reformatting process.

Type:: dict

varlimits

A DataFrame containing the physical limits for each variable.

Type:: pd.DataFrame

drop_soil

A flag indicating whether to drop extra soil columns.

Type:: bool

check_timestamps

A flag indicating whether to perform timestamp alignment checks.

Type:: bool

site_lat

The latitude of the site.

Type:: float

site_lon

The longitude of the site.

Type:: float

site_utc_offset

The UTC offset of the site in hours.

Type:: float

__init__(var_limits_csv=None, drop_soil=True, check_timestamps=False, site_lat=None, site_lon=None, site_utc_offset=-7, logger=None)[source]

Initialize the Reformatter.

Parameters:

var_limits_csv (str | Path | None) – Path to a CSV file containing variable limits.
drop_soil (bool) – If True, extra soil-related columns are dropped. Defaults to True.
check_timestamps (bool) – If True, perform timestamp alignment analysis. Defaults to False.
site_lat (float | None) – Latitude of the site (required if check_timestamps=True).
site_lon (float | None) – Longitude of the site (required if check_timestamps=True).
site_utc_offset (int) – UTC offset in hours (required if check_timestamps=True).
logger (Logger | None) – A logger for tracking the reformatting process.

finalize(df)[source]: Finalize the data by applying cleaning and standardization steps.

prepare(df, interval=30, data_type='eddy')[source]: Current method - keep for backward compatibility

preprocess(df, data_type='eddy', interval=30)[source]: Preprocess the data by applying initial cleaning and standardization steps.

process(df, interval, data_type='eddy')[source]

Prepare the data by applying a series of cleaning and standardization steps.

This method takes a DataFrame of station data and applies a pipeline of transformations to clean and standardize it. The steps include fixing timestamps, renaming columns, setting numeric types, resampling, applying physical limits, and optionally checking timestamp alignment.

Parameters:

df (DataFrame) – The input DataFrame of station data.
data_type (str) – The type of data being processed (e.g., ‘eddy’, ‘met’). This is used to determine which column renaming map to use. Defaults to ‘eddy’.
interval (int) – The sampling interval used with the data; must be either 30 or 60 minutes

Returns:

A tuple containing: - The prepared DataFrame with standardized and cleaned data. - A report DataFrame detailing the changes made during the

application of physical limits.

A dictionary with timestamp alignment results (if check_timestamps=True), or None otherwise. Contains keys: ‘summary’, ‘composites’, ‘flags’.

Return type:

Tuple[DataFrame, DataFrame, Optional[Dict]]

micromet.format.reformatter_vars module

This module contains the configuration dictionary for the data reformatter.

The config dictionary holds several key-value pairs that control the behavior of the data reformatting process. This includes mappings for renaming columns, lists of variables for different data types (e.g., ‘eddy’ and ‘met’), and lists of columns to be dropped.

Module contents

This package contains modules for formatting and transforming data.