micromet.format package
Subpackages
- micromet.format.transformers package
- Submodules
- micromet.format.transformers.cleanup module
- micromet.format.transformers.columns module
- micromet.format.transformers.corrections module
- micromet.format.transformers.interval_updates module
- micromet.format.transformers.timestamp_update module
- micromet.format.transformers.timestamps module
- micromet.format.transformers.validation module
- Module contents
apply_fixes()apply_physical_limits()col_order()drop_extra_soil_columns()drop_extras()fill_na_drop_dups()fix_swc_percent()fix_timestamps()infer_datetime_col()make_unique()make_unique_cols()mask_stuck_values()modernize_soil_legacy()normalize_prefixes()process_and_match_columns()rating()rename_columns()resample_timestamps()scale_and_convert()set_number_types()ssitc_scale()tau_fixer()timestamp_reset()
Submodules
micromet.format.compare module
Relationship comparison and coordinated outlier plots (SciPy version).
Aligns two time series (treats -9999 as NaN)
Fits y ~ x with scipy.stats.linregress
Flags outliers from residuals using robust MAD (or STD)
Produces three coordinated plots: (1) scatter + regression line + highlighted outliers (2) x time series with the same outliers highlighted (3) y time series with the same outliers highlighted
- class micromet.format.compare.FitResult(coef, intercept, r2, y_hat, residuals)[source]
Bases:
objectLinear regression summary.
- micromet.format.compare.align(x, y, x_index=None, y_index=None, x_name='X', y_name='Y', how='inner')[source]
Align two series on their index and drop rows with NaNs.
This function prepares two time series for comparison by coercing them to pandas Series, aligning them based on their time index, and removing any rows that contain missing values (NaNs) in either series.
- Parameters:
x (
Union[Series,DataFrame,ndarray,Iterable]) – The first time series (independent variable).y (
Union[Series,DataFrame,ndarray,Iterable]) – The second time series (dependent variable).x_index (
Optional[DatetimeIndex]) – The time index for the x series, if not already a Series or DataFrame. Defaults to None.y_index (
Optional[DatetimeIndex]) – The time index for the y series, if not already a Series or DataFrame. Defaults to None.x_name (
str) – The name to assign to the x series. Defaults to “X”.y_name (
str) – The name to assign to the y series. Defaults to “Y”.how (
str) – The method for joining the two series, as in pd.concat. Defaults to “inner”.
- Returns:
A DataFrame containing the two aligned and cleaned series as columns, indexed by their common time index.
- Return type:
- micromet.format.compare.compare_and_plot(x, y, *, x_index=None, y_index=None, x_label='X', y_label='Y', title=None, method='mad', k=3.0, join='inner', point_size=8)[source]
Align, fit, detect outliers, and render coordinated plots.
This function provides a comprehensive analysis of the relationship between two time series. It produces a figure with three subplots: 1. A scatter plot of y vs. x with a regression line and outliers. 2. A time series plot of x with outliers highlighted. 3. A time series plot of y with outliers highlighted.
- Parameters:
x (
Union[Series,DataFrame,ndarray,Iterable]) – The first time series (independent variable).y (
Union[Series,DataFrame,ndarray,Iterable]) – The second time series (dependent variable).x_index (
Optional[DatetimeIndex]) – The time index for x. Defaults to None.y_index (
Optional[DatetimeIndex]) – The time index for y. Defaults to None.x_label (
str) – Label for the x-axis. Defaults to “X”.y_label (
str) – Label for the y-axis. Defaults to “Y”.title (
Optional[str]) – Title for the scatter plot. Defaults to None.method (
Literal['mad','std']) – Method for outlier detection. Defaults to “mad”.k (
float) – Threshold for outlier detection. Defaults to 3.0.join (
str) – Method for aligning the series. Defaults to “inner”.point_size (
int) – Size of the scatter plot points. Defaults to 8.
- Returns:
A tuple containing the matplotlib Figure and a dictionary of results, including the aligned data, outlier mask, and fit statistics.
- Return type:
- Raises:
ValueError – If there is no overlapping, non-NaN data between the inputs.
- micromet.format.compare.compare_report(x, y, **kwargs)[source]
Return a tidy per-record report with predictions, residuals, and outlier flags.
This function wraps compare_and_plot to generate a detailed DataFrame report for each data point, including the predicted value, residual, and an outlier flag.
- Parameters:
- Returns:
A DataFrame with columns for the original data, the predicted y-values (y_hat), residuals, and a boolean outlier flag.
- Return type:
- micromet.format.compare.fit_linear(x, y)[source]
Fit a linear model y ~ x using scipy.stats.linregress.
This function performs a simple linear regression and returns the key results, including the fitted values and residuals.
- Parameters:
- Returns:
A dataclass object containing the regression coefficient, intercept, R-squared value, predicted y values (y_hat), and the residuals.
- Return type:
- micromet.format.compare.outlier_mask_from_residuals(residuals, method='mad', k=3.0)[source]
Flag outliers from residuals using MAD (robust, default) or STD.
This function identifies outliers in a set of residuals based on a specified statistical method.
- Parameters:
residuals (
Union[Series,ndarray]) – An array or Series of residuals from a model fit.method (
Literal['mad','std']) – The method for outlier detection. “mad” (Median Absolute Deviation) is a robust method, while “std” (Standard Deviation) is the standard approach. Defaults to “mad”.k (
float) – The number of scaled MADs or standard deviations beyond which a point is considered an outlier. Defaults to 3.0.
- Returns:
A boolean array of the same size as residuals, where True indicates that the corresponding residual is an outlier.
- Return type:
- Raises:
ValueError – If method is not “mad” or “std”.
micromet.format.file_compile module
Compile files by substring into a single directory.
Key logic: - Group by exact filename (case-sensitive match on the filename itself). - Within each group, deduplicate items that have the same (creation_time, size). - If >1 unique items remain and both creation_time and size differ across them,
copy all, labeled sequentially: name_1.ext, name_2.ext, …
Else (effectively duplicates), copy only one.
- class micromet.format.file_compile.FileInfo(path, size, create_ts, mtime_ts)[source]
Bases:
objectA container for file metadata.
- path
The full path to the file.
- Type:
Path
- micromet.format.file_compile.compile_files(root, outdir, contains, case_sensitive=False, dry_run=False, use_mtime=False, sequential_zero_pad=1)[source]
Compile files from a source directory to a destination, handling duplicates.
This function scans a directory tree for files containing a specific substring in their names, groups them by filename, and then copies them to an output directory. It includes logic to handle duplicate files based on their creation time and size.
- Parameters:
root (
Path) – The root directory to search for files.outdir (
Path) – The directory where the compiled files will be saved.contains (
str) – The substring that filenames must contain to be included.case_sensitive (
bool) – If True, the search for contains is case-sensitive. Defaults to False.dry_run (
bool) – If True, the function will only print the actions it would take without actually copying any files. Defaults to False.use_mtime (
bool) – If True, use the file’s modification time instead of its creation time for comparisons. Defaults to False.sequential_zero_pad (
int) – The number of digits to use for zero-padding when creating sequential filenames for duplicates. Defaults to 1.
- Return type:
micromet.format.headers module
Header detection and repair utilities for delimited text files.
This module provides functions to detect missing headers in data files and repair them by borrowing headers from peer files. It supports both single-file processing and batch operations across directories.
Key Features
Automatic delimiter detection using csv.Sniffer with fallback heuristics
Header presence detection with multiple strategies
Peer file matching based on filename similarity and column count
Directory-based batch processing for duplicate files
Support for UTF-8, UTF-8-sig, and Latin-1 encodings
- micromet.format.headers.apply_header(header_file, target_file, *, inplace=False)[source]
Apply a header from a reference file to a data file and return a DataFrame.
This function reads column names from header_file and applies them to target_file, which is assumed to lack a header row. The result is returned as a pandas DataFrame. Optionally, the function can overwrite target_file with the updated version, keeping a backup as *.bak.
- Parameters:
header_file (
Path) – Path to the file containing the correct column headers.target_file (
Path) – Path to the file that is missing column headers.inplace (
bool) – If True, the modified DataFrame is written back to target_file, and a backup of the original file is saved with a .bak extension. Default is False.
- Returns:
DataFrame containing the contents of target_file with headers applied from header_file.
- Return type:
Notes
The delimiter is inferred using a sniffing function to ensure consistent parsing between the header and target files.
- micromet.format.headers.count_columns(path, delimiter)[source]
Count the number of columns in the first non-empty row of a file.
- micromet.format.headers.detect_delimiter_and_header(path, sample_size=64000)[source]
Detect the delimiter and presence of a header in a text file.
Uses csv.Sniffer to determine the delimiter and whether a header row exists. Includes fallbacks for both detection steps if the sniffer fails.
- Parameters:
- Returns:
A tuple containing: - The detected delimiter character (e.g., ‘,’). - A boolean that is True if a header is detected, False otherwise.
- Return type:
- micromet.format.headers.find_header_donor(target, delimiter, expected_cols, min_name_sim=0.4)[source]
Find a peer file to serve as a header “donor”.
Searches the same directory as the target file for a suitable file to borrow a header from. A donor is considered suitable if it: - Is a file with a common text extension. - Has a detectable header and the same delimiter. - Has the same number of columns as the target. - Has a filename similarity above min_name_sim.
Among candidates, the one with the closest modification time to the target is chosen. Ties are broken by selecting the one with the highest name similarity.
- Parameters:
target (
Path) – The path to the file that needs a header.delimiter (
str) – The delimiter used in the target file.expected_cols (
int) – The number of columns in the target file.min_name_sim (
float) – The minimum name similarity ratio (0.0 to 1.0) required for a file to be considered a potential donor. Defaults to 0.4.
- Returns:
A tuple containing the path to the donor file and its raw header line, or None if no suitable donor is found.
- Return type:
- micromet.format.headers.fix_all_in_parent(parent, searchstr='*_AmeriFluxFormat_*.dat')[source]
Recursively scan a parent directory for files with duplicate names and fix missing headers.
This function searches parent for files matching a given pattern. If duplicate filenames are found such that one version has a header and another does not, the header is copied from the former to the latter. The target files are overwritten in-place, and a .bak backup is created for each.
- Parameters:
- Returns:
A dictionary mapping filenames to lists of paths where they were found.
- Return type:
Notes
Files are grouped by basename and inspected line-by-line to determine whether they contain a header.
If multiple files have headers, only the first one is used as the donor.
Files with no header and no matching header source are skipped.
- micromet.format.headers.fix_directory_pairs(dir_with_headers, dir_without_headers)[source]
Apply headers from a directory of correctly formatted files to a directory of files missing headers.
This function loops through all files in dir_without_headers. For each file that lacks a header, it attempts to find a matching file by name in dir_with_headers and uses it to patch the missing header. The original file is overwritten, and a .bak backup is created.
- Parameters:
- Return type:
Notes
This function assumes that files in both directories are named identically, and that headers can be determined by inspecting the first line of each file.
- micromet.format.headers.get_first_line_raw(path)[source]
Return the first line of a file as raw text, without trailing newlines.
- micromet.format.headers.header_line_is_valid(header_line, delimiter, expected_cols)[source]
Check if a header line has the expected number of columns.
This function properly handles quoted fields.
- micromet.format.headers.looks_like_header(line, alpha_thresh=0.2)[source]
Heuristically determine if a line appears to be a header.
This function checks if a line from a text file is likely to be a header row by checking for the presence of alphabetic characters.
- micromet.format.headers.name_similarity(a, b)[source]
Calculate the similarity ratio between two strings.
Uses difflib.SequenceMatcher for the comparison.
- micromet.format.headers.open_text(path, encodings=None)[source]
Open a text file, trying a list of encodings until one succeeds.
- Parameters:
- Returns:
An open file object.
- Return type:
- Raises:
Exception – If all attempted encodings fail, the last exception is re-raised.
- micromet.format.headers.patch_file(donor, target)[source]
Apply a header from a donor file to a target file.
This function reads the header from a donor file and applies it to a target file that is assumed to be missing a header. The modified data is returned as a DataFrame and written back to the target file.
- micromet.format.headers.prepend_header_in_place(path, header_line)[source]
Insert a header line at the top of a file.
This function reads the entire file, then writes it back with the provided header line at the beginning. It attempts to preserve the original newline style.
- micromet.format.headers.process_file(path, min_sim, make_backup)[source]
Detect and repair a headerless delimited text file in place.
The function inspects path to determine its delimiter and whether the file already contains a header row. If a header is missing, it searches for a “donor” file in the same directory with a compatible delimiter and column count, and with column-name similarity above min_sim. When a donor is found, its header is prepended to path (optionally creating a
.bakbackup first). Progress is reported viaprintmessages.- Parameters:
path (
Path) – Path to the target text file to check and possibly fix.min_sim (
float) – Minimum similarity threshold (0–1) for column-name matching when selecting a donor header. Higher values are stricter.make_backup (
bool) – If True, write a bytes-for-bytes backup alongside the file atpath.with_suffix(path.suffix + ".bak")before modifying the file.
- Returns:
The file at path may be modified in place as a side effect.
- Return type:
- Raises:
- micromet.format.headers.read_colnames(path)[source]
Read column names from the first line of a file.
This function infers the delimiter, reads the first line of the file, and returns the column names.
- micromet.format.headers.scan(root, min_sim=0.5, backup=False)[source]
Recursively scan a directory tree and fix headerless text files.
Walks root with
Path.rglob("*")and appliesprocess_file()to every file whose extension is in{".dat"}. Exceptions raised byprocess_file()are caught and reported, allowing the scan to continue.- Parameters:
root (
Path) – Directory to search recursively for candidate text files.min_sim (
float) – Minimum column-name similarity (0–1) when selecting a donor header; passed through toprocess_file().backup (
bool) – If True, create a.bakfile for each modified file; passed through toprocess_file()asmake_backup.
- Return type:
- Returns:
None
Side Effects
————
- May modify files in place by inserting a header line.
May create
.bakfiles adjacent to modified files when backup=True.
- Prints progress, skip, and error messages to standard output.
micromet.format.merge module
- micromet.format.merge.fillna_with_second_df(df1, df2, suffix1='_df1', suffix2='_df2')[source]
Merges two DataFrames by index, prioritizing data from df1 and using df2 to fill any missing (NaN) values introduced by the outer merge for any columns that match between the two dataframes.
- Parameters:
df1 (
DataFrame) – The primary DataFrame whose index and values are prioritized.df2 (
DataFrame) – The secondary DataFrame used to fill NaN values in df1’s columns.suffix1 (
str) – The suffix to apply to columns from df1 during the merge. The default is ‘_df1’. This suffix is removed from the output. Select a suffix that is not a string in a column name in either dataframesuffix2 (
str) – The suffix to apply to columns from df2 during the merge. The default is ‘_df2’. These columns are dropped from the output. Select a suffix that is not a string in a column name in either dataframe
- Returns:
A merged DataFrame containing the union of both indices. Columns are filled: df1’s value if present, otherwise df2’s value. The final column names are stripped of suffix1.
- Return type:
Notes
This function assumes that the column names (excluding suffixes) in both DataFrames are the same for matching purposes.
micromet.format.reformatter module
This module provides the Reformatter class for cleaning and standardizing station data for flux/met processing, with integrated timestamp alignment checks.
- class micromet.format.reformatter.Reformatter(var_limits_csv=None, drop_soil=True, check_timestamps=False, site_lat=None, site_lon=None, site_utc_offset=-7, logger=None)[source]
Bases:
objectA class to clean and standardize station data for flux/met processing.
This class provides a pipeline for preparing raw station data by applying a series of transformations, including fixing timestamps, renaming columns, applying physical limits, and checking timestamp alignment.
- Parameters:
var_limits_csv (
str|Path|None) – Path to a CSV file containing variable limits. If not provided, default limits are used.drop_soil (
bool) – If True, extra soil-related columns are dropped. Defaults to True.check_timestamps (
bool) – If True, perform timestamp alignment analysis on radiation data. Defaults to False.site_lat (
float|None) – Latitude of the site (required if check_timestamps=True).site_lon (
float|None) – Longitude of the site (required if check_timestamps=True).site_utc_offset (
int) – UTC offset in hours for the site (required if check_timestamps=True).logger (
Logger|None) – A logger for tracking the reformatting process. If not provided, a default logger is used.
- logger
The logger used for logging messages.
- Type:
- varlimits
A DataFrame containing the physical limits for each variable.
- Type:
pd.DataFrame
- __init__(var_limits_csv=None, drop_soil=True, check_timestamps=False, site_lat=None, site_lon=None, site_utc_offset=-7, logger=None)[source]
Initialize the Reformatter.
- Parameters:
var_limits_csv (
str|Path|None) – Path to a CSV file containing variable limits.drop_soil (
bool) – If True, extra soil-related columns are dropped. Defaults to True.check_timestamps (
bool) – If True, perform timestamp alignment analysis. Defaults to False.site_lat (
float|None) – Latitude of the site (required if check_timestamps=True).site_lon (
float|None) – Longitude of the site (required if check_timestamps=True).site_utc_offset (
int) – UTC offset in hours (required if check_timestamps=True).logger (
Logger|None) – A logger for tracking the reformatting process.
- prepare(df, interval=30, data_type='eddy')[source]
Current method - keep for backward compatibility
- preprocess(df, data_type='eddy', interval=30)[source]
Preprocess the data by applying initial cleaning and standardization steps.
- process(df, interval, data_type='eddy')[source]
Prepare the data by applying a series of cleaning and standardization steps.
This method takes a DataFrame of station data and applies a pipeline of transformations to clean and standardize it. The steps include fixing timestamps, renaming columns, setting numeric types, resampling, applying physical limits, and optionally checking timestamp alignment.
- Parameters:
- Returns:
A tuple containing: - The prepared DataFrame with standardized and cleaned data. - A report DataFrame detailing the changes made during the
application of physical limits.
A dictionary with timestamp alignment results (if check_timestamps=True), or None otherwise. Contains keys: ‘summary’, ‘composites’, ‘flags’.
- Return type:
micromet.format.reformatter_vars module
This module contains the configuration dictionary for the data reformatter.
The config dictionary holds several key-value pairs that control the behavior of the data reformatting process. This includes mappings for renaming columns, lists of variables for different data types (e.g., ‘eddy’ and ‘met’), and lists of columns to be dropped.
Module contents
This package contains modules for formatting and transforming data.