Function descriptions¶

class pylenm.PylenmDataFactory(data: pandas.core.frame.DataFrame)¶

Bases: object

Class object that initilaizes Pylenm given data.

add_dist_to_source(XX, source_coordinate=[436642.7, 3681927.09], col_name='dist_to_source')¶

adds column to data with the distance of a record to the source coordinate

Parameters

XX (pd.DataFrame) – data with coordinate information
source_coordinate (list, optional) – source coordinate. Defaults to [436642.70,3681927.09].
col_name (str, optional) – name to assign new column. Defaults to ‘dist_to_source’.

Returns

returns original data with additional column with the distance.

Return type

pd.DataFrame

cluster_data(data, analyte_name=['ANALYTE_NAME'], n_clusters=4, filter=False, col=None, equals=[], year_interval=5, y_label='Concentration', return_clusters=False)¶

Clusters time series concentration data using kmeans algorithm and plots it.

Parameters

data (pd.DataFrame) – data to be used in clustering.
analyte_name (list, optional) – analytes to use to cluster. Defaults to [“ANALYTE_NAME”].
n_clusters (int, optional) – number of clusters for kmeans. Defaults to 4.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
year_interval (int, optional) – plot x_label interval in years. Defaults to 5.
y_label (str, optional) – y axis label. Defaults to ‘Concentration’.
return_clusters (bool, optional) – flag to return cluster assignemnt. Defaults to False.

dist(p1, p2)¶

2D Euclidean distance function

Parameters

p1 (tuple) – first point
p2 (tuple) – second point

Returns

Euclidean distance

Return type

float

filter_by_column(data=None, col=None, equals=[])¶

Filters construction data based on one column. You only specify ONE column to filter by, but can selected MANY values for the entry.

Parameters

data (pd.DataFrame, optional) – dataframe to filter. Defaults to None.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

returns filtered dataframe

Return type

pd.DataFrame

filter_wells(units)¶

Returns a list of the well names filtered by the unit(s) specified.

Parameters: units (list) – Letter of the well to be filtered (e.g. [‘A’] or [‘A’, ‘D’])
Returns: well names filtered by the unit(s) specified
Return type: list

fit_gp(X, y, xx, model=None, smooth=True)¶

Fits Gaussian Process for X and y and returns both the GP model and the predicted values

Parameters

X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.

Returns

GP model, prediction of xx

Return type

GaussianProcessRegressor, numpy.array

getCleanData(analytes)¶

Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is all of the dates in the dataset. Many NaN should be expected.

Parameters: analytes (list) – list of analyte names to use
Returns: pd.DataFrame

getCommonDates(analytes, lag=[3, 7, 10])¶

Creates a table which counts the number of wells within a range specified by a list of lag days.

Parameters

analytes (list) – list of analyte names to use
lag (list, optional) – list of days to look ahead and behind the specified date (+/-). Defaults to [3,7,10].

Returns

pd.DataFrame

getData()¶

Returns the concentration data in pylenm

Returns: concentration data that was passed into pylenm
Return type: pd.DataFrame

getJointData(analytes, lag=3)¶

Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is the date ranges secified by the lag.

Parameters

analytes (list) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 3.

Returns

pd.DataFrame

get_Best_GP(X, y, smooth=True, seed=42)¶

Returns the best Gaussian Process model for a given X and y.

Parameters

X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
seed (int, optional) – random state setting. Defaults to 42.

Returns

best GP model

Return type

GaussianProcessRegressor

get_Best_Wells(X, y, xx, ref, initial, max_wells, ft=['Elevation'], regression='linear', verbose=True, smooth=True, model=None)¶

Greedy optimization function to select a subset of wells as to minimizes the MSE from a reference map

Parameters

X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
ref (numpy.array) – reference field to optimize for (aka best/true map)
initial (list) – indices of wells as the starting wells for optimization
max_wells (int) – number of wells to optimize for
ft (list, optional) – feature names to train on. Defaults to [‘Elevation’].
regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.
verbose (bool, optional) –
1. Defaults to True.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.

Returns

index of best wells in order from best to worst

Return type

list

get_Construction_Data()¶

Returns the construction data in pylenm

Returns: construction data that was passed into pylenm
Return type: pd.DataFrame

get_MCL(analyte_name)¶

Returns the Maximum Concentration Limit value for the specified analyte. Example: ‘TRITIUM’ returns 1.3

Parameters: analyte_name (str) – name of the analyte to be processed
Returns: MLC value
Return type: float

get_analyte_details(analyte_name, filter=False, col=None, equals=[], save_to_file=False, save_dir='analyte_details')¶

Returns a csv file saved to save_dir with details pertaining to the specified analyte. Details include the well names, the date ranges and the number of unique samples.

Parameters

analyte_name (str) – name of the analyte to be processed
filter (bool, optional) – whether to filter the data. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
save_to_file (bool, optional) – whether to save data to file. Defaults to False.
save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘analyte_details’.

Returns

Table with well information

Return type

pd.DataFrame

get_data_summary(analytes=None, sort_by='date', ascending=False, filter=False, col=None, equals=[])¶

Returns a dataframe with a summary of the data for certain analytes. Summary includes the date ranges and the number of unique samples and other statistics for the analyte results.

Parameters

analytes (list, optional) – list of analyte names to be processed. If left empty, a list of all the analytes in the data will be used. Defaults to None.
sort_by (str, optional) – {‘date’, ‘samples’, ‘wells’} sorts the data by either the dates by entering: ‘date’, the samples by entering: ‘samples’, or by unique well locations by entering ‘wells’. Defaults to ‘date’.
ascending (bool, optional) – flag to sort in ascending order.. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

Table with well information

Return type

pd.DataFrame

get_unit(analyte_name)¶

Returns the unit of the analyte you specify. Example: ‘DEPTH_TO_WATER’ may return ‘ft’

Parameters: analyte_name (str) – ame of the analyte to be processed
Returns: unit of analyte
Return type: str

get_well_analytes(well_name=None, filter=False, col=None, equals=[])¶

Displays the analyte names available at given well locations.

Parameters

well_name (str, optional) – name of the well. If left empty, all wells are returned.. Defaults to None.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

None

interpolate_topo(X, y, xx, ft=['Elevation'], model=None, smooth=True, regression='linear', seed=42)¶

Spatially interpolate the water table as a function of topographic metrics using Gaussian Process. Uses regression to generate trendline adds the values to the GP map.

Parameters

X (numpy.array) – training values. Must include “Easting” and “Northing” columns.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
ft (list, optional) – eature names to train on. Defaults to [‘Elevation’].
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.
seed (int, optional) – random state setting. Defaults to 42.

Returns

predicton of locations xx

Return type

numpy.array

interpolate_well_data(well_name, analytes, frequency='2W')¶

Resamples the data based on the frequency specified and interpolates the values of the analytes.

Parameters

well_name (str) – name of the well to be processed.
analytes (list) – list of analyte names to use
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

Returns

pd.DataFrame

interpolate_wells_by_analyte(analyte, frequency='2W', rm_outliers=True, z_threshold=3)¶

Resamples analyte data based on the frequency specified and interpolates the values in between. NaN values are replaced with the average value per well.

Parameters

analyte (_type_) – analyte name for interpolation of all present wells.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
rm_outliers (bool, optional) – flag to remove outliers in the data. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 3.

Returns

interpolated data

Return type

pd.DataFrame

jointData_is_set(lag)¶

Checks to see if getJointData function was already called and saved for given lag.

Parameters: lag (int) – number of days to look ahead and behind the specified date (+/-)
Returns: True if JointData was already calculated, False, otherwise.
Return type: bool

mse(y_true, y_pred)¶

Error Metric: Mean Squared Error

Parameters

y_true (numpy.array) – true values
y_pred (numpy.array) – predicted values

Returns

mean squared error

Return type

float

plot_MCL(well_name, analyte_name, year_interval=5, save_dir='plot_MCL')¶

Plots the linear regression line of data given the analyte_name and well_name. The plot includes the prediction where the line of best fit intersects with the Maximum Concentration Limit (MCL).

Parameters

well_name (str) – ame of the well to be processed
analyte_name (str) – name of the analyte to be processed
year_interval (int, optional) – lot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 5.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_MCL’.

plot_PCA_by_date(date, analytes, lag=0, n_clusters=4, return_clusters=False, min_samples=3, show_labels=True, save_dir='plot_PCA_by_date', filter=False, col=None, equals=[])¶

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a date in the dataset. The data is also clustered into n_clusters.

Parameters

date (str) – date to be analyzed
analytes (str) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.
n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.
return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_PCA_by_well(well_name, analytes, interpolate=False, frequency='2W', min_samples=10, show_labels=True, save_dir='plot_PCA_by_well')¶

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a well_name in the dataset. Only uses the 6 important analytes.

Parameters

well_name (str) – name of the well to be processed
analytes (str) – list of analyte names to use
interpolate (bool, optional) – choose to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.

plot_PCA_by_year(year, analytes, n_clusters=4, return_clusters=False, min_samples=10, show_labels=True, save_dir='plot_PCA_by_year', filter=False, col=None, equals=[])¶

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a year in the dataset. The data is also clustered into n_clusters.

Parameters

year (int) – year to be analyzed
analytes (str) – list of analyte names to use
n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.
return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_all_corr_by_well(analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20)¶

Plots the correlations with the physical plots as well as the important analytes over time for each well in the dataset.

Parameters

analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.

plot_all_correlation_heatmap(show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')¶

Plots a heatmap of the correlations of the important analytes over time for each well in the dataset.

Parameters

show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.
color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.

plot_all_data(log_transform=True, alpha=0, year_interval=2, plot_inline=True, save_dir='plot_data')¶

Plot concentrations over time for every well and analyte with a smoothed curve on interpolated data points.

Parameters

log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.
alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.
plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.
year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.

plot_all_time_series(analyte_name=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', x_label_size=8, marker_size=30, min_days=10, x_min_lim=None, x_max_lim=None, y_min_date=None, y_max_date=None, sort_by_distance=True, source_coordinate=[436642.7, 3681927.09], log_transform=False, cmap=<matplotlib.colors.LinearSegmentedColormap object>, drop_cols=[], return_data=False, filter=False, col=None, equals=[], cbar_min=None, cbar_max=None, reverse_y_axis=False, fontsize=20, figsize=(20, 6), dpi=300, y_2nd_label=None)¶

Plots the start and end date of analyte readings for differnt locations/sensors/wells with colored concentration reading.

Parameters

analyte_name (str, optional) – analyte to examine. Defaults to None.
title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.
x_label (str, optional) – x axis label. Defaults to ‘Well’.
y_label (str, optional) – y axis label. Defaults to ‘Year’.
x_label_size (int, optional) – x axis label font size. Defaults to 8.
marker_size (int, optional) – point size for time series. Defaults to 30.
min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.
x_min_lim (int, optional) – x axis starting point. Defaults to None.
x_max_lim (int, optional) – x axis ending point. Defaults to None.
y_min_date (str, optional) – y axis starting date. Defaults to None.
y_max_date (str, optional) – y axis ending date. Defaults to None.
sort_by_distance (bool, optional) – flag to sort by distance from source center. Defaults to True.
source_coordinate (list, optional) – Easting, Northing coordinate of source center. Defaults to [436642.70,3681927.09].
log_transform (bool, optional) – flag to toggle log base 10 transformation. Defaults to False.
cmap (cmap, optional) – color map for plotting. Defaults to mpl.cm.rainbow.
drop_cols (list, optional) – columns, usually wells, to exclude. Defaults to [].
return_data (bool, optional) – flag to return data. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
cbar_min (float, optional) – color bar lower boundary. Defaults to None.
cbar_max (float, optional) – color bar upper boundary. Defaults to None.
reverse_y_axis (bool, optional) – flag that reverses y axis. Defaults to False.
fontsize (int, optional) – plot font size. Defaults to 20.
figsize (tuple, optional) – matplotlib style figure size. Defaults to (20,6).
dpi (int, optional) – DPI of figure. Defaults to 300.
y_2nd_label (str, optional) – color bar label manual override. Defaults to None.

plot_all_time_series_simple(analyte_name=None, start_date=None, end_date=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', min_days=10, x_min_lim=- 5, x_max_lim=170, y_min_date='1988-01-01', y_max_date='2020-01-01', return_data=False, filter=False, col=None, equals=[])¶

Plots the start and end date of analyte readings for differnt locations/sensors/wells.

Parameters

analyte_name (str, optional) – analyte to examine. Defaults to None.
start_date (str, optional) – start date of horizontal time to show alignment. Defaults to None.
end_date (str, optional) – end date of horizontal time to show alignment.. Defaults to None.
title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.
x_label (str, optional) – x axis label. Defaults to ‘Well’.
y_label (str, optional) – y axis label. Defaults to ‘Year’.
min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.
x_min_lim (int, optional) – x axis starting point. Defaults to -5.
x_max_lim (int, optional) – x axis ending point. Defaults to 170.
y_min_date (str, optional) – y axis starting date. Defaults to ‘1988-01-01’.
y_max_date (str, optional) – y axis ending date. Defaults to ‘2020-01-01’.
return_data (bool, optional) – flag to return data. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_coordinates_to_map(gps_data, center=[33.271459, - 81.675873], zoom=14) → ipyleaflet.leaflet.Map¶

Plots the well locations on an interactive map given coordinates.

Parameters

gps_data (pd.DataFrame) – Data frame with the following column names: station_id, latitude, longitude, color. If the color column is not passed, the default color will be blue.
center (list, optional) – latitude and longitude coordinates to center the map view. Defaults to [33.271459, -81.675873].
zoom (int, optional) – value to determine the initial scale of the map. Defaults to 14.

Returns

ipyleaflet.Map

plot_corr_by_date_range(date, analytes, lag=0, min_samples=10, save_dir='plot_corr_by_date', log_transform=False, fontsize=20, returnData=False, no_log=None)¶

Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells on a specified date or range of dates if a lag greater than 0 is specifed.

Parameters

date (str) – date to be analyzed
analytes (_type_) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_corr_by_date’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

plot_corr_by_well(well_name, analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20, returnData=False, remove=[], no_log=None)¶

Plots the correlations with the physical plots as well as the correlations of the important analytes over time for a specified well.

Parameters

well_name (str) – name of the well to be processed
analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
remove (list, optional) – wells to remove. Defaults to [].
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

Returns

None

plot_corr_by_year(year, analytes, remove_outliers=True, z_threshold=4, min_samples=10, save_dir='plot_corr_by_year', log_transform=False, fontsize=20, returnData=False, no_log=None)¶

Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells in specified year.

Parameters

year (int) – year to be analyzed
analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers.. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

plot_correlation_heatmap(well_name, show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')¶

Plots a heatmap of the correlations of the important analytes over time for a specified well.

Parameters

well_name (str) – name of the well to be processed
show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.
color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.

Returns

None

plot_data(well_name, analyte_name, log_transform=True, alpha=0, plot_inline=True, year_interval=2, x_label='Years', y_label='', save_dir='plot_data', filter=False, col=None, equals=[])¶

Plot concentrations over time of a specified well and analyte with a smoothed curve on interpolated data points.

Parameters

well_name (str) – name of the well to be processed
analyte_name (str) – name of the analyte to be processed
log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.
alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.
plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.
year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.
x_label (str, optional) – x axis label. Defaults to ‘Years’.
y_label (str, optional) – y axis label. Defaults to ‘’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

None

query_data(well_name, analyte_name)¶

Filters data by passing the data and specifying the well_name and analyte_name

Parameters

well_name (str) – name of the well to be processed
analyte_name (str) – name of the analyte to be processed

Returns

filtered data based on query conditons

Return type

pd.DataFrame

remove_outliers(data, z_threshold=4)¶

Removes outliers from a dataframe based on the z_scores and returns the new dataframe.

Parameters

data (pd.DataFrame) – data for the outliers to removed from
z_threshold (int, optional) – z_score threshold to eliminate. Defaults to 4.

Returns

data with outliers removed

Return type

pd.DataFrame

setConstructionData(construction_data: pandas.core.frame.DataFrame, verbose=True)¶

Imports the addtitional well information as a separate DataFrame.

Parameters

construction_data (pd.DataFrame) – Data with additonal details.
verbose (bool, optional) – Prints success message. Defaults to True.

Returns

None

setData(data: pandas.core.frame.DataFrame, verbose: bool = True) → None¶

Saves the dataset into pylenm

Parameters

data (pd.DataFrame) – Dataset to be imported.
verbose (bool, optional) – Prints success message. Defaults to True.

Returns

None

simplify_data(data=None, inplace=False, columns=None, save_csv=False, file_name='data_simplified', save_dir='data/')¶

Removes all columns except ‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’.

If the user specifies additional columns in addition to the ones listed above, those columns will be kept. The function returns a dataframe and has an optional parameter to be able to save the dataframe to a csv file.

Parameters

data (pd.DataFrame, optional) – data to simplify. Defaults to None.
inplace (bool, optional) – save data to current working dataset. Defaults to False.
columns (list, optional) – list of any additional columns on top of [‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’] to be kept in the dataframe. Defaults to None.
save_csv (bool, optional) – flag to determine whether or not to save the dataframe to a csv file. Defaults to False.
file_name (str, optional) – name of the csv file you want to save. Defaults to ‘data_simplified’.
save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘data/’.

Returns

pd.DataFrame