Function descriptions

class pylenm.PylenmDataFactory(data: pandas.core.frame.DataFrame)

Bases: object

Class object that initilaizes Pylenm given data.

add_dist_to_source(XX, source_coordinate=[436642.7, 3681927.09], col_name='dist_to_source')

adds column to data with the distance of a record to the source coordinate

Parameters
  • XX (pd.DataFrame) – data with coordinate information

  • source_coordinate (list, optional) – source coordinate. Defaults to [436642.70,3681927.09].

  • col_name (str, optional) – name to assign new column. Defaults to ‘dist_to_source’.

Returns

returns original data with additional column with the distance.

Return type

pd.DataFrame

cluster_data(data, analyte_name=['ANALYTE_NAME'], n_clusters=4, filter=False, col=None, equals=[], year_interval=5, y_label='Concentration', return_clusters=False)

Clusters time series concentration data using kmeans algorithm and plots it.

Parameters
  • data (pd.DataFrame) – data to be used in clustering.

  • analyte_name (list, optional) – analytes to use to cluster. Defaults to [“ANALYTE_NAME”].

  • n_clusters (int, optional) – number of clusters for kmeans. Defaults to 4.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

  • year_interval (int, optional) – plot x_label interval in years. Defaults to 5.

  • y_label (str, optional) – y axis label. Defaults to ‘Concentration’.

  • return_clusters (bool, optional) – flag to return cluster assignemnt. Defaults to False.

dist(p1, p2)

2D Euclidean distance function

Parameters
  • p1 (tuple) – first point

  • p2 (tuple) – second point

Returns

Euclidean distance

Return type

float

filter_by_column(data=None, col=None, equals=[])

Filters construction data based on one column. You only specify ONE column to filter by, but can selected MANY values for the entry.

Parameters
  • data (pd.DataFrame, optional) – dataframe to filter. Defaults to None.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

returns filtered dataframe

Return type

pd.DataFrame

filter_wells(units)

Returns a list of the well names filtered by the unit(s) specified.

Parameters

units (list) – Letter of the well to be filtered (e.g. [‘A’] or [‘A’, ‘D’])

Returns

well names filtered by the unit(s) specified

Return type

list

fit_gp(X, y, xx, model=None, smooth=True)

Fits Gaussian Process for X and y and returns both the GP model and the predicted values

Parameters
  • X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.

  • y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.

  • xx (numpy.array) – prediction locations

  • model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.

  • smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.

Returns

GP model, prediction of xx

Return type

GaussianProcessRegressor, numpy.array

getCleanData(analytes)

Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is all of the dates in the dataset. Many NaN should be expected.

Parameters

analytes (list) – list of analyte names to use

Returns

pd.DataFrame

getCommonDates(analytes, lag=[3, 7, 10])

Creates a table which counts the number of wells within a range specified by a list of lag days.

Parameters
  • analytes (list) – list of analyte names to use

  • lag (list, optional) – list of days to look ahead and behind the specified date (+/-). Defaults to [3,7,10].

Returns

pd.DataFrame

getData()

Returns the concentration data in pylenm

Returns

concentration data that was passed into pylenm

Return type

pd.DataFrame

getJointData(analytes, lag=3)

Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is the date ranges secified by the lag.

Parameters
  • analytes (list) – list of analyte names to use

  • lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 3.

Returns

pd.DataFrame

get_Best_GP(X, y, smooth=True, seed=42)

Returns the best Gaussian Process model for a given X and y.

Parameters
  • X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.

  • y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.

  • smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.

  • seed (int, optional) – random state setting. Defaults to 42.

Returns

best GP model

Return type

GaussianProcessRegressor

get_Best_Wells(X, y, xx, ref, initial, max_wells, ft=['Elevation'], regression='linear', verbose=True, smooth=True, model=None)

Greedy optimization function to select a subset of wells as to minimizes the MSE from a reference map

Parameters
  • X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.

  • y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.

  • xx (numpy.array) – prediction locations

  • ref (numpy.array) – reference field to optimize for (aka best/true map)

  • initial (list) – indices of wells as the starting wells for optimization

  • max_wells (int) – number of wells to optimize for

  • ft (list, optional) – feature names to train on. Defaults to [‘Elevation’].

  • regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.

  • verbose (bool, optional) –

    1. Defaults to True.

  • smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.

  • model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.

Returns

index of best wells in order from best to worst

Return type

list

get_Construction_Data()

Returns the construction data in pylenm

Returns

construction data that was passed into pylenm

Return type

pd.DataFrame

get_MCL(analyte_name)

Returns the Maximum Concentration Limit value for the specified analyte. Example: ‘TRITIUM’ returns 1.3

Parameters

analyte_name (str) – name of the analyte to be processed

Returns

MLC value

Return type

float

get_analyte_details(analyte_name, filter=False, col=None, equals=[], save_to_file=False, save_dir='analyte_details')

Returns a csv file saved to save_dir with details pertaining to the specified analyte. Details include the well names, the date ranges and the number of unique samples.

Parameters
  • analyte_name (str) – name of the analyte to be processed

  • filter (bool, optional) – whether to filter the data. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

  • save_to_file (bool, optional) – whether to save data to file. Defaults to False.

  • save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘analyte_details’.

Returns

Table with well information

Return type

pd.DataFrame

get_data_summary(analytes=None, sort_by='date', ascending=False, filter=False, col=None, equals=[])

Returns a dataframe with a summary of the data for certain analytes. Summary includes the date ranges and the number of unique samples and other statistics for the analyte results.

Parameters
  • analytes (list, optional) – list of analyte names to be processed. If left empty, a list of all the analytes in the data will be used. Defaults to None.

  • sort_by (str, optional) – {‘date’, ‘samples’, ‘wells’} sorts the data by either the dates by entering: ‘date’, the samples by entering: ‘samples’, or by unique well locations by entering ‘wells’. Defaults to ‘date’.

  • ascending (bool, optional) – flag to sort in ascending order.. Defaults to False.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

Table with well information

Return type

pd.DataFrame

get_unit(analyte_name)

Returns the unit of the analyte you specify. Example: ‘DEPTH_TO_WATER’ may return ‘ft’

Parameters

analyte_name (str) – ame of the analyte to be processed

Returns

unit of analyte

Return type

str

get_well_analytes(well_name=None, filter=False, col=None, equals=[])

Displays the analyte names available at given well locations.

Parameters
  • well_name (str, optional) – name of the well. If left empty, all wells are returned.. Defaults to None.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

None

interpolate_topo(X, y, xx, ft=['Elevation'], model=None, smooth=True, regression='linear', seed=42)

Spatially interpolate the water table as a function of topographic metrics using Gaussian Process. Uses regression to generate trendline adds the values to the GP map.

Parameters
  • X (numpy.array) – training values. Must include “Easting” and “Northing” columns.

  • y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.

  • xx (numpy.array) – prediction locations

  • ft (list, optional) – eature names to train on. Defaults to [‘Elevation’].

  • model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.

  • smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.

  • regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.

  • seed (int, optional) – random state setting. Defaults to 42.

Returns

predicton of locations xx

Return type

numpy.array

interpolate_well_data(well_name, analytes, frequency='2W')

Resamples the data based on the frequency specified and interpolates the values of the analytes.

Parameters
  • well_name (str) – name of the well to be processed.

  • analytes (list) – list of analyte names to use

  • frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

Returns

pd.DataFrame

interpolate_wells_by_analyte(analyte, frequency='2W', rm_outliers=True, z_threshold=3)

Resamples analyte data based on the frequency specified and interpolates the values in between. NaN values are replaced with the average value per well.

Parameters
  • analyte (_type_) – analyte name for interpolation of all present wells.

  • frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

  • rm_outliers (bool, optional) – flag to remove outliers in the data. Defaults to True.

  • z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 3.

Returns

interpolated data

Return type

pd.DataFrame

jointData_is_set(lag)

Checks to see if getJointData function was already called and saved for given lag.

Parameters

lag (int) – number of days to look ahead and behind the specified date (+/-)

Returns

True if JointData was already calculated, False, otherwise.

Return type

bool

mse(y_true, y_pred)

Error Metric: Mean Squared Error

Parameters
  • y_true (numpy.array) – true values

  • y_pred (numpy.array) – predicted values

Returns

mean squared error

Return type

float

plot_MCL(well_name, analyte_name, year_interval=5, save_dir='plot_MCL')

Plots the linear regression line of data given the analyte_name and well_name. The plot includes the prediction where the line of best fit intersects with the Maximum Concentration Limit (MCL).

Parameters
  • well_name (str) – ame of the well to be processed

  • analyte_name (str) – name of the analyte to be processed

  • year_interval (int, optional) – lot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 5.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_MCL’.

plot_PCA_by_date(date, analytes, lag=0, n_clusters=4, return_clusters=False, min_samples=3, show_labels=True, save_dir='plot_PCA_by_date', filter=False, col=None, equals=[])

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a date in the dataset. The data is also clustered into n_clusters.

Parameters
  • date (str) – date to be analyzed

  • analytes (str) – list of analyte names to use

  • lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.

  • n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.

  • return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.

  • min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.

  • show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_PCA_by_well(well_name, analytes, interpolate=False, frequency='2W', min_samples=10, show_labels=True, save_dir='plot_PCA_by_well')

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a well_name in the dataset. Only uses the 6 important analytes.

Parameters
  • well_name (str) – name of the well to be processed

  • analytes (str) – list of analyte names to use

  • interpolate (bool, optional) – choose to interpolate the data. Defaults to False.

  • frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

  • min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.

  • show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.

plot_PCA_by_year(year, analytes, n_clusters=4, return_clusters=False, min_samples=10, show_labels=True, save_dir='plot_PCA_by_year', filter=False, col=None, equals=[])

Gernates a PCA biplot (PCA score plot + loading plot) of the data given a year in the dataset. The data is also clustered into n_clusters.

Parameters
  • year (int) – year to be analyzed

  • analytes (str) – list of analyte names to use

  • n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.

  • return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.

  • min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.

  • show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_all_corr_by_well(analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20)

Plots the correlations with the physical plots as well as the important analytes over time for each well in the dataset.

Parameters
  • analytes (list) – list of analyte names to use

  • remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.

  • z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.

  • interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.

  • frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.

  • log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.

  • fontsize (int, optional) – font size. Defaults to 20.

plot_all_correlation_heatmap(show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')

Plots a heatmap of the correlations of the important analytes over time for each well in the dataset.

Parameters
  • show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.

  • color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.

plot_all_data(log_transform=True, alpha=0, year_interval=2, plot_inline=True, save_dir='plot_data')

Plot concentrations over time for every well and analyte with a smoothed curve on interpolated data points.

Parameters
  • log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.

  • alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.

  • plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.

  • year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.

plot_all_time_series(analyte_name=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', x_label_size=8, marker_size=30, min_days=10, x_min_lim=None, x_max_lim=None, y_min_date=None, y_max_date=None, sort_by_distance=True, source_coordinate=[436642.7, 3681927.09], log_transform=False, cmap=<matplotlib.colors.LinearSegmentedColormap object>, drop_cols=[], return_data=False, filter=False, col=None, equals=[], cbar_min=None, cbar_max=None, reverse_y_axis=False, fontsize=20, figsize=(20, 6), dpi=300, y_2nd_label=None)

Plots the start and end date of analyte readings for differnt locations/sensors/wells with colored concentration reading.

Parameters
  • analyte_name (str, optional) – analyte to examine. Defaults to None.

  • title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.

  • x_label (str, optional) – x axis label. Defaults to ‘Well’.

  • y_label (str, optional) – y axis label. Defaults to ‘Year’.

  • x_label_size (int, optional) – x axis label font size. Defaults to 8.

  • marker_size (int, optional) – point size for time series. Defaults to 30.

  • min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.

  • x_min_lim (int, optional) – x axis starting point. Defaults to None.

  • x_max_lim (int, optional) – x axis ending point. Defaults to None.

  • y_min_date (str, optional) – y axis starting date. Defaults to None.

  • y_max_date (str, optional) – y axis ending date. Defaults to None.

  • sort_by_distance (bool, optional) – flag to sort by distance from source center. Defaults to True.

  • source_coordinate (list, optional) – Easting, Northing coordinate of source center. Defaults to [436642.70,3681927.09].

  • log_transform (bool, optional) – flag to toggle log base 10 transformation. Defaults to False.

  • cmap (cmap, optional) – color map for plotting. Defaults to mpl.cm.rainbow.

  • drop_cols (list, optional) – columns, usually wells, to exclude. Defaults to [].

  • return_data (bool, optional) – flag to return data. Defaults to False.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

  • cbar_min (float, optional) – color bar lower boundary. Defaults to None.

  • cbar_max (float, optional) – color bar upper boundary. Defaults to None.

  • reverse_y_axis (bool, optional) – flag that reverses y axis. Defaults to False.

  • fontsize (int, optional) – plot font size. Defaults to 20.

  • figsize (tuple, optional) – matplotlib style figure size. Defaults to (20,6).

  • dpi (int, optional) – DPI of figure. Defaults to 300.

  • y_2nd_label (str, optional) – color bar label manual override. Defaults to None.

plot_all_time_series_simple(analyte_name=None, start_date=None, end_date=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', min_days=10, x_min_lim=- 5, x_max_lim=170, y_min_date='1988-01-01', y_max_date='2020-01-01', return_data=False, filter=False, col=None, equals=[])

Plots the start and end date of analyte readings for differnt locations/sensors/wells.

Parameters
  • analyte_name (str, optional) – analyte to examine. Defaults to None.

  • start_date (str, optional) – start date of horizontal time to show alignment. Defaults to None.

  • end_date (str, optional) – end date of horizontal time to show alignment.. Defaults to None.

  • title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.

  • x_label (str, optional) – x axis label. Defaults to ‘Well’.

  • y_label (str, optional) – y axis label. Defaults to ‘Year’.

  • min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.

  • x_min_lim (int, optional) – x axis starting point. Defaults to -5.

  • x_max_lim (int, optional) – x axis ending point. Defaults to 170.

  • y_min_date (str, optional) – y axis starting date. Defaults to ‘1988-01-01’.

  • y_max_date (str, optional) – y axis ending date. Defaults to ‘2020-01-01’.

  • return_data (bool, optional) – flag to return data. Defaults to False.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

plot_coordinates_to_map(gps_data, center=[33.271459, - 81.675873], zoom=14) ipyleaflet.leaflet.Map

Plots the well locations on an interactive map given coordinates.

Parameters
  • gps_data (pd.DataFrame) – Data frame with the following column names: station_id, latitude, longitude, color. If the color column is not passed, the default color will be blue.

  • center (list, optional) – latitude and longitude coordinates to center the map view. Defaults to [33.271459, -81.675873].

  • zoom (int, optional) – value to determine the initial scale of the map. Defaults to 14.

Returns

ipyleaflet.Map

plot_corr_by_date_range(date, analytes, lag=0, min_samples=10, save_dir='plot_corr_by_date', log_transform=False, fontsize=20, returnData=False, no_log=None)

Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells on a specified date or range of dates if a lag greater than 0 is specifed.

Parameters
  • date (str) – date to be analyzed

  • analytes (_type_) – list of analyte names to use

  • lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.

  • min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_corr_by_date’.

  • log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.

  • fontsize (int, optional) – font size. Defaults to 20.

  • returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.

  • no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

plot_corr_by_well(well_name, analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20, returnData=False, remove=[], no_log=None)

Plots the correlations with the physical plots as well as the correlations of the important analytes over time for a specified well.

Parameters
  • well_name (str) – name of the well to be processed

  • analytes (list) – list of analyte names to use

  • remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.

  • z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.

  • interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.

  • frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.

  • log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.

  • fontsize (int, optional) – font size. Defaults to 20.

  • returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.

  • remove (list, optional) – wells to remove. Defaults to [].

  • no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

Returns

None

plot_corr_by_year(year, analytes, remove_outliers=True, z_threshold=4, min_samples=10, save_dir='plot_corr_by_year', log_transform=False, fontsize=20, returnData=False, no_log=None)

Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells in specified year.

Parameters
  • year (int) – year to be analyzed

  • analytes (list) – list of analyte names to use

  • remove_outliers (bool, optional) – choose whether or to remove the outliers.. Defaults to True.

  • z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.

  • min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.

  • log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.

  • fontsize (int, optional) – font size. Defaults to 20.

  • returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.

  • no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.

plot_correlation_heatmap(well_name, show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')

Plots a heatmap of the correlations of the important analytes over time for a specified well.

Parameters
  • well_name (str) – name of the well to be processed

  • show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.

  • color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.

Returns

None

plot_data(well_name, analyte_name, log_transform=True, alpha=0, plot_inline=True, year_interval=2, x_label='Years', y_label='', save_dir='plot_data', filter=False, col=None, equals=[])

Plot concentrations over time of a specified well and analyte with a smoothed curve on interpolated data points.

Parameters
  • well_name (str) – name of the well to be processed

  • analyte_name (str) – name of the analyte to be processed

  • log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.

  • alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.

  • plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.

  • year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.

  • x_label (str, optional) – x axis label. Defaults to ‘Years’.

  • y_label (str, optional) – y axis label. Defaults to ‘’.

  • save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.

  • filter (bool, optional) – flag to indicate filtering. Defaults to False.

  • col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.

  • equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].

Returns

None

query_data(well_name, analyte_name)

Filters data by passing the data and specifying the well_name and analyte_name

Parameters
  • well_name (str) – name of the well to be processed

  • analyte_name (str) – name of the analyte to be processed

Returns

filtered data based on query conditons

Return type

pd.DataFrame

remove_outliers(data, z_threshold=4)

Removes outliers from a dataframe based on the z_scores and returns the new dataframe.

Parameters
  • data (pd.DataFrame) – data for the outliers to removed from

  • z_threshold (int, optional) – z_score threshold to eliminate. Defaults to 4.

Returns

data with outliers removed

Return type

pd.DataFrame

setConstructionData(construction_data: pandas.core.frame.DataFrame, verbose=True)

Imports the addtitional well information as a separate DataFrame.

Parameters
  • construction_data (pd.DataFrame) – Data with additonal details.

  • verbose (bool, optional) – Prints success message. Defaults to True.

Returns

None

setData(data: pandas.core.frame.DataFrame, verbose: bool = True) None

Saves the dataset into pylenm

Parameters
  • data (pd.DataFrame) – Dataset to be imported.

  • verbose (bool, optional) – Prints success message. Defaults to True.

Returns

None

simplify_data(data=None, inplace=False, columns=None, save_csv=False, file_name='data_simplified', save_dir='data/')

Removes all columns except ‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’.

If the user specifies additional columns in addition to the ones listed above, those columns will be kept. The function returns a dataframe and has an optional parameter to be able to save the dataframe to a csv file.

Parameters
  • data (pd.DataFrame, optional) – data to simplify. Defaults to None.

  • inplace (bool, optional) – save data to current working dataset. Defaults to False.

  • columns (list, optional) – list of any additional columns on top of [‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’] to be kept in the dataframe. Defaults to None.

  • save_csv (bool, optional) – flag to determine whether or not to save the dataframe to a csv file. Defaults to False.

  • file_name (str, optional) – name of the csv file you want to save. Defaults to ‘data_simplified’.

  • save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘data/’.

Returns

pd.DataFrame