Function descriptions¶
- class pylenm.PylenmDataFactory(data: pandas.core.frame.DataFrame)¶
Bases:
object
Class object that initilaizes Pylenm given data.
- add_dist_to_source(XX, source_coordinate=[436642.7, 3681927.09], col_name='dist_to_source')¶
adds column to data with the distance of a record to the source coordinate
- Parameters
XX (pd.DataFrame) – data with coordinate information
source_coordinate (list, optional) – source coordinate. Defaults to [436642.70,3681927.09].
col_name (str, optional) – name to assign new column. Defaults to ‘dist_to_source’.
- Returns
returns original data with additional column with the distance.
- Return type
pd.DataFrame
- cluster_data(data, analyte_name=['ANALYTE_NAME'], n_clusters=4, filter=False, col=None, equals=[], year_interval=5, y_label='Concentration', return_clusters=False)¶
Clusters time series concentration data using kmeans algorithm and plots it.
- Parameters
data (pd.DataFrame) – data to be used in clustering.
analyte_name (list, optional) – analytes to use to cluster. Defaults to [“ANALYTE_NAME”].
n_clusters (int, optional) – number of clusters for kmeans. Defaults to 4.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
year_interval (int, optional) – plot x_label interval in years. Defaults to 5.
y_label (str, optional) – y axis label. Defaults to ‘Concentration’.
return_clusters (bool, optional) – flag to return cluster assignemnt. Defaults to False.
- dist(p1, p2)¶
2D Euclidean distance function
- Parameters
p1 (tuple) – first point
p2 (tuple) – second point
- Returns
Euclidean distance
- Return type
float
- filter_by_column(data=None, col=None, equals=[])¶
Filters construction data based on one column. You only specify ONE column to filter by, but can selected MANY values for the entry.
- Parameters
data (pd.DataFrame, optional) – dataframe to filter. Defaults to None.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- Returns
returns filtered dataframe
- Return type
pd.DataFrame
- filter_wells(units)¶
Returns a list of the well names filtered by the unit(s) specified.
- Parameters
units (list) – Letter of the well to be filtered (e.g. [‘A’] or [‘A’, ‘D’])
- Returns
well names filtered by the unit(s) specified
- Return type
list
- fit_gp(X, y, xx, model=None, smooth=True)¶
Fits Gaussian Process for X and y and returns both the GP model and the predicted values
- Parameters
X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
- Returns
GP model, prediction of xx
- Return type
GaussianProcessRegressor, numpy.array
- getCleanData(analytes)¶
Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is all of the dates in the dataset. Many NaN should be expected.
- Parameters
analytes (list) – list of analyte names to use
- Returns
pd.DataFrame
- getCommonDates(analytes, lag=[3, 7, 10])¶
Creates a table which counts the number of wells within a range specified by a list of lag days.
- Parameters
analytes (list) – list of analyte names to use
lag (list, optional) – list of days to look ahead and behind the specified date (+/-). Defaults to [3,7,10].
- Returns
pd.DataFrame
- getData()¶
Returns the concentration data in pylenm
- Returns
concentration data that was passed into pylenm
- Return type
pd.DataFrame
- getJointData(analytes, lag=3)¶
Creates a table filling the data from the concentration dataset for a given analyte list where the columns are multi-indexed as follows [analytes, well names] and the index is the date ranges secified by the lag.
- Parameters
analytes (list) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 3.
- Returns
pd.DataFrame
- get_Best_GP(X, y, smooth=True, seed=42)¶
Returns the best Gaussian Process model for a given X and y.
- Parameters
X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
seed (int, optional) – random state setting. Defaults to 42.
- Returns
best GP model
- Return type
GaussianProcessRegressor
- get_Best_Wells(X, y, xx, ref, initial, max_wells, ft=['Elevation'], regression='linear', verbose=True, smooth=True, model=None)¶
Greedy optimization function to select a subset of wells as to minimizes the MSE from a reference map
- Parameters
X (numpy.array) – array of dimension (number of wells, 2) where each element is a pair of UTM coordinates.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
ref (numpy.array) – reference field to optimize for (aka best/true map)
initial (list) – indices of wells as the starting wells for optimization
max_wells (int) – number of wells to optimize for
ft (list, optional) – feature names to train on. Defaults to [‘Elevation’].
regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.
verbose (bool, optional) –
Defaults to True.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.
- Returns
index of best wells in order from best to worst
- Return type
list
- get_Construction_Data()¶
Returns the construction data in pylenm
- Returns
construction data that was passed into pylenm
- Return type
pd.DataFrame
- get_MCL(analyte_name)¶
Returns the Maximum Concentration Limit value for the specified analyte. Example: ‘TRITIUM’ returns 1.3
- Parameters
analyte_name (str) – name of the analyte to be processed
- Returns
MLC value
- Return type
float
- get_analyte_details(analyte_name, filter=False, col=None, equals=[], save_to_file=False, save_dir='analyte_details')¶
Returns a csv file saved to save_dir with details pertaining to the specified analyte. Details include the well names, the date ranges and the number of unique samples.
- Parameters
analyte_name (str) – name of the analyte to be processed
filter (bool, optional) – whether to filter the data. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
save_to_file (bool, optional) – whether to save data to file. Defaults to False.
save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘analyte_details’.
- Returns
Table with well information
- Return type
pd.DataFrame
- get_data_summary(analytes=None, sort_by='date', ascending=False, filter=False, col=None, equals=[])¶
Returns a dataframe with a summary of the data for certain analytes. Summary includes the date ranges and the number of unique samples and other statistics for the analyte results.
- Parameters
analytes (list, optional) – list of analyte names to be processed. If left empty, a list of all the analytes in the data will be used. Defaults to None.
sort_by (str, optional) – {‘date’, ‘samples’, ‘wells’} sorts the data by either the dates by entering: ‘date’, the samples by entering: ‘samples’, or by unique well locations by entering ‘wells’. Defaults to ‘date’.
ascending (bool, optional) – flag to sort in ascending order.. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- Returns
Table with well information
- Return type
pd.DataFrame
- get_unit(analyte_name)¶
Returns the unit of the analyte you specify. Example: ‘DEPTH_TO_WATER’ may return ‘ft’
- Parameters
analyte_name (str) – ame of the analyte to be processed
- Returns
unit of analyte
- Return type
str
- get_well_analytes(well_name=None, filter=False, col=None, equals=[])¶
Displays the analyte names available at given well locations.
- Parameters
well_name (str, optional) – name of the well. If left empty, all wells are returned.. Defaults to None.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- Returns
None
- interpolate_topo(X, y, xx, ft=['Elevation'], model=None, smooth=True, regression='linear', seed=42)¶
Spatially interpolate the water table as a function of topographic metrics using Gaussian Process. Uses regression to generate trendline adds the values to the GP map.
- Parameters
X (numpy.array) – training values. Must include “Easting” and “Northing” columns.
y (numpy.array) – array of size (number of wells) where each value corresponds to a concentration value at a well.
xx (numpy.array) – prediction locations
ft (list, optional) – eature names to train on. Defaults to [‘Elevation’].
model (GaussianProcessRegressor, optional) – model to fit. Defaults to None.
smooth (bool, optional) – flag to toggle WhiteKernel on and off. Defaults to True.
regression (str, optional) – choice between ‘linear’ for linear regression, ‘rf’ for random forest regression, ‘ridge’ for ridge regression, or ‘lasso’ for lasso regression.. Defaults to ‘linear’.
seed (int, optional) – random state setting. Defaults to 42.
- Returns
predicton of locations xx
- Return type
numpy.array
- interpolate_well_data(well_name, analytes, frequency='2W')¶
Resamples the data based on the frequency specified and interpolates the values of the analytes.
- Parameters
well_name (str) – name of the well to be processed.
analytes (list) – list of analyte names to use
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
- Returns
pd.DataFrame
- interpolate_wells_by_analyte(analyte, frequency='2W', rm_outliers=True, z_threshold=3)¶
Resamples analyte data based on the frequency specified and interpolates the values in between. NaN values are replaced with the average value per well.
- Parameters
analyte (_type_) – analyte name for interpolation of all present wells.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
rm_outliers (bool, optional) – flag to remove outliers in the data. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 3.
- Returns
interpolated data
- Return type
pd.DataFrame
- jointData_is_set(lag)¶
Checks to see if getJointData function was already called and saved for given lag.
- Parameters
lag (int) – number of days to look ahead and behind the specified date (+/-)
- Returns
True if JointData was already calculated, False, otherwise.
- Return type
bool
- mse(y_true, y_pred)¶
Error Metric: Mean Squared Error
- Parameters
y_true (numpy.array) – true values
y_pred (numpy.array) – predicted values
- Returns
mean squared error
- Return type
float
- plot_MCL(well_name, analyte_name, year_interval=5, save_dir='plot_MCL')¶
Plots the linear regression line of data given the analyte_name and well_name. The plot includes the prediction where the line of best fit intersects with the Maximum Concentration Limit (MCL).
- Parameters
well_name (str) – ame of the well to be processed
analyte_name (str) – name of the analyte to be processed
year_interval (int, optional) – lot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 5.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_MCL’.
- plot_PCA_by_date(date, analytes, lag=0, n_clusters=4, return_clusters=False, min_samples=3, show_labels=True, save_dir='plot_PCA_by_date', filter=False, col=None, equals=[])¶
Gernates a PCA biplot (PCA score plot + loading plot) of the data given a date in the dataset. The data is also clustered into n_clusters.
- Parameters
date (str) – date to be analyzed
analytes (str) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.
n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.
return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- plot_PCA_by_well(well_name, analytes, interpolate=False, frequency='2W', min_samples=10, show_labels=True, save_dir='plot_PCA_by_well')¶
Gernates a PCA biplot (PCA score plot + loading plot) of the data given a well_name in the dataset. Only uses the 6 important analytes.
- Parameters
well_name (str) – name of the well to be processed
analytes (str) – list of analyte names to use
interpolate (bool, optional) – choose to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.
- plot_PCA_by_year(year, analytes, n_clusters=4, return_clusters=False, min_samples=10, show_labels=True, save_dir='plot_PCA_by_year', filter=False, col=None, equals=[])¶
Gernates a PCA biplot (PCA score plot + loading plot) of the data given a year in the dataset. The data is also clustered into n_clusters.
- Parameters
year (int) – year to be analyzed
analytes (str) – list of analyte names to use
n_clusters (int, optional) – number of clusters to split the data into.. Defaults to 4.
return_clusters (bool, optional) – Flag to return the cluster data to be used for spatial plotting.. Defaults to False.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 3.
show_labels (bool, optional) – choose whether or not to show the name of the wells.. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_PCA_by_date’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- plot_all_corr_by_well(analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20)¶
Plots the correlations with the physical plots as well as the important analytes over time for each well in the dataset.
- Parameters
analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
- plot_all_correlation_heatmap(show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')¶
Plots a heatmap of the correlations of the important analytes over time for each well in the dataset.
- Parameters
show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.
color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.
- plot_all_data(log_transform=True, alpha=0, year_interval=2, plot_inline=True, save_dir='plot_data')¶
Plot concentrations over time for every well and analyte with a smoothed curve on interpolated data points.
- Parameters
log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.
alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.
plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.
year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.
- plot_all_time_series(analyte_name=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', x_label_size=8, marker_size=30, min_days=10, x_min_lim=None, x_max_lim=None, y_min_date=None, y_max_date=None, sort_by_distance=True, source_coordinate=[436642.7, 3681927.09], log_transform=False, cmap=<matplotlib.colors.LinearSegmentedColormap object>, drop_cols=[], return_data=False, filter=False, col=None, equals=[], cbar_min=None, cbar_max=None, reverse_y_axis=False, fontsize=20, figsize=(20, 6), dpi=300, y_2nd_label=None)¶
Plots the start and end date of analyte readings for differnt locations/sensors/wells with colored concentration reading.
- Parameters
analyte_name (str, optional) – analyte to examine. Defaults to None.
title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.
x_label (str, optional) – x axis label. Defaults to ‘Well’.
y_label (str, optional) – y axis label. Defaults to ‘Year’.
x_label_size (int, optional) – x axis label font size. Defaults to 8.
marker_size (int, optional) – point size for time series. Defaults to 30.
min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.
x_min_lim (int, optional) – x axis starting point. Defaults to None.
x_max_lim (int, optional) – x axis ending point. Defaults to None.
y_min_date (str, optional) – y axis starting date. Defaults to None.
y_max_date (str, optional) – y axis ending date. Defaults to None.
sort_by_distance (bool, optional) – flag to sort by distance from source center. Defaults to True.
source_coordinate (list, optional) – Easting, Northing coordinate of source center. Defaults to [436642.70,3681927.09].
log_transform (bool, optional) – flag to toggle log base 10 transformation. Defaults to False.
cmap (cmap, optional) – color map for plotting. Defaults to mpl.cm.rainbow.
drop_cols (list, optional) – columns, usually wells, to exclude. Defaults to [].
return_data (bool, optional) – flag to return data. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
cbar_min (float, optional) – color bar lower boundary. Defaults to None.
cbar_max (float, optional) – color bar upper boundary. Defaults to None.
reverse_y_axis (bool, optional) – flag that reverses y axis. Defaults to False.
fontsize (int, optional) – plot font size. Defaults to 20.
figsize (tuple, optional) – matplotlib style figure size. Defaults to (20,6).
dpi (int, optional) – DPI of figure. Defaults to 300.
y_2nd_label (str, optional) – color bar label manual override. Defaults to None.
- plot_all_time_series_simple(analyte_name=None, start_date=None, end_date=None, title='Dataset: Time ranges', x_label='Well', y_label='Year', min_days=10, x_min_lim=- 5, x_max_lim=170, y_min_date='1988-01-01', y_max_date='2020-01-01', return_data=False, filter=False, col=None, equals=[])¶
Plots the start and end date of analyte readings for differnt locations/sensors/wells.
- Parameters
analyte_name (str, optional) – analyte to examine. Defaults to None.
start_date (str, optional) – start date of horizontal time to show alignment. Defaults to None.
end_date (str, optional) – end date of horizontal time to show alignment.. Defaults to None.
title (str, optional) – plot title. Defaults to ‘Dataset: Time ranges’.
x_label (str, optional) – x axis label. Defaults to ‘Well’.
y_label (str, optional) – y axis label. Defaults to ‘Year’.
min_days (int, optional) – minimum number of days required to plot the time series . Defaults to 10.
x_min_lim (int, optional) – x axis starting point. Defaults to -5.
x_max_lim (int, optional) – x axis ending point. Defaults to 170.
y_min_date (str, optional) – y axis starting date. Defaults to ‘1988-01-01’.
y_max_date (str, optional) – y axis ending date. Defaults to ‘2020-01-01’.
return_data (bool, optional) – flag to return data. Defaults to False.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- plot_coordinates_to_map(gps_data, center=[33.271459, - 81.675873], zoom=14) ipyleaflet.leaflet.Map ¶
Plots the well locations on an interactive map given coordinates.
- Parameters
gps_data (pd.DataFrame) – Data frame with the following column names: station_id, latitude, longitude, color. If the color column is not passed, the default color will be blue.
center (list, optional) – latitude and longitude coordinates to center the map view. Defaults to [33.271459, -81.675873].
zoom (int, optional) – value to determine the initial scale of the map. Defaults to 14.
- Returns
ipyleaflet.Map
- plot_corr_by_date_range(date, analytes, lag=0, min_samples=10, save_dir='plot_corr_by_date', log_transform=False, fontsize=20, returnData=False, no_log=None)¶
Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells on a specified date or range of dates if a lag greater than 0 is specifed.
- Parameters
date (str) – date to be analyzed
analytes (_type_) – list of analyte names to use
lag (int, optional) – number of days to look ahead and behind the specified date (+/-). Defaults to 0.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_corr_by_date’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.
- plot_corr_by_well(well_name, analytes, remove_outliers=True, z_threshold=4, interpolate=False, frequency='2W', save_dir='plot_correlation', log_transform=False, fontsize=20, returnData=False, remove=[], no_log=None)¶
Plots the correlations with the physical plots as well as the correlations of the important analytes over time for a specified well.
- Parameters
well_name (str) – name of the well to be processed
analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
interpolate (bool, optional) – choose whether or to interpolate the data. Defaults to False.
frequency (str, optional) – {‘D’, ‘W’, ‘M’, ‘Y’} frequency to interpolate. Note: See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html for valid frequency inputs. (e.g. ‘W’ = every week, ‘D ’= every day, ‘2W’ = every 2 weeks). Defaults to ‘2W’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
remove (list, optional) – wells to remove. Defaults to [].
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.
- Returns
None
- plot_corr_by_year(year, analytes, remove_outliers=True, z_threshold=4, min_samples=10, save_dir='plot_corr_by_year', log_transform=False, fontsize=20, returnData=False, no_log=None)¶
Plots the correlations with the physical plots as well as the correlations of the important analytes for ALL the wells in specified year.
- Parameters
year (int) – year to be analyzed
analytes (list) – list of analyte names to use
remove_outliers (bool, optional) – choose whether or to remove the outliers.. Defaults to True.
z_threshold (int, optional) – z_score threshold to eliminate outliers. Defaults to 4.
min_samples (int, optional) – minimum number of samples the result should contain in order to execute.. Defaults to 10.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation’.
log_transform (bool, optional) – flag for log base 10 transformation. Defaults to False.
fontsize (int, optional) – font size. Defaults to 20.
returnData (bool, optional) – flag to return data used to perfrom correlation analysis. Defaults to False.
no_log (list, optional) – list of column names to not apply log transformation to. Defaults to None.
- plot_correlation_heatmap(well_name, show_symmetry=True, color=True, save_dir='plot_correlation_heatmap')¶
Plots a heatmap of the correlations of the important analytes over time for a specified well.
- Parameters
well_name (str) – name of the well to be processed
show_symmetry (bool, optional) – choose whether or not the heatmap should show the same information twice over the diagonal. Defaults to True.
color (bool, optional) – choose whether or not the plot should be in color or in greyscale. Defaults to True.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_correlation_heatmap’.
- Returns
None
- plot_data(well_name, analyte_name, log_transform=True, alpha=0, plot_inline=True, year_interval=2, x_label='Years', y_label='', save_dir='plot_data', filter=False, col=None, equals=[])¶
Plot concentrations over time of a specified well and analyte with a smoothed curve on interpolated data points.
- Parameters
well_name (str) – name of the well to be processed
analyte_name (str) – name of the analyte to be processed
log_transform (bool, optional) – choose whether or not the data should be transformed to log base 10 values. Defaults to True.
alpha (int, optional) – alue between 0 and 10 for line smoothing. Defaults to 0.
plot_inline (bool, optional) – choose whether or not to show plot inline. Defaults to True.
year_interval (int, optional) – plot by how many years to appear in the axis e.g.(1 = every year, 5 = every 5 years, …). Defaults to 2.
x_label (str, optional) – x axis label. Defaults to ‘Years’.
y_label (str, optional) – y axis label. Defaults to ‘’.
save_dir (str, optional) – name of the directory you want to save the plot to. Defaults to ‘plot_data’.
filter (bool, optional) – flag to indicate filtering. Defaults to False.
col (str, optional) – column to filter. Example: col=’STATION_ID’. Defaults to None.
equals (list, optional) – values to filter col by. Examples: equals=[‘FAI001A’, ‘FAI001B’]. Defaults to [].
- Returns
None
- query_data(well_name, analyte_name)¶
Filters data by passing the data and specifying the well_name and analyte_name
- Parameters
well_name (str) – name of the well to be processed
analyte_name (str) – name of the analyte to be processed
- Returns
filtered data based on query conditons
- Return type
pd.DataFrame
- remove_outliers(data, z_threshold=4)¶
Removes outliers from a dataframe based on the z_scores and returns the new dataframe.
- Parameters
data (pd.DataFrame) – data for the outliers to removed from
z_threshold (int, optional) – z_score threshold to eliminate. Defaults to 4.
- Returns
data with outliers removed
- Return type
pd.DataFrame
- setConstructionData(construction_data: pandas.core.frame.DataFrame, verbose=True)¶
Imports the addtitional well information as a separate DataFrame.
- Parameters
construction_data (pd.DataFrame) – Data with additonal details.
verbose (bool, optional) – Prints success message. Defaults to True.
- Returns
None
- setData(data: pandas.core.frame.DataFrame, verbose: bool = True) None ¶
Saves the dataset into pylenm
- Parameters
data (pd.DataFrame) – Dataset to be imported.
verbose (bool, optional) – Prints success message. Defaults to True.
- Returns
None
- simplify_data(data=None, inplace=False, columns=None, save_csv=False, file_name='data_simplified', save_dir='data/')¶
Removes all columns except ‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’.
If the user specifies additional columns in addition to the ones listed above, those columns will be kept. The function returns a dataframe and has an optional parameter to be able to save the dataframe to a csv file.
- Parameters
data (pd.DataFrame, optional) – data to simplify. Defaults to None.
inplace (bool, optional) – save data to current working dataset. Defaults to False.
columns (list, optional) – list of any additional columns on top of [‘COLLECTION_DATE’, ‘STATION_ID’, ‘ANALYTE_NAME’, ‘RESULT’, and ‘RESULT_UNITS’] to be kept in the dataframe. Defaults to None.
save_csv (bool, optional) – flag to determine whether or not to save the dataframe to a csv file. Defaults to False.
file_name (str, optional) – name of the csv file you want to save. Defaults to ‘data_simplified’.
save_dir (str, optional) – name of the directory you want to save the csv file to. Defaults to ‘data/’.
- Returns
pd.DataFrame