This page was generated from docs/source/notebooks/2) pyLEnM - Unsupervised Learning.ipynb. Interactive online version: Open In Colab.

Case 2 - Unsupervised Learning

Welcome to the demonstration notebook where we’ll go over all of the Unsupervised learning functions in the pylenm package! Let’s get started!

Setup

Make sure to install pylenm from https://pypi.org/project/pylenm/ by running pip install pylenm in your environment terminal. Once completed, you should be able to import the package. Note: to update to the latest version of pylenm run: pip install pylenm --upgrade

[1]:
# pip install pylenm
[2]:
# Import our packages
import pylenm
from pylenm import PylenmDataFactory
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 100) # Display a custom number of rows for dataframe

We can verify the package version by typing: pylenm_df.__version__

[3]:
pylenm.__version__
[3]:
'0.2'
[4]:
url_1 = 'https://raw.githubusercontent.com/ALTEMIS-DOE/pylenm/master/notebooks/data/FASB_Data_thru_3Q2015_Reduced_Demo.csv'
url_2 = 'https://github.com/ALTEMIS-DOE/pylenm/blob/master/notebooks/data/FASB%20Well%20Construction%20Info.xlsx?raw=true'
concentration_data = pd.read_csv(url_1)
construction_data = pd.read_excel(url_2)

# Create instance
pylenm_df = PylenmDataFactory(concentration_data) # Save concentration data
pylenm_df.simplify_data(inplace=True)
pylenm_df.setConstructionData(construction_data) # Save construction data
Successfully imported the data!

Successfully imported the construction data!

Functions

The getCleanData() function is a useful preprocessing tool for restructuring the original concentration dataset into a more suitable structure for analysis. Let’s take a closer look at the function:

[5]:
# We'll save a list of the analytes we want to look at and pass it to the functions below
# analytes = ['TRITIUM','IODINE-129','SPECIFIC CONDUCTANCE', 'PH','URANIUM-238', 'DEPTH_TO_WATER']
analytes = ['TRITIUM','SPECIFIC CONDUCTANCE', 'PH','URANIUM-238', 'DEPTH_TO_WATER']
pylenm_df.getCleanData(analytes)
[5]:
ANALYTE_NAME DEPTH_TO_WATER ... URANIUM-238
STATION_ID FBI 14D FBI 15D FBI 17D FEX 4 FIB 1 FIB 8 FOB 1D FOB 2C FOB 2D FOB 13D ... FSP 2B FSP 2C FSP 47A FSP-072A FSP-072B FSP-12A FSP204A FSP226A FSP249A FSP249B
COLLECTION_DATE
1990-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-09-10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-24 17.4 15.4 25.16 NaN NaN NaN NaN NaN NaN 21.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2421 rows × 773 columns

As you can see above, there are many missing values (NaN), but thats normal since there isn’t always a concentration value for each analyte, well and date 3-tuple.

The next function is called getCommonDates() and this is used to get insight on how many data points exist for a given range of days. Lets see an example and a plot to show how much more data can be extracted from the dataset using a lag.

[6]:
lags = [1,3,7,12]
shared_dates = pylenm_df.getCommonDates(analytes=analytes, lag=lags)
shared_dates
[6]:
Date Ranges Number of wells
Dates Lag
1990-01-01 1 1989-12-31 - 1990-01-02 16
3 1989-12-29 - 1990-01-04 24
7 1989-12-25 - 1990-01-08 41
12 1989-12-20 - 1990-01-13 53
1990-01-02 1 1990-01-01 - 1990-01-03 14
... ... ... ...
2015-09-23 12 2015-09-11 - 2015-10-05 15
2015-09-24 1 2015-09-23 - 2015-09-25 8
3 2015-09-21 - 2015-09-27 14
7 2015-09-17 - 2015-10-01 15
12 2015-09-12 - 2015-10-06 15

9684 rows × 2 columns

Let’s create a plot to examine the differences in number of well data available as we increase the lag.

[7]:
colors = ['r', 'b', 'g', 'y']

fig, axs = plt.subplots(nrows=1, ncols=4,figsize=(20,5), sharex=True, sharey=True)
for i, ax in enumerate(axs):
    data = np.array(shared_dates[shared_dates.index.get_level_values('Lag')==lags[i]]['Number of wells'])
    axs[i].plot(data, color=colors[i], label='Lag of '+str(lags[i]))
    axs[i].legend(loc="upper right")
    axs[i].set_xlabel('Time series')
    axs[i].set_ylabel('Number of wells')
    stats_text = str('Number of wells stats:\nMIN: {}\nMEAN: {}\nMAX: {}'.format(data.min(), round(data.mean(), 2), data.max()))
    axs[i].text(0.5,-0.4, stats_text, size=12, ha="center", transform=axs[i].transAxes)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_14_0.png

As we can see, the average number of wells increases significantly as we increase the lag. With this insight, we can make a determination as to which lag is most suitable for our data. For the purpose of this demonstration we will continue the rest of the examples with a lag of 12. getJointData() takes getCleanData() one step further and saves the data according to the specified lag. You’ll notice that the index is no long a single date but a range of dates. The new range is (date - lag) through (date + lag).

[8]:
lag = 12
jointData = pylenm_df.getJointData(analytes, lag=lag)
jointData
GENERATING DATA WITH A LAG OF 12.
Progress:
1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39%, 40%, 41%, 42%, 43%, 44%, 45%, 46%, 47%, 48%, 49%, 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, Completed
[8]:
ANALYTE_NAME DEPTH_TO_WATER ... URANIUM-238
STATION_ID FBI 14D FBI 15D FBI 17D FEX 4 FIB 1 FIB 8 FOB 1D FOB 2C FOB 2D FOB 13D ... FSP 2B FSP 2C FSP 47A FSP-072A FSP-072B FSP-12A FSP204A FSP226A FSP249A FSP249B
1989-12-20 - 1990-01-13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1989-12-21 - 1990-01-14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1989-12-22 - 1990-01-15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1989-12-25 - 1990-01-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1989-12-26 - 1990-01-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-08-29 - 2015-09-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.237 NaN NaN 0.104 0.0457 1.02 0.479 NaN 0.287 0.169
2015-09-09 - 2015-10-03 17.4 15.4 25.16 NaN NaN NaN NaN NaN NaN 21.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-10 - 2015-10-04 17.4 15.4 25.16 NaN NaN NaN NaN NaN NaN 21.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-11 - 2015-10-05 17.4 15.4 25.16 NaN NaN NaN NaN NaN NaN 21.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-09-12 - 2015-10-06 17.4 15.4 25.16 NaN NaN NaN NaN NaN NaN 21.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2421 rows × 773 columns

[9]:
# analytes = ['TRITIUM','SPECIFIC CONDUCTANCE', 'PH','URANIUM-238', 'DEPTH_TO_WATER']

pylenm_df.plot_corr_by_well(well_name='FSB 95DR', analytes=analytes, log_transform=True, remove_outliers=True, z_threshold=1.3, remove=['1999-07-28'], no_log=['PH'])


pylenm_df.plot_corr_by_well(well_name='FSB 95DR', analytes=analytes,
                         interpolate=True, frequency='M',
                         remove_outliers=True, z_threshold=1.3, log_transform=True, remove=['1999-07-28'], no_log=['PH'])
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_17_0.png
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_17_1.png
[10]:
pylenm_df.plot_corr_by_date_range('1993-02-21', analytes=analytes, returnData=True)
[10]:
ANALYTE_NAME DEPTH_TO_WATER PH SPECIFIC CONDUCTANCE TRITIUM URANIUM-238
STATION_ID
FSB 91C 66.76 4.875 314.500000 946.00 1.81
FSB 91D 63.92 3.600 381.000000 1820.00 102.00
FSB 93C 65.75 5.100 374.000000 1510.00 1.00
FSB 94C 71.40 4.370 1955.000000 10200.00 10.30
FSB 94DR 69.11 3.115 2585.000000 25000.00 1190.00
FSB 96AR 127.52 7.315 175.500000 9.06 1.00
FSB 97A 133.39 7.070 266.333333 335.00 1.00
FSB 97C 76.40 3.615 2095.000000 17700.00 622.50
FSB 97D 73.88 3.735 2415.000000 20900.00 559.00
FSB 98AR 131.98 7.280 157.000000 15.30 1.00
FSB 98C 74.36 NaN NaN NaN NaN
FSB 98D 71.25 NaN NaN NaN NaN
FSB 99A 136.36 7.180 162.000000 115.00 1.00
FSB 99C 76.59 5.390 318.500000 1700.00 5.57
FSB 99D 78.15 4.960 36.000000 34.90 8.08
FSB102C 5.19 4.545 280.500000 897.00 1.00
FSB103C 38.06 5.775 242.000000 675.00 1.00
FSB104C 16.67 5.280 441.000000 1270.00 1.00
FSB104D 18.65 3.720 659.500000 4920.00 273.00
FSB114A 95.86 8.470 184.000000 0.70 1.17
FSB114C 37.40 5.765 57.500000 3.09 1.00
FSB114D 33.52 5.120 47.000000 8.86 1.00
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_18_1.png
[11]:
pylenm_df.plot_corr_by_date_range('1993-02-21', lag=lag, analytes=analytes, log_transform=True, no_log=['PH'])
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_19_0.png
[12]:
pylenm_df.plot_corr_by_date_range('1993-02-21', lag=lag, analytes=analytes, log_transform=True)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_20_0.png
[13]:
pylenm_df.plot_corr_by_year(2015, analytes=analytes, remove_outliers=True, z_threshold=3, no_log=['PH'])
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_21_0.png
[14]:
pylenm_df.plot_PCA_by_date('1993-02-21', analytes)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_22_0.png
[15]:
pylenm_df.plot_PCA_by_date('1993-02-21', analytes, lag=lag)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_23_0.png
[16]:
 pylenm_df.plot_PCA_by_date('1993-02-21', analytes, lag=0, filter=True, col='AQUIFER', equals=['LAZ_UTRAU'])
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_24_0.png
[17]:
pylenm_df.plot_PCA_by_well(well_name='FSB 95DR', analytes=analytes)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_25_0.png
[18]:
pylenm_df.plot_PCA_by_year(2015, analytes=analytes)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_26_0.png

Clustering

[19]:
tritium = pylenm_df.interpolate_wells_by_analyte('TRITIUM', frequency='2W')
tritium = np.log10(tritium)
tritium = tritium.dropna(axis=1)
tritium
[19]:
FAS-091 FAS-092 FSB113A FSB113C FSB113D FSB114C FSB114D FSB115C FSB115D FSB116C ... FSB 88C FSB 92C FSB 88D FSB 89C FSB 89D FSB 90C FSB 90D FSB 91C FSB 91D FSP249B
2002-12-08 2.442409 2.594370 1.993302 1.595075 2.810878 0.443920 0.524787 0.773549 0.753364 0.874275 ... 3.145675 3.065711 3.179308 2.147290 2.861863 2.472391 3.084693 2.617371 3.543351 2.836930
2002-12-22 2.442409 2.594370 1.953661 1.591948 2.799402 0.438073 0.512563 0.770535 0.750288 0.872181 ... 3.139085 3.067431 3.174294 2.168966 2.896594 2.465324 3.100708 2.646217 3.541925 2.783829
2003-01-05 2.442409 2.594370 1.910036 1.588797 2.787615 0.432146 0.499985 0.767500 0.747190 0.870078 ... 3.132393 3.069143 3.169222 2.189612 2.928751 2.458141 3.116153 2.673266 3.540494 2.739983
2003-01-19 2.442409 2.594370 1.861534 1.585624 2.775498 0.426137 0.487032 0.764444 0.744069 0.867964 ... 3.125596 3.070849 3.164090 2.209321 2.958691 2.450836 3.131067 2.698729 3.539059 2.691207
2003-02-02 2.442409 2.594370 1.806927 1.582427 2.763034 0.420043 0.473681 0.761366 0.740926 0.865840 ... 3.118692 3.074719 3.158896 2.228174 2.982648 2.443407 3.132189 2.714732 3.537619 2.636251
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-08-02 2.294466 2.452767 1.405076 1.879211 2.305351 0.080904 0.272331 0.338456 0.250420 0.396199 ... 1.503116 2.510826 0.760163 1.391016 2.194692 2.037884 1.443446 2.188659 0.457377 2.152420
2015-08-16 2.294466 2.452537 1.409933 1.878522 2.305351 0.079181 0.274158 0.338456 0.250420 0.396199 ... 1.502427 2.516323 0.761928 1.390935 2.201397 2.037426 1.445604 2.186108 0.469822 2.153339
2015-08-30 2.294466 2.452306 1.409933 1.878522 2.305351 0.079181 0.274158 0.338456 0.250420 0.396199 ... 1.502427 2.517196 0.761928 1.390935 2.201397 2.037426 1.445604 2.186108 0.469822 2.154257
2015-09-13 2.294466 2.452075 1.409933 1.878522 2.305351 0.079181 0.274158 0.338456 0.250420 0.396199 ... 1.502427 2.517196 0.761928 1.390935 2.201397 2.037426 1.445604 2.186108 0.469822 2.155126
2015-09-27 2.294466 2.451869 1.409933 1.878522 2.305351 0.079181 0.274158 0.338456 0.250420 0.396199 ... 1.502427 2.517196 0.761928 1.390935 2.201397 2.037426 1.445604 2.186108 0.469822 2.155336

335 rows × 156 columns

[20]:
elements = tritium.shape[0]
rptData = pd.DataFrame(columns=['station_id', 'ratio_repeated'])
for well in tritium.columns:
    try:
        occurance = tritium[well].duplicated().value_counts()[True]
    except KeyError:
        occurance = 0
    rptData = rptData.append({'station_id': well, 'ratio_repeated': occurance/elements}, ignore_index=True)
std_ratio = rptData.describe().T['std'].values[0]
bad_wells = rptData[rptData['ratio_repeated']>1.5*std_ratio]
bad_well_names = bad_wells.station_id.to_list()
print("Bad wells: {}\nRemaining wells: {}".format(len(bad_well_names),elements-len(bad_well_names)))
tritium = tritium.drop(bad_well_names, axis=1)
Bad wells: 44
Remaining wells: 291
[21]:
tritium.plot(legend=False, figsize=(10,5))
pylenm_df.remove_outliers(tritium, z_threshold=2.5).plot(legend=False, figsize=(10,5))
tritium_rm = pylenm_df.remove_outliers(tritium, z_threshold=2.5)
print(tritium.shape)
print(tritium_rm.shape)
(335, 112)
(136, 112)
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_30_1.png
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_30_2.png
[22]:
pylenm_df.get_Construction_Data().head()
[22]:
FACILITY_ID SITE_GROUP SITE_NAME STATION_TYPE WELL USE AQUIFER WELL_USE LATITUDE LONGITUDE EASTING ... SZ_BOT(FT MSL) GROUND_ELEVATION REFERENCE_ELEVATION_CODE REFERENCE_ELEVATION TOTAL_DEPTH CONSTR_OBJ_DIAMETER CONSTR_OBJ_MATERIAL PUMP_TYPE COMPLETION_DATE DATE_SEALED
STATION_ID
FAI001A SRS GSA F & H-AREA HAZARDOUS WASTE MANAGEMENT FACILITI... MONITORING WELL Auxiliary Observation UAZ_UTRAU ACTIVE ECO-SENSITIVE 33.273872 -81.622904 441989.564 ... 231.30 250.1 C 252.63 19.10 2.0 PVC NONE 2016-03-22 NaT
FAI001B SRS GSA F & H-AREA HAZARDOUS WASTE MANAGEMENT FACILITI... MONITORING WELL Auxiliary Observation UAZ_UTRAU ACTIVE ECO-SENSITIVE 33.273873 -81.622891 441990.781 ... 240.60 250.2 C 252.73 9.90 2.0 PVC NONE 2016-03-22 NaT
FAI001C SRS GSA F & H-AREA HAZARDOUS WASTE MANAGEMENT FACILITI... MONITORING WELL Auxiliary Observation UAZ_UTRAU ACTIVE ECO-SENSITIVE 33.273874 -81.622895 441990.432 ... 242.68 250.2 C 252.74 7.82 2.0 PVC NONE 2016-03-22 NaT
FAI001D SRS GSA F & H-AREA HAZARDOUS WASTE MANAGEMENT FACILITI... MONITORING WELL Auxiliary Observation UAZ_UTRAU ACTIVE ECO-SENSITIVE 33.273874 -81.622901 441989.928 ... 246.75 250.1 C 252.56 3.65 2.0 PVC NONE 2016-03-22 NaT
FAI002A SRS GSA F & H-AREA HAZARDOUS WASTE MANAGEMENT FACILITI... MONITORING WELL Auxiliary Observation UAZ_UTRAU ACTIVE ECO-SENSITIVE 33.263961 -81.685462 436156.287 ... 165.88 185.1 C 187.58 3.65 2.0 PVC NONE 2016-03-29 NaT

5 rows × 22 columns

[23]:
cluster_data = pylenm_df.cluster_data(analyte_name= 'Tritium', data = tritium_rm, n_clusters=5, year_interval=3, return_clusters=True, filter=True, col='AQUIFER', equals=['UAZ_UTRAU'], y_label = 'Log Concentration')
['FSB 97D', 'FSB136D', 'FSB125DR', 'FSB 90D', 'FSP204A', 'FSB135D', 'FSB 79', 'FSP  2A', 'FOB 14D', 'FSB130D', 'FSB 88D', 'FSB123D', 'FSB 87D', 'FSB134D', 'FSB116D', 'FPZ008AR', 'FSB129D', 'FOB 13D', 'FSB112DR', 'FPZ  6B', 'FSP249B', 'FSB122D', 'FSB 98D', 'FSB108D', 'FSB133D', 'FSB 94DR', 'FSP-072A', 'FSB 95DR', 'FSB115D', 'FSB109D', 'FSB 89D', 'FBI 14D', 'FSP  2B', 'FPZ  2A', 'FPZ  7B', 'FSB132D', 'FSB 92D', 'FSB127D', 'FSB 76', 'FPZ008BR', 'FEX  4', 'FPZ  7A', 'FSB 91D', 'FSB138D', 'FSB124D', 'FSB120D', 'FSP-072B', 'FSB137D', 'FSB 93D', 'FSB117D', 'FSB114D', 'FSB104D', 'FPZ  3A', 'FPZ  4A', 'FSP249A', 'FSB126D', 'FSP226A', 'FSP-12A', 'FSB 99D', 'FSB118D', 'FPZ  6A', 'FSB 78', 'FSB128D', 'FSP 47A']
../_images/notebooks_2)_pyLEnM_-_Unsupervised_Learning_32_1.png
[24]:
cluster_data[['STATION_ID', 'color']].head()
[24]:
STATION_ID color
0 FBI 14D orange
1 FEX 4 red
2 FOB 13D green
3 FOB 14D orange
4 FPZ 2A purple
[25]:
pylenm_df.plot_coordinates_to_map(cluster_data)