For our final tutorial, we have decided to look into the public and private investments of US politicians. In 2012, the STOCK (Stop Trading on Congressional Knowledge) act, which required all of the stock information regarding congressmen be released to the public. In more recent news, politicians have received more attention for insider trading as more people have become aware that politicians are legally allowed to trade. We are planning to look at a dataset that was scraped directly from the senate stock disclosure website (https://efdsearch.senate.gov/search/) to see if this access has allowed these politicians to have superior returns and to what degree. The dataset includes the senator and their respective trading data; however, it lacks party affiliation, state, etc. We pulled in two other datasets to show committee affiliation and party affiliation of the respective senators, and we built classification models to see if we can predict the party and committee based on trade data.
When we set out in building our models, we had three specific goals in mind. One, could we predict the party of the trader based on the trade data? Two, can we predict the committee of the trader baased on the trade data? Three, can we predict the type of trade (Sale, Purchase, Exchange, etc.) based on the trade data? We have all of this data, over 600,000 entries, however we want to see if either of these factors have any significance when it comes to predicting these factors. Our plan is to use classification models, as we are attempting to predict classes, e.g. republican vs democrat, rather than a value. The plan we had going into it was to use K-nearest neighbors model as a baseline classifcation model for our different factors. If necesarry, we would use other, more powerful classification models such as XGBoost and Random Forest. We found that we were able to achieve significant accuracy in predicting party and transaction type, with accuracies of both over 99%, however committee affiliation proved to not be very relevant when looking at the trade data. This ultimately means that while the party is affiliated with the trading data, committee affiliation has less relevance in terms of trading. The reason we decided to include transaction type was to see if there were patterns in the trade data that would allow us to see whether a senator would buy or sell given the particular trade data. Ultimately, our conclusions are that party and transaction type are predictable based on the trade data, however committee affiliation is not.
import pandas as pd
import seaborn as sns
import os
from google.colab import drive
import json
from pandas.io.json import json_normalize
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import matplotlib.pyplot as plt
import squarify
import circlify
from datetime import datetime
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE
from simple_colors import *
import xgboost as xgb
import warnings
import scikitplot as skplt
drive.mount("/content/drive")
The first dataset comes from https://efdsearch.senate.gov/search/, and was scraped using the tools found here (https://github.com/neelsomani/senator-filings). It includes various investments from US senators dating back to 2012, when the STOCK act was passed. Investments included in this data are individual public stock in large companies, as well as investments in mutual funds and exchange traded funds. This dataset also includes a given range for the amount of each investment (i.e. $1,001 - $15,000). Also, this data can be organized by purchases and sales, as well as the date of each transaction. This will be interesting to see because we can use this information to identify when these senators entered and exited investments and then compare it to the stock or fund’s performance. We can also take a step back and see if there was any legislation passed at the time that would have affected the performance of the company that these senators were invested in.
1 - filer - identifies individual or state that filed the transaction
2 - filed-date - day the investment was reported
3 - filed-time - time the investment was reported
4 - transaction-date - day the investment was made
5 - owner - who actually bought the asset (self or spouse)
6 - ticker - ticker of company if asset is public stock
7 - asset-name - full-name of the investment
8 - transaction-type - if the transaction was purchased, full or partially sold, or exchanged
9 - amount - value ranges of the investment, 11 increments beginning with $1,001 - $15,000 and ending with over $50,000,000
10 - asset-type - different investment types, including public and private stock, stock options, corporate and municipal bonds, commodities, and crytocurrency
11 - type - if the transaction was purchased, full or partially sold, or exchanged
# download senate financial investments
url = 'https://raw.githubusercontent.com/jeremiak/us-senate-financial-disclosure-data/master/output/transactions.csv'
df_s = pd.read_csv(url, low_memory = False)
df_s.head()
drop = ['report-id',
'report-title',
'comment',
'comments',
'transactor',
'asset-id',
'date',
'amount-range'] #columns to drop
df_s.drop(drop, inplace=True, axis=1) #Dropping unneccesarry columns
pd.options.mode.chained_assignment = None
df2 = df_s.dropna(subset=["ticker"]) #dropping rows without ticker
df2["filed-date"] = pd.to_datetime(df2["filed-date"], dayfirst=True) #convert filed-date to datetime64
df2["transaction-date"] = pd.to_datetime(df2["transaction-date"], dayfirst=True, errors = "coerce") #convert transaction-date into datetime64
df2 = df2[df2["transaction-date"] >= "2016-1-1"]
df2 = df2[df2["transaction-date"] <= "2022-11-18"]
df2.reset_index(drop=True, inplace=True) #reset dataframe index
df2
df2.isnull().sum()
Below we are combining the two columns which ultimately represent the same thing.
df2["transaction_type"] = df2["transaction-type"].fillna("") + df2["type"].fillna("")
df2["transaction_type"].replace("", np.nan, inplace=True)
df2.drop(["type", "transaction-type"],axis=1,inplace=True)
df2.head()
df2.dtypes
The second dataset comes from https://dataverse.harvard.edu/file.xhtml?fileId=6708559&version=6.1, and was uploaded as a csv file to GitHub. This dataset contains senator name, party affiliation, and state. This will be helpful in looking at the differences in trade volume among political party and states. We will ultimately build a classification model to see if we can predict a senators party affiliation based on the first dataset.
1 - state_po - state that the senator represents
2 - candidate - name of senator
3 - party_simplified - political party, split into Democrat, Republican, or Libertarian
url = "https://raw.githubusercontent.com/RobbieCase/stuff/main/1976-2020-senate.csv"
df_sen = pd.read_csv(url, encoding='latin1') #read csv
df_sen = df_sen[df_sen["year"] >= 2016] #take values from 2016 on as dataset includes data from 2017 on
df_sen.columns
to_drop = ["year", "state", "state_fips", 'state_fips', 'state_cen', 'state_ic',
'office', 'district', 'stage', 'special',
'writein', 'mode', 'candidatevotes', 'totalvotes', 'unofficial',
'version','party_detailed']
df_sen.drop(to_drop, axis=1, inplace=True) #drop irrelevant collumns
df_sen.dropna(axis=0, inplace=True) #drop null rows
df_sen.isnull().sum()
df_sen.head(10)
df3 = df2.copy()
df4 = df_sen.copy()
main = df3["filer"].unique() #make df from unique senators
name = []
for i in main:
name.append(i) #make list of senators
names = []
for i in df4["candidate"].values: #add names from new dataset
names.append(i)
Here, we are utilizing a tool with a funny name called "fuzzy wuzzy." Fuzzy wuzzy is a tool which uses Levenshtein distance to match strings with one another. Since the names might be in different formats between the two different datasets.
key = {}
for i in name:
key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy to match strings of names
merger = [key[name][0][0] for name in df3['filer']]
df3['merger'] = merger #make column from names in key that are in original dataset
df3 = df3.merge(df4, how = 'left', left_on = 'merger', right_on='candidate') #merge df3 and df4
df3.drop(["candidate","merger"],axis=1, inplace = True) #drop merger
df3 = df3[df3["party_simplified"].str.contains("OTHER")==False]
The third dataset comes from https://github.com/unitedstates/congress-legislators, and was uploaded as a csv file to GitHub. This dataset contains senator name and committee affiliation. It is taken from a github file which is constantly updated with accurate government information. This will be helpful in looking at the differences in trade volume among committee and states. We will once again be using this to build a classification model to see if we can predict a sneators committee affiliation based on the initial trade data dataset.
1 - Name - name of the senator
2 - Committee Name - name of committee that the senator is a member
sen_url = 'https://raw.githubusercontent.com/RobbieCase/stuff/main/Senate_assignments_103-117.tab' #read tab dataset
df_co = pd.read_csv(sen_url, sep='\t', engine='python')
df_co.columns=df_co.iloc[0] #remove columns and set first row as them
df_co = df_co.iloc[1: , :]
df_co.columns
dropping = ['Congress', 'Committee Code', 'ID #', 'Maj/Min',
'Rank Within Party', 'Party Code', 'Date of Appointment',
'Date of Termination', 'Senior Party Member', 'Committee Seniority',
'Committee Period of Service',
'Committee status at end of this Congress',
'Committee continuity of assignment in next Congress',
'Appointment Citation', 'State Code', 'District',
'State Name', 'Notes']
df_co.drop(dropping,axis=1,inplace=True) #drop irrelevant data
main = df3["filer"].unique() ##make df from unique senators
name = []
for i in main: #make list from unique senators
name.append(i)
names = []
for i in df_co["Name"].values: #make list from names in new dataset
names.append(i)
key = {}
for i in name:
key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy for string matching of the names
merger = [key[name][0][0] for name in df3['filer']] #make column from names in key that are in original dataset
df3['merger'] = merger
df3 = df3.merge(df_co, how = 'left', left_on = 'merger', right_on='Name') #merge df3 and df_co
df3.drop(["merger","Name"],axis=1, inplace = True) #drop merger data
df3.head()
Below encompasses the Exploratory Data Analysis that we did regarding these three datasets. We set out with goals in mind to see if there was maybe any potential relevance in trading behavior as it concerns the senators party and committee affiliation. We utilized graphs showing trade volume and frequency based on party and committee. We will ultimately build models to predict party and committee affiliation based on trade data, along with predicting whether or not a particular senator will buy/sell a given stock.
disclosures = df3.copy()
disclosures.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True)
disclosures.head(25)
disclosures.to_csv('disclosures.csv')
from google.colab import drive
drive.mount('/content/drive')
party_date = disclosures.groupby(['transaction_date', 'party']).count()['amount'].reset_index()
sns.lineplot(data=party_date, x='transaction_date', y='amount', hue='party', alpha=0.5).set(title='Trades by Party Over Time')
The graph above shows the amount traded by party over the course of the period the data was collected. As we an see, both parties are similarly active in their trading, with libertarian obviously being neglected here as they represent such a small sample size.
sns.histplot(data=party_date, x="amount", log_scale = True, hue='party', alpha=0.5, kde = True, multiple = 'dodge').set(title='Trade Volume Based On Party')
The graph above shows the count of trade volume of senators by party on a particular day. On the x axis is the different amounts traded per day, so if there were 100 trades on 50 days, the count of the amount of 100 trade days would be 50. This shows the volume of trades by party, and we can see that the parties are relatively similar in terms of how much they trade.
warnings.filterwarnings("ignore", 'This pattern has match groups') #filter out warnings
freq_committee = disclosures[disclosures["committee"].str.contains("Budget|Finance|Environment and Public Works|Banking, Housing, and Urban Affairs|Agriculture, Nutrition, and Forestry|Aging (Special Committee)|Judiciary|Energy and Natural Resources|Health, Education, Labor, and Pensions|Homeland Security and Governmental Affairs|Armed Services|Foreign Relations|Appropriations|Intelligence (Select Committee)|Commerce, Science, and Transportation|Veterans' Affairs|Rules and Administration|Economic (Joint Committee)|Small Business and Entrepreneurship|Indian Affairs (Select Committee)") == True] #take 20 most active committees
committee_amt = freq_committee.groupby(['transaction_date','party', 'committee']).count()['amount'].reset_index() #make df from committee,party, and count of volume
committee_amt.head()
sns.set(rc={'figure.figsize':(16.7,10.27)})
# create bar graph displaying the most trading volume based on committee and political party
com_date = disclosures.groupby(['party', 'committee']).count()['amount'].reset_index()
com_date = com_date.nlargest(32, 'amount')
com_date.set_index('amount', inplace=True)
com_date.sort_index(ascending=False, inplace=True)
com_date.reset_index(inplace=True)
com_table = pd.pivot_table(com_date, values='amount', index=['committee'], # create a pivot table to plot a bar graph to display
columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)
com_table.plot.bar(figsize=(20,10), title='Trade Volume Based on Committee', ylabel='count') # create bar graph
The graph above shows the committees with the largest trade volumes. This shows us the 20 most financially active committees, with the largest and most active across both parties being the budget committee. Some interesting notes from this is that on the energy and natural resources side, democrats invested way heavier than republicans, while the alternative showed on agriculture, nutrition, and forestry.
freq_committee["committee"].value_counts()
top_coms = pd.DataFrame({"committee": ["Budget", "Finance","Environment and Public Works","Banking, Housing, and Urban Affairs","Agriculture, Nutrition, and Forestry",
"Judiciary","Energy and Natural Resources", "Health, Education, Labor, and Pensions","Homeland Security and Governmental Affairs",
"Armed Services","Foreign Relations","Appropriations","Commerce, Science, and Transportation","Veterans' Affairs",
"Rules and Administration","Small Business and Entrepreneurship "],
"count": [75763,57987,46015,42123,39430,35145,32200,28432,27129,26163,24746,23875,20834,17224,15624,13169]}) #create new df for treeplot
squarify.plot(sizes=top_coms['count'], label=top_coms['committee'], alpha=.8, text_kwargs={'fontsize':8}) #create treemap with squarify
plt.axis('off')
plt.show()
The plot above is a treemap which shows the heavy hitters in terms of trade volume by committee. You can see how the different committees compare in terms of their investment activity, and unsurprisingly budget and finance take up a good portion of the chart.
trade_vol = disclosures.groupby(["filer", "party"]).count()['amount'].reset_index() # return the 20 senators with the most trades
trade_vol = trade_vol.nlargest(20, 'amount')
trade_vol.set_index('amount', inplace=True)
trade_vol.sort_index(ascending=False, inplace=True)
trade_vol.reset_index(inplace=True)
trade_table = pd.pivot_table(trade_vol, values='amount', index=['filer'], # create a pivot table to plot a bar graph to display
columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)
trade_table.plot.bar(figsize=(10,6), title='Individual Senator Trade Volume', ylabel='count') # create bar graph
Above shows the senators who executed the most trades coded by their party. This is helpful in showing potential outliers that could skew the date for either of the parties. While both democrats and republicans are represented, democrats seem to have a few more outliers, while the republicans have the largest being David Perdue.
modeldf = df3.copy()
modeldf.isnull().sum()
modeldf.drop(["asset-type"], axis=1, inplace = True)
modeldf.dropna(axis = 0 , inplace = True) #drop null values
We elect to drop the null values so they do not interfere with our model.
modeldf.count()
modeldf.dtypes
modeldf.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True) #rename columns
Here, we utilize label encoding to change the dtype of the variable in order to better fit the training model. This converts the data to integer values of unique integers for each value. Some of the mapping of substantial columns have been printed out as well.
le = preprocessing.LabelEncoder() #label encode categorical variables
modeldf['filer']= le.fit_transform(modeldf['filer']) #utilize label encoder on column
modeldf['owner']= le.fit_transform(modeldf['owner'])
print(blue('owner', 'bold'))
for i, j in enumerate(le.classes_): #print out the labels for specific columns
print(j, ":", i)
modeldf['ticker']= le.fit_transform(modeldf['ticker'])
modeldf['asset']= le.fit_transform(modeldf['asset'])
modeldf['amount']= le.fit_transform(modeldf['amount'])
print(blue('amount', 'bold'))
for i, j in enumerate(le.classes_):
print(j, ":", i)
modeldf['transaction_type']= le.fit_transform(modeldf['transaction_type'])
print(blue('type', 'bold'))
for i, j in enumerate(le.classes_):
print(j, ":", i)
modeldf['state']= le.fit_transform(modeldf['state'])
modeldf['party']= le.fit_transform(modeldf['party'])
print(blue('party', 'bold'))
for i, j in enumerate(le.classes_):
print(j, ":", i)
modeldf['committee']= le.fit_transform(modeldf['committee'])
print(blue('committee', 'bold'))
for i, j in enumerate(le.classes_):
print(j, ":", i)
modeldf['filed_date']= le.fit_transform(modeldf['filed_date'])
modeldf['filed_time']= le.fit_transform(modeldf['filed_time'])
modeldf['transaction_date']= le.fit_transform(modeldf['transaction_date'])
modeldf
modeldf.dtypes
First we are going to attempt to use a classification model to predict the party of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader. We are going to use KNN initially, and if our model lacks accuracy, we may bring in a different classification model to compare accuracy.
modeldf2 = modeldf.copy() #copy dataframe for modeling
to_drop = ["filer", "owner", "state", "committee"]
modeldf2.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction
X = modeldf2.drop('party', axis = 1)
y = modeldf2['party'] #assign party to y to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
knn = KNeighborsClassifier(n_neighbors = 20) #knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fitting x and y train
knn.score(X_test, y_test) #accuracy of model
The model is extremely accurate at predicting party based on trade data. Using KNN, we were able to get an accuracy of over 99%, thus the trade data is significant in predicting the party of the trader.
y_pred = knn.predict(X_test) #assign df with predicted vals
Party
DEMOCRAT : 0
LIBERTARIAN : 1
REPUBLICAN : 2
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted party
df
The model is extremely accurate, and predicted the values of the trade type.
fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
title="Party Model Confusion Matrix",
cmap="Purples")
As we can see, the model was successful in correctly projecting democrat and republicans, however not as much with libertarians considering their significantly smaller sample size.
Now we are going to again attempt to use a classification model to predict the committee of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader.
cpred = modeldf.copy() #copy dataframe for modeling
to_drop = ["filer", "owner", "state", "party"]
cpred.drop(to_drop, axis=1, inplace= True)
#drop columns that could skew prediction
X = cpred.drop('committee', axis = 1)
y = cpred['committee'] #assign committee to y to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
knn = KNeighborsClassifier(n_neighbors=100) #knn classifier with 100 neighbors
knn.fit(X_train, y_train) #fitting knn training dfs
knn.score(X_test, y_test) #accuracy of model
KNN doesn't seem to be cutting it in this case. Lets try to add some more firepower and utilize XGBoost to see if it changes in performance at all.
xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=42) #XGBoost classifier with multiple classifiers
xgb_model.fit(X_train, y_train) #fit xgb model with x and y train
xgb_model.score(X_test, y_test) #accuracy of model
XGB only slightly improved the model's accuracy, so we can determine that this dataset is not significant in predicting the committee based on trade data.
y_pred = knn.predict(X_test) #create df with predicted values
Committee
APPROPRIATIONS : 0
Aging (Special Committee) : 1
Agriculture, Nutrition, and Forestry : 2
Appropriations : 3
Armed Services : 4
Banking, Housing, and Urban Affairs : 5
Budget : 6
Commerce, Science, and Transportation : 7
ENVIRONMENT AND PUBLIC WORKS : 8
Economic (Joint Committee) : 9
Energy and Natural Resources : 10
Environment and Public Works : 11
Ethics (Select Committee) : 12
Finance : 13
Foreign Relations : 14
Governmental Affairs : 15
Health, Education, Labor, and Pensions : 16
Homeland Security and Governmental Affairs : 17
Indian Affairs (Select Committee) : 18
Intelligence (Select Committee) : 19
JUDICIARY : 20
Judiciary : 21
Labor and Human Resources : 22
Library (Joint) : 23
Majority whip : 24
Minority whip : 25
Printing (Joint) : 26
Rules and Administration : 27
Small Business : 28
Small Business and Entrepreneurship : 29
Taxation (Joint) : 30
Veterans' Affairs : 31
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted committee
df
As we can see, the model is not very accurate in predicting committee.
Lastly, we are going to attempt to use a classification model to predict the transaction type of the trade based on the trade data. We are again going to be using a classification model and we can ultimately predict trading behavior with this model.
tpred = modeldf.copy() #copy dataframe for modeling
to_drop = ["filer", "owner"]
tpred.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction
X = tpred.drop('transaction_type', axis = 1)
y = tpred['transaction_type'] #assign type to y to train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
knn = KNeighborsClassifier(n_neighbors = 20) #run knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fit knn classifier with x and y train
knn.score(X_test, y_test) #model accuracy
Here we have some very solid results. Using KNN, we are accurately able to predict the classification of whether or not the trade was a Purchase, Partial Sale, Full Sale, and Exchange. Lets see if we can improve accuracy with a different model.
rf = RandomForestClassifier() #run rf classifier
rf.fit(X_train, y_train) #fit x and y train to rf classifier
rf.score(X_test, y_test) #model accuracy
Using random forest, we were able to increase the models accuracy to over 99%! This is very accurate, and shows this is a better classification model than KNN for this data.
y_pred = rf.predict(X_test) #df with predicted types
Transaction Type
Exchange : 0
Purchase : 1
Sale : 2
Sale (Full) : 3
Sale (Partial) : 4
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted transaction types
df
The model is very accurate, and predicted the values of the trade type.
fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
title="Transaction Type Model Confusion Matrix",
cmap="Oranges")
Shown above is a confusion matrix for the model of transaction type. As is evident, the model is successful in predicting the correct columns, with purchase, sale full, and sale partial being the most represented. This is as expected, as they are the most represented in the dataset.
Ultimately, with the given data we were able to build accuracte prediciton models to predict the party affiliation and whether a senator will buy or sell a certain stock based on the trade data. We were unsuccussful in building an accurate model that would predict committee affiliation. This is likely because committee affiliation does not have as much relevance to trading behavor that we had presumed. In terms of our initial plan for discovery, we operated under the presumption that they would be significant, so this goes to show that ones committee affiliation does not nececarilly mean they are going to trade one way or another.
Datasets:
https://dataverse.harvard.edu/file.xhtml?fileId=4640697&version=1.0
https://github.com/jeremiak/us-senate-financial-disclosure-data
https://dataverse.harvard.edu/file.xhtml?fileId=6708559&version=6.1
Fuzzy wuzzy matching on dataframe:
https://www.geeksforgeeks.org/how-to-do-fuzzy-matching-on-pandas-dataframe-column-using-python/
Seaborn:
https://www.geeksforgeeks.org/plot-the-size-of-each-group-in-a-groupby-object-in-pandas/
Treemap:
https://www.python-graph-gallery.com/treemap/
XGBoost:
https://xgboost.readthedocs.io/en/stable/
Random Forest:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Label encoding:
https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/ds_project.ipynb