US Senate Stock Trading Analysis¶

Robbie Case

robbiecase.github.io

Introduction¶

For our final tutorial, we have decided to look into the public and private investments of US politicians. In 2012, the STOCK (Stop Trading on Congressional Knowledge) act, which required all of the stock information regarding congressmen be released to the public. In more recent news, politicians have received more attention for insider trading as more people have become aware that politicians are legally allowed to trade. We are planning to look at a dataset that was scraped directly from the senate stock disclosure website (https://efdsearch.senate.gov/search/) to see if this access has allowed these politicians to have superior returns and to what degree. The dataset includes the senator and their respective trading data; however, it lacks party affiliation, state, etc. We pulled in two other datasets to show committee affiliation and party affiliation of the respective senators, and we built classification models to see if we can predict the party and committee based on trade data.

When we set out in building our models, we had three specific goals in mind. One, could we predict the party of the trader based on the trade data? Two, can we predict the committee of the trader baased on the trade data? Three, can we predict the type of trade (Sale, Purchase, Exchange, etc.) based on the trade data? We have all of this data, over 600,000 entries, however we want to see if either of these factors have any significance when it comes to predicting these factors. Our plan is to use classification models, as we are attempting to predict classes, e.g. republican vs democrat, rather than a value. The plan we had going into it was to use K-nearest neighbors model as a baseline classifcation model for our different factors. If necesarry, we would use other, more powerful classification models such as XGBoost and Random Forest. We found that we were able to achieve significant accuracy in predicting party and transaction type, with accuracies of both over 99%, however committee affiliation proved to not be very relevant when looking at the trade data. This ultimately means that while the party is affiliated with the trading data, committee affiliation has less relevance in terms of trading. The reason we decided to include transaction type was to see if there were patterns in the trade data that would allow us to see whether a senator would buy or sell given the particular trade data. Ultimately, our conclusions are that party and transaction type are predictable based on the trade data, however committee affiliation is not.

Importing Libraries¶

import pandas as pd
import seaborn as sns
import os
from google.colab import drive
import json
from pandas.io.json import json_normalize
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import matplotlib.pyplot as plt
import squarify
import circlify
from datetime import datetime
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE
from simple_colors import *
import xgboost as xgb
import warnings
import scikitplot as skplt
drive.mount("/content/drive")

Mounted at /content/drive

Data Imports and Tidying¶

Senate stock disclosures¶

The first dataset comes from https://efdsearch.senate.gov/search/, and was scraped using the tools found here (https://github.com/neelsomani/senator-filings). It includes various investments from US senators dating back to 2012, when the STOCK act was passed. Investments included in this data are individual public stock in large companies, as well as investments in mutual funds and exchange traded funds. This dataset also includes a given range for the amount of each investment (i.e. $1,001 - $15,000). Also, this data can be organized by purchases and sales, as well as the date of each transaction. This will be interesting to see because we can use this information to identify when these senators entered and exited investments and then compare it to the stock or fund’s performance. We can also take a step back and see if there was any legislation passed at the time that would have affected the performance of the company that these senators were invested in.

US Senators Investments Dataset - Descriptions of columns used:¶

1 - filer - identifies individual or state that filed the transaction

2 - filed-date - day the investment was reported

3 - filed-time - time the investment was reported

4 - transaction-date - day the investment was made

5 - owner - who actually bought the asset (self or spouse)

6 - ticker - ticker of company if asset is public stock

7 - asset-name - full-name of the investment

8 - transaction-type - if the transaction was purchased, full or partially sold, or exchanged

9 - amount - value ranges of the investment, 11 increments beginning with $1,001 - $15,000 and ending with over $50,000,000

10 - asset-type - different investment types, including public and private stock, stock options, corporate and municipal bonds, commodities, and crytocurrency

11 - type - if the transaction was purchased, full or partially sold, or exchanged

# download senate financial investments

url = 'https://raw.githubusercontent.com/jeremiak/us-senate-financial-disclosure-data/master/output/transactions.csv'
df_s = pd.read_csv(url, low_memory = False)

df_s.head()

drop = ['report-id',
           'report-title',
           'comment',
           'comments',
           'transactor',
           'asset-id',
           'date',
           'amount-range']  #columns to drop
df_s.drop(drop, inplace=True, axis=1) #Dropping unneccesarry columns

pd.options.mode.chained_assignment = None

df2 = df_s.dropna(subset=["ticker"]) #dropping rows without ticker
df2["filed-date"] = pd.to_datetime(df2["filed-date"], dayfirst=True) #convert filed-date to datetime64
df2["transaction-date"] = pd.to_datetime(df2["transaction-date"], dayfirst=True, errors = "coerce") #convert transaction-date into datetime64

df2 = df2[df2["transaction-date"] >= "2016-1-1"]
df2 = df2[df2["transaction-date"] <= "2022-11-18"]

df2.reset_index(drop=True, inplace=True) #reset dataframe index
df2

df2.isnull().sum()

filer                   0
filed-date            613
filed-time            613
transaction-date        0
owner                  14
ticker                  0
asset-name             33
transaction-type     8815
amount                  1
asset-type          14719
type                14718
dtype: int64

Below we are combining the two columns which ultimately represent the same thing.

df2["transaction_type"] = df2["transaction-type"].fillna("") + df2["type"].fillna("")
df2["transaction_type"].replace("", np.nan, inplace=True)
df2.drop(["type", "transaction-type"],axis=1,inplace=True)

df2.head()

df2.dtypes

filer                       object
filed-date          datetime64[ns]
filed-time                  object
transaction-date    datetime64[ns]
owner                       object
ticker                      object
asset-name                  object
amount                      object
asset-type                  object
transaction_type            object
dtype: object

Party Affiliation Dataset¶

The second dataset comes from https://dataverse.harvard.edu/file.xhtml?fileId=6708559&version=6.1, and was uploaded as a csv file to GitHub. This dataset contains senator name, party affiliation, and state. This will be helpful in looking at the differences in trade volume among political party and states. We will ultimately build a classification model to see if we can predict a senators party affiliation based on the first dataset.

US Senators State and Political Party Dataset - Descriptions of columns used:¶

1 - state_po - state that the senator represents

2 - candidate - name of senator

3 - party_simplified - political party, split into Democrat, Republican, or Libertarian

url = "https://raw.githubusercontent.com/RobbieCase/stuff/main/1976-2020-senate.csv"

df_sen = pd.read_csv(url, encoding='latin1') #read csv

df_sen = df_sen[df_sen["year"] >= 2016] #take values from 2016 on as dataset includes data from 2017 on

df_sen.columns

Index(['year', 'state', 'state_po', 'state_fips', 'state_cen', 'state_ic',
       'office', 'district', 'stage', 'special', 'candidate', 'party_detailed',
       'writein', 'mode', 'candidatevotes', 'totalvotes', 'unofficial',
       'version', 'party_simplified'],
      dtype='object')

to_drop = ["year", "state", "state_fips", 'state_fips', 'state_cen', 'state_ic',
       'office', 'district', 'stage', 'special',
       'writein', 'mode', 'candidatevotes', 'totalvotes', 'unofficial',
       'version','party_detailed']
df_sen.drop(to_drop, axis=1, inplace=True) #drop irrelevant collumns

df_sen.dropna(axis=0, inplace=True) #drop null rows
df_sen.isnull().sum()

state_po            0
candidate           0
party_simplified    0
dtype: int64

df_sen.head(10)

df3 = df2.copy()

df4 = df_sen.copy()

main = df3["filer"].unique() #make df from unique senators

name = []
for i in main: 
    name.append(i) #make list of senators

names = []
for i in df4["candidate"].values: #add names from new dataset
    names.append(i)

Here, we are utilizing a tool with a funny name called "fuzzy wuzzy." Fuzzy wuzzy is a tool which uses Levenshtein distance to match strings with one another. Since the names might be in different formats between the two different datasets.

key = {}

for i in name:
    key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy to match strings of names

merger = [key[name][0][0] for name in df3['filer']] 
df3['merger'] = merger #make column from names in key that are in original dataset

df3 = df3.merge(df4, how = 'left', left_on = 'merger', right_on='candidate') #merge df3 and df4

df3.drop(["candidate","merger"],axis=1, inplace = True) #drop merger

df3 = df3[df3["party_simplified"].str.contains("OTHER")==False]

Committee Affiliation Dataset¶

The third dataset comes from https://github.com/unitedstates/congress-legislators, and was uploaded as a csv file to GitHub. This dataset contains senator name and committee affiliation. It is taken from a github file which is constantly updated with accurate government information. This will be helpful in looking at the differences in trade volume among committee and states. We will once again be using this to build a classification model to see if we can predict a sneators committee affiliation based on the initial trade data dataset.

US Senators Committee Membership Dataset - Descriptions of columns used:¶

1 - Name - name of the senator

2 - Committee Name - name of committee that the senator is a member

sen_url = 'https://raw.githubusercontent.com/RobbieCase/stuff/main/Senate_assignments_103-117.tab' #read tab dataset
df_co = pd.read_csv(sen_url, sep='\t', engine='python')

df_co.columns=df_co.iloc[0] #remove columns and set first row as them
df_co = df_co.iloc[1: , :]

df_co.columns

Index(['Congress', 'Committee Code', 'ID #', 'Name', 'Maj/Min',
       'Rank Within Party', 'Party Code', 'Date of Appointment',
       'Date of Termination', 'Senior Party Member', 'Committee Seniority',
       'Committee Period of Service',
       'Committee status at end of this Congress',
       'Committee continuity of assignment in next Congress',
       'Appointment Citation', 'Committee Name', 'State Code', 'District',
       'State Name', 'Notes'],
      dtype='object', name=0)

dropping = ['Congress', 'Committee Code', 'ID #', 'Maj/Min',
       'Rank Within Party', 'Party Code', 'Date of Appointment',
       'Date of Termination', 'Senior Party Member', 'Committee Seniority',
       'Committee Period of Service',
       'Committee status at end of this Congress',
       'Committee continuity of assignment in next Congress',
       'Appointment Citation', 'State Code', 'District',
       'State Name', 'Notes'] 

df_co.drop(dropping,axis=1,inplace=True) #drop irrelevant data

main = df3["filer"].unique() ##make df from unique senators

name = []
for i in main: #make list from unique senators
    name.append(i)

names = []
for i in df_co["Name"].values: #make list from names in new dataset
    names.append(i)

key = {}

for i in name:
    key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy for string matching of the names

merger = [key[name][0][0] for name in df3['filer']] #make column from names in key that are in original dataset
df3['merger'] = merger

df3 = df3.merge(df_co, how = 'left', left_on = 'merger', right_on='Name') #merge df3 and df_co

df3.drop(["merger","Name"],axis=1, inplace = True) #drop merger data

df3.head()

EDA¶

Below encompasses the Exploratory Data Analysis that we did regarding these three datasets. We set out with goals in mind to see if there was maybe any potential relevance in trading behavior as it concerns the senators party and committee affiliation. We utilized graphs showing trade volume and frequency based on party and committee. We will ultimately build models to predict party and committee affiliation based on trade data, along with predicting whether or not a particular senator will buy/sell a given stock.

disclosures = df3.copy()

disclosures.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True)

disclosures.head(25)

disclosures.to_csv('disclosures.csv')

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

party_date = disclosures.groupby(['transaction_date', 'party']).count()['amount'].reset_index()
sns.lineplot(data=party_date, x='transaction_date', y='amount', hue='party', alpha=0.5).set(title='Trades by Party Over Time')

[Text(0.5, 1.0, 'Trades by Party Over Time')]

The graph above shows the amount traded by party over the course of the period the data was collected. As we an see, both parties are similarly active in their trading, with libertarian obviously being neglected here as they represent such a small sample size.

sns.histplot(data=party_date, x="amount", log_scale = True, hue='party', alpha=0.5, kde = True,  multiple = 'dodge').set(title='Trade Volume Based On Party')

[Text(0.5, 1.0, 'Trade Volume Based On Party')]

The graph above shows the count of trade volume of senators by party on a particular day. On the x axis is the different amounts traded per day, so if there were 100 trades on 50 days, the count of the amount of 100 trade days would be 50. This shows the volume of trades by party, and we can see that the parties are relatively similar in terms of how much they trade.

warnings.filterwarnings("ignore", 'This pattern has match groups') #filter out warnings
freq_committee = disclosures[disclosures["committee"].str.contains("Budget|Finance|Environment and Public Works|Banking, Housing, and Urban Affairs|Agriculture, Nutrition, and Forestry|Aging (Special Committee)|Judiciary|Energy and Natural Resources|Health, Education, Labor, and Pensions|Homeland Security and Governmental Affairs|Armed Services|Foreign Relations|Appropriations|Intelligence (Select Committee)|Commerce, Science, and Transportation|Veterans' Affairs|Rules and Administration|Economic (Joint Committee)|Small Business and Entrepreneurship|Indian Affairs (Select Committee)") == True] #take 20 most active committees

committee_amt = freq_committee.groupby(['transaction_date','party', 'committee']).count()['amount'].reset_index() #make df from committee,party, and count of volume

committee_amt.head()

sns.set(rc={'figure.figsize':(16.7,10.27)})

# create bar graph displaying the most trading volume based on committee and political party

com_date = disclosures.groupby(['party', 'committee']).count()['amount'].reset_index()
com_date = com_date.nlargest(32, 'amount')
com_date.set_index('amount', inplace=True)
com_date.sort_index(ascending=False, inplace=True)
com_date.reset_index(inplace=True)
com_table = pd.pivot_table(com_date, values='amount', index=['committee'], # create a pivot table to plot a bar graph to display 
                    columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)

com_table.plot.bar(figsize=(20,10), title='Trade Volume Based on Committee', ylabel='count') # create bar graph

<matplotlib.axes._subplots.AxesSubplot at 0x7f29557792b0>

The graph above shows the committees with the largest trade volumes. This shows us the 20 most financially active committees, with the largest and most active across both parties being the budget committee. Some interesting notes from this is that on the energy and natural resources side, democrats invested way heavier than republicans, while the alternative showed on agriculture, nutrition, and forestry.

freq_committee["committee"].value_counts()

Budget                                        75769
Finance                                       57999
Environment and Public Works                  46009
Banking, Housing, and Urban Affairs           42147
Agriculture, Nutrition, and Forestry          39340
Judiciary                                     35139
Energy and Natural Resources                  32200
Health, Education, Labor, and Pensions        28432
Homeland Security and Governmental Affairs    27129
Armed Services                                26163
Foreign Relations                             24746
Appropriations                                23797
Commerce, Science, and Transportation         20834
Veterans' Affairs                             17224
Rules and Administration                      15534
Small Business and Entrepreneurship           13169
Name: committee, dtype: int64

top_coms = pd.DataFrame({"committee": ["Budget", "Finance","Environment and Public Works","Banking, Housing, and Urban Affairs","Agriculture, Nutrition, and Forestry",
                                         "Judiciary","Energy and Natural Resources", "Health, Education, Labor, and Pensions","Homeland Security and Governmental Affairs",
                                         "Armed Services","Foreign Relations","Appropriations","Commerce, Science, and Transportation","Veterans' Affairs",
                                         "Rules and Administration","Small Business and Entrepreneurship "],
                           "count": [75763,57987,46015,42123,39430,35145,32200,28432,27129,26163,24746,23875,20834,17224,15624,13169]}) #create new df for treeplot

squarify.plot(sizes=top_coms['count'], label=top_coms['committee'], alpha=.8, text_kwargs={'fontsize':8}) #create treemap with squarify
plt.axis('off')
plt.show()

The plot above is a treemap which shows the heavy hitters in terms of trade volume by committee. You can see how the different committees compare in terms of their investment activity, and unsurprisingly budget and finance take up a good portion of the chart.

trade_vol = disclosures.groupby(["filer", "party"]).count()['amount'].reset_index() # return the 20 senators with the most trades
trade_vol = trade_vol.nlargest(20, 'amount')
trade_vol.set_index('amount', inplace=True)
trade_vol.sort_index(ascending=False, inplace=True)
trade_vol.reset_index(inplace=True)
trade_table = pd.pivot_table(trade_vol, values='amount', index=['filer'], # create a pivot table to plot a bar graph to display 
                    columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)

trade_table.plot.bar(figsize=(10,6), title='Individual Senator Trade Volume', ylabel='count') # create bar graph

<matplotlib.axes._subplots.AxesSubplot at 0x7f29559c6ee0>

Above shows the senators who executed the most trades coded by their party. This is helpful in showing potential outliers that could skew the date for either of the parties. While both democrats and republicans are represented, democrats seem to have a few more outliers, while the republicans have the largest being David Perdue.

Modeling¶

Model Prep¶

modeldf = df3.copy()

modeldf.isnull().sum()

filer                    0
filed-date           13016
filed-time           13016
transaction-date         0
owner                  378
ticker                   0
asset-name             429
amount                  27
asset-type          383014
transaction_type        27
state_po                 0
party_simplified         0
Committee Name           0
dtype: int64

modeldf.drop(["asset-type"], axis=1, inplace = True)

modeldf.dropna(axis = 0 , inplace = True) #drop null values

We elect to drop the null values so they do not interfere with our model.

modeldf.count()

filer               627435
filed-date          627435
filed-time          627435
transaction-date    627435
owner               627435
ticker              627435
asset-name          627435
amount              627435
transaction_type    627435
state_po            627435
party_simplified    627435
Committee Name      627435
dtype: int64

Label Encoding¶

modeldf.dtypes

filer                       object
filed-date          datetime64[ns]
filed-time                  object
transaction-date    datetime64[ns]
owner                       object
ticker                      object
asset-name                  object
amount                      object
transaction_type            object
state_po                    object
party_simplified            object
Committee Name              object
dtype: object

modeldf.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True) #rename columns

Here, we utilize label encoding to change the dtype of the variable in order to better fit the training model. This converts the data to integer values of unique integers for each value. Some of the mapping of substantial columns have been printed out as well.

le = preprocessing.LabelEncoder() #label encode categorical variables
modeldf['filer']= le.fit_transform(modeldf['filer'])  #utilize label encoder on column
modeldf['owner']= le.fit_transform(modeldf['owner'])
print(blue('owner', 'bold'))
for i, j in enumerate(le.classes_): #print out the labels for specific columns
  print(j, ":", i) 
modeldf['ticker']= le.fit_transform(modeldf['ticker'])
modeldf['asset']= le.fit_transform(modeldf['asset']) 
modeldf['amount']= le.fit_transform(modeldf['amount'])
print(blue('amount', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['transaction_type']= le.fit_transform(modeldf['transaction_type'])
print(blue('type', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['state']= le.fit_transform(modeldf['state'])
modeldf['party']= le.fit_transform(modeldf['party'])
print(blue('party', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['committee']= le.fit_transform(modeldf['committee'])
print(blue('committee', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['filed_date']= le.fit_transform(modeldf['filed_date'])
modeldf['filed_time']= le.fit_transform(modeldf['filed_time'])
modeldf['transaction_date']= le.fit_transform(modeldf['transaction_date'])

owner
Child : 0
Joint : 1
Self : 2
Spouse : 3
amount
$1,000,001 - $5,000,000 : 0
$1,001 - $15,000 : 1
$100,001 - $250,000 : 2
$15,001 - $50,000 : 3
$25,000,001 - $50,000,000 : 4
$250,001 - $500,000 : 5
$5,000,001 - $25,000,000 : 6
$50,001 - $100,000 : 7
$500,001 - $1,000,000 : 8
type
Exchange : 0
Purchase : 1
Sale : 2
Sale (Full) : 3
Sale (Partial) : 4
SaleSale : 5
party
DEMOCRAT : 0
LIBERTARIAN : 1
REPUBLICAN : 2
committee
APPROPRIATIONS : 0
Aging (Special Committee) : 1
Agriculture, Nutrition, and Forestry : 2
Appropriations : 3
Armed Services : 4
Banking, Housing, and Urban Affairs : 5
Budget : 6
Commerce, Science, and Transportation : 7
ENVIRONMENT AND PUBLIC WORKS : 8
Economic (Joint Committee) : 9
Energy and Natural Resources : 10
Environment and Public Works : 11
Ethics (Select Committee) : 12
Finance : 13
Foreign Relations : 14
Governmental Affairs : 15
Health, Education, Labor, and Pensions : 16
Homeland Security and Governmental Affairs : 17
Indian Affairs (Select Committee) : 18
Intelligence (Select Committee) : 19
JUDICIARY : 20
Judiciary : 21
Labor and Human Resources : 22
Library (Joint) : 23
Majority whip : 24
Minority whip : 25
Printing (Joint) : 26
Rules and Administration : 27
Small Business : 28
Small Business and Entrepreneurship : 29
Taxation (Joint) : 30
Veterans' Affairs : 31

modeldf

modeldf.dtypes

filer               int64
filed_date          int64
filed_time          int64
transaction_date    int64
owner               int64
ticker              int64
asset               int64
amount              int64
transaction_type    int64
state               int64
party               int64
committee           int64
dtype: object

Party¶

First we are going to attempt to use a classification model to predict the party of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader. We are going to use KNN initially, and if our model lacks accuracy, we may bring in a different classification model to compare accuracy.

modeldf2 = modeldf.copy() #copy dataframe for modeling

to_drop = ["filer", "owner", "state", "committee"]
modeldf2.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction

X = modeldf2.drop('party', axis = 1)
y = modeldf2['party'] #assign party to y to train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling

knn = KNeighborsClassifier(n_neighbors = 20) #knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fitting x and y train
knn.score(X_test, y_test) #accuracy of model

0.9939946066212331

The model is extremely accurate at predicting party based on trade data. Using KNN, we were able to get an accuracy of over 99%, thus the trade data is significant in predicting the party of the trader.

y_pred = knn.predict(X_test) #assign df with predicted vals

Party

DEMOCRAT : 0

LIBERTARIAN : 1

REPUBLICAN : 2

df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted party
df

The model is extremely accurate, and predicted the values of the trade type.

fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title="Party Model Confusion Matrix",
                                    cmap="Purples")

<matplotlib.axes._subplots.AxesSubplot at 0x7f294e887d00>

<Figure size 1080x432 with 0 Axes>

As we can see, the model was successful in correctly projecting democrat and republicans, however not as much with libertarians considering their significantly smaller sample size.

Committee¶

Now we are going to again attempt to use a classification model to predict the committee of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader.

cpred = modeldf.copy() #copy dataframe for modeling

to_drop = ["filer", "owner", "state", "party"]
cpred.drop(to_drop, axis=1, inplace= True)
#drop columns that could skew prediction

X = cpred.drop('committee', axis = 1) 
y = cpred['committee'] #assign committee to y to train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling

knn = KNeighborsClassifier(n_neighbors=100) #knn classifier with 100 neighbors
knn.fit(X_train, y_train) #fitting knn training dfs
knn.score(X_test, y_test) #accuracy of model

0.16230500003187576

KNN doesn't seem to be cutting it in this case. Lets try to add some more firepower and utilize XGBoost to see if it changes in performance at all.

xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=42) #XGBoost classifier with multiple classifiers
xgb_model.fit(X_train, y_train) #fit xgb model with x and y train
xgb_model.score(X_test, y_test) #accuracy of model

0.16832951886726297

XGB only slightly improved the model's accuracy, so we can determine that this dataset is not significant in predicting the committee based on trade data.

y_pred = knn.predict(X_test) #create df with predicted values

Committee

APPROPRIATIONS : 0

Aging (Special Committee) : 1

Agriculture, Nutrition, and Forestry : 2

Appropriations : 3

Armed Services : 4

Banking, Housing, and Urban Affairs : 5

Budget : 6

Commerce, Science, and Transportation : 7

ENVIRONMENT AND PUBLIC WORKS : 8

Economic (Joint Committee) : 9

Energy and Natural Resources : 10

Environment and Public Works : 11

Ethics (Select Committee) : 12

Finance : 13

Foreign Relations : 14

Governmental Affairs : 15

Health, Education, Labor, and Pensions : 16

Homeland Security and Governmental Affairs : 17

Indian Affairs (Select Committee) : 18

Intelligence (Select Committee) : 19

JUDICIARY : 20

Judiciary : 21

Labor and Human Resources : 22

Library (Joint) : 23

Majority whip : 24

Minority whip : 25

Printing (Joint) : 26

Rules and Administration : 27

Small Business : 28

Small Business and Entrepreneurship : 29

Taxation (Joint) : 30

Veterans' Affairs : 31

df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted committee
df

As we can see, the model is not very accurate in predicting committee.

Transaction Type¶

Lastly, we are going to attempt to use a classification model to predict the transaction type of the trade based on the trade data. We are again going to be using a classification model and we can ultimately predict trading behavior with this model.

tpred = modeldf.copy()  #copy dataframe for modeling

to_drop = ["filer", "owner"]
tpred.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction

X = tpred.drop('transaction_type', axis = 1)
y = tpred['transaction_type'] #assign type to y to train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling

knn = KNeighborsClassifier(n_neighbors = 20) #run knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fit knn classifier with x and y train
knn.score(X_test, y_test) #model accuracy

0.9649175374062056

Here we have some very solid results. Using KNN, we are accurately able to predict the classification of whether or not the trade was a Purchase, Partial Sale, Full Sale, and Exchange. Lets see if we can improve accuracy with a different model.

rf = RandomForestClassifier() #run rf classifier
rf.fit(X_train, y_train) #fit x and y train to rf classifier
rf.score(X_test, y_test) #model accuracy

0.9908134056700604

Using random forest, we were able to increase the models accuracy to over 99%! This is very accurate, and shows this is a better classification model than KNN for this data.

y_pred = rf.predict(X_test) #df with predicted types

Transaction Type

Exchange : 0

Purchase : 1

Sale : 2

Sale (Full) : 3

Sale (Partial) : 4

df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted transaction types
df

The model is very accurate, and predicted the values of the trade type.

fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title="Transaction Type Model Confusion Matrix",
                                    cmap="Oranges")

<matplotlib.axes._subplots.AxesSubplot at 0x7f2953883ac0>

<Figure size 1080x432 with 0 Axes>

Shown above is a confusion matrix for the model of transaction type. As is evident, the model is successful in predicting the correct columns, with purchase, sale full, and sale partial being the most represented. This is as expected, as they are the most represented in the dataset.

Conclusions¶

Ultimately, with the given data we were able to build accuracte prediciton models to predict the party affiliation and whether a senator will buy or sell a certain stock based on the trade data. We were unsuccussful in building an accurate model that would predict committee affiliation. This is likely because committee affiliation does not have as much relevance to trading behavor that we had presumed. In terms of our initial plan for discovery, we operated under the presumption that they would be significant, so this goes to show that ones committee affiliation does not nececarilly mean they are going to trade one way or another.

Works Cited¶

Datasets:

https://dataverse.harvard.edu/file.xhtml?fileId=4640697&version=1.0

https://github.com/jeremiak/us-senate-financial-disclosure-data

https://dataverse.harvard.edu/file.xhtml?fileId=6708559&version=6.1

Fuzzy wuzzy matching on dataframe:

https://www.geeksforgeeks.org/how-to-do-fuzzy-matching-on-pandas-dataframe-column-using-python/

https://towardsdatascience.com/fuzzywuzzy-basica-and-merging-datasets-on-names-with-different-transcriptions-e2bb6e179fbf

Seaborn:

https://www.geeksforgeeks.org/plot-the-size-of-each-group-in-a-groupby-object-in-pandas/

https://stackoverflow.com/questions/52983984/plotting-wtih-seaborn-time-evolution-of-number-of-my-entity-over-years

Treemap:

https://www.python-graph-gallery.com/treemap/

XGBoost:

https://xgboost.readthedocs.io/en/stable/

Random Forest:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Label encoding:

https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/

%%shell
jupyter nbconvert --to html /content/drive/MyDrive/ds_project.ipynb

[NbConvertApp] Converting notebook /content/drive/MyDrive/ds_project.ipynb to html
[NbConvertApp] Writing 780260 bytes to /content/drive/MyDrive/ds_project.html

	report-id	report-title	filer	filed-date	filed-time	transaction-date	owner	ticker	asset-name	transaction-type	amount	asset-type	type	comment	comments	transactor	asset-id	date	amount-range
0	005C1940-0F77-446D-A7E2-4FB8DF292205	NaN	Rhode Island	2012-12-27	22:10	2012-10-25	NaN	NaN	Lowes Companies, Inc.	Sale	$1,001 - $15,000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	00181DE2-0AC3-4373-9C2C-27A9C8D5DB59	NaN	Maine	2013-07-22	14:14	2013-06-19	Spouse	NaN	MAINE ST HSG AUTH BOND	Purchase	$15,001 - $50,000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	0068462f-ee01-4550-98c9-b4437019d615	Periodic Transaction Report for 05/03/2017	Hoeven, John	05/03/2017	11:15 AM	04/24/2017	Self	IBM	International Business Machines Corporation	NaN	$50,001 - $100,000	Stock	Purchase	--	NaN	NaN	NaN	NaN	NaN
3	0068462f-ee01-4550-98c9-b4437019d615	Periodic Transaction Report for 05/03/2017	Hoeven, John	05/03/2017	11:15 AM	04/24/2017	Self	MDLZ	Mondelez International, Inc.	NaN	$50,001 - $100,000	Stock	Purchase	--	NaN	NaN	NaN	NaN	NaN
4	0068462f-ee01-4550-98c9-b4437019d615	Periodic Transaction Report for 05/03/2017	Hoeven, John	05/03/2017	11:15 AM	04/21/2017	Self	GIS	General Mills, Inc.	NaN	$100,001 - $250,000	Stock	Purchase	--	NaN	NaN	NaN	NaN	NaN

	filer	filed-date	filed-time	transaction-date	owner	ticker	asset-name	transaction-type	amount	asset-type	type
0	Hoeven, John	2017-03-05	11:15 AM	2017-04-24	Self	IBM	International Business Machines Corporation	NaN	$50,001 - $100,000	Stock	Purchase
1	Hoeven, John	2017-03-05	11:15 AM	2017-04-24	Self	MDLZ	Mondelez International, Inc.	NaN	$50,001 - $100,000	Stock	Purchase
2	Hoeven, John	2017-03-05	11:15 AM	2017-04-21	Self	GIS	General Mills, Inc.	NaN	$100,001 - $250,000	Stock	Purchase
3	Hoeven, John	2017-03-05	11:15 AM	2017-04-21	Self	SBUX	Starbucks Corporation	NaN	$50,001 - $100,000	Stock	Purchase
4	Hoeven, John	2017-03-05	11:15 AM	2017-04-20	Self	SPG	Simon Property Group, Inc.	NaN	$100,001 - $250,000	Stock	Sale (Full)
...	...	...	...	...	...	...	...	...	...	...	...
23528	Ernst, Joni	2022-11-05	10:49 AM	2021-03-08	Self	CYPIX	Consumer Services Ultra Sector Profund Investor	Sale (Full)	$1,001 - $15,000	NaN	NaN
23529	Ernst, Joni	2022-11-05	10:49 AM	2021-03-08	Self	HCPIX	Health Care Ultrasector Profund Investor	Sale (Full)	$1,001 - $15,000	NaN	NaN
23530	Ernst, Joni	2022-11-05	10:49 AM	2021-03-08	Self	TEPIX	Technology Ultrasector Profund Investor Class	Sale (Full)	$1,001 - $15,000	NaN	NaN
23531	Ernst, Joni	2022-11-05	10:49 AM	2021-03-08	Self	QTSSX	Quantified Tactical Sectors Fund Investor Class	Purchase	$1,001 - $15,000	NaN	NaN
23532	Blunt, Roy	2021-11-25	9:32 AM	2021-11-24	Spouse	--	PM	NaN	$15,001 - $50,000	Stock	Sale (Full)

	state_po	candidate	party_simplified
3048	AL	RON CRUMPTON	DEMOCRAT
3049	AL	RICHARD C. SHELBY	REPUBLICAN
3051	AK	JOE MILLER	LIBERTARIAN
3052	AK	TED GIANOUTSOS	OTHER
3053	AK	RAY METCALFE	DEMOCRAT
3054	AK	MARGARET STOCK	OTHER
3055	AK	BRECK A. CRAIG	OTHER
3056	AK	LISA MURKOWSKI	REPUBLICAN
3058	AZ	ANTHONY CAMBONI	OTHER
3059	AZ	ANN KIRKPATRICK	DEMOCRAT

	filer	filed_date	filed_time	transaction_date	owner	ticker	asset	amount	transaction_type	state	party	committee
0	38	108	84	333	2	945	2025	7	1	26	2	2
1	38	108	84	333	2	945	2025	7	1	26	2	3
2	38	108	84	333	2	945	2025	7	1	26	2	10
3	38	108	84	333	2	945	2025	7	1	26	2	18
4	38	108	84	333	2	945	2025	7	1	26	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...
641253	5	637	617	1494	3	0	3105	3	3	22	2	7
641254	5	637	617	1494	3	0	3105	3	3	22	2	27
641255	5	637	617	1494	3	0	3105	3	3	22	2	19
641256	5	637	617	1494	3	0	3105	3	3	22	2	23
641257	5	637	617	1494	3	0	3105	3	3	22	2	26

	Actual	Predicted
314841	2	2
294761	2	2
15761	2	2
271834	0	0
374517	0	0
...	...	...
465373	0	0
444650	2	2
5826	0	0
454423	0	0
192015	0	0

	transaction_date	party	committee	amount
0	2016-01-01	DEMOCRAT	Banking, Housing, and Urban Affairs	7
1	2016-01-01	DEMOCRAT	Budget	7
2	2016-01-01	DEMOCRAT	Commerce, Science, and Transportation	3
3	2016-01-01	DEMOCRAT	Finance	5
4	2016-01-01	DEMOCRAT	Rules and Administration	7

	Actual	Predicted
493029	6	5
242354	6	21
616467	6	5
127544	11	21
379557	7	10
...	...	...
219510	16	1
30312	1	1
34140	11	29
119244	5	6
110474	6	19

	Actual	Predicted
166513	3	3
433089	3	3
440796	1	1
506381	4	4
523322	1	1
...	...	...
371778	1	1
273708	4	4
623702	1	1
593298	1	1
556972	4	4