US Senate Stock Trading Analysis

Robbie Case

robbiecase.github.io

Introduction

For our final tutorial, we have decided to look into the public and private investments of US politicians. In 2012, the STOCK (Stop Trading on Congressional Knowledge) act, which required all of the stock information regarding congressmen be released to the public. In more recent news, politicians have received more attention for insider trading as more people have become aware that politicians are legally allowed to trade. We are planning to look at a dataset that was scraped directly from the senate stock disclosure website (https://efdsearch.senate.gov/search/) to see if this access has allowed these politicians to have superior returns and to what degree. The dataset includes the senator and their respective trading data; however, it lacks party affiliation, state, etc. We pulled in two other datasets to show committee affiliation and party affiliation of the respective senators, and we built classification models to see if we can predict the party and committee based on trade data.

When we set out in building our models, we had three specific goals in mind. One, could we predict the party of the trader based on the trade data? Two, can we predict the committee of the trader baased on the trade data? Three, can we predict the type of trade (Sale, Purchase, Exchange, etc.) based on the trade data? We have all of this data, over 600,000 entries, however we want to see if either of these factors have any significance when it comes to predicting these factors. Our plan is to use classification models, as we are attempting to predict classes, e.g. republican vs democrat, rather than a value. The plan we had going into it was to use K-nearest neighbors model as a baseline classifcation model for our different factors. If necesarry, we would use other, more powerful classification models such as XGBoost and Random Forest. We found that we were able to achieve significant accuracy in predicting party and transaction type, with accuracies of both over 99%, however committee affiliation proved to not be very relevant when looking at the trade data. This ultimately means that while the party is affiliated with the trading data, committee affiliation has less relevance in terms of trading. The reason we decided to include transaction type was to see if there were patterns in the trade data that would allow us to see whether a senator would buy or sell given the particular trade data. Ultimately, our conclusions are that party and transaction type are predictable based on the trade data, however committee affiliation is not.

Importing Libraries

In [4]:
import pandas as pd
import seaborn as sns
import os
from google.colab import drive
import json
from pandas.io.json import json_normalize
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import matplotlib.pyplot as plt
import squarify
import circlify
from datetime import datetime
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE
from simple_colors import *
import xgboost as xgb
import warnings
import scikitplot as skplt
drive.mount("/content/drive")
Mounted at /content/drive

Data Imports and Tidying

Senate stock disclosures

The first dataset comes from https://efdsearch.senate.gov/search/, and was scraped using the tools found here (https://github.com/neelsomani/senator-filings). It includes various investments from US senators dating back to 2012, when the STOCK act was passed. Investments included in this data are individual public stock in large companies, as well as investments in mutual funds and exchange traded funds. This dataset also includes a given range for the amount of each investment (i.e. $1,001 - $15,000). Also, this data can be organized by purchases and sales, as well as the date of each transaction. This will be interesting to see because we can use this information to identify when these senators entered and exited investments and then compare it to the stock or fund’s performance. We can also take a step back and see if there was any legislation passed at the time that would have affected the performance of the company that these senators were invested in.

US Senators Investments Dataset - Descriptions of columns used:

1 - filer - identifies individual or state that filed the transaction

2 - filed-date - day the investment was reported

3 - filed-time - time the investment was reported

4 - transaction-date - day the investment was made

5 - owner - who actually bought the asset (self or spouse)

6 - ticker - ticker of company if asset is public stock

7 - asset-name - full-name of the investment

8 - transaction-type - if the transaction was purchased, full or partially sold, or exchanged

9 - amount - value ranges of the investment, 11 increments beginning with $1,001 - $15,000 and ending with over $50,000,000

10 - asset-type - different investment types, including public and private stock, stock options, corporate and municipal bonds, commodities, and crytocurrency

11 - type - if the transaction was purchased, full or partially sold, or exchanged

In [5]:
# download senate financial investments

url = 'https://raw.githubusercontent.com/jeremiak/us-senate-financial-disclosure-data/master/output/transactions.csv'
df_s = pd.read_csv(url, low_memory = False)
In [6]:
df_s.head()
Out[6]:
report-id report-title filer filed-date filed-time transaction-date owner ticker asset-name transaction-type amount asset-type type comment comments transactor asset-id date amount-range
0 005C1940-0F77-446D-A7E2-4FB8DF292205 NaN Rhode Island 2012-12-27 22:10 2012-10-25 NaN NaN Lowes Companies, Inc. Sale $1,001 - $15,000 NaN NaN NaN NaN NaN NaN NaN NaN
1 00181DE2-0AC3-4373-9C2C-27A9C8D5DB59 NaN Maine 2013-07-22 14:14 2013-06-19 Spouse NaN MAINE ST HSG AUTH BOND Purchase $15,001 - $50,000 NaN NaN NaN NaN NaN NaN NaN NaN
2 0068462f-ee01-4550-98c9-b4437019d615 Periodic Transaction Report for 05/03/2017 Hoeven, John 05/03/2017 11:15 AM 04/24/2017 Self IBM International Business Machines Corporation NaN $50,001 - $100,000 Stock Purchase -- NaN NaN NaN NaN NaN
3 0068462f-ee01-4550-98c9-b4437019d615 Periodic Transaction Report for 05/03/2017 Hoeven, John 05/03/2017 11:15 AM 04/24/2017 Self MDLZ Mondelez International, Inc. NaN $50,001 - $100,000 Stock Purchase -- NaN NaN NaN NaN NaN
4 0068462f-ee01-4550-98c9-b4437019d615 Periodic Transaction Report for 05/03/2017 Hoeven, John 05/03/2017 11:15 AM 04/21/2017 Self GIS General Mills, Inc. NaN $100,001 - $250,000 Stock Purchase -- NaN NaN NaN NaN NaN
In [7]:
drop = ['report-id',
           'report-title',
           'comment',
           'comments',
           'transactor',
           'asset-id',
           'date',
           'amount-range']  #columns to drop
df_s.drop(drop, inplace=True, axis=1) #Dropping unneccesarry columns
In [8]:
pd.options.mode.chained_assignment = None
In [9]:
df2 = df_s.dropna(subset=["ticker"]) #dropping rows without ticker
df2["filed-date"] = pd.to_datetime(df2["filed-date"], dayfirst=True) #convert filed-date to datetime64
df2["transaction-date"] = pd.to_datetime(df2["transaction-date"], dayfirst=True, errors = "coerce") #convert transaction-date into datetime64
In [10]:
df2 = df2[df2["transaction-date"] >= "2016-1-1"]
df2 = df2[df2["transaction-date"] <= "2022-11-18"]
In [11]:
df2.reset_index(drop=True, inplace=True) #reset dataframe index
df2
Out[11]:
filer filed-date filed-time transaction-date owner ticker asset-name transaction-type amount asset-type type
0 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation NaN $50,001 - $100,000 Stock Purchase
1 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self MDLZ Mondelez International, Inc. NaN $50,001 - $100,000 Stock Purchase
2 Hoeven, John 2017-03-05 11:15 AM 2017-04-21 Self GIS General Mills, Inc. NaN $100,001 - $250,000 Stock Purchase
3 Hoeven, John 2017-03-05 11:15 AM 2017-04-21 Self SBUX Starbucks Corporation NaN $50,001 - $100,000 Stock Purchase
4 Hoeven, John 2017-03-05 11:15 AM 2017-04-20 Self SPG Simon Property Group, Inc. NaN $100,001 - $250,000 Stock Sale (Full)
... ... ... ... ... ... ... ... ... ... ... ...
23528 Ernst, Joni 2022-11-05 10:49 AM 2021-03-08 Self CYPIX Consumer Services Ultra Sector Profund Investor Sale (Full) $1,001 - $15,000 NaN NaN
23529 Ernst, Joni 2022-11-05 10:49 AM 2021-03-08 Self HCPIX Health Care Ultrasector Profund Investor Sale (Full) $1,001 - $15,000 NaN NaN
23530 Ernst, Joni 2022-11-05 10:49 AM 2021-03-08 Self TEPIX Technology Ultrasector Profund Investor Class Sale (Full) $1,001 - $15,000 NaN NaN
23531 Ernst, Joni 2022-11-05 10:49 AM 2021-03-08 Self QTSSX Quantified Tactical Sectors Fund Investor Class Purchase $1,001 - $15,000 NaN NaN
23532 Blunt, Roy 2021-11-25 9:32 AM 2021-11-24 Spouse -- PM NaN $15,001 - $50,000 Stock Sale (Full)

23533 rows × 11 columns

In [12]:
df2.isnull().sum()
Out[12]:
filer                   0
filed-date            613
filed-time            613
transaction-date        0
owner                  14
ticker                  0
asset-name             33
transaction-type     8815
amount                  1
asset-type          14719
type                14718
dtype: int64

Below we are combining the two columns which ultimately represent the same thing.

In [13]:
df2["transaction_type"] = df2["transaction-type"].fillna("") + df2["type"].fillna("")
df2["transaction_type"].replace("", np.nan, inplace=True)
df2.drop(["type", "transaction-type"],axis=1,inplace=True)
In [14]:
df2.head()
Out[14]:
filer filed-date filed-time transaction-date owner ticker asset-name amount asset-type transaction_type
0 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase
1 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self MDLZ Mondelez International, Inc. $50,001 - $100,000 Stock Purchase
2 Hoeven, John 2017-03-05 11:15 AM 2017-04-21 Self GIS General Mills, Inc. $100,001 - $250,000 Stock Purchase
3 Hoeven, John 2017-03-05 11:15 AM 2017-04-21 Self SBUX Starbucks Corporation $50,001 - $100,000 Stock Purchase
4 Hoeven, John 2017-03-05 11:15 AM 2017-04-20 Self SPG Simon Property Group, Inc. $100,001 - $250,000 Stock Sale (Full)
In [15]:
df2.dtypes
Out[15]:
filer                       object
filed-date          datetime64[ns]
filed-time                  object
transaction-date    datetime64[ns]
owner                       object
ticker                      object
asset-name                  object
amount                      object
asset-type                  object
transaction_type            object
dtype: object

Party Affiliation Dataset

The second dataset comes from https://dataverse.harvard.edu/file.xhtml?fileId=6708559&version=6.1, and was uploaded as a csv file to GitHub. This dataset contains senator name, party affiliation, and state. This will be helpful in looking at the differences in trade volume among political party and states. We will ultimately build a classification model to see if we can predict a senators party affiliation based on the first dataset.

US Senators State and Political Party Dataset - Descriptions of columns used:

1 - state_po - state that the senator represents

2 - candidate - name of senator

3 - party_simplified - political party, split into Democrat, Republican, or Libertarian

In [16]:
url = "https://raw.githubusercontent.com/RobbieCase/stuff/main/1976-2020-senate.csv"

df_sen = pd.read_csv(url, encoding='latin1') #read csv
In [17]:
df_sen = df_sen[df_sen["year"] >= 2016] #take values from 2016 on as dataset includes data from 2017 on
In [18]:
df_sen.columns
Out[18]:
Index(['year', 'state', 'state_po', 'state_fips', 'state_cen', 'state_ic',
       'office', 'district', 'stage', 'special', 'candidate', 'party_detailed',
       'writein', 'mode', 'candidatevotes', 'totalvotes', 'unofficial',
       'version', 'party_simplified'],
      dtype='object')
In [19]:
to_drop = ["year", "state", "state_fips", 'state_fips', 'state_cen', 'state_ic',
       'office', 'district', 'stage', 'special',
       'writein', 'mode', 'candidatevotes', 'totalvotes', 'unofficial',
       'version','party_detailed']
df_sen.drop(to_drop, axis=1, inplace=True) #drop irrelevant collumns
In [20]:
df_sen.dropna(axis=0, inplace=True) #drop null rows
df_sen.isnull().sum()
Out[20]:
state_po            0
candidate           0
party_simplified    0
dtype: int64
In [21]:
df_sen.head(10)
Out[21]:
state_po candidate party_simplified
3048 AL RON CRUMPTON DEMOCRAT
3049 AL RICHARD C. SHELBY REPUBLICAN
3051 AK JOE MILLER LIBERTARIAN
3052 AK TED GIANOUTSOS OTHER
3053 AK RAY METCALFE DEMOCRAT
3054 AK MARGARET STOCK OTHER
3055 AK BRECK A. CRAIG OTHER
3056 AK LISA MURKOWSKI REPUBLICAN
3058 AZ ANTHONY CAMBONI OTHER
3059 AZ ANN KIRKPATRICK DEMOCRAT
In [22]:
df3 = df2.copy()
In [23]:
df4 = df_sen.copy()
In [24]:
main = df3["filer"].unique() #make df from unique senators

name = []
for i in main: 
    name.append(i) #make list of senators

names = []
for i in df4["candidate"].values: #add names from new dataset
    names.append(i)

Here, we are utilizing a tool with a funny name called "fuzzy wuzzy." Fuzzy wuzzy is a tool which uses Levenshtein distance to match strings with one another. Since the names might be in different formats between the two different datasets.

In [25]:
key = {}

for i in name:
    key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy to match strings of names
In [26]:
merger = [key[name][0][0] for name in df3['filer']] 
df3['merger'] = merger #make column from names in key that are in original dataset
In [27]:
df3 = df3.merge(df4, how = 'left', left_on = 'merger', right_on='candidate') #merge df3 and df4
In [28]:
df3.drop(["candidate","merger"],axis=1, inplace = True) #drop merger
In [29]:
df3 = df3[df3["party_simplified"].str.contains("OTHER")==False]

Committee Affiliation Dataset

The third dataset comes from https://github.com/unitedstates/congress-legislators, and was uploaded as a csv file to GitHub. This dataset contains senator name and committee affiliation. It is taken from a github file which is constantly updated with accurate government information. This will be helpful in looking at the differences in trade volume among committee and states. We will once again be using this to build a classification model to see if we can predict a sneators committee affiliation based on the initial trade data dataset.

US Senators Committee Membership Dataset - Descriptions of columns used:

1 - Name - name of the senator

2 - Committee Name - name of committee that the senator is a member

In [30]:
sen_url = 'https://raw.githubusercontent.com/RobbieCase/stuff/main/Senate_assignments_103-117.tab' #read tab dataset
df_co = pd.read_csv(sen_url, sep='\t', engine='python')
In [31]:
df_co.columns=df_co.iloc[0] #remove columns and set first row as them
df_co = df_co.iloc[1: , :]
In [32]:
df_co.columns
Out[32]:
Index(['Congress', 'Committee Code', 'ID #', 'Name', 'Maj/Min',
       'Rank Within Party', 'Party Code', 'Date of Appointment',
       'Date of Termination', 'Senior Party Member', 'Committee Seniority',
       'Committee Period of Service',
       'Committee status at end of this Congress',
       'Committee continuity of assignment in next Congress',
       'Appointment Citation', 'Committee Name', 'State Code', 'District',
       'State Name', 'Notes'],
      dtype='object', name=0)
In [33]:
dropping = ['Congress', 'Committee Code', 'ID #', 'Maj/Min',
       'Rank Within Party', 'Party Code', 'Date of Appointment',
       'Date of Termination', 'Senior Party Member', 'Committee Seniority',
       'Committee Period of Service',
       'Committee status at end of this Congress',
       'Committee continuity of assignment in next Congress',
       'Appointment Citation', 'State Code', 'District',
       'State Name', 'Notes'] 

df_co.drop(dropping,axis=1,inplace=True) #drop irrelevant data
In [34]:
main = df3["filer"].unique() ##make df from unique senators

name = []
for i in main: #make list from unique senators
    name.append(i)

names = []
for i in df_co["Name"].values: #make list from names in new dataset
    names.append(i)
In [35]:
key = {}

for i in name:
    key[i] = (process.extract(i, names, limit=2)) #run fuzzy wuzzy for string matching of the names
In [36]:
merger = [key[name][0][0] for name in df3['filer']] #make column from names in key that are in original dataset
df3['merger'] = merger
In [37]:
df3 = df3.merge(df_co, how = 'left', left_on = 'merger', right_on='Name') #merge df3 and df_co
In [38]:
df3.drop(["merger","Name"],axis=1, inplace = True) #drop merger data
In [39]:
df3.head()
Out[39]:
filer filed-date filed-time transaction-date owner ticker asset-name amount asset-type transaction_type state_po party_simplified Committee Name
0 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
1 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
2 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
3 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
4 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry

EDA

Below encompasses the Exploratory Data Analysis that we did regarding these three datasets. We set out with goals in mind to see if there was maybe any potential relevance in trading behavior as it concerns the senators party and committee affiliation. We utilized graphs showing trade volume and frequency based on party and committee. We will ultimately build models to predict party and committee affiliation based on trade data, along with predicting whether or not a particular senator will buy/sell a given stock.

In [40]:
disclosures = df3.copy()
In [41]:
disclosures.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True)
In [42]:
disclosures.head(25)
Out[42]:
filer filed_date filed_time transaction_date owner ticker asset amount type transaction_type state party committee
0 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
1 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
2 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
3 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
4 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
5 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
6 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
7 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
8 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
9 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
10 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
11 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
12 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
13 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
14 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
15 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Homeland Security and Governmental Affairs
16 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
17 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
18 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
19 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
20 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
21 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Agriculture, Nutrition, and Forestry
22 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Appropriations
23 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Energy and Natural Resources
24 Hoeven, John 2017-03-05 11:15 AM 2017-04-24 Self IBM International Business Machines Corporation $50,001 - $100,000 Stock Purchase ND REPUBLICAN Indian Affairs (Select Committee)
In [43]:
disclosures.to_csv('disclosures.csv')
In [44]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [45]:
party_date = disclosures.groupby(['transaction_date', 'party']).count()['amount'].reset_index()
sns.lineplot(data=party_date, x='transaction_date', y='amount', hue='party', alpha=0.5).set(title='Trades by Party Over Time')
Out[45]:
[Text(0.5, 1.0, 'Trades by Party Over Time')]

The graph above shows the amount traded by party over the course of the period the data was collected. As we an see, both parties are similarly active in their trading, with libertarian obviously being neglected here as they represent such a small sample size.

In [46]:
sns.histplot(data=party_date, x="amount", log_scale = True, hue='party', alpha=0.5, kde = True,  multiple = 'dodge').set(title='Trade Volume Based On Party')
Out[46]:
[Text(0.5, 1.0, 'Trade Volume Based On Party')]

The graph above shows the count of trade volume of senators by party on a particular day. On the x axis is the different amounts traded per day, so if there were 100 trades on 50 days, the count of the amount of 100 trade days would be 50. This shows the volume of trades by party, and we can see that the parties are relatively similar in terms of how much they trade.

In [60]:
warnings.filterwarnings("ignore", 'This pattern has match groups') #filter out warnings
freq_committee = disclosures[disclosures["committee"].str.contains("Budget|Finance|Environment and Public Works|Banking, Housing, and Urban Affairs|Agriculture, Nutrition, and Forestry|Aging (Special Committee)|Judiciary|Energy and Natural Resources|Health, Education, Labor, and Pensions|Homeland Security and Governmental Affairs|Armed Services|Foreign Relations|Appropriations|Intelligence (Select Committee)|Commerce, Science, and Transportation|Veterans' Affairs|Rules and Administration|Economic (Joint Committee)|Small Business and Entrepreneurship|Indian Affairs (Select Committee)") == True] #take 20 most active committees
In [61]:
committee_amt = freq_committee.groupby(['transaction_date','party', 'committee']).count()['amount'].reset_index() #make df from committee,party, and count of volume
In [62]:
committee_amt.head()
Out[62]:
transaction_date party committee amount
0 2016-01-01 DEMOCRAT Banking, Housing, and Urban Affairs 7
1 2016-01-01 DEMOCRAT Budget 7
2 2016-01-01 DEMOCRAT Commerce, Science, and Transportation 3
3 2016-01-01 DEMOCRAT Finance 5
4 2016-01-01 DEMOCRAT Rules and Administration 7
In [63]:
sns.set(rc={'figure.figsize':(16.7,10.27)})
In [64]:
# create bar graph displaying the most trading volume based on committee and political party

com_date = disclosures.groupby(['party', 'committee']).count()['amount'].reset_index()
com_date = com_date.nlargest(32, 'amount')
com_date.set_index('amount', inplace=True)
com_date.sort_index(ascending=False, inplace=True)
com_date.reset_index(inplace=True)
com_table = pd.pivot_table(com_date, values='amount', index=['committee'], # create a pivot table to plot a bar graph to display 
                    columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)
In [52]:
com_table.plot.bar(figsize=(20,10), title='Trade Volume Based on Committee', ylabel='count') # create bar graph
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29557792b0>

The graph above shows the committees with the largest trade volumes. This shows us the 20 most financially active committees, with the largest and most active across both parties being the budget committee. Some interesting notes from this is that on the energy and natural resources side, democrats invested way heavier than republicans, while the alternative showed on agriculture, nutrition, and forestry.

In [53]:
freq_committee["committee"].value_counts()
Out[53]:
Budget                                        75769
Finance                                       57999
Environment and Public Works                  46009
Banking, Housing, and Urban Affairs           42147
Agriculture, Nutrition, and Forestry          39340
Judiciary                                     35139
Energy and Natural Resources                  32200
Health, Education, Labor, and Pensions        28432
Homeland Security and Governmental Affairs    27129
Armed Services                                26163
Foreign Relations                             24746
Appropriations                                23797
Commerce, Science, and Transportation         20834
Veterans' Affairs                             17224
Rules and Administration                      15534
Small Business and Entrepreneurship           13169
Name: committee, dtype: int64
In [54]:
top_coms = pd.DataFrame({"committee": ["Budget", "Finance","Environment and Public Works","Banking, Housing, and Urban Affairs","Agriculture, Nutrition, and Forestry",
                                         "Judiciary","Energy and Natural Resources", "Health, Education, Labor, and Pensions","Homeland Security and Governmental Affairs",
                                         "Armed Services","Foreign Relations","Appropriations","Commerce, Science, and Transportation","Veterans' Affairs",
                                         "Rules and Administration","Small Business and Entrepreneurship "],
                           "count": [75763,57987,46015,42123,39430,35145,32200,28432,27129,26163,24746,23875,20834,17224,15624,13169]}) #create new df for treeplot
In [55]:
squarify.plot(sizes=top_coms['count'], label=top_coms['committee'], alpha=.8, text_kwargs={'fontsize':8}) #create treemap with squarify
plt.axis('off')
plt.show()

The plot above is a treemap which shows the heavy hitters in terms of trade volume by committee. You can see how the different committees compare in terms of their investment activity, and unsurprisingly budget and finance take up a good portion of the chart.

In [56]:
trade_vol = disclosures.groupby(["filer", "party"]).count()['amount'].reset_index() # return the 20 senators with the most trades
trade_vol = trade_vol.nlargest(20, 'amount')
trade_vol.set_index('amount', inplace=True)
trade_vol.sort_index(ascending=False, inplace=True)
trade_vol.reset_index(inplace=True)
trade_table = pd.pivot_table(trade_vol, values='amount', index=['filer'], # create a pivot table to plot a bar graph to display 
                    columns=['party'], aggfunc=np.sum, fill_value=0, sort=False)
In [57]:
trade_table.plot.bar(figsize=(10,6), title='Individual Senator Trade Volume', ylabel='count') # create bar graph
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29559c6ee0>

Above shows the senators who executed the most trades coded by their party. This is helpful in showing potential outliers that could skew the date for either of the parties. While both democrats and republicans are represented, democrats seem to have a few more outliers, while the republicans have the largest being David Perdue.

Modeling

Model Prep

In [65]:
modeldf = df3.copy()
In [66]:
modeldf.isnull().sum()
Out[66]:
filer                    0
filed-date           13016
filed-time           13016
transaction-date         0
owner                  378
ticker                   0
asset-name             429
amount                  27
asset-type          383014
transaction_type        27
state_po                 0
party_simplified         0
Committee Name           0
dtype: int64
In [67]:
modeldf.drop(["asset-type"], axis=1, inplace = True)
In [68]:
modeldf.dropna(axis = 0 , inplace = True) #drop null values

We elect to drop the null values so they do not interfere with our model.

In [69]:
modeldf.count()
Out[69]:
filer               627435
filed-date          627435
filed-time          627435
transaction-date    627435
owner               627435
ticker              627435
asset-name          627435
amount              627435
transaction_type    627435
state_po            627435
party_simplified    627435
Committee Name      627435
dtype: int64

Label Encoding

In [70]:
modeldf.dtypes
Out[70]:
filer                       object
filed-date          datetime64[ns]
filed-time                  object
transaction-date    datetime64[ns]
owner                       object
ticker                      object
asset-name                  object
amount                      object
transaction_type            object
state_po                    object
party_simplified            object
Committee Name              object
dtype: object
In [71]:
modeldf.rename(columns={"filed-date":"filed_date","filed-time":"filed_time","transaction-date":"transaction_date" ,"asset-name":"asset","asset-type":"type", "state_po":"state", "party_simplified":"party", "Committee Name": "committee"}, inplace=True) #rename columns

Here, we utilize label encoding to change the dtype of the variable in order to better fit the training model. This converts the data to integer values of unique integers for each value. Some of the mapping of substantial columns have been printed out as well.

In [72]:
le = preprocessing.LabelEncoder() #label encode categorical variables
modeldf['filer']= le.fit_transform(modeldf['filer'])  #utilize label encoder on column
modeldf['owner']= le.fit_transform(modeldf['owner'])
print(blue('owner', 'bold'))
for i, j in enumerate(le.classes_): #print out the labels for specific columns
  print(j, ":", i) 
modeldf['ticker']= le.fit_transform(modeldf['ticker'])
modeldf['asset']= le.fit_transform(modeldf['asset']) 
modeldf['amount']= le.fit_transform(modeldf['amount'])
print(blue('amount', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['transaction_type']= le.fit_transform(modeldf['transaction_type'])
print(blue('type', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['state']= le.fit_transform(modeldf['state'])
modeldf['party']= le.fit_transform(modeldf['party'])
print(blue('party', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['committee']= le.fit_transform(modeldf['committee'])
print(blue('committee', 'bold'))
for i, j in enumerate(le.classes_):
  print(j, ":", i) 
modeldf['filed_date']= le.fit_transform(modeldf['filed_date'])
modeldf['filed_time']= le.fit_transform(modeldf['filed_time'])
modeldf['transaction_date']= le.fit_transform(modeldf['transaction_date'])
owner
Child : 0
Joint : 1
Self : 2
Spouse : 3
amount
$1,000,001 - $5,000,000 : 0
$1,001 - $15,000 : 1
$100,001 - $250,000 : 2
$15,001 - $50,000 : 3
$25,000,001 - $50,000,000 : 4
$250,001 - $500,000 : 5
$5,000,001 - $25,000,000 : 6
$50,001 - $100,000 : 7
$500,001 - $1,000,000 : 8
type
Exchange : 0
Purchase : 1
Sale : 2
Sale (Full) : 3
Sale (Partial) : 4
SaleSale : 5
party
DEMOCRAT : 0
LIBERTARIAN : 1
REPUBLICAN : 2
committee
APPROPRIATIONS : 0
Aging (Special Committee) : 1
Agriculture, Nutrition, and Forestry : 2
Appropriations : 3
Armed Services : 4
Banking, Housing, and Urban Affairs : 5
Budget : 6
Commerce, Science, and Transportation : 7
ENVIRONMENT AND PUBLIC WORKS : 8
Economic (Joint Committee) : 9
Energy and Natural Resources : 10
Environment and Public Works : 11
Ethics (Select Committee) : 12
Finance : 13
Foreign Relations : 14
Governmental Affairs : 15
Health, Education, Labor, and Pensions : 16
Homeland Security and Governmental Affairs : 17
Indian Affairs (Select Committee) : 18
Intelligence (Select Committee) : 19
JUDICIARY : 20
Judiciary : 21
Labor and Human Resources : 22
Library (Joint) : 23
Majority whip : 24
Minority whip : 25
Printing (Joint) : 26
Rules and Administration : 27
Small Business : 28
Small Business and Entrepreneurship : 29
Taxation (Joint) : 30
Veterans' Affairs : 31
In [73]:
modeldf
Out[73]:
filer filed_date filed_time transaction_date owner ticker asset amount transaction_type state party committee
0 38 108 84 333 2 945 2025 7 1 26 2 2
1 38 108 84 333 2 945 2025 7 1 26 2 3
2 38 108 84 333 2 945 2025 7 1 26 2 10
3 38 108 84 333 2 945 2025 7 1 26 2 18
4 38 108 84 333 2 945 2025 7 1 26 2 2
... ... ... ... ... ... ... ... ... ... ... ... ...
641253 5 637 617 1494 3 0 3105 3 3 22 2 7
641254 5 637 617 1494 3 0 3105 3 3 22 2 27
641255 5 637 617 1494 3 0 3105 3 3 22 2 19
641256 5 637 617 1494 3 0 3105 3 3 22 2 23
641257 5 637 617 1494 3 0 3105 3 3 22 2 26

627435 rows × 12 columns

In [74]:
modeldf.dtypes
Out[74]:
filer               int64
filed_date          int64
filed_time          int64
transaction_date    int64
owner               int64
ticker              int64
asset               int64
amount              int64
transaction_type    int64
state               int64
party               int64
committee           int64
dtype: object

Party

First we are going to attempt to use a classification model to predict the party of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader. We are going to use KNN initially, and if our model lacks accuracy, we may bring in a different classification model to compare accuracy.

In [129]:
modeldf2 = modeldf.copy() #copy dataframe for modeling
In [130]:
to_drop = ["filer", "owner", "state", "committee"]
modeldf2.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction
In [131]:
X = modeldf2.drop('party', axis = 1)
y = modeldf2['party'] #assign party to y to train
In [132]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
In [133]:
knn = KNeighborsClassifier(n_neighbors = 20) #knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fitting x and y train
knn.score(X_test, y_test) #accuracy of model
Out[133]:
0.9939946066212331

The model is extremely accurate at predicting party based on trade data. Using KNN, we were able to get an accuracy of over 99%, thus the trade data is significant in predicting the party of the trader.

In [134]:
y_pred = knn.predict(X_test) #assign df with predicted vals

Party

DEMOCRAT : 0

LIBERTARIAN : 1

REPUBLICAN : 2

In [135]:
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted party
df
Out[135]:
Actual Predicted
314841 2 2
294761 2 2
15761 2 2
271834 0 0
374517 0 0
... ... ...
465373 0 0
444650 2 2
5826 0 0
454423 0 0
192015 0 0

156859 rows × 2 columns

The model is extremely accurate, and predicted the values of the trade type.

In [136]:
fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title="Party Model Confusion Matrix",
                                    cmap="Purples")
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f294e887d00>
<Figure size 1080x432 with 0 Axes>

As we can see, the model was successful in correctly projecting democrat and republicans, however not as much with libertarians considering their significantly smaller sample size.

Committee

Now we are going to again attempt to use a classification model to predict the committee of the trader based on the trade data. We are using a classification model rather than a regression model as our goal is to predict the party of the given trader.

In [118]:
cpred = modeldf.copy() #copy dataframe for modeling
In [119]:
to_drop = ["filer", "owner", "state", "party"]
cpred.drop(to_drop, axis=1, inplace= True)
#drop columns that could skew prediction
In [120]:
X = cpred.drop('committee', axis = 1) 
y = cpred['committee'] #assign committee to y to train
In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
In [122]:
knn = KNeighborsClassifier(n_neighbors=100) #knn classifier with 100 neighbors
knn.fit(X_train, y_train) #fitting knn training dfs
knn.score(X_test, y_test) #accuracy of model
Out[122]:
0.16230500003187576

KNN doesn't seem to be cutting it in this case. Lets try to add some more firepower and utilize XGBoost to see if it changes in performance at all.

In [ ]:
xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=42) #XGBoost classifier with multiple classifiers
xgb_model.fit(X_train, y_train) #fit xgb model with x and y train
xgb_model.score(X_test, y_test) #accuracy of model
Out[ ]:
0.16832951886726297

XGB only slightly improved the model's accuracy, so we can determine that this dataset is not significant in predicting the committee based on trade data.

In [123]:
y_pred = knn.predict(X_test) #create df with predicted values

Committee

APPROPRIATIONS : 0

Aging (Special Committee) : 1

Agriculture, Nutrition, and Forestry : 2

Appropriations : 3

Armed Services : 4

Banking, Housing, and Urban Affairs : 5

Budget : 6

Commerce, Science, and Transportation : 7

ENVIRONMENT AND PUBLIC WORKS : 8

Economic (Joint Committee) : 9

Energy and Natural Resources : 10

Environment and Public Works : 11

Ethics (Select Committee) : 12

Finance : 13

Foreign Relations : 14

Governmental Affairs : 15

Health, Education, Labor, and Pensions : 16

Homeland Security and Governmental Affairs : 17

Indian Affairs (Select Committee) : 18

Intelligence (Select Committee) : 19

JUDICIARY : 20

Judiciary : 21

Labor and Human Resources : 22

Library (Joint) : 23

Majority whip : 24

Minority whip : 25

Printing (Joint) : 26

Rules and Administration : 27

Small Business : 28

Small Business and Entrepreneurship : 29

Taxation (Joint) : 30

Veterans' Affairs : 31

In [124]:
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted committee
df
Out[124]:
Actual Predicted
493029 6 5
242354 6 21
616467 6 5
127544 11 21
379557 7 10
... ... ...
219510 16 1
30312 1 1
34140 11 29
119244 5 6
110474 6 19

156859 rows × 2 columns

As we can see, the model is not very accurate in predicting committee.

Transaction Type

Lastly, we are going to attempt to use a classification model to predict the transaction type of the trade based on the trade data. We are again going to be using a classification model and we can ultimately predict trading behavior with this model.

In [82]:
tpred = modeldf.copy()  #copy dataframe for modeling
In [84]:
to_drop = ["filer", "owner"]
tpred.drop(to_drop, axis=1, inplace= True) #drop columns that could skew prediction
In [85]:
X = tpred.drop('transaction_type', axis = 1)
y = tpred['transaction_type'] #assign type to y to train
In [86]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) #train test split for modeling
In [87]:
knn = KNeighborsClassifier(n_neighbors = 20) #run knn classifier with 20 neighbors
knn.fit(X_train, y_train) #fit knn classifier with x and y train
knn.score(X_test, y_test) #model accuracy
Out[87]:
0.9649175374062056

Here we have some very solid results. Using KNN, we are accurately able to predict the classification of whether or not the trade was a Purchase, Partial Sale, Full Sale, and Exchange. Lets see if we can improve accuracy with a different model.

In [88]:
rf = RandomForestClassifier() #run rf classifier
rf.fit(X_train, y_train) #fit x and y train to rf classifier
rf.score(X_test, y_test) #model accuracy
Out[88]:
0.9908134056700604

Using random forest, we were able to increase the models accuracy to over 99%! This is very accurate, and shows this is a better classification model than KNN for this data.

In [91]:
y_pred = rf.predict(X_test) #df with predicted types

Transaction Type

Exchange : 0

Purchase : 1

Sale : 2

Sale (Full) : 3

Sale (Partial) : 4

In [92]:
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) #actual vs predicted transaction types
df
Out[92]:
Actual Predicted
166513 3 3
433089 3 3
440796 1 1
506381 4 4
523322 1 1
... ... ...
371778 1 1
273708 4 4
623702 1 1
593298 1 1
556972 4 4

156859 rows × 2 columns

The model is very accurate, and predicted the values of the trade type.

In [108]:
fig = plt.figure(figsize=(15,6))
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title="Transaction Type Model Confusion Matrix",
                                    cmap="Oranges")
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2953883ac0>
<Figure size 1080x432 with 0 Axes>

Shown above is a confusion matrix for the model of transaction type. As is evident, the model is successful in predicting the correct columns, with purchase, sale full, and sale partial being the most represented. This is as expected, as they are the most represented in the dataset.

Conclusions

Ultimately, with the given data we were able to build accuracte prediciton models to predict the party affiliation and whether a senator will buy or sell a certain stock based on the trade data. We were unsuccussful in building an accurate model that would predict committee affiliation. This is likely because committee affiliation does not have as much relevance to trading behavor that we had presumed. In terms of our initial plan for discovery, we operated under the presumption that they would be significant, so this goes to show that ones committee affiliation does not nececarilly mean they are going to trade one way or another.

Works Cited

In [141]:
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/ds_project.ipynb
[NbConvertApp] Converting notebook /content/drive/MyDrive/ds_project.ipynb to html
[NbConvertApp] Writing 780260 bytes to /content/drive/MyDrive/ds_project.html
Out[141]: