As data on private political donors in Norway are defective, not collected and sorted, I want to build a database of all donors back until 2011. I want to finish this, before the national election in 2017. My first step is to scrape the donordata for the Conservative Party, the party that currently has the PM. I chose them first; because they have the most donors, and second; because their donors tend to be wealthy people and companies in shipping, real-estate and investment banking.

Inspired by the NYTimes-story “Small pool of rich donors Dominates Election giving”, I wanted to find out how many families dominates election giving to the Conservative party. I also wanted to find out who the most faithful donors are, which donors stopped donating, and who the newcomers are.

Compared to the US, where donordata are made public on given dates, the Norwegian political parties have to make donations public no more than four weeks after the donation was made. So my second part of the project was to build a newsbot, using Mandrill, that would email me, if any changes were made to the Conservative Partys 2015 donors website.

To build the bot, I was inspired by quakebot, the LATimes newsbot that we worked on for the first part of summer. I set the bot to print out a sentence, that can be published as part of a story, if one or several new donors are made public. The newsbot will scrape the 2015-site every five minute, and runs from an ec2 server.

As there is an election for local municipalities coming up in two weeks, I expect a few more donations to be made public on the last Friday before the election, September 14th.

My plan was to scrape the five pages in one, but since the data are not in the same tags, I had to do it for every year from 2011-2015. Then I joined the csvs for each year, into one csv. For how I cleaned the data, see comments in ipython notebook.

To count the number of families and companies that dominates election giving in Norway, initially I wanted to use the same algorithm that we used for the donors-dataset, but as there are only 259 rows in my dataset, joining the donors by hand, was more efficient.

By doing some reporting, I could join several donors into one donor, and count them as one family.  Eks. Gadus, Gadus SE and Gaudus Holding are all owned by the the same family or are subsidiary of Gadus SE, so I joined them into one donor. Same with all donors named Wilhelmsen, who all belong to the same shipping family.

By creating a new column called donor_group, and keeping the old column donor, I can still see what entities that I joined.

For the last five years; 182 donors/families are responsible for the donations to the Norwegian Conservative party. This is an estimate, I need to investigate if there are relationships between the remaining donors, that are not evident by looking at the name.

The second thing I wanted to find out, was what donors are new, who stopped giving and who are the most faithful. The data frame that will produce this, is the equivalent of making a pivot table in excel.

** These donors are the most faitfull donors: Canica, Sundt, Høeg, Bjørgvin, Parra and Mustad Industrier.

** This donor gave the most: Canica.

** No longer donors: Mustad Industrier, Furholmen Invest

** The 2015-only donors: Vidsjå, Seabulk and Soto Eiendom

Three things I know now, that I did not know before I started this project:

** Cleaning, cleaning, more cleaning and some more cleaning: This did not come as a surprise to me, but still, I was surprised to see to what extent I had to return and do some more cleaning of the data. That being said, I used to worry that I would get a set of dirty data and not be able to get to the phase of working with it, because of all the cleaning I could´t do. But thanks to the very good help of Aram, I now have a little toolkit of different cleaning methods, that will hopefully help me work around unicode-errors and dirty data.

** I do think that it is wise to make a plan for what data you want and in what format you want them, before you start scraping and cleaning the data. I found it easier to put the data in string format, rather than in a dictionary, and then split and clean the data.

** Excel: The best tool for data this size, but cleaning both in ipython notebook and Excel caused some trouble.

Finally – I have a few ideas on how to build further on this data:

** Join the data with data from the other political parties, and calculate how large the contribution from these donors are compered to the donations given to the other political parties.

** Find out how many donors donate to multiple parties.

** Join the data with business data and income/tax data, and see if there is a correlation between the donors/entities making donations in years with low or high revenue.

** Find out how the donors are linked – are they in the same corporate boards – and do network analysis on them.