Detecting Election Result Irregularities in Kaduna(Nigeria) Using Geospatial Analysis.

14 min readJul 12, 2024

Election Result Anomaly Inspector Built using results of analysis On Google Earth Engine using python and Javascript

During my Internship at HNG Tech, I performed geoanalytics on election results recorded in Kaduna, a state in Nigeria. The goal was to help the data analytics team uncover potential influences in the 2023 election results in Nigeria. I volunteered to analyse the dataset on election results from kaduna. I thoroughly enjoyed every second of this project and I wanted to present my analytics process and what I discovered.

In this blog I will discuss the steps I took to prepare the data for analysis and the steps I took to build an interactive map to assist stakeholders investigate the anomalies detected from the statistical analysis.

Executive Summary

Based on the results of my analysis, election officials or relevant stakeholders should start looking into some severe anomalies detected in the 2023 election results from Kaduna, especialy vote entires for NNPP at the polling units in Kaduna with the following codes;
PU_Code: 18–15–05–032
PU_Code: 18–21–05–001
PU_Code: 18–21–02–009
It is also interesting to note that all three polling units recorded the exact same number of votes for NNPP. To assist the team with investigations I created the Election Anomaly Inspector application using the results of the analysis.

Prerequisites

To be able to follow along you will need;

A basic understanding of Python. See this resource
Ability to set up Python environment for data analysis. see this resource
A basic understanding of Google Earth Engine. See this resource
Familiarity with libraries like geopy, numpy, pandas, and Scikit-learn.

Task Description

pinpoint polling units where the voting results significantly deviate from their neighbours, indicating potential irregularities or influences.

I broke down tasks into these subtasks:

Prepare dataset for analysis
Identify Neighbouring Polling Units
Calculate Outlier Scores for each party
Sort and Report on the Data Using Visualisation.

Preparing Dataset

For the election outlier detection task, I focused on election results from Kaduna, a state in Nigeria. The given dataset contained information about the polling units, the votes each party received, and the election results. The dataset is in CSV format.

I checked the data for `Null` values and spot-checked for longitude and latitude coordinate columns.

import pandas as pd

# read csv file
df = pd.read_csv('KADUNA_crosschecked.csv')
# inspect columns and check for null values
df.isnull().sum()

I got the output below indicating no null values in all the columns.

and I also noticed there are no coordinates to work with, so we will create one in the coming section.

For our analysis these columns will not be relevant: Results_File , Transcription_Count , Result_Sheet_Stamped , Result_Sheet_Corrected , Result_Sheet_Unclear , Result_Sheet_Invalid , Results_Found , Registered_Voters and Acredited_Voters . We will need to drop them.

columns_to_drop = ['Accredited_Voters', 'Registered_Voters', 
                   'Results_Found', 'Transcription_Count',
                   'Result_Sheet_Stamped', 'Result_Sheet_Corrected',
                   'Result_Sheet_Invalid', 'Result_Sheet_Unclear',
                   'Result_Sheet_Unsigned', 'Results_File'
                    ]
for column in columns_to_drop:
    df.drop(column, axis=1, inplace=True)

Extracting Longitude and Latitude Values

For the geocoding procedure, I chose the geopy package. The geopy library makes it easy for developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders like Google, Bing, HERE, QGIS, US Census Bureau, and Esri.

With the help of Google Geocoding API, I geocoded the dataset using a Python script I wrote.

Step One: Retrieve Precise Location of Polling Units

import pandas as pd

import pandas as pd

def create_location(row: pd.Series) -> str:
    # Extract LGA and Ward from the row
    LGA = row['LGA']
    Ward = row['Ward']
    PU_Name = row['PU-Name']
    
    # Combine LGA and Ward into a location string
    location = f"{LGA}, {Ward}, {PU_Name}, Kaduna, Nigeria"
    return location

# Testing the create_location function
df_copy = df.copy()
df_copy['Location'] = df.apply(create_location, axis=1)
df_copy['Location'][0]

Testing the function should return the output below:

'BIRNIN GWARI, MAGAJIN GARI I, PRY. SCH. SHITU, Kaduna, Nigeria'

From the above output string, we can see that the script correctly combines the LGA , Ward, and PU-Namecolumns to create a more precise location string which ispassed to a create_geocode() function we will soon write.

Step Two: Get Geocode Based on Precise Location

I wrote another function that uses the location string for each polling unit to retrieve the longitude and latitudes.

from geopy.geocoders import GoogleV3 
from geopy.extra.rate_limiter import RateLimiter
import os
from dotenv import load_dotenv #for loading .env file which has My API key

load_dotenv() 
api_key = os.getenv('GMAP')

# create geocoder instance and set rate limits
geolocator = GoogleV3(api_key=api_key)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# create geocode function
def create_geocode(location: str) -> str:
    """Creates a geocode based on a given location
    Args: 
        location(str)
    Return:
        str: geocode
    """
    try:
        return geocode(location)
    except:
        return None

The code creates a geocode using the Google Maps API. To be able to access the Google map API you need to get the API credential .

After getting the API credential, create a .env file and add this variable GMAP=<your_api_credential> we then used os.getenv('GMAP') to load the API key from the.env file. We then instantiated the GoogleV3 geocoder instance using the API key and added a rate limiter to ensure that geocoding requests do not exceed one request per second.

Finally, we defined a create_geocode()function to geocode a given location string using the GoogleV3 geocoder. If the geocoding operation fails, the function returns None.

In the next steps, we will use the create_geocode() function to update our dataset with the geo-location data.

Putting it All Together

I used the create_location()function to create a Location column in our dataset. I then used the create_geocode() function to retrieve all geocode.

After getting the geocode, I extracted the longitude and latitude columns.

# create Location Column
df['Location'] = df.apply(create_location, axis=1)

# create Geocode
df['geocode'] = df['Location'].apply(create_geocode)

# extract longitude and Latitude from Geocode
df['latitude'] = df['geocode'].apply(lambda x: x.latitude if x else None)
df['longitude'] = df['geocode'].apply(lambda x: x.longitude if x else None)

# Save output in a file
df.to_csv('KADUNA_geocoded_data.csv', index=False)

Identifying Neighbouring Polling Units

To identify neighbouring polling units, let’s use a simple heuristic:

Polling units are considered neighbours if the distance between them is 1 kilometre

To compare the distance between the polling units we need to know the actual distances between all the polling units.

We will then compare the distances to the 1-kilometre rule of thumb to identify the neighbouring polling units.

I Followed These steps:

1. Use the Harversine Formula to calculate the distance between polling units.

2. Check for neighbours based on the rule of thumb, using an identify_neighbor()function.

1. Find The Harvensine Distance Between Polling Units in Kaduna

from geopy.distance import geodesic 

# define function to calculate distance

def haversine_distance(coord1, coord2):
    return geodesic(coord1, coord2).km

The code above defines a function for calculating the distance between two points using the geodesic() function provided by the geopy library. It is based on the Haversine Formula and it accepts two coordinates as arguments which will be used to find the distance.

In the next step, we will define an identify_neighbor()function to find neighbouring polling units.

2. Find Neighboring Polling Units

def identify_neighbor(df, radius_km=1):
   
    # Extract coordinates
    coords = df[['latitude', 'longitude']].to_numpy()

    # Create KD-Tree
    tree = cKDTree(coords)

    neighbors = []
    for idx, coord in enumerate(coords):
        # Find initial candidates within a bounding box of approximately 1 km
        indices = tree.query_ball_point(coord, radius_km / 6371.0) 
        neighbor_indices = [i for i in indices if i != idx]

        # Filter candidates using the Haversine function
        true_neighbors = [
            i for i in neighbor_indices if haversine_distance(coord, 
                                                     coords[i]) <= radius_km
        ]
        neighbors.append(true_neighbors)

    # Add neighbors to DataFrame
    df['neighbors'] = neighbors
    return df

neighbors_df = identify_neighbor(geocoded_df, radius_km=1)
neighbors_df.to_csv('KADUNA_with_neighbors.csv')

The identify_neighbor()function accepts a dataframe and a radius in kilometers, it then extracts coordniate information and finds the initial candidates for comparison using the harvesine_distance()

Putting it all together

I saved the output of the function in a csv for further analysis.

import pandas as pd

# read geocoded data
geocoded_df = pd.read_csv('KADUNA_geocoded_data.csv')

# identify neighbours 
neighbors_df = identify_neighbor(geocoded_df, radius_km=1)

# save data to a csv file
neighbors_df.to_csv('KADUNA_geocoded_with_neighbors.csv')

At this point, I had enough data points to proceed with the next key part of the task; finding outlier scores

Calculate Outlier Scores for each party

For each polling unit(PU_Name), I will compare the votes for each party with those of its neighbouring units.

There are four political parties to consider for each polling unit;

APC
LP
PDP
NNPP

I calculated an outlier score for each party based on the deviation of votes from neighbouring units.

I recorded the outlier scores along with the respective parties and neighbouring units.

I followed these steps:

1. Choosing The Appropriate Statistical Method

The choice of the appropriate statistical method for calculating outlier scores depends on several factors including the distribution of the data, the type of data, and the context of the analysis.

To help me decide on the right statistical approach to calculating outlier scores, I needed to understand the type of distribution the votes have per political party.

I used Q-Q plots to examine the distribution and performed the Shapiro-Wilk test to determine normality in the distribution of votes per political party.

I put all this in a check_normality() function to perform the checks for normality on the distribution.

import matplotlib.pyplot as plt
from scipy.stats import shapiro
import statsmodels.api as sm

def check_normality(dataframe, parties):
    num_parties = len(parties)
    num_cols = 2
    num_rows = (num_parties + 1) // num_cols

    fig, axs = plt.subplots(num_rows, num_cols, figsize=(12, 6 * num_rows))
    axs = axs.flatten()

    for i, party in enumerate(parties):
        votes = dataframe[party].dropna()  

        print(f"Checking distribution for {party} votes")

        # Q-Q Plot
        sm.qqplot(votes, line='45', ax=axs[i])
        axs[i].set_title(f'Q-Q Plot of {party} Votes')

        # Shapiro-Wilk Test
        shapiro_test = shapiro(votes)
        print(f'Shapiro-Wilk Test p-value for {party}:', shapiro_test.pvalue)
        if shapiro_test.pvalue > 0.05:
            print(f"The distribution of {party} votes is likely normal.")
        else:
            print(f"The distribution of {party} votes is likely not normal.")
        print("\n")

    # Hide any unused subplots
    for j in range(i + 1, num_rows * num_cols):
        fig.delaxes(axs[j])

    plt.tight_layout()
    plt.show()

Performing the Check

# Load the CSV file into a DataFrame
normal_dist_df = neighbors_df

# List of political parties
parties = ['APC', 'LP', 'PDP', 'NNPP']

# Check normality for each party
check_normality(df, parties)

Results

Checking distribution for APC votes
Shapiro-Wilk Test p-value for APC: 2.6345688074106287e-59
The distribution of APC votes is likely not normal.


Checking distribution for LP votes
Shapiro-Wilk Test p-value for LP: 3.4330261622692515e-78
The distribution of LP votes is likely not normal.


Checking distribution for PDP votes
Shapiro-Wilk Test p-value for PDP: 2.1798802282225605e-48
The distribution of PDP votes is likely not normal.


Checking distribution for NNPP votes
Shapiro-Wilk Test p-value for NNPP: 1.0787169297523082e-78
The distribution of NNPP votes is likely not normal.

Q-Q plots

As can be seen, the distribution of the voting pattern for all four parties does not follow the normal distribution and so statistical methods like the basic Z-score approach will not be particularly effective in finding the outliers. This is because the approach assumes a normally distributed data contrary to what we discovered.

I settled on using the Local Outlier Factor Method(LOF) to calculate the outlier scores of the political parties in each polling unit. The Local Outlier Factor (LOF) method provided by scikit-learn is an unsupervised Machine Learning anomaly detection technique that identifies outliers by comparing the concentration of data points within a specific area in a given dataset.

In analyzing voting patterns, LOF is useful because of its ability to handle multivariate data and capture complex interactions between votes for different parties. By focusing on local variations, LOF can uncover easy to miss irregularities in specific polling units, providing a more nuanced understanding of outlier voting behaviours. I am confident that this capability makes LOF a suitable and robust choice for this outlier detection task.

Calcuating LOF Scores

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor
import numpy as np

# Load election data
election_data = pd.read_csv('KADUNA_with_neighbors.csv')

# Define party columns
party_columns = ['APC', 'LP', 'PDP', 'NNPP']

def calculate_lof_for_each_party(data, party_columns):
    scaler = StandardScaler()
    for party in party_columns:
        # Apply a logarithmic transformation to reduce skewness
        data[f'{party}_log'] = np.log1p(data[party])
        
        # Standardize the log-transformed data
        normalized_column = f'{party}_norm'
        lof_column = f'{party}_lof_score'
        data[normalized_column] = scaler.fit_transform(data[[f'{party}_log']])
        
        # Adjust LOF parameters based on experimentation
        clf = LocalOutlierFactor(n_neighbors=15, contamination=0.05)
        
        # Fit the model and predict outliers
        data[f'{party}_outlier_flag'] = clf.fit_predict(data[
                                                        [normalized_column]
                                                         ])
        
        # Assign LOF scores
        data[lof_column] = clf.negative_outlier_factor_
    
    return data

# Calculate LOF scores for each party
election_data = calculate_lof_for_each_party(election_data, party_columns)

Displaying results

from IPython.display import display

columns_to_display = election_data[['PU-Name', 'APC', 
                                    'LP', 'PDP', 'NNPP', 
                                    'APC_lof_score',
                                    'LP_lof_score','PDP_lof_score',
                                    'NNPP_lof_score',
                                    'outlier_score', 'lof_score']]
display(columns_to_display)

Jupyter notebook Output showing outlier scores

Interpreting the Scores

The Local Outlier Factor scores <party_name>_lof_score represents real-valued floats that indicate how much some votes for a party in a polling unit deviate from its local neighbourhood density.

- LOF Score < 1: Indicates that the votes observed in a polling unit are in a relatively dense region and are similar to its neighbours.

- LOF Score > 1: Indicates that the votes observed are in a less dense region and may be considered an outlier compared to its neighbours.

<party_name>_lof_score column represents outlier scores for a party.

In the next steps, I will sort and report on the outlier scores observed on the election results to determine the regions with the highest outliers.

Sorting and Reporting on Findings

The goal of this section is to sort the dataset by the outlier scores for each party to identify the most significant outliers. I will Highlight the top 3 outliers and their closest polling units.

Sorting Party LOF Scores

Let’s define a sort_party_lof function to sort data for multiple parties based on their individual Local Outlier Factor (LOF) scores.

import pandas as pd

def sort_party_lof(dataframe, party_name, lof_score_column):
    # Select relevant columns 
    columns_to_select = ['State', 
                         'LGA', 
                         party_name, 
                         'latitude',
                         'longitude',
                         'neighbors',
                         'geocode', 
                         lof_score_column]

    existing_columns = [col for col in columns_to_select 
                        if col in dataframe.columns]
    # Select the relevant columns from the DataFrame
    selected_data = dataframe[existing_columns]
    # Sort the selected data by the LOF score column in ascending order
    sorted_data = selected_data.sort_values(by=lof_score_column, 
                                            ascending=True) 
    return sorted_data

party_names = ['APC', 'PDP', 'NNPP', 'LP']

sorted_scores = []

for party_name in party_names:
    sorted_party = f"sorted_{party_name}"
    sorted_scores.append(sort_party_lof(scores_df, 
                          party_name, f"{party_name}_lof_score"))

The sort_party_lof accepts a dataframe, the name of the party and the name of the lof score column to retrieve. The function retrieves only the columns listed, it then creates a dataframe object and sorts the data by the lof score column in ascending order.

I called the sort_party_lof function in a for loop to retrieve the sorted LOF scores for all four political parties. All the sorted dataframe objects are stored in the sorted_scores list for later analysis.

Top Outlier Votes Recorded for APC Party.

we can access the sorted lof_score dataframe for APC by checking the first element of the sorted_scores list.

sorted_apc = sorted_scores[0]
sorted_apc.head(3)

The LOF Score of -6.578948e+06 for APC represents an extremely high deviation from the distribution of votes. The same observation was made in the top 3 LOF scores of LP, PDP and NNPP.

You can repeat the process above to see the top 3 scores for the remaining parties.

Reasons that may account for these extremely high outliers may include the following:

A problem with my approach to calculating the LOF score: To rule out this possibility I cross-checked my steps, applied normalisation techniques, and experimented with different parameters for the LocalOutlierFactor(n_neighbors=n, contamination=n) class. Additionally, looking at the LOF scores data, in general, reveals several instances where normal distributions were observed.

2. A problem with the data entry: the anomalies observed most likely point to data-specific issues that prompt further contextual investigation. For instance, in all the top 3 outlier scores, the number of votes is the same for 3 different polling units with different locations resulting in the same values for LOF scores.

Visualisation

Preparing Dataset for visualisation

Currently, our sorted_scores variable holds the dataframes for the top three outlier scores for each party. we need to export them as csv files for later analysis.

Export CSV Sample Files

# export file for APC
sorted_apc = sorted_scores[0]
sorted_apc.head(3).to_csv('outliers_APC.csv')

# export file for PDP
sorted_pdp = sorted_scores[1]
sorted_pdp.head(3).to_csv('outliers_PDP.csv')

# export file for NNPP
sorted_nnpp = sorted_scores[2]
sorted_nnpp.head(3).to_csv('outliers_NNPP.csv')

# export file for NNPP
sorted_lp = sorted_scores[3]
sorted_lp.head(3).to_csv('outliers_LP.csv')

The neighbors column contains index values we can use to find more information on each neighboring polling unit. Let’s retrieve the LOF score for each neighbor in the list of neighbors for each sample file.

def retrieve_neighbor_insights(source,  lof_column_name, sample=None,):
    # Step 1: Load the source DataFrame
    source_df = pd.read_csv(source)

    # Step 2: Check if a sample DataFrame is provided and load it
    if sample:
        sample_df = pd.read_csv(sample)
    else:
        sample_df = source_df.copy()
    
    # Prepare the 'neighbors' column in both DataFrames
    source_df['neighbors'] = source_df['neighbors'].apply(lambda x: eval(x))
    sample_df['neighbors'] = sample_df['neighbors'].apply(lambda x: eval(x))
   
    sample_df['neighbor_score'] = [[] for _ in range(len(sample_df))]
    
    # Iterate over each row in the sample DataFrame
    for index, row in sample_df.iterrows():
        for neighbor_index in row['neighbors']:
            if neighbor_index in source_df.index:
                # Retrieve the LOF score for that neighbor
                neighbor_lof_score = source_df.loc[
                                                    neighbor_index, 
                                                    lof_column_name
                                                   ]
          
                sample_df.at[index, 'neighbor_score'].append(float(neighbor_lof_score))

    # Return the updated sample DataFrame
    return sample_df

Let’s run the retrieve_neighbor_insights() function in a for loop to find the lof scores for neighboring polling units.

political_parties = ['APC', 'PDP', 'NNPP', 'LP']

for party in political_parties:
    # Retrieve neighbor insights for each party
    outliers_insight = retrieve_neighbor_insights('KADUNA_outlier_scores.csv', 
                                                   f'{party}_lof_score', 
                                                   f'outliers_{party}.csv')
    
    # Check if the DataFrame has an unnamed column and rename it
    if 'Unnamed: 0' in outliers_insight.columns:
        outliers_insight.rename(columns={'Unnamed: 0': 'index'}, 
                                inplace=True)
    else:
        # Reset the index to create a new 'index' column
        outliers_insight.reset_index(inplace=True)
        outliers_insight.rename(columns={'index': 'index'}, inplace=True)
    
    # Save the DataFrame to a CSV without adding index column
    outliers_insight.to_csv(f'outliers_insight_{party}.csv', index=False)

The code above creates the following csv files which will be used for our visualisation:

outliers_insight_APC.csv
outliers_insight_PDP.csv
outliers_insight_NNPP.csv
outliers_insight_LP.csv

Ranking Severity of Election Vote Anomalies by Party

First, let’s observe the ranking of LOF scores for all parties using the code below.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


filenames = ['outliers_insight_PDP.csv', 
             'outliers_insight_APC.csv', 
             'outliers_insight_LP.csv', 
             'outliers_insight_NNPP.csv']
party_names = ['PDP', 'APC', 'LP', 'NNPP']
scores = {}

for file_name, party_name in zip(filenames, party_names):
    df = pd.read_csv(file_name)
    score_column = f"{party_name}_lof_score"
    scores[party_name] = df[score_column].head(3).tolist()

def visualize_scores(scores):
    
    parties = list(scores.keys())
    data = np.array([scores[party] for party in parties])
    
    
    fig, ax = plt.subplots()
    cax = ax.matshow(data, cmap="viridis")
    
    
    ax.set_xticklabels([''] + ['Top 1st', 
                                'Top 2nd', 
                                'Top 3rd'
                               ], minor=False)
    ax.set_yticklabels([''] + parties, minor=False)
    
    
    for (i, j), val in np.ndenumerate(data):
        ax.text(j, i, f"{val:.2f}", 
                ha='center', 
                va='center', 
                 color="white")
    
    plt.colorbar(cax)
    plt.show()

visualize_scores(scores)

Visualising Election Irregularity On Interactive Map.

For this visualization, Let’s build an interactive map application that can help election officials and stakeholders investigate the election anomalies in Kaduna. The stakeholders should be able to see the polling units with the highest anomalies and their neighbouring polling units.

Requirements

1. Ability to see the top 3 outliers for each party on the map

2. Ability to spot the neighboring polling units and observe the outlier scores of polling units in the vicinity.

3. Ability to see at a high level the spread of anomalies.

I wrote this javascript that runs in the google earth engine console to visualise the location of the outlier scores.

Conclusion and Recommendations

During the analysis, I discovered these key anomalies in the votes recorded in some polling units across all four political parties.

NNPP had the most severe cases of deviation in voting distributions
In all the top three outlier scores for each party, votes across different polling units were the same.
Further field Investigations needs to be done to understand these anomalies in order to confirm or rule out possible influences in election results in Kaduna. A good place to start is with the Election Anomaly Inspector