Data Cleaning Challenge: Inspecting and Wrangling the FIFA 21 Data Using PostgreSQL

Source: FIFA-21-Lag.jpeg (1920×1280) (windowsreport.com)

BACKGROUND

I was thrilled to take part in a #datacleaningchallenge in the data-tech community that Promise Chinonso and other data enthusiasts organized in an effort to foster an atmosphere where novice, advanced, and professional data analysts could hone their data cleaning skills. The competition gave participants a chance to connect with other students and expand their networks. Every participant was also urged to demonstrate their data cleaning abilities using any tool of their choice, including Excel, Power-Query, SQL, Python, R, and Google Sheets.

The tool used for this challenge was the PostgreSQL.

Data cleaning, which is also referred to as data wrangling, is a crucial step in the data analysis process. Before delving into the main analysis, it is typically necessary to prepare and validate the data for this vital procedure. While data cleansing is a commonly used method, there are also other techniques available. The bulk of the effort is focused on identifying and rectifying “dirty” data, which encompasses data that is inaccurate, incomplete, immaterial, corrupted, or incorrectly formatted, and then replacing, transforming, or removing the dirty data as needed.

PROJECT GOALS

Once data has been accessed for importing, there are three primary objectives that must be achieved through the cleaning process:

Data Quality: This involves ensuring the accuracy and consistency of the data by addressing issues such as incorrect data types, null entries, missing values, special characters, duplicate entries, and errors in spellings and values.
Data Integrity: This involves ensuring that the data is reliable by identifying and correcting any incorrect calculations that may exist across rows and columns.
Data Relevance: This involves ensuring that the data is pertinent by removing any irrelevant data from the dataset.

About the dataset

The FIFA 21 data was used for this task. It was obtained from Kaggle and can be found here. Once the data from sofifa.com had been scraped, it was obtained in its raw, unprocessed form, and contained information about football players and their performances up until 2021. Notably, the FIFA 21 dataset included a total of 18,979 rows and 79 columns. I imported the data from MSSQL as it was easier for me to import it there, then I linked the database with POSTGRESQL where I did the cleaning. Here is the raw data in the POSTGRESQL. To familiarize oneself with the data prior to cleaning, a comprehensive collection was included in a data glossary and the source of the data, sofifa.com.

DATA EXPLORATION AND CLEANING

Initially, the raw data was downloaded as a compressed, zipped file, which was then extracted into CSV (Comma Separated Values) format for ease of use. After importing the data into MSSQL, I migrated the table to postgreSQL, where the cleaning process was carried out.

Upon my initial review of the dataset, I quickly realized that it was extremely disorganized and difficult to navigate. However, I was determined to get the job done and began working diligently to clean and refine the data. It was a challenging process, but I successfully transformed the messy data into a clean, polished product that was ready for analysis.

If you are interested in learning more about how I accomplished this, I would be more than happy to share my approach with you.

• Each player is assigned a unique number in the ‘ID’ column, and there are no instances of duplicate rows in the dataset.

• There are a few occurrences of unusual characters and non-English letters in the dataset’s ‘Name’ column. The ‘PlayerURL’ column, on the other hand, contains weblinks to the players’ profiles and images, along with their entire names incorporated within the URLs but devoid of any special characters.

Noticed special characters for people with the above IDs.

To address the special characters in non-English language present in the ‘name’ column, I utilized the ‘regexp_replace(name, ‘([^[:ascii:]])’, ‘[\1]’, ‘g’)’ function to identify and highlight them. Then, to correct these characters, I used the ‘concat’ function.

Correct the special characters with the correct letters.

I extracted the correct longname from the playerurl.

longname extraction from the playerurl column.

I got rid of the “Playerurl” and “Photourl” columns because they are no longer necessary using the ‘Drop’ column function.

• ‘Nationality’ and ‘age’ columns are okay

• I identified the presence of white spaces within the ‘club’ column of the dataset.

‘regexp_replace’ function was used in removing the whitespaces.

• After a closer inspection, I found that the ‘Contract’ column had inconsistent values and an incorrect data type. It was divided into three categories: ’30 Jun, 2021 on Loan’, ‘Free’, and ‘2004 ~ 2021’.

Inconsistent values in the contract column.

To address this problem, I implemented a solution by introducing three additional columns: ‘agreement’, ‘contract_start’, and ‘contract_end’. The ‘agreement’ column was populated with the values ‘free’, ‘loan’, or ‘contract’, while the ‘contract_start’ and ‘contract_end’ columns were populated with the respective years.

New column to get the type of contract and the year of the contract.

• I added a new column called ‘loan_start’ which extracted the year from the ‘Joined’ column to represent the start of the loan agreement.

New column to get loan start year and to populate the years.

To update the ‘Loan_Date_End’ column, I extracted only the year from the column to represent the end of the loan agreement.

New column to get loan end year and to populate the years

I deleted the ‘contract’ and ‘joined’ columns and renamed the ‘loan_end_date’ column to ‘loan_end.

Removed contract and joined columns, then renamed load_date_end.

• ‘Positions’ column was removed because there is another column bearing the ‘Best_Position’ of each player. The ‘Positions’ column had the details in the ‘Best_Position’ column in addition to the extra positions for players that had more than one position.

• ‘Height’ column had inconsistencies in the units attached to their values. The entries For the ‘Height’ column were in the format, ‘170cm’, 6’2" (feet & inches).

Different values in the height column, values is either in feet and inches or in centimetres.

To tackle this issue, I created a new column ‘Height_cm’ then converted the values in feet to centimetres. Using the code below.

• ‘Weight’ column had inconsistencies in the units attached to their values. The entries For the ‘Weight’ column were in the format, ‘79kg’, 172lbs.

Different values in the weight column, values is either in kilogrammes or in pounds.

To tackle this issue, I created a new column ‘Weight_kg’ and used the code below to convert the values in LBS to KG.

• ‘Value’, ‘Wage’ and ‘Release_clause’ columns had data type issues and values with suffixes in front. These columns had suffixes ‘M’ for millions, ‘K’ for thousands and’€’ for euro (currency).

value, wage and release_clause columns had prefixes ‘€’ and suffixes ‘M’ or ‘K’.

I created three new columns with ‘€’ at the end of the original column name.

Add new columns with ‘€’ at the end of the initial column.

To convert the values in the dataset that ended with ‘K’ to thousands and ‘M’ to millions, I applied a multiplication factor of 1000 and 1000000, respectively. This required removing the suffixes from the values using the ‘Substring’ function. Changing the data type was also necessary to ensure the values could be multiplied correctly.

Add the numbers from the removed prefixes and suffixes to the newly crreated columns.

I checked if the outcome is the desired result

A look at updated values in the newly added columns.

Perfect!! I dropped the value, wage and release_clause columns as they won’t be needed again.

• I converted the data types of the ‘OVA’, ‘POT’, and ‘BOV’ columns to decimals to represent the values as percentages. This allowed me to divide the values by 100. According to the Data Dictionary, OVA stands for Overall Potential, POT stands for Player Overall Rating, and BOV represents the best overall rating.

• Columns ‘w_f’, ‘sm’ and ‘ir’ had ★’ added to them

To fix this data I added three columns, ‘skillmoves_rating’, ‘weakfoot_rating’ and ‘international_rating’. Then removed the ‘★’ from the columns and inserted the values to the newly created rows

• The ‘hits’ column had null values, values ending with ‘K’ and the ones with numbers.

I converted the datatype in the ‘hits’ column to smallint and also multiplied the values ending with ‘K’ by 1000.

• The following columns had no issues and I also decided to renamed them for better understanding: ‘ova’, ‘pot’, ‘bov’, ‘a_w’, ‘d_w’, ‘pac’, ‘sho’, ‘pas’, ‘def’ and ‘phy’.

• I removed the columns ‘w_f’, ‘sm’, ‘ir’, ‘height’, ‘weight’, and ‘trial786’ generated by Postgres as they are not required for the analysis.

I removed the columns above as they won’t be needed.

Conclusion

Although the FIFA 21 data presented some challenges during the cleaning process, the dataset was eventually transformed into a usable format for analysis despite the initial disorganization.

The cleaned dataset can be found Here.

The full data cleaning process is documented on my GitHub Repository

You can reach me on Twitter and on LinkedIn.

Thanks for your time!

Data Cleaning Challenge: Inspecting and Wrangling the FIFA 21 Data Using PostgreSQL

Written by Monye Okechukwu Noel