top of page
Search

US Household Data Cleaning Project

  • Jan 15, 2025
  • 2 min read

Updated: Jan 23, 2025

In this project, I walk through the process of cleaning the raw US household data.



BACKGROUND: Received the raw data from a client and needed to clean the data to be used in a Web Application.


PROCESS: Used MySQL to analyze the data, note and appropriately correct data inconsistencies, and normalize the data using processes shown below. First, let's take a look at the data:



We need to check for duplicates. Let's do this by running a count on the id which should be unique identifier:


SELECT id, COUNT(id)
FROM us_project.us_household_income
GROUP BY id
HAVING COUNT(id) > 1;


With this code, we can see that we have duplicates:














We need to remove duplicates which can be done with this code:


DELETE FROM us_household_income
WHERE row_id IN (
	SELECT row_id
	FROM (
		SELECT row_id,
		id,
		ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) row_num
		FROM us_project.us_household_income) duplicates
	WHERE row_num > 1)
;








We can see no results, so our code worked.


Now let's check the statistics table for duplicates.


SELECT id, COUNT(id)
FROM us_household_income_statistics
GROUP BY id
HAVING COUNT(id) > 1;











Good! No duplicates.


Let's check the quality of the state name data.


SELECT DISTINCT State_Name
FROM us_project.us_household_income
GROUP BY 1
;









To fix this, we need to use the following code:

UPDATE us_project.us_household_income
SET State_Name = 'Georgia'
WHERE State_Name = 'georia'
;

This looks better.




















While looking over the data, I noticed a blank here:







Let's fix that with the following code:

UPDATE us_household_income
SET Place = 'Autaugaville'
WHERE County = 'Autauga County'
AND City = 'Vinemont'
;

Which yields the following table results:







Finally, let's sort by Type with this code:

SELECT Type, COUNT(Type)
FROM us_project.us_household_income
GROUP BY Type
;



















I want to combine boroughs and borough because this is a data entry error. I can do this with the following code:

UPDATE us_project.us_household_income
SET Type = 'Borough'
WHERE Type = 'Boroughs'
;















Thank you for following along. Have a great day!

 
 
 

Comments


bottom of page