Grocery Store Sales
As part of my SQL certification journey, I tackled an intriguing challenge with FoodYum, a U.S.-based grocery store chain. FoodYum offers a wide range of products—from fresh produce and meat to dairy, baked goods, snacks, and other household staples. With the rising costs of food, the challenge for FoodYum is to strategically manage their inventory. They aim to stock products in every category across various price ranges to accommodate a diverse customer base.
To learn more about me and my work, expand the ABOUT ME section.
ABOUT ME
Learn more about me on my ABOUT page. Below, you can find links to explore more of my projects categorized by topics, access my resume, and contact me.| PROJECTS | RESUME | CONTACT |
Data
The data is available in the table products
.
The dataset contains records of customers for their last full year of the loyalty program.
Column Name | Criteria |
---|---|
product_id | Nominal. The unique identifier of the product. Missing values are not possible due to the database structure. |
product_type | Nominal. The product category type of the product, one of 5 values (Produce, Meat, Dairy, Bakery, Snacks).Missing values should be replaced with “Unknown”. |
brand | Nominal. The brand of the product. One of 7 possible values. Missing values should be replaced with “Unknown”. |
weight | Continuous. The weight of the product in grams. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median weight. |
price | Continuous. The price the product is sold at, in US dollars. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median price. |
average_units_sold | Discrete. The average number of units sold each month. This can be any positive integer value. Missing values should be replaced with 0. |
year_added | Nominal. The year the product was first added to FoodYum stock. Missing values should be replaced with 2022. |
stock_location | Nominal. The location that stock originates. This can be one of four warehouse locations, A, B, C or D. Missing values should be replaced with “Unknown”. |
SQL Query for Find Missing Year Added Values
Situation:
In 2022, a bug was discovered in the product system where some products added during the year did not have the year_added
value set. Since the year a product was added could potentially affect its price, it’s crucial to have this information available.
Task:
We need to write a query that determines how many products have a missing year_added
value. The query should output a single column named missing_year
containing the count of rows where the year_added
is missing (i.e., set to NULL
).
Action:
We can achieve this by using the SQL COUNT
function and filtering the rows where the year_added
column is NULL
. The COUNT
function will count the total number of such rows, and the result will be labeled as missing_year
.
1
2
3
SELECT COUNT(*) AS missing_year
FROM products
WHERE year_added IS NULL;
Result:
The output successfully provided the number of products missing the year_added value.
missing_year | |
---|---|
0 | 170 |
SQL Query for Data Cleaning of the products
Table
Situation:
I was tasked with ensuring the data in the products
table is clean and accurate before proceeding with further analysis. The data included missing values in various columns such as product_type
, brand
, weight
, price
, average_units_sold
, year_added
, and stock_location
, which needed to be replaced based on specific criteria. Additionally, I needed to calculate median values for the weight
and price
columns to use as replacements for missing data.
Task:
My objective was to write an SQL query that would clean the data by handling missing values in accordance with the provided criteria:
- Nominal data (e.g.,
product_type
,brand
,stock_location
): replace missing values with “Unknown.” - Continuous data (e.g.,
weight
,price
): replace missing values with the overall median. - Discrete data (e.g.,
average_units_sold
): replace missing values with 0. - Year data (
year_added
): replace missing values with the year 2022.
Action:
To clean the data, I wrote an SQL query that:
- Used COALESCE to provide fallback values for columns with missing values.
- Replaced missing nominal values with ‘Unknown’ and replaced missing continuous values (such as
weight
andprice
) with the median values calculated using the PERCENTILE_CONT function. - For columns like
average_units_sold
, I used COALESCE to replace missing values with 0. - Implemented REGEXP_REPLACE and CAST to clean and process the
weight
column, ensuring it only contains numeric values. - Applied the ROUND function to ensure values are rounded to two decimal places where necessary.
Here’s the query I used:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
WITH median_values AS (
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY CAST(REGEXP_REPLACE(weight, '[^\d.]', '', 'g') AS NUMERIC)) AS median_weight,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM
products
)
SELECT
product_id,
COALESCE(TRIM(product_type), 'Unknown') AS product_type,
COALESCE(NULLIF(TRIM(REPLACE(brand, '-', '')), ''), 'Unknown') AS brand,
COALESCE(
ROUND(CAST(NULLIF(REGEXP_REPLACE(weight, '[^\d.]', '', 'g'), '') AS NUMERIC), 2),
ROUND((SELECT median_weight FROM median_values)::NUMERIC, 2)
) AS weight,
COALESCE(
TO_CHAR(ROUND(price::NUMERIC, 2), '9999999999.99'),
TO_CHAR(ROUND((SELECT median_price FROM median_values)::NUMERIC, 2), '9999999999.99')
) AS price,
COALESCE(average_units_sold, 0) AS average_units_sold,
COALESCE(year_added, 2022) AS year_added,
COALESCE(UPPER(TRIM(stock_location)), 'Unknown') AS stock_location
FROM
products;
Result:
The query successfully cleaned the data in the products table. All missing values were handled appropriately:
Nominal values (e.g., product_type, brand) were replaced with ‘Unknown’ where missing. Continuous values (e.g., weight, price) were filled with the respective median values. Discrete values (e.g., average_units_sold) were replaced with 0. Missing year_added values were filled with 2022.
product_id | product_type | brand | weight | price | average_units_sold | year_added | stock_location | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Bakery | TopBrand | 602.61 | 11.00 | 15 | 2022 | C |
1 | 2 | Produce | SilverLake | 478.26 | 8.08 | 22 | 2022 | C |
2 | 3 | Produce | TastyTreat | 532.38 | 6.16 | 21 | 2018 | B |
3 | 4 | Bakery | StandardYums | 453.43 | 7.26 | 21 | 2021 | D |
4 | 5 | Produce | GoldTree | 588.63 | 7.88 | 21 | 2020 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1695 | 1696 | Meat | TastyTreat | 503.99 | 14.08 | 25 | 2017 | A |
1696 | 1697 | Meat | GoldTree | 526.89 | 16.13 | 25 | 2016 | D |
1697 | 1698 | Bakery | YumMie | 583.85 | 7.05 | 16 | 2021 | A |
1698 | 1699 | Produce | TopBrand | 441.64 | 8.10 | 19 | 2019 | A |
1699 | 1700 | Meat | TopBrand | 518.60 | 15.89 | 24 | 2021 | A |
1700 rows × 8 columns
SQL Query for Finding Minimum and Maximum Prices by Product Type
Situation:
The task involved determining the price range for each product type within the products
table. Specifically, I needed to calculate both the minimum and maximum prices for each product type, which would allow to see the variation in price within each category of products.
Task:
I was asked to write an SQL query that would return:
- The product_type (i.e., category of the product).
- The min_price (the minimum price for each product type).
- The max_price (the maximum price for each product type).
Action:
To solve this, I wrote an SQL query that:
- Used the MIN() function to find the lowest price for each
product_type
. - Used the MAX() function to find the highest price for each
product_type
. - Applied GROUP BY to group the data by
product_type
, ensuring that the results would be aggregated accordingly.
Here’s the query I used:
1
2
3
4
5
6
SELECT
product_type,
MIN(price) AS min_price,
MAX(price) AS max_price
FROM products
GROUP BY product_type;
Result:
The query successfully returned the minimum and maximum prices for each product_type
product_type | min_price | max_price | |
---|---|---|---|
0 | Snacks | 5.20 | 10.72 |
1 | Produce | 3.46 | 8.78 |
2 | Dairy | 8.33 | 13.97 |
3 | Bakery | 6.26 | 11.88 |
4 | Meat | 11.48 | 16.98 |
SQL Query to Identifying Meat and Dairy Products with High Sales
Situation:
The team wanted to focus on Meat and Dairy products, particularly those that had an average sales volume greater than 10 units per month. This would help them identify which products in these categories were performing well in terms of sales.
Task:
I was tasked with writing a query to extract the following details for these high-performing products:
- product_id: The unique identifier for the product.
- price: The price of the product.
- average_units_sold: The average number of units sold per month.
Action:
To address the task, I created a query that:
- Filters the
products
table to include only rows where theproduct_type
is either'Meat'
or'Dairy'
. - Adds another filter to return only rows where the
average_units_sold
is greater than 10. - Selects the relevant columns:
product_id
,price
, andaverage_units_sold
.
Here’s the query:
1
2
3
4
5
6
7
8
SELECT
product_id,
price,
average_units_sold
FROM products
WHERE 1=1
AND product_type IN('Meat', 'Dairy')
AND average_units_sold > 10;
Result:
The query successfully returned a list of Meat and Dairy products with high sales volume, giving the team valuable insights into which products were performing the best.
product_id | price | average_units_sold | |
---|---|---|---|
0 | 6 | 16.20 | 24 |
1 | 8 | 15.77 | 28 |
2 | 9 | 11.57 | 30 |
3 | 10 | 13.94 | 27 |
4 | 11 | 9.26 | 26 |
... | ... | ... | ... |
693 | 1694 | 16.00 | 25 |
694 | 1695 | 12.88 | 20 |
695 | 1696 | 14.08 | 25 |
696 | 1697 | 16.13 | 25 |
697 | 1700 | 15.89 | 24 |
698 rows × 3 columns
Summary of Tasks
Task 1: Identify Products with Missing Year Information I identified products that were missing the year_added field. This information is crucial for analyzing how the year a product was added impacts its pricing.
Task 2: Clean and Prepare the Product Data I cleaned the product data by replacing missing values with appropriate defaults (e.g., using the median for continuous variables like weight and price, or ‘Unknown’ for categorical data like brand and product_type).
Task 3: Determine Price Range by Product Type I wrote a query to calculate the minimum and maximum prices for each product type, giving the team a clear understanding of price variation across product categories.
Task 4: Identify High-Selling Meat and Dairy Products I filtered the products to find Meat and Dairy items where the average units sold exceeded 10, providing the team with valuable insights into top-performing products in these categories.