Grocery Store Sales

Posted Oct 7, 2024 Updated Oct 11, 2024

Grocery Store Sales

By Samir Alikperov

9 min read

Grocery Store Sales

As part of my SQL certification journey, I tackled an intriguing challenge with FoodYum, a U.S.-based grocery store chain. FoodYum offers a wide range of products—from fresh produce and meat to dairy, baked goods, snacks, and other household staples. With the rising costs of food, the challenge for FoodYum is to strategically manage their inventory. They aim to stock products in every category across various price ranges to accommodate a diverse customer base.

To learn more about me and my work, expand the ABOUT ME section.

ABOUT ME

Learn more about me on my ABOUT page. Below, you can find links to explore more of my projects categorized by topics, access my resume, and contact me.
| PROJECTS | RESUME | CONTACT |

Data

The data is available in the table products.

The dataset contains records of customers for their last full year of the loyalty program.

Column Name	Criteria
product_id	Nominal. The unique identifier of the product. Missing values are not possible due to the database structure.
product_type	Nominal. The product category type of the product, one of 5 values (Produce, Meat, Dairy, Bakery, Snacks).Missing values should be replaced with “Unknown”.
brand	Nominal. The brand of the product. One of 7 possible values. Missing values should be replaced with “Unknown”.
weight	Continuous. The weight of the product in grams. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median weight.
price	Continuous. The price the product is sold at, in US dollars. This can be any positive value, rounded to 2 decimal places. Missing values should be replaced with the overall median price.
average_units_sold	Discrete. The average number of units sold each month. This can be any positive integer value. Missing values should be replaced with 0.
year_added	Nominal. The year the product was first added to FoodYum stock. Missing values should be replaced with 2022.
stock_location	Nominal. The location that stock originates. This can be one of four warehouse locations, A, B, C or D. Missing values should be replaced with “Unknown”.

SQL Query for Find Missing Year Added Values

Situation:

In 2022, a bug was discovered in the product system where some products added during the year did not have the year_added value set. Since the year a product was added could potentially affect its price, it’s crucial to have this information available.

Task:

We need to write a query that determines how many products have a missing year_added value. The query should output a single column named missing_year containing the count of rows where the year_added is missing (i.e., set to NULL).

Action:

We can achieve this by using the SQL COUNT function and filtering the rows where the year_added column is NULL. The COUNT function will count the total number of such rows, and the result will be labeled as missing_year.

  
SELECT COUNT(*) AS missing_year
FROM products 
WHERE year_added IS NULL;

Result:

The output successfully provided the number of products missing the year_added value.

	missing_year
0	170

SQL Query for Data Cleaning of the `products` Table

Situation:

I was tasked with ensuring the data in the products table is clean and accurate before proceeding with further analysis. The data included missing values in various columns such as product_type, brand, weight, price, average_units_sold, year_added, and stock_location, which needed to be replaced based on specific criteria. Additionally, I needed to calculate median values for the weight and price columns to use as replacements for missing data.

Task:

My objective was to write an SQL query that would clean the data by handling missing values in accordance with the provided criteria:

Nominal data (e.g., product_type, brand, stock_location): replace missing values with “Unknown.”
Continuous data (e.g., weight, price): replace missing values with the overall median.
Discrete data (e.g., average_units_sold): replace missing values with 0.
Year data (year_added): replace missing values with the year 2022.

Action:

To clean the data, I wrote an SQL query that:

Used COALESCE to provide fallback values for columns with missing values.
Replaced missing nominal values with ‘Unknown’ and replaced missing continuous values (such as weight and price) with the median values calculated using the PERCENTILE_CONT function.
For columns like average_units_sold, I used COALESCE to replace missing values with 0.
Implemented REGEXP_REPLACE and CAST to clean and process the weight column, ensuring it only contains numeric values.
Applied the ROUND function to ensure values are rounded to two decimal places where necessary.

Here’s the query I used:

  
WITH median_values AS (
    SELECT 
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY CAST(REGEXP_REPLACE(weight, '[^\d.]', '', 'g') AS NUMERIC)) AS median_weight,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
    FROM 
        products
)
SELECT
    product_id,
    COALESCE(TRIM(product_type), 'Unknown') AS product_type,
    COALESCE(NULLIF(TRIM(REPLACE(brand, '-', '')), ''), 'Unknown') AS brand,
    COALESCE(
        ROUND(CAST(NULLIF(REGEXP_REPLACE(weight, '[^\d.]', '', 'g'), '') AS NUMERIC), 2),
        ROUND((SELECT median_weight FROM median_values)::NUMERIC, 2)
    ) AS weight,
    COALESCE(
        TO_CHAR(ROUND(price::NUMERIC, 2), '9999999999.99'),
        TO_CHAR(ROUND((SELECT median_price FROM median_values)::NUMERIC, 2), '9999999999.99')
    ) AS price,
    COALESCE(average_units_sold, 0) AS average_units_sold,
    COALESCE(year_added, 2022) AS year_added,
    COALESCE(UPPER(TRIM(stock_location)), 'Unknown') AS stock_location
FROM 
    products;

Result:

The query successfully cleaned the data in the products table. All missing values were handled appropriately:

Nominal values (e.g., product_type, brand) were replaced with ‘Unknown’ where missing. Continuous values (e.g., weight, price) were filled with the respective median values. Discrete values (e.g., average_units_sold) were replaced with 0. Missing year_added values were filled with 2022.

	product_id	product_type	brand	weight	price	average_units_sold	year_added	stock_location
0	1	Bakery	TopBrand	602.61	11.00	15	2022	C
1	2	Produce	SilverLake	478.26	8.08	22	2022	C
2	3	Produce	TastyTreat	532.38	6.16	21	2018	B
3	4	Bakery	StandardYums	453.43	7.26	21	2021	D
4	5	Produce	GoldTree	588.63	7.88	21	2020	A
...	...	...	...	...	...	...	...	...
1695	1696	Meat	TastyTreat	503.99	14.08	25	2017	A
1696	1697	Meat	GoldTree	526.89	16.13	25	2016	D
1697	1698	Bakery	YumMie	583.85	7.05	16	2021	A
1698	1699	Produce	TopBrand	441.64	8.10	19	2019	A
1699	1700	Meat	TopBrand	518.60	15.89	24	2021	A

1700 rows × 8 columns

SQL Query for Finding Minimum and Maximum Prices by Product Type

Situation:

The task involved determining the price range for each product type within the products table. Specifically, I needed to calculate both the minimum and maximum prices for each product type, which would allow to see the variation in price within each category of products.

Task:

I was asked to write an SQL query that would return:

The product_type (i.e., category of the product).
The min_price (the minimum price for each product type).
The max_price (the maximum price for each product type).

Action:

To solve this, I wrote an SQL query that:

Used the MIN() function to find the lowest price for each product_type.
Used the MAX() function to find the highest price for each product_type.
Applied GROUP BY to group the data by product_type, ensuring that the results would be aggregated accordingly.

Here’s the query I used:

  
SELECT 
    product_type,
    MIN(price) AS min_price,
    MAX(price) AS max_price
FROM products 
GROUP BY product_type;

Result:

The query successfully returned the minimum and maximum prices for each product_type

	product_type	min_price	max_price
0	Snacks	5.20	10.72
1	Produce	3.46	8.78
2	Dairy	8.33	13.97
3	Bakery	6.26	11.88
4	Meat	11.48	16.98

SQL Query to Identifying Meat and Dairy Products with High Sales

Situation:

The team wanted to focus on Meat and Dairy products, particularly those that had an average sales volume greater than 10 units per month. This would help them identify which products in these categories were performing well in terms of sales.

Task:

I was tasked with writing a query to extract the following details for these high-performing products:

product_id: The unique identifier for the product.
price: The price of the product.
average_units_sold: The average number of units sold per month.

Action:

To address the task, I created a query that:

Filters the products table to include only rows where the product_type is either 'Meat' or 'Dairy'.
Adds another filter to return only rows where the average_units_sold is greater than 10.
Selects the relevant columns: product_id, price, and average_units_sold.

Here’s the query:

  
SELECT
    product_id,
    price,
    average_units_sold
FROM products 
WHERE 1=1
  AND product_type IN('Meat', 'Dairy')
  AND average_units_sold > 10;

Result:

The query successfully returned a list of Meat and Dairy products with high sales volume, giving the team valuable insights into which products were performing the best.

	product_id	price	average_units_sold
0	6	16.20	24
1	8	15.77	28
2	9	11.57	30
3	10	13.94	27
4	11	9.26	26
...	...	...	...
693	1694	16.00	25
694	1695	12.88	20
695	1696	14.08	25
696	1697	16.13	25
697	1700	15.89	24

698 rows × 3 columns

Summary of Tasks

Task 1: Identify Products with Missing Year Information I identified products that were missing the year_added field. This information is crucial for analyzing how the year a product was added impacts its pricing.

Task 2: Clean and Prepare the Product Data I cleaned the product data by replacing missing values with appropriate defaults (e.g., using the median for continuous variables like weight and price, or ‘Unknown’ for categorical data like brand and product_type).

Task 3: Determine Price Range by Product Type I wrote a query to calculate the minimum and maximum prices for each product type, giving the team a clear understanding of price variation across product categories.

Task 4: Identify High-Selling Meat and Dairy Products I filtered the products to find Meat and Dairy items where the average units sold exceeded 10, providing the team with valuable insights into top-performing products in these categories.

SQL

DATA ANALYSIS SALES ANALYSIS

This post is licensed under CC BY 4.0 by the author.

Data

SQL Query for Find Missing Year Added Values

Situation:

Task:

Action:

Result:

SQL Query for Data Cleaning of the products Table

Situation:

Task:

Action:

Result:

SQL Query for Finding Minimum and Maximum Prices by Product Type

Situation:

Task:

Action:

Result:

SQL Query to Identifying Meat and Dairy Products with High Sales

Situation:

Task:

Action:

Result:

Summary of Tasks

Trending Tags

SQL Query for Data Cleaning of the `products` Table