Exploring the Essence: Insights from Perfume Data Visualization

Interactive

Published

December 12, 2024

Introduction

We’re stepping into the captivating world of fragrances, exploring a #TidTuesday dataset that delves deep into the intricate details of perfumes. The dataset, sourced from Parfumo a vibrant community of perfume enthusiasts, was web-scraped by Olga G.and provides a comprehensive overview of perfumes—from their ratings and olfactory notes to the perfumers behind them and their year of release.

For this project, I focused on analyzing the top notes of perfumes, the first impression fragrances leave. After cleaning and transforming the data in R, I used D3.js to craft a beautiful beeswarm visualization, showcasing the most popular top notes. Let’s dive into the details, including how the data was prepared and visualized.

Understanding the Dataset

This dataset contains detailed information about 59,325 perfumes listed on Parfumo. It includes:

Perfume ratings
Olfactory notes: Top, middle, and base notes
Perfumers and the year of release
Other relevant characteristics

The data was cleaned to focus on top notes and explore their popularity. Below is the step-by-step R code used to clean and prepare the data.

Data Preparation in R

Code

# Load necessary libraries
library(tidyverse)  # For data manipulation and visualization
library(scales)     # For scaling values
library(lubridate)  # For handling dates
library(janitor)    # For cleaning column names

Code

# Import the cleaned Parfumo dataset
parfumo_data_clean <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-12-10/parfumo_data_clean.csv') |>
  clean_names()  # Clean column names for easier manipulation

Rows: 59325 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Number, Name, Brand, Concentration, Main_Accords, Top_Notes, Middl...
dbl  (3): Release_Year, Rating_Value, Rating_Count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# Analyze brand and perfume counts
parfumo_data_clean |>
  count(brand, sort = TRUE)

# A tibble: 1,452 × 2
   brand                                                    n
   <chr>                                                <int>
 1 Avon                                                  1000
 2 Victoria's Secret                                      995
 3 Zara                                                   995
 4 Bath & Body Works                                      972
 5 Guerlain                                               586
 6 Ensar Oud / Oriscent                                   534
 7 Demeter Fragrance Library / The Library Of Fragrance   474
 8 Al Haramain / الحرمين                                  434
 9 Oriflame                                               394
10 Yves Rocher                                            381
# ℹ 1,442 more rows

1442 brands available in the dataset.

Code

parfumo_data_clean |>
  count(name, sort = TRUE)

# A tibble: 55,120 × 2
   name                      n
   <chr>                 <int>
 1 Chypre                   37
 2 Gardenia                 36
 3 Amber                    34
 4 Rose                     26
 5 Jasmin                   24
 6 Magnolia                 20
 7 The Fragrance Kitchen    20
 8 Eau de                   19
 9 Black                    18
10 Ambre                    17
# ℹ 55,110 more rows

55,110 unique perfumes.

Code

# Focus on top notes
# Select the "top_notes" column and separate the notes into individual entries
df <- parfumo_data_clean |>
  select(top_notes) |>
  mutate(top_notes = str_split(top_notes, ", ")) %>%  # Split multiple notes into a list
  unnest_wider(top_notes, names_sep = "_")  # Expand the list into separate columns

Code

# Transform the data to a long format
df <- df |>
  pivot_longer(cols = c(1:25),  # Transform all top notes columns into rows
               names_to = "name",  # New column for original column names
               values_to = "note_name")  # New column for the actual note names

Code

# Filter out missing values and count occurrences of each note
df <- df |>
  filter(!is.na(note_name)) |>  # Remove rows with missing values
  count(note_name, sort = TRUE)  # Count each note and sort by frequency

There are a total of 2,430 unique top notes in the dataset.

Key Takeaways:

The dataset includes information on 1,442 brands and 55,110 perfumes.
A total of 2,430 unique top notes were identified from the dataset.

Visualization with D3.js

Using the cleaned data from R, I transitioned to D3.js to create an interactive beeswarm visualization. The visual captures the most prominent top notes, revealing insights into which fragrances dominate perfume creation. Each bubble represents a top note, with its size corresponding to its frequency.

Key Highlights:

The most popular top notes include Bergamot, Mandarin, Grapefruit, and Lemon.
These citrusy notes are widely used in perfumes for their fresh and vibrant appeal.

The visualization also integrates elegant design elements, such as soft gradients and a perfume bottle illustration, to evoke the essence of luxury and refinement.

Reflections

Exploring the world of perfumes through data has been both fascinating and rewarding. The combination of R for data cleaning and D3.js for visualization allowed me to uncover and present intriguing insights. This project highlights how data visualization can bring abstract concepts, such as fragrances, to life.

Feel free to explore the dataset and experiment with your own analyses. The olfactory journey is just beginning!