@file consumerInsight.py
@brief Synthetic brand mention analysis pipeline using sentiment analysis
and semantic clustering.
@details
This script implements an end-to-end pipeline that:
- Generates synthetic social media mentions for two retail brands
- Cleans and preprocesses textual content
- Computes sentiment using VADER
- Embeds text using Sentence-BERT
- Clusters mentions into themes using KMeans
- Labels clusters via TF-IDF keywords
- Ranks themes using volume, engagement, and sentiment intensity
All outputs are exported as CSV files for downstream analysis.
| pd.DataFrame consumerInsight.synthesize_mentions |
( |
int | n = 100, |
|
|
int | seed = 42 ) |
@brief Generate a synthetic dataset of brand mentions.
@details
This function simulates social media mentions for Amazon and MediaMarkt
across multiple platforms. Each mention is generated from predefined
templates with controlled probability distributions to ensure that
dominant themes emerge realistically.
Ground-truth theme and polarity labels are included for validation
and debugging purposes.
@section synth_config Synthesis configuration (documented local elements)
@subsection brand_weights_doc brand_weights
@brief Probability distribution for brand occurrence.
@details
Amazon is assigned a slightly higher weight (0.55) than MediaMarkt (0.45)
to reflect its larger online presence and higher expected mention volume
on social platforms. The values are intentionally close to avoid extreme
class imbalance while still allowing observable dominance.
@subsection polarity_weights_doc polarity_weights
@brief Sentiment polarity distribution per brand.
@details
For Amazon, a slight negative skew (52%) is introduced to reflect common
complaints related to customer service, returns, and third-party sellers.
MediaMarkt is modeled with a more positive balance (55%) due to its
in-store support and assisted purchasing experience.
Values are close to 50/50 to preserve realism and avoid artificial bias.
@subsection theme_weights_doc theme_weights
@brief Theme probability distribution conditioned on brand and polarity.
@details
These weights are designed so that:
- Two dominant themes per (brand, polarity) naturally emerge
- Minor themes remain present but less frequent
Example:
- Amazon positive mentions are dominated by price deals (0.42) and fast
delivery (0.40), reflecting its value and logistics strengths.
- Amazon negative mentions emphasize customer service and return issues.
All sub-distributions sum to 1.0 to maintain probabilistic consistency.
@subsection weighted_choice_doc weighted_choice
@brief Weighted sampling helper (nested function).
@details
This helper is defined inside synthesize_mentions() to keep sampling logic
scoped locally. Doxygen does not list nested Python functions as standalone
API entries, so the documentation is provided here to remain visible.
@subsection rows_doc rows
@brief Container for synthesized mention records.
@details
Each entry represents one synthetic social media mention, including
brand, platform, text, engagement metrics, and ground-truth labels.
@subsection loop_doc Main generation loop
@brief Iteratively construct mention records.
@details
For each iteration:
- A brand is sampled according to brand_weights
- A platform is sampled uniformly
- Sentiment polarity is sampled conditionally on brand
- A theme is sampled conditionally on (brand, polarity)
- A text template is selected accordingly
- Engagement metrics are generated using normal distributions
@subsection engagement_doc Engagement generation
@brief Engagement is modeled using normal distributions.
@details
Engagement metrics are generated using normal distributions to mimic
real-world variability:
- Positive mentions receive higher average likes
- Negative mentions receive more replies (discussion-driven)
- Shares are kept low and noisy across both polarities
The max(0, ยท) constraint ensures non-negative engagement counts.
@param n Number of synthetic mentions to generate.
Default = 100 to keep clustering stable while remaining lightweight.
@param seed Random seed for reproducibility of sampling and engagement values.
@return DataFrame containing synthetic mentions and metadata.