Skip to content

Commit c3ad70a

Browse files
committed
content(post): imdbayes
1 parent a32fee1 commit c3ad70a

File tree

10 files changed

+345
-0
lines changed

10 files changed

+345
-0
lines changed
109 KB
Loading
30 KB
Loading

public/images/posts/wgsiw/gap.png

64.4 KB
Loading

public/posts/posts.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,15 @@
11
[
2+
{
3+
"slug": "what-genre-should-i-watch",
4+
"title": "Dying is Easy, Comedy is Statistically Impossible: An IMDbayes Analysis",
5+
"date": "2026-01-18",
6+
"updated": "2026-01-18",
7+
"description": "Deconstructing Hollywood: A Data Science Journey from Raw Data to p99 Insights. Why you need a 6.5 filter for laughs, but can fly blind with a Documentary.",
8+
"tags": ["data science", "math", "python", "imdb", "movies", "uv", "dev"],
9+
"category": "dev",
10+
"filename": "what-genre-should-i-watch.txt",
11+
"authors": ["fezcode"]
12+
},
213
{
314
"slug": "debian-upgrade-path",
415
"title": "Upgrading Debian 11 to 13: The Safe Path",
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
> This analysis was built by a Software Engineer relying on 8-year-old university memories of statistics.
2+
> If the math looks wrong, just assume it's a feature, not a bug.
3+
> You can always contact me.
4+
5+
## Deconstructing Hollywood: A Data Science Journey from Raw Data to p99 Insights
6+
7+
As software engineers, we are used to deterministic systems. If `a = b`, then a equals b.
8+
Data Science, however, deals with probability, distributions, and noise.
9+
It's less about "**what is the answer**" and more about "**how confident are we in this trend?**"
10+
11+
Recently, I wanted to bridge my engineering background with data science to answer a simple pop-culture question:
12+
**How do different movie genres actually perform?**
13+
14+
Are "**Action**" movies inherently rated lower than "**Dramas**"? Is it harder to make a masterpiece "**Horror**" movie than a masterpiece "**Biography**"?
15+
16+
To answer this, I didn't just want to run a script; I wanted to build a production-grade Data Science lab?!. (/s)
17+
This post details the entire journey—from choosing the modern Python stack and engineering the data pipeline to defining
18+
the statistical metrics that reveal the "truth" behind average ratings.
19+
20+
## Part 1: The Engineering Foundation
21+
22+
A data project is only as good as its environment. I wanted a setup that was fast, reproducible, and clean.
23+
24+
### The Stack Decision
25+
26+
I chose Python because it is the undisputed **[lingua franca](/vocab/lingua-franca)** of data science.
27+
The ecosystem (Pandas for data crunching, Seaborn for visualization) is unmatched.
28+
29+
### The Package Manager: Why `uv`?
30+
31+
Traditionally, Python data science relies on Conda because it manages complex C-library dependencies used by
32+
math libraries like NumPy. However, Conda can be slow and bloated.
33+
34+
For this project, I chose `uv`.
35+
36+
`uv` is a modern, blazing-fast Python package manager written in Rust.
37+
It replaces `pip`, `poetry`, and `virtualenv`. It resolves dependencies in milliseconds and creates deterministic environments instantly.
38+
For a project relying on standard wheels like Pandas, `uv` provides a vastly superior developer experience.
39+
40+
```bash
41+
# Setting up the environment took seconds
42+
$ uv init movie-analysis
43+
$ uv python install 3.10
44+
$ uv add pandas matplotlib seaborn scipy jupyter ipykernel
45+
```
46+
47+
Then connected VS Code to this `.venv` created by `uv`, giving me a robust Jupyter Notebook experience right in the IDE.
48+
49+
## Part 2: The Data Pipeline (ETL)
50+
51+
I needed data with genres, votes, and ratings, went straight to the source: the **IMDb Non-Commercial Datasets**.
52+
53+
Then I faced a classic data engineering challenge: these are massive TSV (Tab Separated Values) files.
54+
Loading the entirety of IMDb into RAM on a laptop is a bad idea.
55+
56+
Solution? Build a Python ETL script to handle ingestion smartly:
57+
58+
1. **Stream & Filter**: used Pandas to read the raw files in chunks, filtering immediately for `titleType == 'movie'` and excluding older films. This kept memory usage low.
59+
2. **Merge**: joined the `title.basics` (genres/names) with `title.ratings` (scores/votes) on their unique IDs.
60+
3. **The "Explode"**: This was the crucial data transformation step. IMDb lists genres as a single string: "Action,Adventure,Sci-Fi". To analyze by category, I had to split that string and "explode" the dataset, duplicating the movie row for each genre it belongs to.
61+
62+
```python
63+
# Transforming "Action,Comedy" into two distinct analysis rows
64+
df['genres'] = df['genres'].str.split(',')
65+
df_exploded = df.explode('genres')
66+
```
67+
68+
## Part 3: The Science (Beyond Averages)
69+
70+
With clean data in hand, we moved into a Jupyter Notebook for Exploratory Data Analysis (EDA).
71+
72+
### 1. Removing the Noise (The Long Tail)
73+
74+
If you average every movie on IMDb, your data is polluted by home videos with 5 votes from the director's family.
75+
In statistics, vote counts often follow a ["Power Law"](/vocab/power-law) or long-tail distribution.
76+
77+
To analyze global sentiment, we had to filter out the noise. We set a threshold, dropping any movie with fewer than 100 votes.
78+
This ensured our statistical analysis was based on titles with a minimum level of public engagement.
79+
80+
### 2. Visualizing the Truth (The Box Plot)
81+
82+
A simple average rating is misleading. If a genre has many `1/10`s and many `10/10`s, the average is `5/10` - but that doesn't tell the story of how polarizing it is.
83+
84+
I used a [Box Plot](/vocab/box-plot) to visualize the distribution. It shows the median (the center line), the Interquartile Range (the colored box containing the middle 50% of data), and outliers (the dots).
85+
86+
![The Box Plot](/images/posts/wgsiw/boxplot.png)
87+
88+
**Initial Observations:**
89+
- **Documentary/Biography**: High medians, compact boxes. They are consistently rated highly.
90+
- **Horror**: The lowest median and a wide spread. It’s very easy to make a bad horror movie.
91+
92+
### 3. The Metrics: Weighted Ratings & p99
93+
94+
To get deeper insights, I needed better math than simple means.
95+
96+
#### Metric A: The Weighted Rating (Bayesian Average)
97+
98+
How do you compare a movie with a 9.0 rating and 105 votes against an 8.2 rating with 500,000 votes? The latter score is more statistically significant.
99+
100+
I adopted IMDb's own **Weighted Rating** formula. This "Bayesian average" pulls a movie's rating toward the global average (C) if it has few votes (v),
101+
only allowing it to deviate as it gains more votes over a threshold (m).
102+
103+
![Weighted Rating](/images/posts/wgsiw/formula.png)
104+
105+
This provided a fair "Quality Score" for every movie.
106+
107+
#### Metric B: The p99 Ceiling
108+
109+
I wanted to know the "potential" of a genre. Even if most Action movies are mediocre, how good are the very best ones?
110+
111+
For this, I calculated the [99th Percentile (p99)](/vocab/p99) rating for each genre. This is the rating value below which 99% of the genre falls.
112+
It represents the elite tier, the "Masterpiece Ceiling."
113+
114+
### Part 4: The Deductions (The Gap Analysis)
115+
116+
By combining the Average Weighted Rating (the typical experience) and the p99 Rating (the elite potential), we created a "Gap Analysis" chart.
117+
118+
The dark green bar is the average quality. The total height of the bar is the p99 ceiling. The light green area represents the "Masterpiece Gap".
119+
120+
![Masterpiece Gap](/images/posts/wgsiw/gap.png)
121+
122+
#### The Data Science Deductions
123+
124+
This single chart reveals the "personality" of every genre:
125+
126+
1. **The "Safe Bets" (Documentary, History, Biography)**:
127+
They have very high averages (tall dark bars) and a small gap to the ceiling.
128+
_Deduction_: It is difficult to make a poorly rated documentary. Audience selection bias likely plays a role here
129+
(people only watch docs on topics they already like).
130+
131+
2. **The "High Risk / High Reward" (Horror, Sci-Fi)**: They have the lowest averages (short dark bars),
132+
indicating the typical output is poor. However, their p99 ceilings remain high.
133+
_Deduction_: The gap is huge. It is incredibly difficult to execute these genres well, but when it's done right
134+
(e.g., Alien, The Exorcist), they are revered just as highly as dramas.
135+
136+
3. **The Animation Anomaly**: Animation has a high average and a very high ceiling.
137+
_Deduction_: Statistically, this is perhaps the most consistently high-quality genre in modern cinema.
138+
139+
## Conclusion
140+
141+
This project demonstrated that with a solid engineering setup using modern tools like `uv`,
142+
and by applying statistical concepts beyond simple averages, we can uncover nuanced truths hidden in raw data.
143+
Averages tell you what is probable; distributions and percentiles tell you what is possible.
144+
145+
### Question A: Which genre is "easier" to make? (Action vs. Drama vs. Comedy)
146+
147+
**The Data Verdict:** It is significantly "easier" to make an acceptable **Drama** than an acceptable **Action** or **Comedy** movie.
148+
149+
- **Evidence:** Look at the box plot, kindly.
150+
- **Drama** has a high median and a "tight" box (smaller Interquartile Range). This means even "average" Dramas are usually rated around 6.5–7.0. The "floor" is high.
151+
- **Action** has a lower median. Action movies require budget, stunts, and effects. If those look cheap, the rating tanks immediately.
152+
A bad drama is just "boring" (5/10); a bad action movie looks "broken" (3/10).
153+
- **Comedy** is arguably the *hardest* to get a high rating for. Humor is subjective.
154+
If a joke lands for 50% of the audience but annoys the other 50%, the rating averages out to a 5.0.
155+
**Drama is universal; Comedy is divisive**.
156+
157+
### Question B: Should I use lower search bounds for Comedy compared to Drama?
158+
159+
**The Data Verdict:** **YES. Absolutely.**
160+
161+
- **The "Genre Inflation" Factor:** Users rate genres differently. A **7.0** in Horror or Comedy is effectively an **8.0** in Drama or Biography.
162+
- **The Strategy:** If you filter for `Rating > 7.5`, you will see hundreds of Biographies, but you will filter out some of the funniest Comedies ever made (which often sit at 6.8 - 7.2).
163+
- **Action/Comedy Filter:** Set your threshold to **6.5**.
164+
- **Drama/Doc Filter:** Set your threshold to **7.5**.
165+
166+
167+
### Question C: The "Blindfold Test" (Documentary vs. Sci-Fi)
168+
169+
**The Data Verdict:** You will be statistically safer picking the **Documentary**.
170+
171+
- **The "Floor" Concept:** Look at the "Whiskers" (the lines extending from the boxes) on the box plot.
172+
- **Sci-Fi:** The bottom whisker goes deep down (towards 1.0 or 2.0). There is a significant statistical probability that a random Sci-Fi movie is unwatchable _garbage_.
173+
- **Documentary:** The bottom whisker rarely dips below 5.0 or 6.0.
174+
175+
- **The Psychology:**
176+
- **Documentaries** are usually made by passionate experts about specific topics. They rarely "fail" completely.
177+
- **Sci-Fi** is high-risk. It attempts to build new worlds. When that fails, it looks ridiculous, leading to "hate-watching" and 1-star reviews.
178+
- **Conclusion:** If you are tired and just want a "guaranteed decent watch" (Low Variance), pick **Documentary**. If you want to gamble for a potentially mind-blowing experience (High Variance), pick **Sci-Fi**.
179+
180+
181+
You can check the project here: [IMDbayes](https://github.com/fezcode/IMDbayes)

src/data/vocab/box-plot.jsx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import React from 'react';
2+
3+
export default function BoxPlot() {
4+
return (
5+
<div className="space-y-4">
6+
<p>
7+
A <strong>Box Plot</strong> (or box-and-whisker plot) is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
8+
</p>
9+
<div className="bg-gray-800 p-4 rounded-lg border border-gray-700 space-y-3">
10+
<div>
11+
<h4 className="text-sm font-bold text-blue-400">Median (Q2)</h4>
12+
<p className="text-sm text-gray-400">
13+
The middle value of the dataset. It splits the data into two equal halves. In a box plot, this is represented by the line inside the box.
14+
</p>
15+
</div>
16+
<div>
17+
<h4 className="text-sm font-bold text-green-400">Interquartile Range (IQR)</h4>
18+
<p className="text-sm text-gray-400">
19+
The distance between the first quartile (25th percentile) and the third quartile (75th percentile). It represents the "middle 50%" of the data and is shown as the height/width of the box itself.
20+
</p>
21+
</div>
22+
<div>
23+
<h4 className="text-sm font-bold text-red-400">Outliers</h4>
24+
<p className="text-sm text-gray-400">
25+
Data points that fall significantly outside the range of the rest of the data. Usually defined as points further than 1.5 × IQR from the edges of the box. They are typically plotted as individual dots beyond the whiskers.
26+
</p>
27+
</div>
28+
</div>
29+
<p className="text-sm italic text-gray-500">
30+
Box plots are exceptionally useful for comparing distributions between several groups at once, highlighting differences in spread and central tendency.
31+
</p>
32+
</div>
33+
);
34+
}

src/data/vocab/lingua-franca.jsx

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import React from 'react';
2+
3+
export default function LinguaFranca() {
4+
return (
5+
<div className="space-y-4">
6+
<p>
7+
A <strong>Lingua Franca</strong> (literally "Frankish language") is a
8+
language or way of communicating which is used between people who do not
9+
speak each other's native language.
10+
</p>
11+
<p>
12+
In modern contexts, it often refers to a common language that is
13+
adopted as a bridge for communication across different linguistic
14+
groups.
15+
</p>
16+
<div className="bg-gray-800 p-4 rounded-lg border border-gray-700">
17+
<h4 className="text-sm font-bold text-blue-400 mb-2">Examples:</h4>
18+
<ul className="list-disc pl-5 space-y-1 text-sm text-gray-400">
19+
<li>
20+
<strong>English:</strong> The global lingua franca of science,
21+
aviation, and the internet.
22+
</li>
23+
<li>
24+
<strong>Latin:</strong> The lingua franca of scholars and the Catholic
25+
Church in Europe for centuries.
26+
</li>
27+
<li>
28+
<strong>Swahili:</strong> A major lingua franca in East Africa.
29+
</li>
30+
<li>
31+
<strong>JavaScript:</strong> Often called the lingua franca of the
32+
web.
33+
</li>
34+
</ul>
35+
</div>
36+
</div>
37+
);
38+
}

src/data/vocab/p99.jsx

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import React from 'react';
2+
3+
export default function P99() {
4+
return (
5+
<div className="space-y-4">
6+
<p>
7+
<strong>P99</strong> (or the 99th Percentile) is a statistical metric indicating the value below which 99% of the observations in a group of observations fall.
8+
</p>
9+
<p>
10+
In software engineering, it is widely used to measure <strong>system latency</strong> and performance. If a system has a P99 response time of 500ms, it means that <strong>99% of all requests</strong> are served in 500ms or less, while the remaining 1% (the outliers) take longer.
11+
</p>
12+
<div className="bg-gray-800 p-4 rounded-lg border border-gray-700">
13+
<h4 className="text-sm font-bold text-purple-400 mb-2">Why not just use the Average?</h4>
14+
<p className="text-sm text-gray-400 mb-3">
15+
Averages (Means) can be misleading because they hide extreme outliers.
16+
</p>
17+
<ul className="list-disc pl-5 space-y-1 text-sm text-gray-400">
18+
<li>
19+
<strong>Average:</strong> If 9 users load in 1s and 1 user loads in 100s, the average is ~11s. This doesn't represent the typical user (1s) OR the worst case (100s).
20+
</li>
21+
<li>
22+
<strong>P99:</strong> Focuses on the "tail" latency—the worst experience that a significant chunk of your users might see. Optimizing for P99 ensures a consistent experience for everyone, not just the "lucky" ones.
23+
</li>
24+
</ul>
25+
</div>
26+
</div>
27+
);
28+
}

src/data/vocab/power-law.jsx

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import React from 'react';
2+
3+
export default function PowerLaw() {
4+
return (
5+
<div className="space-y-4">
6+
<p>
7+
A <strong>Power Law</strong> is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities.
8+
</p>
9+
<p>
10+
In statistics, it often manifests as a <strong>Long-Tail Distribution</strong>, where a small number of events (the "head") occur with high frequency, while a large number of events (the "tail") occur with low frequency.
11+
</p>
12+
<div className="bg-gray-800 p-4 rounded-lg border border-gray-700">
13+
<h4 className="text-sm font-bold text-yellow-400 mb-2">Key Characteristics:</h4>
14+
<ul className="list-disc pl-5 space-y-1 text-sm text-gray-400">
15+
<li>
16+
<strong>Scale Invariance:</strong> The distribution looks the same regardless of the scale at which you observe it.
17+
</li>
18+
<li>
19+
<strong>Pareto Principle (80/20 Rule):</strong> A common manifestation where 80% of effects come from 20% of causes.
20+
</li>
21+
<li>
22+
<strong>Lack of "Average":</strong> Unlike a Normal Distribution (Bell Curve), the "average" in a power law is often misleading as it's heavily skewed by extreme outliers.
23+
</li>
24+
</ul>
25+
</div>
26+
<div className="bg-gray-800 p-4 rounded-lg border border-gray-700">
27+
<h4 className="text-sm font-bold text-green-400 mb-2">Real-World Examples:</h4>
28+
<ul className="list-disc pl-5 space-y-1 text-sm text-gray-400">
29+
<li><strong>Wealth Distribution:</strong> A small percentage of the population holds the majority of the wealth.</li>
30+
<li><strong>City Populations:</strong> A few mega-cities vs. thousands of small towns.</li>
31+
<li><strong>Word Frequency (Zipf's Law):</strong> The most common words in a language appear significantly more often than the rest.</li>
32+
<li><strong>Internet Traffic:</strong> A few websites (Google, YouTube) receive the vast majority of all traffic.</li>
33+
</ul>
34+
</div>
35+
</div>
36+
);
37+
}

src/data/vocabulary.js

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,20 @@ export const vocabulary = {
6363
title: 'Modules vs. Includes',
6464
loader: () => import('./vocab/modules-vs-includes'),
6565
},
66+
'lingua-franca': {
67+
title: 'Lingua Franca',
68+
loader: () => import('./vocab/lingua-franca'),
69+
},
70+
'power-law': {
71+
title: 'Power Law (Long-tail)',
72+
loader: () => import('./vocab/power-law'),
73+
},
74+
'box-plot': {
75+
title: 'Box Plot',
76+
loader: () => import('./vocab/box-plot'),
77+
},
78+
p99: {
79+
title: 'P99 (99th Percentile)',
80+
loader: () => import('./vocab/p99'),
81+
},
6682
};

0 commit comments

Comments
 (0)