|
| 1 | +> This analysis was built by a Software Engineer relying on 8-year-old university memories of statistics. |
| 2 | +> If the math looks wrong, just assume it's a feature, not a bug. |
| 3 | +> You can always contact me. |
| 4 | + |
| 5 | +## Deconstructing Hollywood: A Data Science Journey from Raw Data to p99 Insights |
| 6 | + |
| 7 | +As software engineers, we are used to deterministic systems. If `a = b`, then a equals b. |
| 8 | +Data Science, however, deals with probability, distributions, and noise. |
| 9 | +It's less about "**what is the answer**" and more about "**how confident are we in this trend?**" |
| 10 | + |
| 11 | +Recently, I wanted to bridge my engineering background with data science to answer a simple pop-culture question: |
| 12 | +**How do different movie genres actually perform?** |
| 13 | + |
| 14 | +Are "**Action**" movies inherently rated lower than "**Dramas**"? Is it harder to make a masterpiece "**Horror**" movie than a masterpiece "**Biography**"? |
| 15 | + |
| 16 | +To answer this, I didn't just want to run a script; I wanted to build a production-grade Data Science lab?!. (/s) |
| 17 | +This post details the entire journey—from choosing the modern Python stack and engineering the data pipeline to defining |
| 18 | +the statistical metrics that reveal the "truth" behind average ratings. |
| 19 | + |
| 20 | +## Part 1: The Engineering Foundation |
| 21 | + |
| 22 | +A data project is only as good as its environment. I wanted a setup that was fast, reproducible, and clean. |
| 23 | + |
| 24 | +### The Stack Decision |
| 25 | + |
| 26 | +I chose Python because it is the undisputed **[lingua franca](/vocab/lingua-franca)** of data science. |
| 27 | +The ecosystem (Pandas for data crunching, Seaborn for visualization) is unmatched. |
| 28 | + |
| 29 | +### The Package Manager: Why `uv`? |
| 30 | + |
| 31 | +Traditionally, Python data science relies on Conda because it manages complex C-library dependencies used by |
| 32 | +math libraries like NumPy. However, Conda can be slow and bloated. |
| 33 | + |
| 34 | +For this project, I chose `uv`. |
| 35 | + |
| 36 | +`uv` is a modern, blazing-fast Python package manager written in Rust. |
| 37 | +It replaces `pip`, `poetry`, and `virtualenv`. It resolves dependencies in milliseconds and creates deterministic environments instantly. |
| 38 | +For a project relying on standard wheels like Pandas, `uv` provides a vastly superior developer experience. |
| 39 | + |
| 40 | +```bash |
| 41 | +# Setting up the environment took seconds |
| 42 | +$ uv init movie-analysis |
| 43 | +$ uv python install 3.10 |
| 44 | +$ uv add pandas matplotlib seaborn scipy jupyter ipykernel |
| 45 | +``` |
| 46 | + |
| 47 | +Then connected VS Code to this `.venv` created by `uv`, giving me a robust Jupyter Notebook experience right in the IDE. |
| 48 | + |
| 49 | +## Part 2: The Data Pipeline (ETL) |
| 50 | + |
| 51 | +I needed data with genres, votes, and ratings, went straight to the source: the **IMDb Non-Commercial Datasets**. |
| 52 | + |
| 53 | +Then I faced a classic data engineering challenge: these are massive TSV (Tab Separated Values) files. |
| 54 | +Loading the entirety of IMDb into RAM on a laptop is a bad idea. |
| 55 | + |
| 56 | +Solution? Build a Python ETL script to handle ingestion smartly: |
| 57 | + |
| 58 | +1. **Stream & Filter**: used Pandas to read the raw files in chunks, filtering immediately for `titleType == 'movie'` and excluding older films. This kept memory usage low. |
| 59 | +2. **Merge**: joined the `title.basics` (genres/names) with `title.ratings` (scores/votes) on their unique IDs. |
| 60 | +3. **The "Explode"**: This was the crucial data transformation step. IMDb lists genres as a single string: "Action,Adventure,Sci-Fi". To analyze by category, I had to split that string and "explode" the dataset, duplicating the movie row for each genre it belongs to. |
| 61 | + |
| 62 | +```python |
| 63 | +# Transforming "Action,Comedy" into two distinct analysis rows |
| 64 | +df['genres'] = df['genres'].str.split(',') |
| 65 | +df_exploded = df.explode('genres') |
| 66 | +``` |
| 67 | + |
| 68 | +## Part 3: The Science (Beyond Averages) |
| 69 | + |
| 70 | +With clean data in hand, we moved into a Jupyter Notebook for Exploratory Data Analysis (EDA). |
| 71 | + |
| 72 | +### 1. Removing the Noise (The Long Tail) |
| 73 | + |
| 74 | +If you average every movie on IMDb, your data is polluted by home videos with 5 votes from the director's family. |
| 75 | +In statistics, vote counts often follow a ["Power Law"](/vocab/power-law) or long-tail distribution. |
| 76 | + |
| 77 | +To analyze global sentiment, we had to filter out the noise. We set a threshold, dropping any movie with fewer than 100 votes. |
| 78 | +This ensured our statistical analysis was based on titles with a minimum level of public engagement. |
| 79 | + |
| 80 | +### 2. Visualizing the Truth (The Box Plot) |
| 81 | + |
| 82 | +A simple average rating is misleading. If a genre has many `1/10`s and many `10/10`s, the average is `5/10` - but that doesn't tell the story of how polarizing it is. |
| 83 | + |
| 84 | +I used a [Box Plot](/vocab/box-plot) to visualize the distribution. It shows the median (the center line), the Interquartile Range (the colored box containing the middle 50% of data), and outliers (the dots). |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +**Initial Observations:** |
| 89 | +- **Documentary/Biography**: High medians, compact boxes. They are consistently rated highly. |
| 90 | +- **Horror**: The lowest median and a wide spread. It’s very easy to make a bad horror movie. |
| 91 | + |
| 92 | +### 3. The Metrics: Weighted Ratings & p99 |
| 93 | + |
| 94 | +To get deeper insights, I needed better math than simple means. |
| 95 | + |
| 96 | +#### Metric A: The Weighted Rating (Bayesian Average) |
| 97 | + |
| 98 | +How do you compare a movie with a 9.0 rating and 105 votes against an 8.2 rating with 500,000 votes? The latter score is more statistically significant. |
| 99 | + |
| 100 | +I adopted IMDb's own **Weighted Rating** formula. This "Bayesian average" pulls a movie's rating toward the global average (C) if it has few votes (v), |
| 101 | +only allowing it to deviate as it gains more votes over a threshold (m). |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +This provided a fair "Quality Score" for every movie. |
| 106 | + |
| 107 | +#### Metric B: The p99 Ceiling |
| 108 | + |
| 109 | +I wanted to know the "potential" of a genre. Even if most Action movies are mediocre, how good are the very best ones? |
| 110 | + |
| 111 | +For this, I calculated the [99th Percentile (p99)](/vocab/p99) rating for each genre. This is the rating value below which 99% of the genre falls. |
| 112 | +It represents the elite tier, the "Masterpiece Ceiling." |
| 113 | + |
| 114 | +### Part 4: The Deductions (The Gap Analysis) |
| 115 | + |
| 116 | +By combining the Average Weighted Rating (the typical experience) and the p99 Rating (the elite potential), we created a "Gap Analysis" chart. |
| 117 | + |
| 118 | +The dark green bar is the average quality. The total height of the bar is the p99 ceiling. The light green area represents the "Masterpiece Gap". |
| 119 | + |
| 120 | + |
| 121 | + |
| 122 | +#### The Data Science Deductions |
| 123 | + |
| 124 | +This single chart reveals the "personality" of every genre: |
| 125 | + |
| 126 | +1. **The "Safe Bets" (Documentary, History, Biography)**: |
| 127 | +They have very high averages (tall dark bars) and a small gap to the ceiling. |
| 128 | +_Deduction_: It is difficult to make a poorly rated documentary. Audience selection bias likely plays a role here |
| 129 | +(people only watch docs on topics they already like). |
| 130 | + |
| 131 | +2. **The "High Risk / High Reward" (Horror, Sci-Fi)**: They have the lowest averages (short dark bars), |
| 132 | +indicating the typical output is poor. However, their p99 ceilings remain high. |
| 133 | +_Deduction_: The gap is huge. It is incredibly difficult to execute these genres well, but when it's done right |
| 134 | +(e.g., Alien, The Exorcist), they are revered just as highly as dramas. |
| 135 | + |
| 136 | +3. **The Animation Anomaly**: Animation has a high average and a very high ceiling. |
| 137 | +_Deduction_: Statistically, this is perhaps the most consistently high-quality genre in modern cinema. |
| 138 | + |
| 139 | +## Conclusion |
| 140 | + |
| 141 | +This project demonstrated that with a solid engineering setup using modern tools like `uv`, |
| 142 | +and by applying statistical concepts beyond simple averages, we can uncover nuanced truths hidden in raw data. |
| 143 | +Averages tell you what is probable; distributions and percentiles tell you what is possible. |
| 144 | + |
| 145 | +### Question A: Which genre is "easier" to make? (Action vs. Drama vs. Comedy) |
| 146 | + |
| 147 | +**The Data Verdict:** It is significantly "easier" to make an acceptable **Drama** than an acceptable **Action** or **Comedy** movie. |
| 148 | + |
| 149 | +- **Evidence:** Look at the box plot, kindly. |
| 150 | + - **Drama** has a high median and a "tight" box (smaller Interquartile Range). This means even "average" Dramas are usually rated around 6.5–7.0. The "floor" is high. |
| 151 | + - **Action** has a lower median. Action movies require budget, stunts, and effects. If those look cheap, the rating tanks immediately. |
| 152 | + A bad drama is just "boring" (5/10); a bad action movie looks "broken" (3/10). |
| 153 | + - **Comedy** is arguably the *hardest* to get a high rating for. Humor is subjective. |
| 154 | + If a joke lands for 50% of the audience but annoys the other 50%, the rating averages out to a 5.0. |
| 155 | + **Drama is universal; Comedy is divisive**. |
| 156 | + |
| 157 | +### Question B: Should I use lower search bounds for Comedy compared to Drama? |
| 158 | + |
| 159 | +**The Data Verdict:** **YES. Absolutely.** |
| 160 | + |
| 161 | +- **The "Genre Inflation" Factor:** Users rate genres differently. A **7.0** in Horror or Comedy is effectively an **8.0** in Drama or Biography. |
| 162 | + - **The Strategy:** If you filter for `Rating > 7.5`, you will see hundreds of Biographies, but you will filter out some of the funniest Comedies ever made (which often sit at 6.8 - 7.2). |
| 163 | + - **Action/Comedy Filter:** Set your threshold to **6.5**. |
| 164 | + - **Drama/Doc Filter:** Set your threshold to **7.5**. |
| 165 | + |
| 166 | + |
| 167 | +### Question C: The "Blindfold Test" (Documentary vs. Sci-Fi) |
| 168 | + |
| 169 | +**The Data Verdict:** You will be statistically safer picking the **Documentary**. |
| 170 | + |
| 171 | +- **The "Floor" Concept:** Look at the "Whiskers" (the lines extending from the boxes) on the box plot. |
| 172 | + - **Sci-Fi:** The bottom whisker goes deep down (towards 1.0 or 2.0). There is a significant statistical probability that a random Sci-Fi movie is unwatchable _garbage_. |
| 173 | + - **Documentary:** The bottom whisker rarely dips below 5.0 or 6.0. |
| 174 | + |
| 175 | +- **The Psychology:** |
| 176 | + - **Documentaries** are usually made by passionate experts about specific topics. They rarely "fail" completely. |
| 177 | + - **Sci-Fi** is high-risk. It attempts to build new worlds. When that fails, it looks ridiculous, leading to "hate-watching" and 1-star reviews. |
| 178 | + - **Conclusion:** If you are tired and just want a "guaranteed decent watch" (Low Variance), pick **Documentary**. If you want to gamble for a potentially mind-blowing experience (High Variance), pick **Sci-Fi**. |
| 179 | + |
| 180 | + |
| 181 | +You can check the project here: [IMDbayes](https://github.com/fezcode/IMDbayes) |
0 commit comments