AI Duplicate Content Detector for Symfony Using PHP and OpenAI Embeddings
If you've been running a Symfony-based blog or CMS for a while, chances are you already have duplicate content. You just don't know it yet. Editors rewrite old articles, documentation pages grow organically, and over time you end up with five pages that all basically say the same thing, just worded differently.
The usual approach to catching this, string matching or exact text comparison, falls apart the moment someone changes a few words. Two articles can be 90% the same in meaning and a simple diff won't flag either of them.
That's where OpenAI embeddings come in. Instead of comparing words, we compare meaning. In this tutorial, I'll show you how to build a duplicate content detector in Symfony that uses vector embeddings and cosine similarity to catch semantically similar articles, even when the wording is completely different..
What We're Constructing
After completing this guide, you will have:
- AI-produced embeddings for every article
- A cosine similarity-based semantic similarity checker
- A command for the console to find duplicates
- A threshold for similarity (e.g., 85%+) to mark content
- Any Symfony CMS can be integrated with this foundation.
This is effective for:
- Blogs
- Knowledge bases
- Portals for documentation
- Pages with e-commerce content
Requirements
- Symfony 6 or 7
- PHP 8.1+
- Doctrine ORM
- MySQL / PostgreSQL
- An OpenAI API key
Step 1: Add an Embedding Column to Your Entity
Assume an Article entity.
src/Entity/Article.php
#[ORM\Column(type: 'json', nullable: true)]
private ?array $embedding = null;
public function getEmbedding(): ?array
{
return $this->embedding;
}
public function setEmbedding(?array $embedding): self
{
$this->embedding = $embedding;
return $this;
}
Create and run migration:
php bin/console make:migration
php bin/console doctrine:migrations:migrate
Step 2: Generate Embeddings for Articles
Create a Symfony command:
php bin/console make:command app:generate-article-embeddings
GenerateArticleEmbeddingsCommand.php
namespace App\Command;
use App\Entity\Article;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
class GenerateArticleEmbeddingsCommand extends Command
{
protected static $defaultName = 'app:generate-article-embeddings';
public function __construct(
private EntityManagerInterface $em,
private string $apiKey
) {
parent::__construct();
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$articles = $this->em->getRepository(Article::class)->findAll();
foreach ($articles as $article) {
if ($article->getEmbedding()) {
continue;
}
$embedding = $this->getEmbedding(
strip_tags($article->getContent())
);
$article->setEmbedding($embedding);
$this->em->persist($article);
$output->writeln("Embedding generated for article ID {$article->getId()}");
}
$this->em->flush();
return Command::SUCCESS;
}
private function getEmbedding(string $text): array
{
$payload = [
'model' => 'text-embedding-3-small',
'input' => mb_substr($text, 0, 4000)
];
$ch = curl_init('https://api.openai.com/v1/embeddings');
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"Content-Type: application/json",
"Authorization: Bearer {$this->apiKey}"
],
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => json_encode($payload)
]);
$response = curl_exec($ch);
curl_close($ch);
return json_decode($response, true)['data'][0]['embedding'] ?? [];
}
}
Store the API key in .env.local
OPENAI_API_KEY=your_key_here
Step 3: Cosine Similarity Helper
Create a reusable helper.
src/Service/SimilarityService.php
namespace App\Service;
class SimilarityService
{
public function cosine(array $a, array $b): float
{
$dot = 0;
$magA = 0;
$magB = 0;
foreach ($a as $i => $val) {
$dot += $val * $b[$i];
$magA += $val ** 2;
$magB += $b[$i] ** 2;
}
return $dot / (sqrt($magA) * sqrt($magB));
}
}
Step 4: Detect Duplicate Articles
Create another command:
php bin/console make:command app:detect-duplicates
DetectDuplicateContentCommand.php
namespace App\Command;
use App\Entity\Article;
use App\Service\SimilarityService;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
class DetectDuplicateContentCommand extends Command
{
protected static $defaultName = 'app:detect-duplicates';
public function __construct(
private EntityManagerInterface $em,
private SimilarityService $similarity
) {
parent::__construct();
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$articles = $this->em->getRepository(Article::class)->findAll();
$threshold = 0.85;
foreach ($articles as $i => $a) {
foreach ($articles as $j => $b) {
if ($j <= $i) continue;
if (!$a->getEmbedding() || !$b->getEmbedding()) continue;
$score = $this->similarity->cosine(
$a->getEmbedding(),
$b->getEmbedding()
);
if ($score >= $threshold) {
$output->writeln(
sprintf(
"⚠ Duplicate detected (%.2f): Article %d and %d",
$score,
$a->getId(),
$b->getId()
)
);
}
}
}
return Command::SUCCESS;
}
}
Step 5: Run via Cron (Optional)
To scan regularly, add a cron job:
0 2 * * * php /path/to/project/bin/console app:detect-duplicates
You can store results in a table or send email notifications.
Example Output
Duplicate detected (0.91): Article 12 and 37
Duplicate detected (0.88): Article 18 and 44
Useful Improvements
This system can be expanded with:
- Admin UI for reviewing duplicates
- Canonical page suggestions automatically
- Weighting of the title and excerpt
- Similarity detection at the section level
- Using Messenger for batch processing
- Large-scale vector databases
Cost & Performance Advice
- Create embeddings for each article only once.
- Before embedding, limit the length of the content.
- Ignore the draft content
- Cache similarity findings
- For big datasets, use queues.
AI Category Recommendation System for Drupal 11 Using PHP and OpenAI
Categorization in Drupal is one of those things that looks fine on the surface but gets messy fast. Editors are busy, categories get picked in a hurry, and before long you've got a dozen articles filed under the wrong taxonomy term or spread inconsistently across three different ones that mean almost the same thing.
The fix isn't enforcing stricter rules on editors. It's removing the guesswork entirely.
In this tutorial, I'll walk you through building a custom Drupal 11 module that reads a node's actual content and uses OpenAI to pick the most appropriate category automatically, no manual selection needed.
It hooks into the node save process, pulls your existing taxonomy terms, and asks the AI to match the content against them. The result gets assigned before the node is stored. It's a small module but it solves a real problem, especially on sites with large editorial teams or high publishing volume.
What This Module Will Do
Our AI category system will:
- Analyze node body content on save
- Compare it against existing taxonomy terms
- Recommend the most relevant category
- Automatically assign it (or display it to editors)
Use cases include:
- Blog posts
- Documentation pages
- News articles
- Knowledge bases
Prerequisites
Make sure you have:
- Drupal 11
- PHP 8.1+
- Composer
- A taxonomy vocabulary (example: categories)
- An OpenAI API key
Step 1: Create the Custom Module
Create a new folder:
/modules/custom/ai_category/
Inside it, create the below files:
- ai_category.info.yml
- ai_category.module
ai_category.info.yml
name: AI Category Recommendation
type: module
description: Automatically recommend and assign taxonomy categories using AI.
core_version_requirement: ^11
package: Custom
version: 1.0.0
Step 2: Hook Into Node Save
We’ll use hook_entity_presave() to analyze content before it’s stored.
ai_category.module
use Drupal\Core\Entity\EntityInterface;
use Drupal\taxonomy\Entity\Term;
/**
* Implements hook_entity_presave().
*/
function ai_category_entity_presave(EntityInterface $entity) {
if ($entity->getEntityTypeId() !== 'node') {
return;
}
// Only apply to articles (adjust as needed)
if ($entity->bundle() !== 'article') {
return;
}
$body = $entity->get('body')->value ?? '';
if (empty($body)) {
return;
}
$category = ai_category_recommend_term($body);
if ($category) {
$entity->set('field_category', ['target_id' => $category]);
}
}
This ensures our logic runs only for specific content types and avoids unnecessary processing.
Step 3: Ask AI for Category Recommendation
We’ll send the node content plus a list of available categories to OpenAI and ask it to pick the best one.
function ai_category_recommend_term(string $text): ?int {
$apiKey = 'YOUR_OPENAI_API_KEY';
$endpoint = 'https://api.openai.com/v1/chat/completions';
$terms = \Drupal::entityTypeManager()
->getStorage('taxonomy_term')
->loadTree('categories');
$categoryNames = array_map(fn($t) => $t->name, $terms);
$prompt = "Choose the best category from this list:\n"
. implode(', ', $categoryNames)
. "\n\nContent:\n"
. strip_tags($text)
. "\n\nReturn only the category name.";
$payload = [
"model" => "gpt-4o-mini",
"messages" => [
["role" => "system", "content" => "You are a content classification assistant."],
["role" => "user", "content" => $prompt]
],
"temperature" => 0
];
$ch = curl_init($endpoint);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"Content-Type: application/json",
"Authorization: Bearer {$apiKey}"
],
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => json_encode($payload),
CURLOPT_TIMEOUT => 15
]);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
$chosen = trim($data['choices'][0]['message']['content'] ?? '');
foreach ($terms as $term) {
if (strcasecmp($term->name, $chosen) === 0) {
return $term->tid;
}
}
return null;
}
What’s happening here:
- Drupal loads all available categories
- AI receives both content + allowed categories
- AI returns one matching category name
- Drupal maps it back to a taxonomy term ID
Step 4: Enable the Module
- Place the module in /modules/custom/ai_category
- Go to Extend → Enable module
- Enable AI Category Recommendation
- That’s it — no UI needed yet.
Step 5: Test It
- Create a new Article
- Write content related to PHP, Drupal, AI, or CMS topics
- Click Save
- The Category field is auto-filled
Example:
Article content:
“This tutorial explains how to build a custom Drupal 11 module using PHP hooks…”
AI-selected category:
Drupal
Optional Enhancements
Once the basics work, you can extend this system:
- Show AI recommendation as a suggestion, not auto-assignment
- Add admin settings (API key, confidence threshold)
- Use Queue API for bulk classification
- Switch to embeddings for higher accuracy
- Log category confidence scores
- Support multi-term assignment
Security & Performance Tips
- Never hard-code API keys (use settings.php or environment variables)
- Limit text length before sending to AI
- Cache recommendations to reduce API calls
- Add fallbacks if the AI response is invalid
AI Auto-Tagging in Laravel Using OpenAI Embeddings + Cron Jobs
Manually tagging blog posts works fine when you have ten articles. At a hundred, it gets inconsistent. At a thousand, it's basically broken. Tags get applied differently depending on who wrote the post, and over time your taxonomy becomes a mess that's hard to search and harder to maintain.
I wanted a way to fix this without retagging everything by hand. The approach I landed on uses OpenAI embeddings to represent both post content and tag names as vectors, then assigns tags based on how closely they match in meaning.
The whole thing runs as a Laravel queue job triggered by a cron, so new posts get tagged automatically without any manual step.
In this tutorial I'll walk you through the full setup: generating tag vectors, storing post embeddings, running the cosine similarity match, and wiring it all together with Laravel's scheduler.
What We're Constructing
You'll construct:
- Table of Tag Vector - The meaning of each tag (such as "PHP", "Laravel", "Security", and "AI") will be represented by an embedding vector created by AI.
- A Generator for Post Embedding - We generate an embedding for the post content whenever a new post is saved.
- A Matching Algorithm - The system determines which post embeddings are closest by comparing them with tag embeddings.
- A Cron Job -The system automatically assigns AI-recommended tags every hour (or on any schedule).
This is ideal for:
- Custom blogs made with Laravel
- Headless CMS configurations
- Tagging categories in e-commerce
- Auto-classification of knowledge bases
- Websites for documentation
Now let's get started.
Step 1: Create Migration for Tag Embeddings
Run:
php artisan make:migration create_tag_embeddings_table
Migration:
public function up()
{
Schema::create('tag_embeddings', function (Blueprint $table) {
$table->id();
$table->unsignedBigInteger('tag_id')->unique();
$table->json('embedding'); // store vector
$table->timestamps();
});
}
Run:
php artisan migrate
Step 2: Generate Embeddings for Tags
Create a command:
php artisan make:command GenerateTagEmbeddings
Add logic:
public function handle()
{
$tags = Tag::all();
foreach ($tags as $tag) {
$vector = $this->embed($tag->name);
TagEmbedding::updateOrCreate(
['tag_id' => $tag->id],
['embedding' => json_encode($vector)]
);
$this->info("Embedding created for tag: {$tag->name}");
}
}
private function embed($text)
{
$client = new \GuzzleHttp\Client();
$response = $client->post("https://api.openai.com/v1/embeddings", [
"headers" => [
"Authorization" => "Bearer " . env('OPENAI_API_KEY'),
"Content-Type" => "application/json",
],
"json" => [
"model" => "text-embedding-3-large",
"input" => $text
]
]);
$data = json_decode($response->getBody(), true);
return $data['data'][0]['embedding'] ?? [];
}
Run once:
php artisan generate:tag-embeddings
Now all tags have AI meaning vectors.
Step 3: Save Embeddings for Each Post
Add to your Post model observer or event.
$post->embedding = $this->embed($post->content);
$post->save();
Migration for posts:
$table->json('embedding')->nullable();
Step 4: Matching Algorithm (Post → Tags)
Create a helper class:
class EmbeddingHelper
{
public static function cosineSimilarity($a, $b)
{
$dot = array_sum(array_map(fn($i, $j) => $i * $j, $a, $b));
$magnitudeA = sqrt(array_sum(array_map(fn($i) => $i * $i, $a)));
$magnitudeB = sqrt(array_sum(array_map(fn($i) => $i * $i, $b)));
return $dot / ($magnitudeA * $magnitudeB);
}
}
Step 5: Assign Tags Automatically (Queue Job)
Create job:
php artisan make:job AutoTagPost
Job logic:
public function handle()
{
$postEmbedding = json_decode($this->post->embedding, true);
$tags = TagEmbedding::with('tag')->get();
$scores = [];
foreach ($tags as $te) {
$sim = EmbeddingHelper::cosineSimilarity(
$postEmbedding,
json_decode($te->embedding, true)
);
$scores[$te->tag->id] = $sim;
}
arsort($scores); // highest similarity first
$best = array_slice($scores, 0, 5, true); // top 5 matches
$this->post->tags()->sync(array_keys($best));
}
Step 6: Cron Job to Process New Posts
Add to app/Console/Kernel.php:
protected function schedule(Schedule $schedule)
{
$schedule->command('ai:autotag-posts')->hourly();
}
Create command:
php artisan make:command AutoTagPosts
Command logic:
public function handle()
{
$posts = Post::whereNull('tags_assigned_at')->get();
foreach ($posts as $post) {
AutoTagPost::dispatch($post);
$post->update(['tags_assigned_at' => now()]);
}
}
Now, every hour, Laravel processes all new posts and assigns AI-selected tags.
Step 7: Test the Full Flow
- Create tags in admin
- Run: php artisan generate:tag-embeddings
- Create a new blog post
- Cron or queue runs
- Post automatically gets AI-selected tags
Useful enhancements
- Weight tags by frequency
- Use title + excerpt, not full content
- Add confidence scores to DB
- Auto-create new tags using AI
- Add a manual override UI
- Cache embeddings for performance
- Batch process 1,000+ posts
No more posts to load.
- Building a RAG System in Laravel from Scratch
- Steps to create a Contact Form in Symfony With SwiftMailer
- Build a WhatsApp AI Assistant Using Laravel, Twilio and OpenAI
- CIBB - Basic Forum With Codeigniter and Twitter Bootstrap
- Laravel and Prism PHP: The Modern Way to Work with AI Models
- Drupal 7 - Create your custom Hello World module
- Build an AI Code Review Bot with Laravel — Real-World Use Case
- Create Front End Component in Joomla - Step by step procedure
- Symfony Framework - Introduction
- A step by step procedure to develop wordpress plugin