Text Preprocessing in Python Using Regex

Text data often contains noise such as special characters, unnecessary numbers or irrelevant symbols. To extract meaningful information, preprocessing is required. Python Regex module helps in cleaning and transforming text efficiently by pattern matching and removal of unwanted elements.

Removing Special Characters

To remove unwanted characters from a text:

We can define a pattern and exclude certain characters using ^ inside square brackets [].
The + symbol indicates one or more occurrences of the pattern.

Here in this code removes the specified special characters from the text and combines the remaining parts into a clean string.

re.findall() returns a list of all substrings that match the regex pattern

Python

import re

text = "Eshant @# is happ$y!"
cleaned = re.findall(r'[^@#$!]+', text)
print(cleaned)
print(''.join(cleaned))

Output:

['Eshant ', ' is happ', 'y']
Eshant is happy

Here, the regex removed special characters and using join(), we combined the list into a proper string.

Excluding Specific Characters

Regex provides shorthand symbols to include or exclude certain character types:

\d: matches digits
\D: excludes digits
\w: matches alphanumeric characters
\W: excludes alphanumeric characters

Here in this code we extracts all characters from the text except digits, effectively removing the numbers.

Python

text = "I'm Ashish and 123"
print(re.findall(r'\D+', text))

Output:

["I'm Ashish and "]

Using \D+ excluded all digits and kept the rest of the text.

Finding Patterns

Regular expressions allow you to extract specific patterns from text, such as URLs, email addresses or words with hyphens. Here in the code it uses a regular expression to find and extract words connected by hyphens including patterns with one or two hyphens from the text.

Python

text = "Geeks-for-Geeks works-it-out wklfd-dfjgk-fjkds"
patterns = re.findall(r'\w+-\w+(?:-\w+)?', text)
print(patterns)

Output:

['Geeks-for-Geeks', 'works-it-out', 'wklfd-dfjgk-fjkds']

Phone Number Pattern Matching

Regular expressions can be used to identify and extract phone numbers from text by matching specific digit patterns making it easier to locate contact information during text pre processing.

Here in this code we extracts phone number–like patterns using regex and then removes the hyphens to print only the numeric digits.

Python

for no in ["64-6534-342", "543-5345-645", "4563-453-445", "53-5453-5345", "435-234-6324"]:
    print(re.findall(r'\d+-\d+-\d+', no)[0].replace('-', ''))

Output:

Removing Extra Spaces

Text often contains multiple spaces or inconsistent spacing. Regex can normalize whitespace.

re.sub() replaces all occurrences of a regex pattern with a specified string.

\s+ matches one or more whitespace characters.
.strip() removes leading and trailing spaces.

Python

text = "This   is  a    messy   text"
clean_text = re.sub(r'\s+', ' ', text).strip()
print(clean_text)

Output:

This is a messy text

Lowercasing Text

Now we will converts all text to lowercase to ensure uniformity and prevent duplicate tokens in NLP tasks.

.lower() converts all characters to lowercase.
Standardize text for NLP or comparison tasks.

Python

text = "Python Is Fun!"
clean_text = text.lower()
print(clean_text)

Output:

python is fun!

Removing Punctuation

Now we will remove punctuation characters while preserving words and spaces which is useful for tokenization or counting.

[^\w\s] matches everything except letters, digits and spaces.
re.sub() removes punctuation while keeping words and spaces intact.

Python

text = "Hello, world! How's it going?"
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)

Output:

Hello world Hows it going

Extracting Email Addresses

Now we will find all email addresses in text which is useful for data extraction or contact information retrieval.

[\w.-]+@[\w.-]+\.\w+ matches typical email patterns.
re.findall() returns a list of email addresses.

Python

text = "Contact us at support@example.com or admin@site.org"
emails = re.findall(r'\b[\w.-]+@[\w.-]+\.\w+\b', text)
print(emails)

Output:

['support@example.com', 'admin@site.org']

Replacing Words or Patterns

Now we will replaces specific words or patterns in text with options for case-insensitive replacements.

cats? matches "cat" or "cats", ignoring case.
re.sub() replaces the matched words with "dogs".

Python

text = "I love cats. Cats are amazing."
clean_text = re.sub(r'cats?', 'dogs', text, flags=re.IGNORECASE)
print(clean_text)

Output:

I love dogs. dogs are amazing.

Removing HTML Tags

Text collected from web pages often contains HTML tags that are not useful for analysis. Regex can be used to remove these tags and extract only the readable content.

<.*?> matches any HTML tag enclosed within < >.
re.sub() removes all matched tags, leaving only plain text.

Python

import re

text = "<p>Hello <b>World</b>, welcome to <a href='#'>Python</a></p>"
clean_text = re.sub(r'<.*?>', '', text)
print(clean_text)

Output:

Hello World, welcome to Python

Combining Multiple Preprocessing Steps

This approach applies several cleaning operations—like lowercasing, removing digits and punctuation and normalizing spaces—in a single function to produce clean, ready-to-use text for NLP or machine learning tasks.

Python

def preprocess_text(text):
    text = text.lower()                      
    text = re.sub(r'\d+', '', text)          
    text = re.sub(r'[^\w\s]', '', text)    
    text = re.sub(r'\s+', ' ', text).strip() 
    return text

text = "Hello!!! My number is 12345.   Welcome..."
clean_text = preprocess_text(text)
print(clean_text)

Output:

hello my number is welcome

Text Preprocessing in Python Using Regex

Removing Special Characters

Excluding Specific Characters

Finding Patterns

Phone Number Pattern Matching

Removing Extra Spaces

Lowercasing Text

Removing Punctuation

Extracting Email Addresses

Replacing Words or Patterns

Removing HTML Tags

Combining Multiple Preprocessing Steps

Explore