Regular Expressions (Regex): A Comprehensive Guide

Regular expressions (RegEx) are a potent tool for defining search patterns in strings. They allow us to find, validate, and manipulate text by specifying a particular format or structure. Python provides built-in support for regular expressions via the “re” module, making it easy to integrate this functionality into our applications.

What Are Regular Expressions?

Regular Expressions are declarative methods for defining a set of strings according to specific patterns or rules. Some common use cases include:

Validating phone numbers, email addresses, and other input formats.
Implementing search and pattern-matching logic (e.g., “Find” or “Grep”).
Building interpreters and compilers.
Designing communication protocols and digital circuits.

Key Applications of Regular Expressions

Validation Frameworks: Ensure data inputs follow specific formats (e.g., email validation).
Pattern Matching: Common tools like grep on Unix-based systems or Ctrl + F on Windows.
Language Translators: Used in compilers and interpreters.
Digital Circuit Design: Mapping and validating circuit designs.
Communication Protocols: Ensuring correct patterns in data transmission, such as TCP/IP.

Working with the ‘re’ module

Python’s “re module” provides numerous functions that make working with regular expressions more convenient.

Basic Functions in re Module

re.compile(): Compiles a pattern into a regex object.

import re
pattern = re.compile("ab")

re.finditer(): Returns an iterator yielding match objects.

matcher = pattern.finditer("abaababa")
for match in matcher:
    print(match.start(), match.end(), match.group())

Output:

0 2 ab
3 5 ab
5 7 ab

Example: Counting Occurrences of a Pattern

import re

count = 0
pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")
for match in matcher:
    count += 1
    print(match.start(), "...", match.end(), "...", match.group())
print("The number of occurrences:", count)

Output:

0 ... 2 ... ab
3 ... 5 ... ab
5 ... 7 ... ab
The number of occurrences: 3

Character Classes

Character classes allow for matching specific groups of characters:

[abc]: Matches any character ‘a’, ‘b’, or ‘c’.
[^abc]: Matches any character except ‘a’, ‘b’, or ‘c’.
[a-z]: Matches any lowercase letter.
[A-Z]: Matches any uppercase letter.
[a-zA-Z0-9]: Matches any alphanumeric character.
[^a-zA-Z0-9]: Matches any non-alphanumeric character (special characters).

import re

text = "abcABC123!@#"

patterns = {
    "Matches 'a', 'b', or 'c'": "[abc]",
    "Matches any character except 'a', 'b', or 'c'": "[^abc]",
    "Matches any lowercase letter": "[a-z]",
    "Matches any uppercase letter": "[A-Z]",
    "Matches any alphanumeric character": "[a-zA-Z0-9]",
    "Matches any non-alphanumeric character": "[^a-zA-Z0-9]"
}

# Find matches for each pattern
for description, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Output:

Matches 'a', 'b', or 'c': ['a', 'b', 'c']
Matches any character except 'a', 'b', or 'c': ['A', 'B', 'C', '1', '2', '3', '!', '@', '#']
Matches any lowercase letter: ['a', 'b', 'c']
Matches any uppercase letter: ['A', 'B', 'C']
Matches any alphanumeric character: ['a', 'b', 'c', 'A', 'B', 'C', '1', '2', '3']
Matches any non-alphanumeric character: ['!', '@', '#']

Predefined Character Classes

\s: Matches any whitespace character.
\S: Matches any non-whitespace character.
\d: Matches any digit (0-9).
\D: Matches any non-digit.
\w: Matches any alphanumeric character.
\W: Matches any non-alphanumeric character.
. : Matches any character except newline (\n).

import re

text = "Hello World 123! @#$"

patterns = {
    "Matches any whitespace character": r"\s",
    "Matches any non-whitespace character": r"\S",
    "Matches any digit (0-9)": r"\d",
    "Matches any non-digit": r"\D",
    "Matches any alphanumeric character": r"\w",
    "Matches any non-alphanumeric character": r"\W",
    "Matches any character except newline": r"."
}

for description, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Output:

Matches any whitespace character: [' ', ' ', ' ']
Matches any non-whitespace character: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd', '1', '2', '3', '!', '@', '#', '$']
Matches any digit (0-9): ['1', '2', '3']
Matches any non-digit: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', ' ', '!', ' ', '@', '#', '$']
Matches any alphanumeric character: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd', '1', '2', '3']
Matches any non-alphanumeric character: [' ', ' ', '!', ' ', '@', '#', '$']
Matches any character except newline: ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', ' ', '1', '2', '3', '!', ' ', '@', '#', '$']

Quantifiers

It specify the number of times a character or group of characters must occur:

a: Matches exactly one ‘a’.
a+: Matches one or more ‘a’s.
a*: Matches zero or more ‘a’s.
a?: Matches zero or one ‘a’.
a{m}: Matches exactly ‘m’ occurrences of ‘a’.
a{m,n}: Matches between ‘m’ and ‘n’ occurrences of ‘a’.

import re
text = "aaabaaa  aaaaaaaa a a"

# Quantifiers
patterns = {
    "Matches exactly one 'a'": "a",
    "Matches one or more 'a's": "a+",
    "Matches zero or more 'a's": "a*",
    "Matches zero or one 'a'": "a?",
    "Matches exactly 3 occurrences of 'a'": "a{3}",
    "Matches between 2 and 4 occurrences of 'a'": "a{2,4}"
}

# Find matches for each pattern
for description, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Output:

Matches exactly one 'a': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
Matches one or more 'a's: ['aaa', 'aaa', 'aaaaaaaa', 'a', 'a']
Matches zero or more 'a's: ['aaa', '', 'aaa', '', '', 'aaaaaaaa', '', 'a', '', 'a', '']
Matches zero or one 'a': ['a', 'a', 'a', '', 'a', 'a', 'a', '', '', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', '', 'a', '', 'a', '']
Matches exactly 3 occurrences of 'a': ['aaa', 'aaa', 'aaa', 'aaa']
Matches between 2 and 4 occurrences of 'a': ['aaa', 'aaa', 'aaaa', 'aaaa']

Matching at the Start or End of Strings

^x: Checks if the string starts with x.
x$: Checks if the string ends with x.

import re

text = "Hello, World!"

# Patterns for matching at the start and end of the string
patterns = {
    "Checks if the string starts with 'H'": "^H",
    "Checks if the string ends with '!'": "!$",
    "Checks if the string starts with 'Hello'": "^Hello",
    "Checks if the string ends with 'World'": "World$"
}

# Find matches for each pattern
for description, pattern in patterns.items():
    match = re.search(pattern, text)
    if match:
        print(f"{description}: Match found")
    else:
        print(f"{description}: No match")

Output:

Checks if the string starts with 'H': Match found
Checks if the string ends with '!': Match found
Checks if the string starts with 'Hello': Match found
Checks if the string ends with 'World': No match

Important Functions of the re Module

1. re.match(): Checks if the pattern matches the start of the string

m = re.match("abc", "abcde")
if m:
    print("Match found:", m.group())

output:

Match found: abc

2. re.search(): Searches for the initial occurrence of the pattern in the string

m = re.search("abc", "123abc456")
if m:
    print("Pattern found at:", m.start(), "-", m.end())

Output:

Pattern found at: 3 - 6

3. re.findall(): Returns a list of all matches in the string.

results = re.findall("[0-9]", "a7b9c5kz")
print(results)

Output:

['7', '9', '5']

4. re.finditer(): Returns an iterator yielding match objects.

for match in re.finditer("in", "The rain in Spain stays mainly in the plain."):
    print(f"Match found: '{match.group()}' at positions {match.start()} to {match.end()}")

Output:

Match found: 'in' at positions 6 to 8
Match found: 'in' at positions 9 to 11
Match found: 'in' at positions 15 to 17
Match found: 'in' at positions 26 to 28
Match found: 'in' at positions 31 to 33
Match found: 'in' at positions 41 to 43

5. re.sub(): Replaces matches with a given replacement.

result = re.sub("[a-z]", "#", "a7b9c5kz")
print(result)

Output:

#7#9#5##

6. re.subn(): Similar to re.sub(), but also returns the number of replacements made.

result, count = re.subn(r"[a-z]", "#", "a7b9c5kz")
print(result, count)

Output:

#7#9#5## 5

7. re.split(): Splits the string by occurrences of the pattern.

tokens = re.split(",", "apple,banana,cherry")
print(tokens)

Output:

['apple', 'banana', 'cherry']

Example Programs

1. Matching Phone Numbers

import re

phone_numbers = """
    123-456-7890
    (123) 456-7890
    123.456.7890
    1234567890
    123 456 7890
"""

pattern = '\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
matches = re.findall(pattern, phone_numbers)
print(matches)

Output:

['123-456-7890', '(123) 456-7890', '123.456.7890', '1234567890', '123 456 7890']

2. Extracting Dates

Extract dates in various common formats (e.g., dd-mm-yyyy or mm/dd/yyyy):

text = "The event will be held on 12-05-2024 and 05/12/2024."
date_pattern = r'\b\d{2}[-/]\d{2}[-/]\d{4}\b'

dates = re.findall(date_pattern, text)
print(dates)

Output:

['12-05-2024', '05/12/2024']

3. Validating Passwords

password = "Password123"

pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[A-Za-z\d]{8,20}$'

if re.fullmatch(pattern, password):
    print("Valid password")
else:
    print("Invalid password")

Output:

Valid password

4. Finding HTML Tags

html_text = "<h1>Title</h1><p>This is a paragraph.</p><a href='example.com'>Link</a>"

tag_pattern = '<[^>]+>'
tags = re.findall(tag_pattern, html_text)
print(tags)

Output:

['<h1>', '</h1>', '<p>', '</p>', "<a href='example.com'>", '</a>']

5. Removing Whitespace from a String

text = "   This   is   an   example.    "

cleaned_text = re.sub('\s+', ' ', text).strip()
print(cleaned_text)

Output:

"This is an example."

6. Finding Email Addresses

text = "Contact us at support@example.com or sales@example.org for more information."

email_pattern = '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails)

Output:

['support@example.com', 'sales@example.org']

7. Matching Words with Specific Lengths

text = "This is a test with several four-letter words like time, play, and jump."

four_letter_words = re.findall('\b\w{4}\b', text)
print(four_letter_words)

Output:

['This', 'test', 'with', 'play', 'jump']

8. Checking if a String Contains Only Digits

string = "123456"

if re.fullmatch(r'\d+', string):
    print("String contains only digits.")
else:
    print("String contains non-digit characters.")

Output:

String contains only digits.

Conclusion

Regular expressions offer a highly flexible way to search, match, and manipulate strings in Python. They are powerful tools, especially when working with text data for tasks such as validation, efficient parsing, and string manipulation.