Learn Regular Expressions (Regex): A Comprehensive Guide

Regular expressions (RegEx) are a potent tool for defining search patterns in strings. They allow us to find, validate, and manipulate text by specifying a particular format or structure. Python provides built-in support for regular expressions via the “re” module, making it easy to integrate this functionality into our applications.

What Are Regular Expressions?

Regular Expressions are declarative methods for defining a set of strings according to specific patterns or rules. Some common use cases include:

  • Validating phone numbers, email addresses, and other input formats.
  • Implementing search and pattern-matching logic (e.g., “Find” or “Grep”).
  • Building interpreters and compilers.
  • Designing communication protocols and digital circuits.

Key Applications of Regular Expressions

  • Validation Frameworks: Ensure data inputs follow specific formats (e.g., email validation).
  • Pattern Matching: Common tools like grep on Unix-based systems or Ctrl + F on Windows.
  • Language Translators: Used in compilers and interpreters.
  • Digital Circuit Design: Mapping and validating circuit designs.
  • Communication Protocols: Ensuring correct patterns in data transmission, such as TCP/IP.

Working with the ‘re’ module

Python’s “re module” provides numerous functions that make working with regular expressions more convenient.

Basic Functions in re Module

re.compile(): Compiles a pattern into a regex object.

import re
pattern = re.compile("ab")

re.finditer(): Returns an iterator yielding match objects.

matcher = pattern.finditer("abaababa")
for match in matcher:
    print(match.start(), match.end(), match.group())

Example: Counting Occurrences of a Pattern

import re
count = 0
pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")
for match in matcher:
    count += 1
    print(match.start(), "...", match.end(), "...", match.group())
print("The number of occurrences:", count)

#Output:

0 ... 2 ... ab
3 ... 5 ... ab
5 ... 7 ... ab
The number of occurrences: 3

Character Classes

Character classes allow for matching specific groups of characters:

  • [abc]: Matches any character ‘a’, ‘b’, or ‘c’.
  • [^abc]: Matches any character except ‘a’, ‘b’, or ‘c’.
  • [a-z]: Matches any lowercase letter.
  • [A-Z]: Matches any uppercase letter.
  • [a-zA-Z0-9]: Matches any alphanumeric character.
  • [^a-zA-Z0-9]: Matches any non-alphanumeric character (special characters).
import re

text = "abcABC123!@#"

patterns = {
"Matches 'a', 'b', or 'c'": "[abc]",
"Matches any character except 'a', 'b', or 'c'": "[^abc]",
"Matches any lowercase letter": "[a-z]",
"Matches any uppercase letter": "[A-Z]",
"Matches any alphanumeric character": "[a-zA-Z0-9]",
"Matches any non-alphanumeric character": "[^a-zA-Z0-9]"
}

# Find matches for each pattern
for description, pattern in patterns.items():
matches = re.findall(pattern, text)
print(f"{description}: {matches}")

Predefined Character Classes

  • \s: Matches any whitespace character.
  • \S: Matches any non-whitespace character.
  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit.
  • \w: Matches any alphanumeric character.
  • \W: Matches any non-alphanumeric character.
  • . : Matches any character except newline (\n).
import re

text = "Hello World 123! @#$"

# Predefined Character Classes
patterns = {
    "Matches any whitespace character": "\s",
    "Matches any non-whitespace character": "\S",
    "Matches any digit (0-9)": "\d",
    "Matches any non-digit": "\D",
    "Matches any alphanumeric character": "\w",
    "Matches any non-alphanumeric character": "\W",
    "Matches any character except newline": "."
}

# Find matches for each pattern
for description, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Quantifiers

Quantifiers specify the number of times a character or group of characters must occur:

  • a: Matches exactly one ‘a’.
  • a+: Matches one or more ‘a’s.
  • a*: Matches zero or more ‘a’s.
  • a?: Matches zero or one ‘a’.
  • a{m}: Matches exactly ‘m’ occurrences of ‘a’.
  • a{m,n}: Matches between ‘m’ and ‘n’ occurrences of ‘a’.
import re
text = "aaabaaa  aaaaaaaa a a"

# Quantifiers
patterns = {
    "Matches exactly one 'a'": "a",
    "Matches one or more 'a's": "a+",
    "Matches zero or more 'a's": "a*",
    "Matches zero or one 'a'": "a?",
    "Matches exactly 3 occurrences of 'a'": "a{3}",
    "Matches between 2 and 4 occurrences of 'a'": "a{2,4}"
}

# Find matches for each pattern
for description, pattern in patterns.items():
    matches = re.findall(pattern, text)
    print(f"{description}: {matches}")

Matching at the Start or End of Strings

  • ^x: Checks if the string starts with x.
  • x$: Checks if the string ends with x.
import re

text = "Hello, World!"

# Patterns for matching at the start and end of the string
patterns = {
    "Checks if the string starts with 'H'": "^H",
    "Checks if the string ends with '!'": "!$",
    "Checks if the string starts with 'Hello'": "^Hello",
    "Checks if the string ends with 'World'": "World$"
}

# Find matches for each pattern
for description, pattern in patterns.items():
    match = re.search(pattern, text)
    if match:
        print(f"{description}: Match found")
    else:
        print(f"{description}: No match")

Important Functions of the re Module

1. re.match(): Checks if the pattern matches the start of the string

m = re.match("abc", "abcde")
if m:
    print("Match found:", m.group())

2. re.search(): Searches for the initial occurrence of the pattern in the string

m = re.search("abc", "123abc456")
if m:
    print("Pattern found at:", m.start(), "-", m.end())

3. re.findall(): Returns a list of all matches in the string.

results = re.findall("[0-9]", "a7b9c5kz")
print(results)  # ['7', '9', '5']

4. re.finditer(): Returns an iterator yielding match objects.

for match in re.finditer("in", "The rain in Spain stays mainly in the plain."):
    print(f"Match found: '{match.group()}' at positions {match.start()} to {match.end()}")

5. re.sub(): Replaces matches with a given replacement.

result = re.sub("[a-z]", "#", "a7b9c5kz")
print(result)  #output: #7#9#5##

6. re.subn(): Similar to re.sub(), but also returns the number of replacements made.

result, count = re.subn(r"[a-z]", "#", "a7b9c5kz")
print(result, count)  # Output: #7#9#5## 5

7. re.split(): Splits the string by occurrences of the pattern.

tokens = re.split(",", "apple,banana,cherry")
print(tokens)  # ['apple', 'banana', 'cherry']

Example Programs:

1. Matching Phone Numbers

import re

phone_numbers = """
    123-456-7890
    (123) 456-7890
    123.456.7890
    1234567890
    123 456 7890
"""

pattern = '\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
matches = re.findall(pattern, phone_numbers)
print(matches)
# Output: ['123-456-7890', '(123) 456-7890', '123.456.7890', '1234567890', '123 456 7890']

2. Extracting Dates

Extract dates in various common formats (e.g., dd-mm-yyyy or mm/dd/yyyy):

text = "The event will be held on 12-05-2024 and 05/12/2024."
date_pattern = r'\b\d{2}[-/]\d{2}[-/]\d{4}\b'

dates = re.findall(date_pattern, text)
print(dates)
# Output: ['12-05-2024', '05/12/2024']

3. Validating Passwords

password = "Password123"

password_pattern = '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[A-Za-z\d]{8,20}$'

if re.match(password_pattern, password):
    print("Valid password")
else:
    print("Invalid password")

4. Finding HTML Tags

html_text = "<h1>Title</h1><p>This is a paragraph.</p><a href='example.com'>Link</a>"

tag_pattern = '<[^>]+>'
tags = re.findall(tag_pattern, html_text)
print(tags)
# Output: ["<h1>", "</h1>", "<p>", "</p>", "<a href='example.com'>", "</a>"]

5. Removing Whitespace from a String

text = "   This   is   an   example.    "

cleaned_text = re.sub('\s+', ' ', text).strip()
print(cleaned_text)
# Output: "This is an example."

6. Finding Email Addresses

text = "Contact us at support@example.com or sales@example.org for more information."

email_pattern = '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails)
# Output: ['support@example.com', 'sales@example.org']

7. Matching Words with Specific Lengths

text = "This is a test with several four-letter words like time, play, and jump."

four_letter_words = re.findall('\b\w{4}\b', text)
print(four_letter_words)
# Output: ['This', 'test', 'with', 'play', 'jump']

8. Checking if a String Contains Only Digits

string = "123456"

if re.fullmatch('\d+', string):
    print("String contains only digits.")
else:
    print("String contains non-digit characters.")
# Output: "String contains only digits."

Conclusion

Regular expressions offer a highly flexible way to search, match, and manipulate strings in Python. They are powerful tools, especially when working with text data for tasks such as validation, efficient parsing, and string manipulation.

Knowledge Check

Related Article No.4

image 11 Explore and Read Our Blogs Written By Our Insutry Experts Learn From KSR Data Vizon
Data Analytics with Power Bi and Fabric
Could Data Engineer
Data Analytics With Power Bi Fabic
AWS Data Engineering with Snowflake
Azure Data Engineering
Azure & Fabric for Power bi
Full Stack Power Bi
Subscribe to our channel & Don’t miss any update on trending technologies

Kick Start Your Career With Our Data Job

Master Fullstack Power BI – SQL, Power BI, Azure Cloud & Fabric Tools
Master in Data Science With Generative AI Transform Data into Business Solutions
Master Azure Data Engineering – Build Scalable Solutions for Big Data
Master AWS Data Engineering with Snowflake: Build Scalable Data Solutions
Transform Your Productivity With Low Code Technology: Master the Microsoft Power Platform

Social Media channels

► KSR Datavizon Website :- https://www.datavizon.com
► KSR Datavizon LinkedIn :- https://www.linkedin.com/company/datavizon/
► KSR Datavizon You tube :- https://www.youtube.com/c/KSRDatavizon
► KSR Datavizon Twitter :- https://twitter.com/ksrdatavizo

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *