Regular expressions (RegEx) are a potent tool for defining search patterns in strings. They allow us to find, validate, and manipulate text by specifying a particular format or structure. Python provides built-in support for regular expressions via the “re” module, making it easy to integrate this functionality into our applications.
Table of Contents
What Are Regular Expressions?
Regular Expressions are declarative methods for defining a set of strings according to specific patterns or rules. Some common use cases include:
- Validating phone numbers, email addresses, and other input formats.
- Implementing search and pattern-matching logic (e.g., “Find” or “Grep”).
- Building interpreters and compilers.
- Designing communication protocols and digital circuits.
Key Applications of Regular Expressions
- Validation Frameworks: Ensure data inputs follow specific formats (e.g., email validation).
- Pattern Matching: Common tools like grep on Unix-based systems or Ctrl + F on Windows.
- Language Translators: Used in compilers and interpreters.
- Digital Circuit Design: Mapping and validating circuit designs.
- Communication Protocols: Ensuring correct patterns in data transmission, such as TCP/IP.
Working with the ‘re’ module
Python’s “re module” provides numerous functions that make working with regular expressions more convenient.
Basic Functions in re Module
re.compile(): Compiles a pattern into a regex object.
import re
pattern = re.compile("ab")
re.finditer(): Returns an iterator yielding match objects.
matcher = pattern.finditer("abaababa")
for match in matcher:
print(match.start(), match.end(), match.group())
Example: Counting Occurrences of a Pattern
import re
count = 0
pattern = re.compile("ab")
matcher = pattern.finditer("abaababa")
for match in matcher:
count += 1
print(match.start(), "...", match.end(), "...", match.group())
print("The number of occurrences:", count)
#Output:
0 ... 2 ... ab
3 ... 5 ... ab
5 ... 7 ... ab
The number of occurrences: 3
Character Classes
Character classes allow for matching specific groups of characters:
- [abc]: Matches any character ‘a’, ‘b’, or ‘c’.
- [^abc]: Matches any character except ‘a’, ‘b’, or ‘c’.
- [a-z]: Matches any lowercase letter.
- [A-Z]: Matches any uppercase letter.
- [a-zA-Z0-9]: Matches any alphanumeric character.
- [^a-zA-Z0-9]: Matches any non-alphanumeric character (special characters).
import re
text = "abcABC123!@#"
patterns = {
"Matches 'a', 'b', or 'c'": "[abc]",
"Matches any character except 'a', 'b', or 'c'": "[^abc]",
"Matches any lowercase letter": "[a-z]",
"Matches any uppercase letter": "[A-Z]",
"Matches any alphanumeric character": "[a-zA-Z0-9]",
"Matches any non-alphanumeric character": "[^a-zA-Z0-9]"
}
# Find matches for each pattern
for description, pattern in patterns.items():
matches = re.findall(pattern, text)
print(f"{description}: {matches}")
Predefined Character Classes
- \s: Matches any whitespace character.
- \S: Matches any non-whitespace character.
- \d: Matches any digit (0-9).
- \D: Matches any non-digit.
- \w: Matches any alphanumeric character.
- \W: Matches any non-alphanumeric character.
- . : Matches any character except newline (\n).
import re
text = "Hello World 123! @#$"
# Predefined Character Classes
patterns = {
"Matches any whitespace character": "\s",
"Matches any non-whitespace character": "\S",
"Matches any digit (0-9)": "\d",
"Matches any non-digit": "\D",
"Matches any alphanumeric character": "\w",
"Matches any non-alphanumeric character": "\W",
"Matches any character except newline": "."
}
# Find matches for each pattern
for description, pattern in patterns.items():
matches = re.findall(pattern, text)
print(f"{description}: {matches}")
Quantifiers
Quantifiers specify the number of times a character or group of characters must occur:
- a: Matches exactly one ‘a’.
- a+: Matches one or more ‘a’s.
- a*: Matches zero or more ‘a’s.
- a?: Matches zero or one ‘a’.
- a{m}: Matches exactly ‘m’ occurrences of ‘a’.
- a{m,n}: Matches between ‘m’ and ‘n’ occurrences of ‘a’.
import re
text = "aaabaaa aaaaaaaa a a"
# Quantifiers
patterns = {
"Matches exactly one 'a'": "a",
"Matches one or more 'a's": "a+",
"Matches zero or more 'a's": "a*",
"Matches zero or one 'a'": "a?",
"Matches exactly 3 occurrences of 'a'": "a{3}",
"Matches between 2 and 4 occurrences of 'a'": "a{2,4}"
}
# Find matches for each pattern
for description, pattern in patterns.items():
matches = re.findall(pattern, text)
print(f"{description}: {matches}")
Matching at the Start or End of Strings
- ^x: Checks if the string starts with x.
- x$: Checks if the string ends with x.
import re
text = "Hello, World!"
# Patterns for matching at the start and end of the string
patterns = {
"Checks if the string starts with 'H'": "^H",
"Checks if the string ends with '!'": "!$",
"Checks if the string starts with 'Hello'": "^Hello",
"Checks if the string ends with 'World'": "World$"
}
# Find matches for each pattern
for description, pattern in patterns.items():
match = re.search(pattern, text)
if match:
print(f"{description}: Match found")
else:
print(f"{description}: No match")
Important Functions of the re Module
1. re.match(): Checks if the pattern matches the start of the string
m = re.match("abc", "abcde")
if m:
print("Match found:", m.group())
2. re.search(): Searches for the initial occurrence of the pattern in the string
m = re.search("abc", "123abc456")
if m:
print("Pattern found at:", m.start(), "-", m.end())
3. re.findall(): Returns a list of all matches in the string.
results = re.findall("[0-9]", "a7b9c5kz")
print(results) # ['7', '9', '5']
4. re.finditer(): Returns an iterator yielding match objects.
for match in re.finditer("in", "The rain in Spain stays mainly in the plain."):
print(f"Match found: '{match.group()}' at positions {match.start()} to {match.end()}")
5. re.sub(): Replaces matches with a given replacement.
result = re.sub("[a-z]", "#", "a7b9c5kz")
print(result) #output: #7#9#5##
6. re.subn(): Similar to re.sub(), but also returns the number of replacements made.
result, count = re.subn(r"[a-z]", "#", "a7b9c5kz")
print(result, count) # Output: #7#9#5## 5
7. re.split(): Splits the string by occurrences of the pattern.
tokens = re.split(",", "apple,banana,cherry")
print(tokens) # ['apple', 'banana', 'cherry']
Example Programs:
1. Matching Phone Numbers
import re
phone_numbers = """
123-456-7890
(123) 456-7890
123.456.7890
1234567890
123 456 7890
"""
pattern = '\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
matches = re.findall(pattern, phone_numbers)
print(matches)
# Output: ['123-456-7890', '(123) 456-7890', '123.456.7890', '1234567890', '123 456 7890']
2. Extracting Dates
Extract dates in various common formats (e.g., dd-mm-yyyy or mm/dd/yyyy):
text = "The event will be held on 12-05-2024 and 05/12/2024."
date_pattern = r'\b\d{2}[-/]\d{2}[-/]\d{4}\b'
dates = re.findall(date_pattern, text)
print(dates)
# Output: ['12-05-2024', '05/12/2024']
3. Validating Passwords
password = "Password123"
password_pattern = '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[A-Za-z\d]{8,20}$'
if re.match(password_pattern, password):
print("Valid password")
else:
print("Invalid password")
4. Finding HTML Tags
html_text = "<h1>Title</h1><p>This is a paragraph.</p><a href='example.com'>Link</a>"
tag_pattern = '<[^>]+>'
tags = re.findall(tag_pattern, html_text)
print(tags)
# Output: ["<h1>", "</h1>", "<p>", "</p>", "<a href='example.com'>", "</a>"]
5. Removing Whitespace from a String
text = " This is an example. "
cleaned_text = re.sub('\s+', ' ', text).strip()
print(cleaned_text)
# Output: "This is an example."
6. Finding Email Addresses
text = "Contact us at support@example.com or sales@example.org for more information."
email_pattern = '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails)
# Output: ['support@example.com', 'sales@example.org']
7. Matching Words with Specific Lengths
text = "This is a test with several four-letter words like time, play, and jump."
four_letter_words = re.findall('\b\w{4}\b', text)
print(four_letter_words)
# Output: ['This', 'test', 'with', 'play', 'jump']
8. Checking if a String Contains Only Digits
string = "123456"
if re.fullmatch('\d+', string):
print("String contains only digits.")
else:
print("String contains non-digit characters.")
# Output: "String contains only digits."
Conclusion
Regular expressions offer a highly flexible way to search, match, and manipulate strings in Python. They are powerful tools, especially when working with text data for tasks such as validation, efficient parsing, and string manipulation.
Knowledge Check
Related Article No.4

Check out our Trending Courses Demo Playlist
Data Analytics with Power Bi and Fabric |
Could Data Engineer |
Data Analytics With Power Bi Fabic |
AWS Data Engineering with Snowflake |
Azure Data Engineering |
Azure & Fabric for Power bi |
Full Stack Power Bi |
Kick Start Your Career With Our Data Job
Social Media channels
► KSR Datavizon Website :- https://www.datavizon.com
► KSR Datavizon LinkedIn :- https://www.linkedin.com/company/datavizon/
► KSR Datavizon You tube :- https://www.youtube.com/c/KSRDatavizon
► KSR Datavizon Twitter :- https://twitter.com/ksrdatavizo
Most Commented