当前位置:首页 > Python > 正文

Python正则表达式:匹配替换文字及空格的完整指南 | Python字符串处理教程

Python正则表达式:匹配替换文字及空格

全面指南 - 使用re.sub()方法高效处理文本内容

正则表达式替换基础

在Python中,re.sub() 函数是处理文本替换的强大工具。基本语法如下:

import re

result = re.sub(pattern, replacement, string, count=0, flags=0)
  • pattern: 要匹配的正则表达式模式
  • replacement: 替换的字符串或函数
  • string: 要处理的原始字符串
  • count: 最大替换次数(0表示全部替换)
  • flags: 正则表达式标志(如re.IGNORECASE)

文字替换示例

示例1:基本文字替换

import re

text = "Python is an excellent programming language. I love Python!"
result = re.sub(r"Python", "JavaScript", text)

print(result)
# 输出: JavaScript is an excellent programming language. I love JavaScript!

示例2:不区分大小写替换

text = "Python is great. Do you like python?"
result = re.sub(r"python", "JavaScript", text, flags=re.IGNORECASE)

print(result)
# 输出: JavaScript is great. Do you like JavaScript?

示例3:使用函数进行复杂替换

def to_upper(match):
    return match.group(0).upper()

text = "make this important. and this too!"
result = re.sub(r"important|too", to_upper, text)

print(result)
# 输出: make this IMPORTANT. and this TOO!

空格处理技巧

示例4:替换多个连续空格为单个空格

text = "This   text    has    too    many     spaces."
result = re.sub(r"\s+", " ", text)

print(result)
# 输出: This text has too many spaces.

示例5:删除字符串开头和结尾的空格

text = "   This has leading and trailing spaces.   "
result = re.sub(r"^\s+|\s+$", "", text)

print(f"'{result}'")
# 输出: 'This has leading and trailing spaces.'

示例6:删除所有空格(包括制表符、换行符)

text = "This text has\tspaces and\nnew lines."
result = re.sub(r"\s+", "", text)

print(result)
# 输出: Thistexthasspacesandnewlines.

示例7:保留换行符的文本清理

text = "  This  has\n  extra  spaces  \n  between  words.  "
# 替换多个空格但保留换行符
result = re.sub(r"[^\S\n]+", " ", text)
# 删除开头和结尾空格
result = re.sub(r"^\s+|\s+$", "", result, flags=re.MULTILINE)

print(result)
# 输出: This has\n extra spaces\n between words.

实用场景应用

场景1:清理用户输入

def clean_input(user_input):
    # 移除多余空格
    cleaned = re.sub(r"\s+", " ", user_input)
    # 移除首尾空格
    cleaned = cleaned.strip()
    # 替换特殊字符
    cleaned = re.sub(r"[^\w\s]", "", cleaned)
    return cleaned

user_text = "  Hello,   World! This is some $ text!   "
print(clean_input(user_text))
# 输出: Hello World This is some text

场景2:格式化电话号码

def format_phone_number(phone):
    # 移除非数字字符
    cleaned = re.sub(r"\D", "", phone)
    # 格式化为 (123) 456-7890
    formatted = re.sub(r"(\d{3})(\d{3})(\d{4})", r"(\1) \2-\3", cleaned)
    return formatted

print(format_phone_number("555-123-4567"))      # (555) 123-4567
print(format_phone_number("1 (800) 555-1234"))  # (800) 555-1234

场景3:处理HTML文本

html_text = "<p>  This   is <b>bold</b> text!  </p>"
# 去除HTML标签
text_only = re.sub(r"<.*?>", "", html_text)
# 清理多余空格
cleaned_text = re.sub(r"\s+", " ", text_only).strip()

print(cleaned_text)
# 输出: This is bold text!

最佳实践与注意事项

1. 编译常用模式

对于频繁使用的模式,使用re.compile()提高效率:

space_pattern = re.compile(r"\s+")
result = space_pattern.sub(" ", text)

2. 注意贪婪匹配

使用非贪婪匹配.*?避免匹配过多内容:

# 贪婪匹配
re.sub(r"<.*>", "", "<div>content</div><p>more</p>")
# 输出: ""

# 非贪婪匹配
re.sub(r"<.*?>", "", "<div>content</div><p>more</p>")
# 输出: "contentmore"

3. 特殊字符转义

处理包含正则特殊字符的文本时使用re.escape()

search_term = "file.txt"
safe_pattern = re.escape(search_term)
result = re.sub(safe_pattern, "document.txt", "Find file.txt here")

4. 性能考虑

对于简单替换,字符串方法可能更快:

# 简单替换 - 使用字符串方法更快
text.replace("old", "new")

# 复杂模式 - 使用正则表达式

Python正则表达式替换教程 © 2023 | 专注于文本处理技巧

发表评论