Python文本转换完全指南：从基础到高级操作 | Python数据处理教程

为什么需要文本转换？

文本转换是数据处理中的常见任务，包括：

清理用户输入数据
准备数据用于机器学习模型
转换文本格式以满足系统要求
处理不同编码的文本数据
从文本中提取结构化信息

Python凭借其强大的字符串处理能力和丰富的文本处理库，成为文本转换的首选工具。

基础文本转换操作

1. 大小写转换

Python提供了简单的方法来转换字符串的大小写：

text = "Python Text Processing"

# 转换为大写
upper_text = text.upper()
print(upper_text)  # 输出: PYTHON TEXT PROCESSING

# 转换为小写
lower_text = text.lower()
print(lower_text)  # 输出: python text processing

# 首字母大写
title_text = text.title()
print(title_text)  # 输出: Python Text Processing

# 大小写交换
swap_text = text.swapcase()
print(swap_text)  # 输出: pYTHON tEXT pROCESSING

2. 字符串替换

使用replace()方法替换文本中的特定部分：

text = "I like apples. Apples are delicious."

# 简单替换
new_text = text.replace("apples", "oranges")
print(new_text)  
# 输出: I like oranges. Apples are delicious.

# 替换所有出现（区分大小写）
new_text = text.replace("apples", "oranges").replace("Apples", "Oranges")
print(new_text)  
# 输出: I like oranges. Oranges are delicious.

# 更高级的替换可以使用正则表达式（见下文）

3. 去除空白字符

清理文本中的多余空白：

text = "   Python Text Processing   \t\n"

# 去除两端空白
stripped = text.strip()
print(f"'{stripped}'")  # 输出: 'Python Text Processing'

# 去除左侧空白
left_stripped = text.lstrip()
print(f"'{left_stripped}'")  # 输出: 'Python Text Processing   \t\n'

# 去除右侧空白
right_stripped = text.rstrip()
print(f"'{right_stripped}'") # 输出: '   Python Text Processing'

高级文本转换技术

1. 使用正则表达式进行复杂转换

Python的re模块提供强大的正则表达式功能：

import re

text = "Contact us at: support@example.com or sales@company.net"

# 替换所有电子邮件地址
anon_text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 
                  '[EMAIL]', text)
print(anon_text)  
# 输出: Contact us at: [EMAIL] or [EMAIL]

# 提取所有日期格式
text = "Event dates: 2023-08-15, 2023/09/20, and 15.10.2023"
dates = re.findall(r'\d{4}[-/.]\d{2}[-/.]\d{2}', text)
print(dates)  # 输出: ['2023-08-15', '2023/09/20', '15.10.2023']

2. 文本编码转换

处理不同编码的文本数据：

# 假设我们有一个UTF-8编码的字符串
utf8_text = "Python文本处理 - 编码转换示例"

# 转换为字节对象
bytes_data = utf8_text.encode('utf-8')

# 转换为其他编码 (例如: latin-1, 注意可能会丢失信息)
try:
    latin_text = bytes_data.decode('latin-1')
except UnicodeEncodeError:
    latin_text = bytes_data.decode('latin-1', errors='replace')

print(latin_text)  # 输出可能包含替换字符

# 转换为GBK编码 (中文常用)
gbk_bytes = utf8_text.encode('gbk')
print(gbk_bytes)  # 输出: b'Python\xce\xc4\xb1\xbe\xb4\xa6\xc0\xed - \xb1\xe0\xc2\xeb\xd7\xaa\xbb\xbb\xca\xb4\xc0\xfd'

# 转换回UTF-8
back_to_utf8 = gbk_bytes.decode('gbk').encode('utf-8')
print(back_to_utf8.decode('utf-8'))  # 输出原始字符串

3. 处理HTML/XML内容

使用BeautifulSoup解析和转换HTML/XML：

from bs4 import BeautifulSoup

html_content = """


Python文本处理
这是一个示例HTML文档
主要文本内容在这里


"""

# 解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 提取所有文本
all_text = soup.get_text()
print(all_text)
# 输出: Python文本处理 这是一个示例HTML文档 主要文本内容在这里

# 提取特定元素文本
content_div = soup.find('div', class_='content')
print(content_div.text)  # 输出: 主要文本内容在这里

实战：文件文本转换脚本

以下是一个完整的Python脚本，用于处理文本文件：

import re
from pathlib import Path

def process_text_file(input_file, output_file):
    """
    处理文本文件：
    1. 转换为小写
    2. 移除多余空格
    3. 替换特定词汇
    4. 清理特殊字符
    """
    try:
        # 读取文件内容
        with open(input_file, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # 文本转换处理
        content = content.lower()  # 转换为小写
        content = re.sub(r'\s+', ' ', content)  # 将多个空格替换为单个空格
        content = re.sub(r'http\S+', '[URL]', content)  # 替换URL
        content = content.replace('python', 'Python')  # 正确大写Python
        
        # 写入处理后的内容
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(content)
            
        print(f"文件处理完成，结果已保存到: {output_file}")
        return True
        
    except Exception as e:
        print(f"处理文件时出错: {str(e)}")
        return False

# 使用示例
if __name__ == "__main__":
    input_file = "source.txt"
    output_file = "processed.txt"
    
    # 确保输入文件存在
    if Path(input_file).exists():
        process_text_file(input_file, output_file)
    else:
        print(f"输入文件 {input_file} 不存在")