'` | `//div` | | 匹配 class="title" 的元素 | `r'class="title"'` | `//*[@class="title"]` | | 匹配包含 /subject/ 的链接 | `r'href="/subject/[^"]+"'` | `//a[contains(@href,"/subject/")]` | | 提取标签文本内容 | `r'>([^<]+)'` → group(1) | `//span/text()` | | 匹配任意字符 | `.+?`（非贪婪） | `//text()[contains(.,'关键词')]` | | 匹配数字 | `\d+` | `//span[number(text())]` | | 匹配开头/结尾 | `^内容`、`内容$` | `//div[starts-with(@class,"title")]` | --- # 9. 常见问题与解决方案 ## 9.1 匹配不到内容 ```python import re text = 'Hello\nWorld' # 问题：. 不匹配换行符 print(re.findall(r'.+', text)) # ['Hello', 'World'] - 被拆成了两行？ # 解决1：用 re.DOTALL 模式（让 . 匹配换行） print(re.findall(r'.+', text, re.DOTALL)) # ['Hello\nWorld'] # 解决2：用 [\s\S] 代替 . print(re.findall(r'[\s\S]+', text)) # ['Hello\nWorld'] ``` ## 9.2 中文匹配 ```python import re text = 'Python是一门很棒的语言，C语言也是' # 匹配中文 chinese = re.findall(r'[\u4e00-\u9fff]+', text) print(chinese) # ['很棒的语言', '也是'] # 或者直接用 unicode pattern = r'[\u4e00-\u9fff]+' print(re.findall(pattern, text)) ``` ## 9.3 大小写不敏感匹配 ```python import re text = 'Python PYTHON pYtHoN' # 默认区分大小写 print(re.findall(r'python', text)) # [] # 用 re.IGNORECASE（可简写为 re.I） print(re.findall(r'python', text, re.I)) # ['Python', 'PYTHON', 'pYtHoN'] ``` ## 9.4 多行模式 ```python import re text = '''第一行内容第二行内容第三行内容''' # 问题：^ 只匹配字符串开头 print(re.findall(r'^第.+', text)) # ['第一行内容'] # 解决：用 re.MULTILINE（可简写为 re.M） print(re.findall(r'^第.+', text, re.M)) # ['第一行内容', '第二行内容', '第三行内容'] ``` ## 9.5 编译正则表达式提高性能如果一个正则表达式要匹配多次，预先编译可以提高性能： ```python import re import time # 需要匹配 10000 次的文本 text = '订单号：20240315-001，总额：999.50元' # 方法1：每次调用都编译 start = time.time() for _ in range(10000): re.findall(r'\d{4}-\d+', text) print(f'未编译: {time.time() - start:.4f}秒') # 方法2：预先编译 pattern = re.compile(r'\d{4}-\d+') start = time.time() for _ in range(10000): pattern.findall(text) print(f'已编译: {time.time() - start:.4f}秒') ``` --- # 10. 动手练习 ## 练习 1：提取天气预报 **目标**：从以下文本中提取日期、天气和温度。 ```python text = ''' 2024-03-15 天气：晴温度：15-25°C 2024-03-16 天气：多云温度：12-20°C 2024-03-17 天气：小雨温度：10-18°C ''' # 提示：使用分组捕获日期、天气、温度 # pattern = r'你的正则表达式' import re pattern = r'(\d{4}-\d{2}-\d{2})\s*天气：([^ ]+)\s*温度：(\d+)-(\d+)°C' matches = re.findall(pattern, text) for match in matches: date, weather, low, high = match print(f'{date}: {weather}, {low}°C-{high}°C') ``` **预期输出：** ``` 2024-03-15: 晴, 15°C-25°C 2024-03-16: 多云, 12°C-20°C 2024-03-17: 小雨, 10°C-18°C ``` --- ## 练习 2：爬取豆瓣电影信息 **目标**：编写正则表达式，从模拟的 HTML 中提取电影信息。 ```python import re html = '''

《流浪地球》

(2024) 8.5 导演：郭帆

《你好，李焕英》

(2024) 7.9 导演：贾玲

''' # 编写正则表达式，提取所有电影信息 # pattern = r'你的正则表达式' # 提示：可以用多个正则分别提取，或者用一个复杂的正则提取所有 name_pattern = r'

《([^》]+)》

' year_pattern = r'$(\d{4})$' rating_pattern = r'([^<]+)' director_pattern = r'导演：([^<]+)' names = re.findall(name_pattern, html) years = re.findall(year_pattern, html) ratings = re.findall(rating_pattern, html) directors = re.findall(director_pattern, html) for i in range(len(names)): print(f"{names[i]} | {years[i]} | 评分：{ratings[i]} | {directors[i]}") ``` --- ## 练习 3：日志分析 **目标**：从服务器日志中提取 IP 地址、请求时间和状态码。 ```python import re log = ''' 192.168.1.100 - - [15/Mar/2024:10:15:30 +0800] "GET /index.html HTTP/1.1" 200 1234 10.0.0.50 - - [15/Mar/2024:10:15:31 +0800] "POST /api/login HTTP/1.1" 200 256 192.168.1.101 - - [15/Mar/2024:10:15:32 +0800] "GET /notfound.html HTTP/1.1" 404 512 172.16.0.200 - - [15/Mar/2024:10:15:33 +0800] "GET /images/logo.png HTTP/1.1" 200 4096 ''' # 提取 IP、时间和状态码 pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[([^\]]+)\].*?" (\d{3}) \d+' for match in re.finditer(pattern, log): ip, time, status = match.groups() print(f'IP: {ip:15} | 时间: {time:25} | 状态: {status}') ``` --- ## 练习 4：电话号码脱敏 **目标**：将手机号中间四位用 `*` 替换，保护隐私。 ```python import re phone_book = ''' 张三：138-1234-5678 李四：139-5678-1234 王五：138-0000-1111 ''' # 脱敏：将 138-****-5678 格式输出 # 提示：使用分组和 re.sub pattern = r'(\d{3})-(\d{4})-(\d{4})' def mask_phone(match): return f'{match.group(1)}-****-{match.group(3)}' masked = re.sub(pattern, mask_phone, phone_book) print(masked) ``` --- ## 练习 5：综合挑战（选做）从以下课程表 HTML 中，用正则表达式提取所有课程信息： ```python import re html = '''

时间	课程	教室
周一 1-2节	Python程序设计	A101
周一 3-4节	数据结构	B205
周二 1-2节	高等数学	C301
周三 5-6节	Python程序设计	A102

''' # 分别提取时间、课程、教室 time_pattern = r'([^<]+)([^<]+)([^<]+)' courses = re.findall(time_pattern, html) print('课程表：') for time, course, room in courses: print(f'{time} | {course} | {room}') ``` --- # 📋 课程小结 ## 核心要点 1. **正则表达式是强大的文本匹配工具**，可以精确提取任意文本中的目标内容 2. **元字符是基础**： - `.` 匹配任意字符 - `\d` 匹配数字，`\w` 匹配字母数字下划线 - `[]` 定义字符集，`[^...]` 排除字符集 3. **量词控制次数**： - `+` 一次或多次，`*` 零次或多次，`?` 零次或一次 - `{n,m}` 指定范围 4. **贪婪 vs 非贪婪**： - 贪婪（`.*`、`\d+`）尽可能多匹配 - 非贪婪（`.*?`、`\d+?`）尽可能少匹配 - `[^<]+` 是爬虫中常用的非贪婪技巧 5. **分组 `()`**： - 捕获匹配内容 - 用 `\1`、`\2` 在替换中引用分组 - `(?P...)` 命名分组更清晰 6. **BeautifulSoup + 正则 = 黄金组合**： - BeautifulSoup 负责结构定位 - 正则负责精确提取 7. **re 模块常用函数**： - `findall()` 返回所有匹配列表 - `search()` 找第一个匹配 - `match()` 只匹配字符串开头 - `finditer()` 返回迭代器，节省内存 - `sub()` 替换匹配内容 ## 常见错误速查 | 错误 | 原因 | 解决方法 | |------|------|----------| | 匹配结果为空 | `.` 不匹配换行符 | 用 `re.DOTALL` 或 `[^\n]` | | 匹配太多 | 贪婪匹配 | 改为非贪婪 `*?`、`+?` | | 匹配错误 | 元字符未转义 | 用 `\.` 匹配点号本身 | | 中文匹配失败 | 字符集不对 | 用 `[\u4e00-\u9fff]` | ## 下节课预告 - XPath 选择器详解 - Selenium 动态页面抓取 - 反反爬策略与实战 --- *本讲义由 AI 助教生成，如有问题请随时提问。*

Python	90
Math	85