1544 lines
42 KiB
Markdown
1544 lines
42 KiB
Markdown
# 📚 正则表达式:爬虫内容提取利器
|
||
|
||
---
|
||
|
||
# 目录
|
||
|
||
1. [先导知识:为什么需要正则表达式?](#1-先导知识为什么需要正则表达式)
|
||
2. [正则表达式快速入门](#2-正则表达式快速入门)
|
||
3. [Python re 模块详解](#3-python-re-模块详解)
|
||
4. [贪婪 vs 非贪婪匹配](#4-贪婪-vs-非贪婪匹配)
|
||
5. [分组与捕获](#5-分组与捕获)
|
||
6. [实战:爬虫中的正则表达式](#6-实战爬虫中的正则表达式)
|
||
7. [BeautifulSoup + 正则表达式:黄金组合](#7-beautifulsoup--正则表达式黄金组合)
|
||
8. [正则表达式 vs XPath:两种提取方式对比](#8-正则表达式-vs-xpath两种提取方式对比)
|
||
9. [常见问题与解决方案](#9-常见问题与解决方案)
|
||
10. [动手练习](#10-动手练习)
|
||
|
||
---
|
||
|
||
# 1. 先导知识:为什么需要正则表达式?
|
||
|
||
## 1.1 回顾:BeautifulSoup 的局限
|
||
|
||
上节课我们学习了 BeautifulSoup,它可以方便地通过 CSS 选择器提取 HTML 内容。但它有一个前提:**网页内容必须是标准的 HTML 结构**。
|
||
|
||
```python
|
||
# BeautifulSoup 适合这种结构清晰的 HTML
|
||
<div class="movie">
|
||
<span class="title">流浪地球</span>
|
||
<span class="score">8.5</span>
|
||
</div>
|
||
|
||
soup.select('.movie .title') # ✅ 完美提取
|
||
```
|
||
|
||
**但是**,有时候我们需要的内容并不在标准的 HTML 标签里:
|
||
|
||
```html
|
||
<!-- 电影评分在属性值里 -->
|
||
<a href="/movie/1" data-title="流浪地球" data-score="8.5">流浪地球</a>
|
||
|
||
<!-- 或者是这种混在一起的文本 -->
|
||
<script>
|
||
window.__INITIAL_STATE__ = {"title":"流浪地球","score":8.5}
|
||
</script>
|
||
|
||
<!-- 或者是邮件、电话、身份证号等 -->
|
||
联系我们:010-12345678,邮箱:help@example.com
|
||
```
|
||
|
||
这时候,**正则表达式**就能大显身手!
|
||
|
||
## 1.2 什么是正则表达式?
|
||
|
||
**正则表达式**(Regular Expression,简称 regex)是一种强大的**文本模式匹配**工具。
|
||
|
||
- 它不是 Python 独有的,而是一种通用的文本处理技术
|
||
- 它使用特殊的语法描述文本的规律
|
||
- 可以精确地从杂乱文本中提取我们需要的内容
|
||
|
||
> 💡 **打个比方**
|
||
>
|
||
> 如果把 BeautifulSoup 比作"用筷子夹菜"(适合有规则的 HTML 结构),那正则表达式就是"用筛子筛豆子"(适合从任意文本中精准提取目标内容)。
|
||
|
||
---
|
||
|
||
# 2. 正则表达式快速入门
|
||
|
||
## 2.1 字符字面量:精确匹配
|
||
|
||
最简单的正则表达式就是直接匹配我们看到的字符:
|
||
|
||
```python
|
||
import re
|
||
|
||
pattern = r'python' # 匹配字母 "python"
|
||
text = 'I love python programming'
|
||
|
||
result = re.findall(pattern, text)
|
||
print(result) # ['python']
|
||
```
|
||
|
||
**注意**:`r'python'` 中的 `r` 表示 **raw string(原始字符串)**,Python 不会处理其中的转义字符。
|
||
|
||
```python
|
||
# 没有 r:\n 被解释为换行符
|
||
print('<span class="title">\n</span>')
|
||
|
||
# 有 r:\n 就是两个字符
|
||
print(r'<span class="title">\n</span>')
|
||
```
|
||
|
||
## 2.2 元字符:特殊意义的字符
|
||
|
||
正则表达式中有一些字符有特殊含义:
|
||
|
||
| 元字符 | 含义 | 示例 |
|
||
|--------|------|------|
|
||
| `.` | 匹配**任意**单个字符(换行符除外) | `r't.m'` 匹配 'tom'、'tam' |
|
||
| `\d` | 匹配任意数字 | `r'\d'` 匹配 '0'-'9' |
|
||
| `\D` | 匹配任意非数字 | `r'\D'` 匹配非数字字符 |
|
||
| `\w` | 匹配字母、数字、下划线 | `r'\w'` 匹配 a-z, A-Z, 0-9, _ |
|
||
| `\W` | 匹配非单词字符 | `r'\W'` 匹配空格、标点等 |
|
||
| `\s` | 匹配空白字符 | `r'\s'` 匹配空格、\t、\n |
|
||
| `\S` | 匹配非空白字符 | `r'\S'` 匹配非空白字符 |
|
||
|
||
```python
|
||
import re
|
||
|
||
# 示例:匹配电话号码(假设格式为 138-1234-5678)
|
||
phone_text = '张三的电话是138-1234-5678,李四是139-5678-1234'
|
||
pattern = r'\d{3}-\d{4}-\d{4}'
|
||
phones = re.findall(pattern, phone_text)
|
||
print(phones) # ['138-1234-5678', '139-5678-1234']
|
||
```
|
||
|
||
## 2.3 量词:指定匹配次数
|
||
|
||
| 量词 | 含义 | 示例 |
|
||
|------|------|------|
|
||
| `*` | 匹配0次或多次 | `r'ab*'` 匹配 'a'、'ab'、'abbb' |
|
||
| `+` | 匹配1次或多次 | `r'ab+'` 匹配 'ab'、'abbb'(不匹配 'a') |
|
||
| `?` | 匹配0次或1次 | `r'ab?'` 匹配 'a' 或 'ab' |
|
||
| `{n}` | 匹配恰好n次 | `r'\d{4}'` 匹配4位数字 |
|
||
| `{n,}` | 匹配至少n次 | `r'\d{4,}'` 匹配4位及以上数字 |
|
||
| `{n,m}` | 匹配n到m次 | `r'\d{4,6}'` 匹配4到6位数字 |
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '我有 2 个苹果,她有 10 个橘子,他有 100 个香蕉'
|
||
|
||
# 匹配一个或多个数字
|
||
numbers = re.findall(r'\d+', text)
|
||
print(numbers) # ['2', '10', '100']
|
||
|
||
# 匹配4位数字(年份)
|
||
year_text = '北京奥运会是2008年,上海世博会2010年'
|
||
years = re.findall(r'\d{4}', year_text)
|
||
print(years) # ['2008', '2010']
|
||
```
|
||
|
||
## 2.4 字符类:`[]`
|
||
|
||
用方括号 `[]` 定义一个字符集合,匹配其中**任意一个**字符:
|
||
|
||
```python
|
||
import re
|
||
|
||
# 匹配所有颜色名称
|
||
colors = re.findall(r'red|green|blue', 'red apple green leaf blue sky')
|
||
print(colors) # ['red', 'green', 'blue']
|
||
|
||
# 匹配数字或字母
|
||
alphanum = re.findall(r'[A-Za-z0-9]', 'Hello123!')
|
||
print(alphanum) # ['H', 'e', 'l', 'l', 'o', '1', '2', '3']
|
||
|
||
# 使用范围
|
||
letters = re.findall(r'[a-z]', 'Hello123')
|
||
print(letters) # ['e', 'l', 'l', 'o']
|
||
|
||
# 排除:[^...] 表示"不是..."
|
||
non_digit = re.findall(r'[^0-9]', 'abc123')
|
||
print(non_digit) # ['a', 'b', 'c']
|
||
```
|
||
|
||
**常用字符类简写:**
|
||
|
||
| 字符类 | 等价 | 含义 |
|
||
|--------|------|------|
|
||
| `[0-9]` | `\d` | 数字 |
|
||
| `[a-zA-Z]` | - | 字母 |
|
||
| `[a-zA-Z0-9]` | `\w` | 字母或数字 |
|
||
|
||
## 2.5 边界匹配
|
||
|
||
| 边界 | 含义 | 示例 |
|
||
|------|------|------|
|
||
| `^` | 匹配字符串**开头** | `r'^Hello'` 匹配开头是 Hello 的字符串 |
|
||
| `$` | 匹配字符串**结尾** | `r'World$'` 匹配结尾是 World 的字符串 |
|
||
| `\b` | 匹配单词边界 | `r'\bword\b'` 精确匹配 "word" |
|
||
|
||
```python
|
||
import re
|
||
|
||
# 匹配以 138 开头的手机号
|
||
phones = ['138-1234-5678', '139-5678-1234', '100-1234-5678']
|
||
for phone in phones:
|
||
if re.match(r'^138', phone):
|
||
print(f'{phone} 是移动号码')
|
||
|
||
# 匹配 .com 结尾的域名
|
||
domains = ['example.com', 'test.org', 'hello.com.cn', 'site.com']
|
||
for d in domains:
|
||
if re.search(r'\.com$', d):
|
||
print(f'{d} 是商业网站')
|
||
```
|
||
|
||
## 2.6 转义字符 `\`
|
||
|
||
如果我们要匹配元字符本身(如 `.`、`*`、`?`),需要用 `\` 转义:
|
||
|
||
```python
|
||
import re
|
||
|
||
# 匹配 IP 地址(注意要转义点号)
|
||
ip_text = '服务器IP: 192.168.1.1,子控IP: 10.0.0.1'
|
||
pattern = r'\d+\.\d+\.\d+\.\d+'
|
||
ips = re.findall(pattern, ip_text)
|
||
print(ips) # ['192.168.1.1', '10.0.0.1']
|
||
|
||
# 匹配邮箱(@ 本身不需要转义)
|
||
email_text = '联系邮箱: user@example.com,备用: admin@test.org'
|
||
pattern = r'[a-zA-Z0-9_]+@[a-zA-Z0-9]+\.[a-zA-Z]+'
|
||
emails = re.findall(pattern, email_text)
|
||
print(emails) # ['user@example.com', 'admin@test.org']
|
||
```
|
||
|
||
---
|
||
|
||
# 3. Python re 模块详解
|
||
|
||
## 3.1 `re.findall()` - 查找所有匹配
|
||
|
||
**最常用**:返回所有匹配项的列表。
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '我的邮箱是 user@python.com,另一个是 admin@java.com'
|
||
|
||
# 提取所有邮箱
|
||
pattern = r'[a-zA-Z0-9_]+@[a-zA-Z0-9]+\.[a-zA-Z]+'
|
||
emails = re.findall(pattern, text)
|
||
print(emails) # ['user@python.com', 'admin@java.com']
|
||
```
|
||
|
||
## 3.2 `re.search()` - 查找第一个匹配
|
||
|
||
返回第一个匹配的对象(Match Object),如果没找到返回 `None`。
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '电话号码:138-1234-5678 或 139-5678-1234'
|
||
|
||
# 找第一个电话号码
|
||
result = re.search(r'\d{3}-\d{4}-\d{4}', text)
|
||
if result:
|
||
print(f'找到: {result.group()}') # 138-1234-5678
|
||
print(f'位置: {result.start()}-{result.end()}') # 6-18
|
||
```
|
||
|
||
## 3.3 `re.match()` - 从字符串开头匹配
|
||
|
||
只在字符串**开头**匹配,匹配成功返回 Match 对象,失败返回 `None`。
|
||
|
||
```python
|
||
import re
|
||
|
||
# 检查是否是以 "http" 开头的 URL
|
||
urls = ['http://example.com', 'https://test.org', 'ftp://files.net']
|
||
|
||
for url in urls:
|
||
result = re.match(r'http', url)
|
||
print(f'{url}: {"匹配" if result else "不匹配"}')
|
||
# http://example.com: 匹配
|
||
# https://test.org: 不匹配(因为是 https,不是 http)
|
||
```
|
||
|
||
## 3.4 `re.finditer()` - 返回迭代器
|
||
|
||
适合大规模匹配,节省内存:
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '3.14是圆周率,1.414是根号2,9.8是重力加速度'
|
||
|
||
# 找出所有小数
|
||
for match in re.finditer(r'\d+\.\d+', text):
|
||
print(f'找到: {match.group()}, 位置: {match.start()}-{match.end()}')
|
||
|
||
# 输出:
|
||
# 找到: 3.14, 位置: 0-4
|
||
# 找到: 1.414, 位置: 8-13
|
||
# 找到: 9.8, 位置: 20-23
|
||
```
|
||
|
||
## 3.5 Match 对象的方法
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '时间:2024-03-15 地点:北京'
|
||
result = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
|
||
|
||
print(result.group()) # 2024-03-15(完整匹配)
|
||
print(result.group(1)) # 2024(第一个分组)
|
||
print(result.group(2)) # 03(第二个分组)
|
||
print(result.group(3)) # 15(第三个分组)
|
||
print(result.groups()) # ('2024', '03', '15')
|
||
print(result.span()) # (3, 13)
|
||
```
|
||
|
||
## 3.6 `re.sub()` - 替换
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '我的手机号是138-1234-5678,请记住!'
|
||
|
||
# 把手机号中间四位用 * 替换(脱敏)
|
||
pattern = r'(\d{3})-(\d{4})-(\d{4})'
|
||
result = re.sub(pattern, r'\1-****-\3', text)
|
||
print(result) # 我的手机号是138-****-5678,请记住!
|
||
```
|
||
|
||
---
|
||
|
||
# 4. 贪婪 vs 非贪婪匹配
|
||
|
||
## 4.1 什么是贪婪匹配?
|
||
|
||
正则表达式的量词(`*`、`+`、`{n,}`)默认是**贪婪**的——它们会尽可能多地匹配字符。
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '<div class="title">流浪地球</div><div class="title">你好,李焕英</div>'
|
||
|
||
# 用 .* 贪婪匹配
|
||
pattern = r'<div class="title">.*</div>'
|
||
result = re.findall(pattern, html)
|
||
print('贪婪匹配结果:')
|
||
print(result)
|
||
# 输出: ['<div class="title">流浪地球</div><div class="title">你好,李焕英</div>']
|
||
# ⚠️ 错误!它把两个 div 合并成一个了!
|
||
```
|
||
|
||
**原因**:`.*` 会一直匹配到**最后一个** `</div>`,而不是第一个。
|
||
|
||
## 4.2 什么是非贪婪匹配?
|
||
|
||
在量词后面加一个 `?`,变成**非贪婪**(也叫**最小匹配**):
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '<div class="title">流浪地球</div><div class="title">你好,李焕英</div>'
|
||
|
||
# 用 .*? 非贪婪匹配
|
||
pattern = r'<div class="title">.*?</div>'
|
||
result = re.findall(pattern, html)
|
||
print('非贪婪匹配结果:')
|
||
print(result)
|
||
# 输出: ['<div class="title">流浪地球</div>', '<div class="title">你好,李焕英</div>']
|
||
# ✅ 正确!每个 div 单独匹配
|
||
```
|
||
|
||
## 4.3 贪婪 vs 非贪婪对比
|
||
|
||
| 量词 | 贪婪 | 非贪婪 |
|
||
|------|------|--------|
|
||
| `*` | `*` | `*?` |
|
||
| `+` | `+` | `+?` |
|
||
| `{n,}` | `{n,}` | `{n,}?` |
|
||
|
||
```python
|
||
import re
|
||
|
||
text = 'abc123def456'
|
||
|
||
# 贪婪:匹配尽可能多
|
||
print(re.findall(r'\d+', text)) # ['123456'] - 匹配所有数字
|
||
|
||
# 非贪婪:匹配尽可能少
|
||
print(re.findall(r'\d+?', text)) # ['1', '2', '3', '4', '5', '6'] - 每个数字单独匹配
|
||
```
|
||
|
||
## 4.4 `[^<]+` 技巧
|
||
|
||
在爬虫中,有一个经典技巧:用 `[^<]+` 代替 `.+?`
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '<span class="title">流浪地球</span><span class="title">你好,李焕英</span>'
|
||
|
||
# 方法1:.*? 非贪婪
|
||
pattern1 = r'<span class="title">.*?</span>'
|
||
print(re.findall(pattern1, html))
|
||
|
||
# 方法2:[^<]+ 更好!
|
||
pattern2 = r'<span class="title">([^<]+)</span>'
|
||
titles = re.findall(pattern2, html)
|
||
print(titles) # ['流浪地球', '你好,李焕英']
|
||
```
|
||
|
||
**为什么 `[^<]+` 更好?**
|
||
- `.` 需要考虑换行符,默认不匹配 `\n`(除非用 `re.DOTALL` 模式)
|
||
- `[^<]` 明确排除了 `<`,性能更好,也更直观
|
||
|
||
---
|
||
|
||
# 5. 分组与捕获
|
||
|
||
## 5.1 捕获分组 `()`
|
||
|
||
用圆括号 `()` 创建分组,匹配到的内容会被**捕获**,可以通过 `groups()` 获取。
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '2024-03-15'
|
||
|
||
pattern = r'(\d{4})-(\d{2})-(\d{2})'
|
||
result = re.findall(pattern, text)
|
||
print(result) # [('2024', '03', '15')]
|
||
|
||
# 如果只想要捕获的内容,用 findall 会直接返回分组列表
|
||
```
|
||
|
||
## 5.2 命名分组 `(?P<name>...)`
|
||
|
||
给分组起个名字,更清晰:
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '2024-03-15'
|
||
|
||
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
|
||
result = re.search(pattern, text)
|
||
|
||
if result:
|
||
print(result.group('year')) # 2024
|
||
print(result.group('month')) # 03
|
||
print(result.group('day')) # 15
|
||
```
|
||
|
||
## 5.3 非捕获分组 `(?:...)`
|
||
|
||
有时候分组但不想要捕获,用 `(?:...)` 可以避免创建分组编号:
|
||
|
||
```python
|
||
import re
|
||
|
||
# 捕获分组:会创建 group(1)
|
||
result1 = re.search(r'(\d{4})-(\d{2})', '2024-03')
|
||
print(result1.groups()) # ('2024', '03')
|
||
|
||
# 非捕获分组:不创建额外 group
|
||
result2 = re.search(r'(?:\d{4})-(\d{2})', '2024-03')
|
||
print(result2.groups()) # ('03',) - 只有 group(1)
|
||
```
|
||
|
||
## 5.4 分组在替换中的应用
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '2024-03-15'
|
||
|
||
# 交换年月日的顺序
|
||
pattern = r'(\d{4})-(\d{2})-(\d{2})'
|
||
result = re.sub(pattern, r'\2/\3/\1', text)
|
||
print(result) # 03/15/2024
|
||
```
|
||
|
||
---
|
||
|
||
# 6. 实战:爬虫中的正则表达式
|
||
|
||
## 6.1 提取电影信息
|
||
|
||
```python
|
||
import re
|
||
|
||
# 模拟从某电影网站抓取的 HTML
|
||
html = '''
|
||
<div class="movie-item">
|
||
<div class="title">流浪地球</div>
|
||
<div class="score">8.5</div>
|
||
<div class="info">导演:郭帆 | 2024 | 科幻</div>
|
||
</div>
|
||
<div class="movie-item">
|
||
<div class="title">你好,李焕英</div>
|
||
<div class="score">7.9</div>
|
||
<div class="info">导演:贾玲 | 2024 | 喜剧</div>
|
||
</div>
|
||
'''
|
||
|
||
# 提取电影名
|
||
title_pattern = r'<div class="title">([^<]+)</div>'
|
||
titles = re.findall(title_pattern, html)
|
||
print('电影名:', titles)
|
||
|
||
# 提取评分
|
||
score_pattern = r'<div class="score">([^<]+)</div>'
|
||
scores = re.findall(score_pattern, html)
|
||
print('评分:', scores)
|
||
|
||
# 提取导演
|
||
director_pattern = r'导演:([^|]+)'
|
||
directors = re.findall(director_pattern, html)
|
||
print('导演:', [d.strip() for d in directors])
|
||
```
|
||
|
||
**输出:**
|
||
```
|
||
电影名: ['流浪地球', '你好,李焕英']
|
||
评分: ['8.5', '7.9']
|
||
导演: ['郭帆', '贾玲']
|
||
```
|
||
|
||
## 6.2 豆瓣电影 Top250:逐行代码详解
|
||
|
||
> 💡 **本节说明**
|
||
>
|
||
> 上节课同学们已经用过这几段代码来抓取豆瓣电影 Top250。很多同学对正则表达式的细节还有疑问,这节我们逐行讲解,逐一拆解每个 pattern 是怎么工作的。
|
||
|
||
### 6.2.1 提取电影标题(含英文名过滤)
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
|
||
response = requests.get(url, headers=headers)
|
||
html = response.text
|
||
|
||
# 匹配 <span class="title">电影名</span>
|
||
title_pattern = r'<span class="title">([^<]+)</span>'
|
||
titles = re.findall(title_pattern, html)
|
||
|
||
# 过滤掉英文名(以 / 开头)
|
||
chinese_titles = [t for t in titles if not t.startswith('/')]
|
||
|
||
print('电影名称(前10部):')
|
||
for i, title in enumerate(chinese_titles[:10], 1):
|
||
print(f'{i}. {title}')
|
||
```
|
||
|
||
**逐行解析:**
|
||
|
||
| 代码行 | 含义 |
|
||
|--------|------|
|
||
| `html = response.text` | 把服务器返回的 HTML 页面内容转换成字符串 |
|
||
| `title_pattern = r'<span class="title">([^<]+)</span>'` | 定义正则表达式 |
|
||
| `re.findall(title_pattern, html)` | 在整个 HTML 页面中找出所有匹配的部分 |
|
||
|
||
**正则表达式拆解:** `r'<span class="title">([^<]+)</span>'`
|
||
|
||
```
|
||
<span class="title"> → 匹配HTML标签,字面意思,一个字符一个字符地匹配
|
||
( → 开始捕获分组——把匹配到的内容"抓"出来
|
||
[^<]+ → 核心!一个或多个"不是<"的字符
|
||
) → 结束捕获分组
|
||
</span> → 匹配闭合标签,字面意思
|
||
```
|
||
|
||
`[^<]+` 是关键:
|
||
- `[^...]` 是字符集,表示"不是括号里任意一个字符"
|
||
- `[^<]` = 不是 `<` 的任意字符
|
||
- `+` = 出现一次或多次
|
||
- 所以 `[^<]+` 的意思是:**一直匹配,直到遇到 `<` 为止**
|
||
|
||
为什么要这样?来看豆瓣的 HTML 真实结构:
|
||
|
||
```html
|
||
<span class="title">肖申克的救赎</span>
|
||
<span class="title">/The Shawshank Redemption</span>
|
||
<span class="title">千与千寻</span>
|
||
```
|
||
|
||
每部电影有两个 `<span class="title">`:
|
||
1. **第一个**:中文名,如 `肖申克的救赎`
|
||
2. **第二个**:英文名,以 `/` 开头,如 `/The Shawshank Redemption`
|
||
|
||
所以后面需要过滤:
|
||
|
||
```python
|
||
chinese_titles = [t for t in titles if not t.startswith('/')]
|
||
```
|
||
|
||
这就是列表推导式,等价于:
|
||
|
||
```python
|
||
chinese_titles = []
|
||
for t in titles:
|
||
if not t.startswith('/'):
|
||
chinese_titles.append(t)
|
||
```
|
||
|
||
**想一想:** 如果不用 `[^<]+`,而是用 `.+` 代替,会发生什么?
|
||
|
||
```python
|
||
# 贪婪匹配版
|
||
bad_pattern = r'<span class="title">.+</span>'
|
||
# .+ 会尽可能多地匹配,从第一个 <span class="title"> 一直匹配到...
|
||
# 整个HTML里最后一个 </span>!把所有标题合并成了一大段文本!
|
||
```
|
||
|
||
---
|
||
|
||
### 6.2.2 提取电影评分
|
||
|
||
```python
|
||
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.\d)</span>'
|
||
ratings = re.findall(rating_pattern, html)
|
||
|
||
print('评分(前10部):')
|
||
for i, rating in enumerate(ratings[:10], 1):
|
||
print(f'{i}. 评分: {rating}')
|
||
```
|
||
|
||
**正则表达式拆解:** `r'<span class="rating_num"[^>]*>(\d+\.\d)</span>'`
|
||
|
||
```
|
||
<span class="rating_num" → 匹配开始标签的字面部分
|
||
[^>]* → [^>] = 不是">"的字符,* = 出现0次或多次
|
||
含义:标签内可能有其他属性(id、style等),全部跳过
|
||
> → 匹配到标签结束符号 >
|
||
(\d+\.\d) → 捕获评分数字
|
||
\d+ = 一个或多个数字
|
||
\. = 小数点(点号需要转义!)
|
||
\d = 一个小数位
|
||
</span> → 匹配闭合标签
|
||
```
|
||
|
||
**为什么要用 `[^>]*`?**
|
||
|
||
豆瓣的真实评分 HTML 可能长这样:
|
||
|
||
```html
|
||
<!-- 简单情况 -->
|
||
<span class="rating_num">9.7</span>
|
||
|
||
<!-- 有额外属性的情况 -->
|
||
<span class="rating_num" id="rating_1" style="color:red">9.7</span>
|
||
```
|
||
|
||
`[^>]*` 可以灵活处理这两种情况——不管标签里有多少属性(甚至没有属性),都能正常匹配。
|
||
|
||
**为什么要转义 `\.`?**
|
||
|
||
在正则表达式里,`.` 是元字符,代表"任意字符"。但我们需要匹配**小数点本身**,而不是"任意字符",所以要写成 `\.`。
|
||
|
||
```python
|
||
# 错误:\d+. 会把小数点也匹配掉
|
||
r'(\d+.\d)' # . 匹配任意字符,危险!
|
||
|
||
# 正确:\d+\.\d 小数点用 \. 转义
|
||
r'(\d+\.\d)' # 只有小数点才能被匹配
|
||
```
|
||
|
||
---
|
||
|
||
### 6.2.3 提取经典台词(引言)
|
||
|
||
```python
|
||
quote_pattern = r'<span class="inq">([^<]+)</span>'
|
||
quotes = re.findall(quote_pattern, html)
|
||
|
||
print('经典台词:')
|
||
for i, quote in enumerate(quotes, 1):
|
||
print(f'{i}. "{quote}"')
|
||
```
|
||
|
||
**正则表达式拆解:** `r'<span class="inq">([^<]+)</span>'`
|
||
|
||
这个 pattern 和标题的非常像,结构完全一样,只是 class 名不同:
|
||
|
||
```
|
||
<span class="inq"> → 匹配开始标签
|
||
([^<]+) → 捕获标签内的文字(不能有<)
|
||
</span> → 匹配闭合标签
|
||
```
|
||
|
||
**豆瓣的真实引言 HTML:**
|
||
|
||
```html
|
||
<span class="inq">希望让人自由。</span>
|
||
<span class="inq">暗恋一个人,是最安静的秘密。</span>
|
||
```
|
||
|
||
每句引言都被一个 `<span class="inq">` 包裹,正则直接按标签匹配即可。
|
||
|
||
**注意:** 并不是每部电影都有引言,所以 `quotes` 列表的长度可能小于电影数量,这是正常的。
|
||
|
||
---
|
||
|
||
### 6.2.4 三段代码完整版
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
|
||
response = requests.get(url, headers=headers)
|
||
response.encoding = 'utf-8'
|
||
html = response.text
|
||
|
||
# 1. 提取电影名
|
||
title_pattern = r'<span class="title">([^<]+)</span>'
|
||
titles = re.findall(title_pattern, html)
|
||
chinese_titles = [t for t in titles if not t.startswith('/')]
|
||
|
||
# 2. 提取评分
|
||
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.\d)</span>'
|
||
ratings = re.findall(rating_pattern, html)
|
||
|
||
# 3. 提取经典台词
|
||
quote_pattern = r'<span class="inq">([^<]+)</span>'
|
||
quotes = re.findall(quote_pattern, html)
|
||
|
||
# 合并打印(前10部)
|
||
print(f'{"排名":<4} {"电影名":<20} {"评分":<6} {"引言"}')
|
||
print('-' * 70)
|
||
for i in range(min(10, len(chinese_titles))):
|
||
title = chinese_titles[i]
|
||
rating = ratings[i] if i < len(ratings) else '无'
|
||
quote = f'"{quotes[i]}"' if i < len(quotes) else '无'
|
||
print(f'{i+1:<4} {title:<20} {rating:<6} {quote}')
|
||
```
|
||
|
||
**输出示例:**
|
||
```
|
||
排名 电影名 评分 引言
|
||
----------------------------------------------------------------------
|
||
1 肖申克的救赎 9.7 "希望让人自由。"
|
||
2 霸王别姬 9.6 无
|
||
3 阿甘正传 9.5 "生活就像一盒巧克力..."
|
||
4 泰坦尼克号 9.4 无
|
||
5 千与千寻 9.3 "我只能送你到这里了..."
|
||
...
|
||
```
|
||
|
||
---
|
||
|
||
### 6.2.5 调试技巧:先打印原始匹配
|
||
|
||
如果匹配不到内容,先不要急着改 pattern,先打印原始匹配看看:
|
||
|
||
```python
|
||
# ❌ 找不到就放弃了
|
||
titles = re.findall(title_pattern, html)
|
||
print(titles) # []
|
||
|
||
# ✅ 调试:打印原始匹配结果
|
||
raw_matches = re.findall(title_pattern, html)
|
||
print(f'原始匹配数: {len(raw_matches)}')
|
||
print(f'前3个: {raw_matches[:3]}')
|
||
# 输出: 原始匹配数: 250
|
||
# 前3个: ['肖申克的救赎', '/The Shawshank Redemption', '霸王别姬']
|
||
```
|
||
|
||
调试思路:
|
||
1. 先确认原始匹配数量是否合理(豆瓣 Top250,理论上应该有 500 个匹配,因为每部电影有中英文两个 title)
|
||
2. 再看内容是否符合预期
|
||
3. 最后再过滤/处理
|
||
|
||
---
|
||
|
||
## 6.3 提取 JSON 数据中的字段
|
||
|
||
有时候网页中嵌入了 JSON 数据,正则表达式可以快速提取:
|
||
|
||
```python
|
||
import re
|
||
import json
|
||
|
||
# 模拟嵌入页面的 JSON
|
||
page_html = '''
|
||
<script>
|
||
window.__INIT_DATA__ = {
|
||
"user": {
|
||
"name": "张三",
|
||
"age": 20,
|
||
"email": "zhangsan@example.com"
|
||
},
|
||
"token": "abc123xyz"
|
||
};
|
||
</script>
|
||
'''
|
||
|
||
# 提取 JSON 字符串
|
||
json_pattern = r'window\.__INIT_DATA__\s*=\s*({.*?});'
|
||
match = re.search(json_pattern, page_html, re.DOTALL)
|
||
|
||
if match:
|
||
json_str = match.group(1)
|
||
data = json.loads(json_str)
|
||
print(f"用户名: {data['user']['name']}")
|
||
print(f"邮箱: {data['user']['email']}")
|
||
```
|
||
|
||
## 6.4 提取手机号、邮箱、URL
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '''
|
||
张三的手机号:138-1234-5678,邮箱:zhangsan@python.com
|
||
李四的手机号:139-5678-1234,邮箱:lisi@java.com
|
||
他们的网站:http://example.com 和 https://test.org
|
||
地址:北京市海淀区中关村1号
|
||
'''
|
||
|
||
# 手机号:3位-4位-4位
|
||
phones = re.findall(r'\d{3}-\d{4}-\d{4}', text)
|
||
print('手机号:', phones)
|
||
|
||
# 邮箱
|
||
emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', text)
|
||
print('邮箱:', emails)
|
||
|
||
# URL(http 或 https)
|
||
urls = re.findall(r'https?://[a-zA-Z0-9./-]+', text)
|
||
print('网址:', urls)
|
||
```
|
||
|
||
## 6.5 提取视频/音频链接
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '''
|
||
<video src="https://cdn.example.com/video1.mp4"></video>
|
||
<video src="https://cdn.example.com/video2.mp4"></video>
|
||
<audio src="https://cdn.example.com/music.mp3"></audio>
|
||
'''
|
||
|
||
# 提取 MP4 视频链接
|
||
video_pattern = r'<video src="([^"]+\.mp4)">'
|
||
videos = re.findall(video_pattern, html)
|
||
print('视频:', videos)
|
||
|
||
# 提取所有媒体链接
|
||
media_pattern = r'<(?P<tag>video|audio) src="(?P<url>[^"]+)">'
|
||
for match in re.finditer(media_pattern, html):
|
||
print(f"{match.group('tag')}: {match.group('url')}")
|
||
```
|
||
|
||
---
|
||
|
||
# 7. BeautifulSoup + 正则表达式:黄金组合
|
||
|
||
## 7.1 为什么需要组合?
|
||
|
||
BeautifulSoup 擅长处理 HTML 结构,但面对不规则的文本就力不从心。
|
||
|
||
**最佳实践:**
|
||
1. 用 BeautifulSoup 定位到包含目标内容的**区域**
|
||
2. 在这个区域内用正则表达式**精确提取**
|
||
|
||
```python
|
||
import requests
|
||
from bs4 import BeautifulSoup
|
||
import re
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://movie.douban.com/subject/35267208/'
|
||
|
||
try:
|
||
response = requests.get(url, headers=headers, timeout=10)
|
||
soup = BeautifulSoup(response.text, 'lxml')
|
||
|
||
# 1. 用 BeautifulSoup 找到包含评分的区域
|
||
score_section = soup.select_one('#interest_sect_level')
|
||
|
||
if score_section:
|
||
# 2. 在这个区域内用正则提取数字
|
||
score_text = score_section.text
|
||
rating = re.search(r'(\d+\.\d)', score_text)
|
||
if rating:
|
||
print(f'评分: {rating.group(1)}')
|
||
|
||
except Exception as e:
|
||
print(f'请求失败: {e}')
|
||
```
|
||
|
||
## 7.2 用 `re.find_all` 结合 BeautifulSoup
|
||
|
||
```python
|
||
import requests
|
||
from bs4 import BeautifulSoup
|
||
import re
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
# 豆瓣电影页面
|
||
url = 'https://movie.douban.com/subject/35267208/'
|
||
|
||
response = requests.get(url, headers=headers)
|
||
soup = BeautifulSoup(response.text, 'lxml')
|
||
|
||
# 提取页面标题(通常在 title 标签里)
|
||
title_tag = soup.find('title')
|
||
if title_tag:
|
||
title = re.search(r'^(.+?)\s*\(豆瓣\)', title_tag.text)
|
||
print(f'片名: {title.group(1) if title else title_tag.text}')
|
||
|
||
# 提取年份(4位数字)
|
||
info_text = soup.select_one('#info').text
|
||
year = re.search(r'\d{4}', info_text)
|
||
print(f'年份: {year.group() if year else "未知"}')
|
||
|
||
# 提取评分
|
||
rating = soup.select_one('.rating_num')
|
||
if rating:
|
||
print(f'豆瓣评分: {rating.text}')
|
||
|
||
# 提取评价人数
|
||
comments = re.search(r'(\d+)\s*人评价', info_text)
|
||
if comments:
|
||
print(f'评价人数: {comments.group(1)}')
|
||
```
|
||
|
||
## 7.3 提取所有链接的特定格式
|
||
|
||
```python
|
||
import requests
|
||
from bs4 import BeautifulSoup
|
||
import re
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://www.example.com'
|
||
response = requests.get(url, headers=headers)
|
||
soup = BeautifulSoup(response.text, 'lxml')
|
||
|
||
# 找到所有链接
|
||
all_links = soup.find_all('a', href=True)
|
||
|
||
# 筛选出所有指向 .pdf 文件的链接
|
||
pdf_links = [link['href'] for link in all_links
|
||
if re.search(r'\.pdf$', link['href'])]
|
||
|
||
# 筛选出所有图片链接
|
||
img_links = [link['href'] for link in all_links
|
||
if re.search(r'\.(jpg|jpeg|png|gif)$', link['href'], re.I)]
|
||
|
||
print(f'PDF文件 ({len(pdf_links)}个):')
|
||
for link in pdf_links:
|
||
print(f' {link}')
|
||
|
||
print(f'\n图片文件 ({len(img_links)}个):')
|
||
for link in img_links[:5]: # 只显示前5个
|
||
print(f' {link}')
|
||
```
|
||
|
||
---
|
||
|
||
# 8. 正则表达式 vs XPath:两种提取方式对比
|
||
|
||
> 💡 **本节说明**
|
||
>
|
||
> 课本上主要介绍了 XPath 作为网页内容提取的工具,但实际上正则表达式同样强大。这节课我们用同一个案例(豆瓣电影 Top250)来对比两种方法,帮助大家理解各自的优势与局限。
|
||
|
||
## 8.1 案例回顾:豆瓣电影 Top250
|
||
|
||
先回顾上节课我们用 BeautifulSoup + CSS 选择器提取豆瓣电影的方法:
|
||
|
||
```python
|
||
import requests
|
||
from bs4 import BeautifulSoup
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
response = requests.get(url, headers=headers)
|
||
response.encoding = 'utf-8'
|
||
soup = BeautifulSoup(response.text, 'lxml')
|
||
|
||
# 找到所有电影条目
|
||
movies = soup.select('.doulist-item')
|
||
|
||
count = 0
|
||
for movie in movies:
|
||
title_link = movie.select_one('a[href*="/subject/"]')
|
||
if title_link:
|
||
title = title_link.get_text(strip=True)
|
||
rating = movie.select_one('.rating_nums')
|
||
rating_text = rating.text.strip() if rating else '无评分'
|
||
print(f'{count + 1}. {title} - 评分: {rating_text}')
|
||
count += 1
|
||
if count >= 10:
|
||
break
|
||
```
|
||
|
||
**输出示例:**
|
||
```
|
||
1. 肖申克的救赎 - 评分: 9.7
|
||
2. 霸王别姬 - 评分: 9.6
|
||
3. 阿甘正传 - 评分: 9.5
|
||
4. 泰坦尼克号 - 评分: 9.4
|
||
5. 千与千寻 - 评分: 9.3
|
||
...
|
||
```
|
||
|
||
## 8.2 方法二:使用正则表达式提取
|
||
|
||
如果我们直接观察豆瓣页面的 HTML 源码,会发现每部电影的信息结构大致如下:
|
||
|
||
```html
|
||
<div class="doulist-item">
|
||
...
|
||
<div class="title">
|
||
<a href="/subject/1292055/">肖申克的救赎</a>
|
||
</div>
|
||
<span class="rating_nums">9.7</span>
|
||
...
|
||
</div>
|
||
```
|
||
|
||
**用正则表达式来提取:**
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
response = requests.get(url, headers=headers)
|
||
response.encoding = 'utf-8'
|
||
html = response.text
|
||
|
||
# 提取电影名(捕获 a 标签内、/subject/ 链接后的中文内容)
|
||
title_pattern = r'<a href="/subject/\d+/">([^<]+)</a>'
|
||
titles = re.findall(title_pattern, html)
|
||
|
||
# 提取评分
|
||
rating_pattern = r'class="rating_nums">([\d.]+)</span>'
|
||
ratings = re.findall(rating_pattern, html)
|
||
|
||
# 合并输出
|
||
for i, title in enumerate(titles[:10]):
|
||
rating = ratings[i] if i < len(ratings) else '无评分'
|
||
print(f'{i+1}. {title} - 评分: {rating}')
|
||
```
|
||
|
||
**输出:**
|
||
```
|
||
1. 肖申克的救赎 - 评分: 9.7
|
||
2. 霸王别姬 - 评分: 9.6
|
||
3. 阿甘正传 - 评分: 9.5
|
||
4. 泰坦尼克号 - 评分: 9.4
|
||
5. 千与千寻 - 评分: 9.3
|
||
...
|
||
```
|
||
|
||
## 8.3 方法三:使用 XPath 提取
|
||
|
||
XPath 是另一种在 XML/HTML 中定位节点的语言。以下是等效的 XPath 写法:
|
||
|
||
> ⚠️ **注意**:Python 原生不支持 XPath,需要配合 `lxml` 库使用:
|
||
> `pip install lxml`
|
||
|
||
```python
|
||
import requests
|
||
from lxml import etree
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
response = requests.get(url, headers=headers)
|
||
response.encoding = 'utf-8'
|
||
html = response.text
|
||
|
||
# 解析 HTML
|
||
tree = etree.HTML(html)
|
||
|
||
# XPath 提取电影名
|
||
# //div[@class='doulist-item'] 表示任意位置的 div,class 属性为 doulist-item
|
||
# .//a[contains(@href,'/subject/')] 表示在该 div 内的 a 标签,href 包含 /subject/
|
||
title_xpath = '//div[@class="doulist-item"]//a[contains(@href,"/subject/")]/text()'
|
||
titles = tree.xpath(title_xpath)
|
||
|
||
# XPath 提取评分
|
||
rating_xpath = '//div[@class="doulist-item"]//span[@class="rating_nums"]/text()'
|
||
ratings = tree.xpath(rating_xpath)
|
||
|
||
# 合并输出
|
||
for i, title in enumerate(titles[:10]):
|
||
rating = ratings[i] if i < len(ratings) else '无评分'
|
||
print(f'{i+1}. {title.strip()} - 评分: {rating}')
|
||
```
|
||
|
||
**输出:**
|
||
```
|
||
1. 肖申克的救赎 - 评分: 9.7
|
||
2. 霸王别姬 - 评分: 9.6
|
||
3. 阿甘正传 - 评分: 9.5
|
||
4. 泰坦尼克号 - 评分: 9.4
|
||
5. 千与千寻 - 评分: 9.3
|
||
...
|
||
```
|
||
|
||
## 8.4 三种方法对比
|
||
|
||
| 对比维度 | BeautifulSoup (CSS选择器) | 正则表达式 | XPath |
|
||
|----------|---------------------------|------------|-------|
|
||
| **学习难度** | ⭐ 低,容易上手 | ⭐⭐⭐ 较高,语法复杂 | ⭐⭐ 中等,有自己的语法体系 |
|
||
| **提取单位** | HTML 标签、属性 | 文本片段(任意位置) | XML/HTML 节点树 |
|
||
| **适用场景** | 结构规范的 HTML | 不规则文本、字符串处理 | 结构规范的 XML/HTML |
|
||
| **代码可读性** | 高,类似 CSS | 中,符号多 | 中,路径表达式较长 |
|
||
| **性能** | 较慢 | 快(编译后) | 快 |
|
||
| **灵活性** | 一般,依赖标签结构 | 极高,可匹配任意模式 | 较高,依赖节点结构 |
|
||
| **需要外部库** | beautifulsoup4 / lxml | 不需要(re 是内置) | lxml |
|
||
|
||
## 8.5 何时用哪种方法?
|
||
|
||
### 推荐用 **正则表达式** 的场景:
|
||
|
||
1. **文本内容分散在标签内外**
|
||
```html
|
||
<script>
|
||
window.data = {name: "流浪地球", score: 8.5};
|
||
</script>
|
||
```
|
||
BeautifulSoup 和 XPath 很难处理,但正则可以轻松提取 `name: "(.+?)"`
|
||
|
||
2. **需要精确的文本格式匹配**
|
||
- 手机号:`\d{3}-\d{4}-\d{4}`
|
||
- 邮箱:`[a-z]+@[a-z]+\.[a-z]+`
|
||
- IP 地址:`\d+\.\d+\.\d+\.\d+`
|
||
|
||
3. **从日志或非结构化文本中提取**
|
||
```
|
||
[2024-03-15 10:30:45] ERROR: connection failed
|
||
[2024-03-15 10:31:00] INFO: retry successful
|
||
```
|
||
|
||
4. **数据清洗和替换**
|
||
```python
|
||
# 把手机号脱敏
|
||
re.sub(r'(\d{3})-(\d{4})-(\d{4})', r'\1-****-\3', text)
|
||
```
|
||
|
||
### 推荐用 **XPath** 的场景:
|
||
|
||
1. **HTML 结构清晰、层次分明**
|
||
```html
|
||
<table>
|
||
<tr><td>Python</td><td>90</td></tr>
|
||
<tr><td>Math</td><td>85</td></tr>
|
||
</table>
|
||
```
|
||
`//table//tr/td[1]/text()` 即可提取第一列
|
||
|
||
2. **需要遍历树状结构**
|
||
XPath 支持 `parent::*`、`following-sibling::*` 等轴向选择器
|
||
|
||
3. **结构化文档(如 XML)**
|
||
XPath 是为 XML 设计的,处理 RSS、SVG 等格式更自然
|
||
|
||
### 推荐用 **BeautifulSoup (CSS)** 的场景:
|
||
|
||
1. **初学者入门**,语法最简单
|
||
2. **快速原型开发**,不需要精确控制
|
||
3. **CSS 选择器已够用**的简单场景
|
||
|
||
## 8.6 进阶:正则 + XPath 组合使用
|
||
|
||
实际上,最强大的做法是将两者结合:
|
||
|
||
```python
|
||
import requests
|
||
from lxml import etree
|
||
import re
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
url = 'https://www.douban.com/doulist/3936288/'
|
||
response = requests.get(url, headers=headers)
|
||
response.encoding = 'utf-8'
|
||
html = response.text
|
||
|
||
# 第一步:用 XPath 快速定位到所有电影条目(结构筛选)
|
||
tree = etree.HTML(html)
|
||
movie_blocks = tree.xpath('//div[@class="doulist-item"]')
|
||
|
||
results = []
|
||
for block in movie_blocks[:10]:
|
||
block_html = etree.tostring(block, encoding='unicode')
|
||
|
||
# 第二步:在每个 block 内用正则精确提取
|
||
movie_id = re.search(r'/subject/(\d+)/', block_html)
|
||
movie_id = movie_id.group(1) if movie_id else '未知'
|
||
|
||
title = block.xpath('.//a[contains(@href,"/subject/")]/text()')
|
||
title = title[0].strip() if title else '未知'
|
||
|
||
rating = block.xpath('.//span[@class="rating_nums"]/text()')
|
||
rating = rating[0] if rating else '无评分'
|
||
|
||
results.append(f'{movie_id}. {title} - 评分: {rating}')
|
||
|
||
for r in results:
|
||
print(r)
|
||
```
|
||
|
||
**思路总结:**
|
||
```
|
||
XPath 负责:结构定位 → 快速找到目标区域(省去大量无关 HTML)
|
||
正则负责:精确提取 → 在区域内精准匹配目标文本
|
||
```
|
||
|
||
## 8.7 语法对照速查表
|
||
|
||
| 任务 | 正则表达式 | XPath |
|
||
|------|-----------|-------|
|
||
| 匹配所有 div 标签 | `r'<div>'` | `//div` |
|
||
| 匹配 class="title" 的元素 | `r'class="title"'` | `//*[@class="title"]` |
|
||
| 匹配包含 /subject/ 的链接 | `r'href="/subject/[^"]+"'` | `//a[contains(@href,"/subject/")]` |
|
||
| 提取标签文本内容 | `r'>([^<]+)</span>'` → group(1) | `//span/text()` |
|
||
| 匹配任意字符 | `.+?`(非贪婪) | `//text()[contains(.,'关键词')]` |
|
||
| 匹配数字 | `\d+` | `//span[number(text())]` |
|
||
| 匹配开头/结尾 | `^内容`、`内容$` | `//div[starts-with(@class,"title")]` |
|
||
|
||
---
|
||
|
||
|
||
# 9. 常见问题与解决方案
|
||
|
||
## 9.1 匹配不到内容
|
||
|
||
```python
|
||
import re
|
||
|
||
text = 'Hello\nWorld'
|
||
|
||
# 问题:. 不匹配换行符
|
||
print(re.findall(r'.+', text)) # ['Hello', 'World'] - 被拆成了两行?
|
||
|
||
# 解决1:用 re.DOTALL 模式(让 . 匹配换行)
|
||
print(re.findall(r'.+', text, re.DOTALL)) # ['Hello\nWorld']
|
||
|
||
# 解决2:用 [\s\S] 代替 .
|
||
print(re.findall(r'[\s\S]+', text)) # ['Hello\nWorld']
|
||
```
|
||
|
||
## 9.2 中文匹配
|
||
|
||
```python
|
||
import re
|
||
|
||
text = 'Python是一门很棒的语言,C语言也是'
|
||
|
||
# 匹配中文
|
||
chinese = re.findall(r'[\u4e00-\u9fff]+', text)
|
||
print(chinese) # ['很棒的语言', '也是']
|
||
|
||
# 或者直接用 unicode
|
||
pattern = r'[\u4e00-\u9fff]+'
|
||
print(re.findall(pattern, text))
|
||
```
|
||
|
||
## 9.3 大小写不敏感匹配
|
||
|
||
```python
|
||
import re
|
||
|
||
text = 'Python PYTHON pYtHoN'
|
||
|
||
# 默认区分大小写
|
||
print(re.findall(r'python', text)) # []
|
||
|
||
# 用 re.IGNORECASE(可简写为 re.I)
|
||
print(re.findall(r'python', text, re.I)) # ['Python', 'PYTHON', 'pYtHoN']
|
||
```
|
||
|
||
## 9.4 多行模式
|
||
|
||
```python
|
||
import re
|
||
|
||
text = '''第一行内容
|
||
第二行内容
|
||
第三行内容'''
|
||
|
||
# 问题:^ 只匹配字符串开头
|
||
print(re.findall(r'^第.+', text)) # ['第一行内容']
|
||
|
||
# 解决:用 re.MULTILINE(可简写为 re.M)
|
||
print(re.findall(r'^第.+', text, re.M)) # ['第一行内容', '第二行内容', '第三行内容']
|
||
```
|
||
|
||
## 9.5 编译正则表达式提高性能
|
||
|
||
如果一个正则表达式要匹配多次,预先编译可以提高性能:
|
||
|
||
```python
|
||
import re
|
||
import time
|
||
|
||
# 需要匹配 10000 次的文本
|
||
text = '订单号:20240315-001,总额:999.50元'
|
||
|
||
# 方法1:每次调用都编译
|
||
start = time.time()
|
||
for _ in range(10000):
|
||
re.findall(r'\d{4}-\d+', text)
|
||
print(f'未编译: {time.time() - start:.4f}秒')
|
||
|
||
# 方法2:预先编译
|
||
pattern = re.compile(r'\d{4}-\d+')
|
||
start = time.time()
|
||
for _ in range(10000):
|
||
pattern.findall(text)
|
||
print(f'已编译: {time.time() - start:.4f}秒')
|
||
```
|
||
|
||
---
|
||
|
||
# 10. 动手练习
|
||
|
||
## 练习 1:提取天气预报
|
||
|
||
**目标**:从以下文本中提取日期、天气和温度。
|
||
|
||
```python
|
||
text = '''
|
||
2024-03-15 天气:晴 温度:15-25°C
|
||
2024-03-16 天气:多云 温度:12-20°C
|
||
2024-03-17 天气:小雨 温度:10-18°C
|
||
'''
|
||
|
||
# 提示:使用分组捕获日期、天气、温度
|
||
# pattern = r'你的正则表达式'
|
||
|
||
import re
|
||
pattern = r'(\d{4}-\d{2}-\d{2})\s*天气:([^ ]+)\s*温度:(\d+)-(\d+)°C'
|
||
matches = re.findall(pattern, text)
|
||
|
||
for match in matches:
|
||
date, weather, low, high = match
|
||
print(f'{date}: {weather}, {low}°C-{high}°C')
|
||
```
|
||
|
||
**预期输出:**
|
||
```
|
||
2024-03-15: 晴, 15°C-25°C
|
||
2024-03-16: 多云, 12°C-20°C
|
||
2024-03-17: 小雨, 10°C-18°C
|
||
```
|
||
|
||
---
|
||
|
||
## 练习 2:爬取豆瓣电影信息
|
||
|
||
**目标**:编写正则表达式,从模拟的 HTML 中提取电影信息。
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '''
|
||
<div class="movie">
|
||
<h2 class="name">《流浪地球》</h2>
|
||
<span class="year">(2024)</span>
|
||
<span class="rating">8.5</span>
|
||
<span class="director">导演:郭帆</span>
|
||
</div>
|
||
<div class="movie">
|
||
<h2 class="name">《你好,李焕英》</h2>
|
||
<span class="year">(2024)</span>
|
||
<span class="rating">7.9</span>
|
||
<span class="director">导演:贾玲</span>
|
||
</div>
|
||
'''
|
||
|
||
# 编写正则表达式,提取所有电影信息
|
||
# pattern = r'你的正则表达式'
|
||
|
||
# 提示:可以用多个正则分别提取,或者用一个复杂的正则提取所有
|
||
name_pattern = r'<h2 class="name">《([^》]+)》</h2>'
|
||
year_pattern = r'<span class="year">\((\d{4})\)</span>'
|
||
rating_pattern = r'<span class="rating">([^<]+)</span>'
|
||
director_pattern = r'导演:([^<]+)'
|
||
|
||
names = re.findall(name_pattern, html)
|
||
years = re.findall(year_pattern, html)
|
||
ratings = re.findall(rating_pattern, html)
|
||
directors = re.findall(director_pattern, html)
|
||
|
||
for i in range(len(names)):
|
||
print(f"{names[i]} | {years[i]} | 评分:{ratings[i]} | {directors[i]}")
|
||
```
|
||
|
||
---
|
||
|
||
## 练习 3:日志分析
|
||
|
||
**目标**:从服务器日志中提取 IP 地址、请求时间和状态码。
|
||
|
||
```python
|
||
import re
|
||
|
||
log = '''
|
||
192.168.1.100 - - [15/Mar/2024:10:15:30 +0800] "GET /index.html HTTP/1.1" 200 1234
|
||
10.0.0.50 - - [15/Mar/2024:10:15:31 +0800] "POST /api/login HTTP/1.1" 200 256
|
||
192.168.1.101 - - [15/Mar/2024:10:15:32 +0800] "GET /notfound.html HTTP/1.1" 404 512
|
||
172.16.0.200 - - [15/Mar/2024:10:15:33 +0800] "GET /images/logo.png HTTP/1.1" 200 4096
|
||
'''
|
||
|
||
# 提取 IP、时间和状态码
|
||
pattern = r'(\d+\.\d+\.\d+\.\d+).*?\[([^\]]+)\].*?" (\d{3}) \d+'
|
||
|
||
for match in re.finditer(pattern, log):
|
||
ip, time, status = match.groups()
|
||
print(f'IP: {ip:15} | 时间: {time:25} | 状态: {status}')
|
||
```
|
||
|
||
---
|
||
|
||
## 练习 4:电话号码脱敏
|
||
|
||
**目标**:将手机号中间四位用 `*` 替换,保护隐私。
|
||
|
||
```python
|
||
import re
|
||
|
||
phone_book = '''
|
||
张三:138-1234-5678
|
||
李四:139-5678-1234
|
||
王五:138-0000-1111
|
||
'''
|
||
|
||
# 脱敏:将 138-****-5678 格式输出
|
||
# 提示:使用分组和 re.sub
|
||
|
||
pattern = r'(\d{3})-(\d{4})-(\d{4})'
|
||
|
||
def mask_phone(match):
|
||
return f'{match.group(1)}-****-{match.group(3)}'
|
||
|
||
masked = re.sub(pattern, mask_phone, phone_book)
|
||
print(masked)
|
||
```
|
||
|
||
---
|
||
|
||
## 练习 5:综合挑战(选做)
|
||
|
||
从以下课程表 HTML 中,用正则表达式提取所有课程信息:
|
||
|
||
```python
|
||
import re
|
||
|
||
html = '''
|
||
<table class="schedule">
|
||
<tr><th>时间</th><th>课程</th><th>教室</th></tr>
|
||
<tr><td>周一 1-2节</td><td>Python程序设计</td><td>A101</td></tr>
|
||
<tr><td>周一 3-4节</td><td>数据结构</td><td>B205</td></tr>
|
||
<tr><td>周二 1-2节</td><td>高等数学</td><td>C301</td></tr>
|
||
<tr><td>周三 5-6节</td><td>Python程序设计</td><td>A102</td></tr>
|
||
</table>
|
||
'''
|
||
|
||
# 分别提取时间、课程、教室
|
||
time_pattern = r'<td>([^<]+)</td><td>([^<]+)</td><td>([^<]+)</td>'
|
||
courses = re.findall(time_pattern, html)
|
||
|
||
print('课程表:')
|
||
for time, course, room in courses:
|
||
print(f'{time} | {course} | {room}')
|
||
```
|
||
|
||
---
|
||
|
||
# 📋 课程小结
|
||
|
||
## 核心要点
|
||
|
||
1. **正则表达式是强大的文本匹配工具**,可以精确提取任意文本中的目标内容
|
||
|
||
2. **元字符是基础**:
|
||
- `.` 匹配任意字符
|
||
- `\d` 匹配数字,`\w` 匹配字母数字下划线
|
||
- `[]` 定义字符集,`[^...]` 排除字符集
|
||
|
||
3. **量词控制次数**:
|
||
- `+` 一次或多次,`*` 零次或多次,`?` 零次或一次
|
||
- `{n,m}` 指定范围
|
||
|
||
4. **贪婪 vs 非贪婪**:
|
||
- 贪婪(`.*`、`\d+`)尽可能多匹配
|
||
- 非贪婪(`.*?`、`\d+?`)尽可能少匹配
|
||
- `[^<]+` 是爬虫中常用的非贪婪技巧
|
||
|
||
5. **分组 `()`**:
|
||
- 捕获匹配内容
|
||
- 用 `\1`、`\2` 在替换中引用分组
|
||
- `(?P<name>...)` 命名分组更清晰
|
||
|
||
6. **BeautifulSoup + 正则 = 黄金组合**:
|
||
- BeautifulSoup 负责结构定位
|
||
- 正则负责精确提取
|
||
|
||
7. **re 模块常用函数**:
|
||
- `findall()` 返回所有匹配列表
|
||
- `search()` 找第一个匹配
|
||
- `match()` 只匹配字符串开头
|
||
- `finditer()` 返回迭代器,节省内存
|
||
- `sub()` 替换匹配内容
|
||
|
||
## 常见错误速查
|
||
|
||
| 错误 | 原因 | 解决方法 |
|
||
|------|------|----------|
|
||
| 匹配结果为空 | `.` 不匹配换行符 | 用 `re.DOTALL` 或 `[^\n]` |
|
||
| 匹配太多 | 贪婪匹配 | 改为非贪婪 `*?`、`+?` |
|
||
| 匹配错误 | 元字符未转义 | 用 `\.` 匹配点号本身 |
|
||
| 中文匹配失败 | 字符集不对 | 用 `[\u4e00-\u9fff]` |
|
||
|
||
## 下节课预告
|
||
- XPath 选择器详解
|
||
- Selenium 动态页面抓取
|
||
- 反反爬策略与实战
|
||
|
||
---
|
||
|
||
*本讲义由 AI 助教生成,如有问题请随时提问。*
|