1561 lines
37 KiB
Markdown
1561 lines
37 KiB
Markdown
# 📚 Python文件操作完全指南
|
||
|
||
---
|
||
|
||
# 目录
|
||
|
||
1. [先导知识:Python基础回顾](#1-先导知识python基础回顾)
|
||
2. [with语句:文件操作的好帮手](#2-with语句文件操作的好帮手)
|
||
3. [文本文件读写](#3-文本文件读写)
|
||
4. [CSV文件:表格数据存储](#4-csv文件表格数据存储)
|
||
5. [JSON文件:结构化数据存储](#5-json文件结构化数据存储)
|
||
6. [二进制文件:图片的读写](#6-二进制文件图片的读写)
|
||
7. [动手练习](#7-动手练习)
|
||
|
||
---
|
||
|
||
# 1. 先导知识:Python基础回顾
|
||
|
||
> 在开始文件操作之前,我们先来复习一下Python中最常用的数据结构。这些内容非常重要,后面会反复用到!
|
||
|
||
## 1.1 变量和数据类型
|
||
|
||
**什么是变量?**
|
||
变量就像一个盒子,我们把数据放进盒子里,给盒子贴个标签(变量名),方便以后使用。
|
||
|
||
```python
|
||
# 整数 - 就像数学里的整数
|
||
age = 20
|
||
print(f'年龄: {age}')
|
||
|
||
# 浮点数 - 带小数点的数
|
||
score = 88.5
|
||
print(f'成绩: {score}')
|
||
|
||
# 字符串 - 用引号包起来的文字
|
||
name = '张三'
|
||
print(f'姓名: {name}')
|
||
|
||
# 布尔值 - 只有两个值:True(真)和 False(假)
|
||
is_passed = True
|
||
print(f'是否及格: {is_passed}')
|
||
|
||
# 空值 - 表示"什么都没有"
|
||
empty_value = None
|
||
print(f'空值: {empty_value}')
|
||
```
|
||
|
||
**小提示:** Python中的变量不需要声明类型,同一个变量可以赋不同类型的值:
|
||
|
||
```python
|
||
x = 10 # 现在x是整数
|
||
print(f'x = {x}, 类型: {type(x)}')
|
||
|
||
x = 'hello' # 现在x变成字符串了!
|
||
print(f'x = {x}, 类型: {type(x)}')
|
||
```
|
||
|
||
**常见数据类型一览:**
|
||
|
||
| 数据类型 | 示例 | 说明 |
|
||
|---------|------|------|
|
||
| int | 20, 100, -5 | 整数 |
|
||
| float | 88.5, 3.14 | 浮点数(小数) |
|
||
| str | '你好', "Python" | 字符串(文字) |
|
||
| bool | True, False | 布尔值 |
|
||
| None | None | 空值 |
|
||
|
||
---
|
||
|
||
## 1.2 列表 List
|
||
|
||
**什么是列表?**
|
||
列表就像一排连续的储物盒,每个盒子里放一个数据,通过编号(索引)来访问。
|
||
|
||
### 1.2.1 创建列表
|
||
|
||
```python
|
||
# 用方括号创建列表,元素之间用逗号分隔
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄'] # 字符串列表
|
||
numbers = [1, 2, 3, 4, 5] # 整数列表
|
||
mixed = ['hello', 123, True, 3.14] # 混合列表
|
||
```
|
||
|
||
### 1.2.2 访问列表元素
|
||
|
||
**重要:索引从0开始!**
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
|
||
print(fruits[0]) # 第一个元素:苹果
|
||
print(fruits[1]) # 第二个元素:香蕉
|
||
print(fruits[-1]) # 最后一个元素:葡萄
|
||
print(fruits[-2]) # 倒数第二个:橙子
|
||
```
|
||
|
||
**索引示意图:**
|
||
|
||
```
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
[0] [1] [2] [3]
|
||
[-4] [-3] [-2] [-1]
|
||
```
|
||
|
||
### 1.2.3 修改列表元素
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
|
||
fruits[0] = '西瓜' # 修改第一个元素
|
||
print(fruits) # ['西瓜', '香蕉', '橙子', '葡萄']
|
||
```
|
||
|
||
### 1.2.4 添加元素
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
|
||
fruits.append('草莓') # 在末尾添加
|
||
print(fruits) # ['苹果', '香蕉', '橙子', '葡萄', '草莓']
|
||
|
||
fruits.insert(1, '桃子') # 在索引1的位置插入
|
||
print(fruits) # ['苹果', '桃子', '香蕉', '橙子', '葡萄', '草莓']
|
||
```
|
||
|
||
### 1.2.5 删除元素
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
|
||
fruits.remove('香蕉') # 删除指定元素(删除第一个匹配的)
|
||
print(fruits) # ['苹果', '橙子', '葡萄']
|
||
|
||
del fruits[0] # 删除指定索引的元素
|
||
print(fruits) # ['橙子', '葡萄']
|
||
```
|
||
|
||
### 1.2.6 列表长度
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
print(len(fruits)) # 4
|
||
```
|
||
|
||
### 1.2.7 遍历列表
|
||
|
||
```python
|
||
fruits = ['苹果', '香蕉', '橙子', '葡萄']
|
||
|
||
# 方法1:直接遍历(最常用)
|
||
for fruit in fruits:
|
||
print(fruit)
|
||
|
||
# 方法2:带索引遍历
|
||
for i, fruit in enumerate(fruits):
|
||
print(f'{i}: {fruit}')
|
||
```
|
||
|
||
**输出:**
|
||
```
|
||
苹果
|
||
香蕉
|
||
橙子
|
||
葡萄
|
||
```
|
||
|
||
### 1.2.8 列表切片
|
||
|
||
切片就像从列表中"切"出一部分。
|
||
|
||
```python
|
||
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
||
|
||
print(numbers[2:5]) # [2, 3, 4] - 从索引2到4(不包含5)
|
||
print(numbers[:4]) # [0, 1, 2, 3] - 从开头到索引3
|
||
print(numbers[5:]) # [5, 6, 7, 8, 9] - 从索引5到结尾
|
||
print(numbers[-3:]) # [7, 8, 9] - 最后3个元素
|
||
```
|
||
|
||
**切片语法:`列表[开始:结束:步长]`**
|
||
|
||
---
|
||
|
||
## 1.3 字典 Dict
|
||
|
||
**什么是字典?**
|
||
字典就像真实的字典,用"键"来找"值"。每个键对应一个值,形成键值对。
|
||
|
||
### 1.3.1 创建字典
|
||
|
||
```python
|
||
student = {
|
||
'name': '张三', # 键是'name',值是'张三'
|
||
'age': 20, # 键是'age',值是20
|
||
'major': '人工智能', # 键是'major',值是'人工智能'
|
||
'score': 88.5 # 键是'score',值是88.5
|
||
}
|
||
```
|
||
|
||
### 1.3.2 访问字典的值
|
||
|
||
```python
|
||
# 通过键来访问值
|
||
print(student['name']) # 张三
|
||
print(student['age']) # 20
|
||
|
||
# 使用get()方法(更安全)
|
||
print(student.get('city')) # None(键不存在,返回None)
|
||
print(student.get('city', '未知')) # 未知(键不存在,返回默认值)
|
||
```
|
||
|
||
### 1.3.3 修改和添加
|
||
|
||
```python
|
||
student = {'name': '张三', 'age': 20}
|
||
|
||
student['age'] = 21 # 修改已有键的值
|
||
student['city'] = '北京' # 添加新的键值对
|
||
|
||
print(student)
|
||
# {'name': '张三', 'age': 21, 'city': '北京'}
|
||
```
|
||
|
||
### 1.3.4 删除键值对
|
||
|
||
```python
|
||
student = {'name': '张三', 'age': 20, 'city': '北京'}
|
||
|
||
del student['city'] # 删除指定键值对
|
||
print(student) # {'name': '张三', 'age': 20}
|
||
```
|
||
|
||
### 1.3.5 遍历字典
|
||
|
||
```python
|
||
student = {'name': '张三', 'age': 20, 'major': '人工智能'}
|
||
|
||
# 遍历所有键
|
||
for key in student:
|
||
print(f'{key}: {student[key]}')
|
||
|
||
# 遍历所有键值对(常用!)
|
||
for key, value in student.items():
|
||
print(f'{key} = {value}')
|
||
|
||
# 只遍历值
|
||
for value in student.values():
|
||
print(value)
|
||
```
|
||
|
||
---
|
||
|
||
## 1.4 字符串基础
|
||
|
||
### 1.4.1 字符串的创建
|
||
|
||
```python
|
||
s1 = '单引号字符串'
|
||
s2 = "双引号字符串"
|
||
s3 = '''三引号字符串
|
||
可以换行''' # 三引号可以写多行
|
||
```
|
||
|
||
### 1.4.2 字符串拼接
|
||
|
||
```python
|
||
first_name = '张'
|
||
last_name = '三'
|
||
full_name = first_name + last_name
|
||
print(full_name) # 张三
|
||
```
|
||
|
||
### 1.4.3 字符串格式化(重要!)
|
||
|
||
```python
|
||
name = '李四'
|
||
age = 21
|
||
|
||
# 方法1:f-string(推荐,最常用!)
|
||
info = f'姓名: {name}, 年龄: {age}'
|
||
print(info) # 姓名: 李四, 年龄: 21
|
||
|
||
# 方法2:format()
|
||
info = '姓名: {}, 年龄: {}'.format(name, age)
|
||
print(info) # 姓名: 李四, 年龄: 21
|
||
```
|
||
|
||
### 1.4.4 常用字符串方法
|
||
|
||
```python
|
||
text = ' Hello, Python! '
|
||
|
||
print(text.strip()) # 'Hello, Python!' - 去除首尾空白
|
||
print(text.lower()) # ' hello, python! ' - 转小写
|
||
print(text.upper()) # ' HELLO, PYTHON! ' - 转大写
|
||
print(text.replace('Python', 'World')) # ' Hello, World! ' - 替换
|
||
```
|
||
|
||
### 1.4.5 字符串分割
|
||
|
||
```python
|
||
csv_line = '张三,20,人工智能,88.5'
|
||
|
||
# split() 把字符串按分隔符拆分成列表
|
||
parts = csv_line.split(',')
|
||
print(parts) # ['张三', '20', '人工智能', '88.5']
|
||
```
|
||
|
||
### 1.4.6 判断包含
|
||
|
||
```python
|
||
text = 'Hello, Python!'
|
||
|
||
if 'Python' in text:
|
||
print('包含Python')
|
||
```
|
||
|
||
---
|
||
|
||
# 2. with语句:文件操作的好帮手
|
||
|
||
> 这一节非常重要!with语句是Python文件操作的核心,必须完全掌握。
|
||
|
||
## 2.1 为什么需要with语句?
|
||
|
||
### 2.1.1 普通方式的问题
|
||
|
||
```python
|
||
# 普通方式打开文件
|
||
f = open('test.txt', 'w', encoding='utf-8')
|
||
f.write('Hello')
|
||
# 问题:如果这里发生异常,close()不会执行!
|
||
# f.close() 容易被遗忘
|
||
print('文件已打开,但忘记关闭')
|
||
```
|
||
|
||
**问题在哪?**
|
||
- 如果写入过程中出错,程序会崩溃,`close()`不会执行
|
||
- 文件没关闭可能导致数据丢失
|
||
- 忘记关文件还浪费系统资源
|
||
|
||
### 2.1.2 try-finally方式
|
||
|
||
```python
|
||
# 用try-finally确保关闭
|
||
f = None
|
||
try:
|
||
f = open('test.txt', 'w', encoding='utf-8')
|
||
f.write('try-finally方式')
|
||
finally:
|
||
if f:
|
||
f.close() # 无论如何都会执行
|
||
print('已关闭')
|
||
```
|
||
|
||
**缺点:** 代码太繁琐!
|
||
|
||
### 2.1.3 with语句(推荐!)
|
||
|
||
```python
|
||
# 用with语句,自动管理关闭
|
||
with open('test.txt', 'w', encoding='utf-8') as f:
|
||
f.write('with语句方式')
|
||
# with会自动调用 f.close()
|
||
|
||
print('文件已自动关闭')
|
||
```
|
||
|
||
**优点:**
|
||
- 自动关闭,无需手动写close()
|
||
- 即使出错也会正确关闭
|
||
- 代码更简洁
|
||
|
||
---
|
||
|
||
## 2.2 with语句的工作原理
|
||
|
||
with语句背后依赖两个方法:`__enter__()` 和 `__exit__()`
|
||
|
||
```python
|
||
# with 相当于自动调用这两个方法:
|
||
# 1. with进入时 → 调用 __enter__()
|
||
# 2. with退出时 → 调用 __exit__()
|
||
|
||
# 想象成:
|
||
# f = open('test.txt', 'w') → __enter__()返回文件对象
|
||
# f.write('Hello')
|
||
# f.close() → __exit__()自动调用
|
||
```
|
||
|
||
**为什么用as?**
|
||
|
||
```python
|
||
with open('test.txt', 'r') as f:
|
||
# open()返回的文件对象传给f
|
||
content = f.read()
|
||
# 离开with块时,自动调用f.close()
|
||
print(content) # 在外面仍然可以使用content
|
||
```
|
||
|
||
---
|
||
|
||
## 2.3 with的多种用法
|
||
|
||
### 用法1:单个文件
|
||
|
||
```python
|
||
with open('file1.txt', 'w') as f:
|
||
f.write('写入内容')
|
||
```
|
||
|
||
### 用法2:同时操作多个文件
|
||
|
||
```python
|
||
with open('source.txt', 'w') as src, open('dest.txt', 'w') as dst:
|
||
src.write('源文件内容')
|
||
dst.write('目标文件内容')
|
||
```
|
||
|
||
### 用法3:嵌套with
|
||
|
||
```python
|
||
with open('outer.txt', 'w') as outer:
|
||
outer.write('外层')
|
||
with open('inner.txt', 'w') as inner:
|
||
inner.write('内层')
|
||
```
|
||
|
||
### 用法4:with结合循环
|
||
|
||
```python
|
||
lines = ['第一行', '第二行', '第三行']
|
||
with open('lines.txt', 'w') as f:
|
||
for i, line in enumerate(lines):
|
||
f.write(f'{i+1}. {line}\n')
|
||
```
|
||
|
||
---
|
||
|
||
# 3. 文本文件读写
|
||
|
||
> 文本文件是最常见的文件类型,.txt、.py、.md都是文本文件。
|
||
|
||
## 3.1 文件打开模式
|
||
|
||
打开文件时,需要指定**模式**:
|
||
|
||
| 模式 | 字符 | 说明 | 注意事项 |
|
||
|------|------|------|----------|
|
||
| 只读 | 'r' | 读取文件 | 文件不存在会报错 |
|
||
| 写入 | 'w' | 写入文件 | 文件存在会**清空**内容 |
|
||
| 追加 | 'a' | 在末尾添加 | 文件不存在会创建 |
|
||
| 创建 | 'x' | 创建文件 | 文件存在会报错 |
|
||
|
||
**加上 `b` 表示二进制模式:**
|
||
- `rb`、`wb`、`ab` - 二进制读写
|
||
|
||
**加上 `+` 表示同时读写:**
|
||
- `r+`、`w+`、`a+` - 读写模式
|
||
|
||
### 模式对比
|
||
|
||
```python
|
||
# 'w' 写入模式 - 文件存在会清空
|
||
f = open('test.txt', 'w')
|
||
f.write('第一次')
|
||
f.close()
|
||
|
||
f = open('test.txt', 'w') # 再打开,内容被清空!
|
||
f.write('第二次')
|
||
f.close()
|
||
|
||
with open('test.txt', 'r') as f:
|
||
print(f.read()) # 只有"第二次"
|
||
|
||
# 'a' 追加模式 - 在末尾添加
|
||
f = open('test.txt', 'a')
|
||
f.write('\n追加的内容')
|
||
f.close()
|
||
|
||
with open('test.txt', 'r') as f:
|
||
print(f.read()) # "第二次" + "追加的内容"
|
||
|
||
# 'r' 只读模式
|
||
try:
|
||
with open('not_exist.txt', 'r') as f:
|
||
print(f.read())
|
||
except FileNotFoundError:
|
||
print('文件不存在!')
|
||
|
||
# 'x' 创建模式
|
||
try:
|
||
with open('new_file.txt', 'x') as f:
|
||
f.write('新文件')
|
||
except FileExistsError:
|
||
print('文件已存在!')
|
||
```
|
||
|
||
---
|
||
|
||
## 3.2 文件读取方法
|
||
|
||
准备测试文件:
|
||
|
||
```python
|
||
with open('read_test.txt', 'w', encoding='utf-8') as f:
|
||
f.write('第一行内容\n')
|
||
f.write('第二行内容\n')
|
||
f.write('第三行内容\n')
|
||
f.write('第四行内容\n')
|
||
f.write('第五行(无换行)')
|
||
```
|
||
|
||
### 方法1:read() - 读取全部
|
||
|
||
```python
|
||
with open('read_test.txt', 'r', encoding='utf-8') as f:
|
||
content = f.read()
|
||
print(content)
|
||
```
|
||
|
||
### 方法2:read(n) - 读取n个字符
|
||
|
||
```python
|
||
with open('read_test.txt', 'r', encoding='utf-8') as f:
|
||
content = f.read(10) # 只读10个字符
|
||
print(content)
|
||
```
|
||
|
||
### 方法3:readline() - 读取一行
|
||
|
||
```python
|
||
with open('read_test.txt', 'r', encoding='utf-8') as f:
|
||
line1 = f.readline() # 读第一行
|
||
line2 = f.readline() # 读第二行
|
||
print(f'第一行: {line1}')
|
||
print(f'第二行: {line2}')
|
||
# 注意:readline会保留换行符\n
|
||
```
|
||
|
||
### 方法4:readlines() - 读取所有行到列表
|
||
|
||
```python
|
||
with open('read_test.txt', 'r', encoding='utf-8') as f:
|
||
lines = f.readlines()
|
||
print(f'共{len(lines)}行')
|
||
for i, line in enumerate(lines):
|
||
print(f'{i}: {repr(line)}') # repr显示原始内容
|
||
```
|
||
|
||
### 方法5:for循环遍历(推荐!)
|
||
|
||
```python
|
||
with open('read_test.txt', 'r', encoding='utf-8') as f:
|
||
for line in f:
|
||
print(line.strip()) # strip()去除换行符
|
||
```
|
||
|
||
**推荐原因:** 内存友好,大文件也不会卡
|
||
|
||
---
|
||
|
||
## 3.3 文件写入方法
|
||
|
||
### 方法1:write() - 写入字符串
|
||
|
||
```python
|
||
with open('write_test.txt', 'w', encoding='utf-8') as f:
|
||
f.write('第一行')
|
||
f.write('\n第二行') # 换行要自己加\n
|
||
f.write('\n第三行')
|
||
```
|
||
|
||
### 方法2:writelines() - 写入多行
|
||
|
||
```python
|
||
lines = ['第一行\n', '第二行\n', '第三行\n']
|
||
with open('writelines_test.txt', 'w', encoding='utf-8') as f:
|
||
f.writelines(lines)
|
||
```
|
||
|
||
**注意:** writelines不会自动加换行符!
|
||
|
||
---
|
||
|
||
## 3.4 逐行处理实战
|
||
|
||
### 例1:读取并处理CSV格式数据
|
||
|
||
```python
|
||
# 准备数据
|
||
data = '''姓名,年龄,专业
|
||
张三,20,人工智能
|
||
李四,21,计算机科学
|
||
王五,19,软件工程'''
|
||
|
||
with open('students.txt', 'w', encoding='utf-8') as f:
|
||
f.write(data)
|
||
|
||
# 逐行处理
|
||
with open('students.txt', 'r', encoding='utf-8') as f:
|
||
header = f.readline().strip() # 读取第一行(表头)
|
||
print(f'表头: {header}')
|
||
|
||
for line in f: # 遍历其余行
|
||
line = line.strip()
|
||
if line:
|
||
parts = line.split(',')
|
||
name, age, major = parts
|
||
print(f'{name} - {major}')
|
||
```
|
||
|
||
### 例2:计算平均年龄
|
||
|
||
```python
|
||
with open('students.txt', 'r', encoding='utf-8') as f:
|
||
next(f) # 跳过表头
|
||
|
||
total_age = 0
|
||
count = 0
|
||
|
||
for line in f:
|
||
line = line.strip()
|
||
if line:
|
||
parts = line.split(',')
|
||
age = int(parts[1])
|
||
total_age += age
|
||
count += 1
|
||
|
||
average = total_age / count
|
||
print(f'学生人数: {count}')
|
||
print(f'平均年龄: {average:.1f}')
|
||
```
|
||
|
||
### 例3:筛选并保存
|
||
|
||
```python
|
||
with open('students.txt', 'r', encoding='utf-8') as f:
|
||
header = f.readline() # 保存表头
|
||
|
||
with open('filtered.txt', 'w', encoding='utf-8') as out:
|
||
out.write(header) # 写入表头
|
||
|
||
for line in f:
|
||
line = line.strip()
|
||
if line:
|
||
parts = line.split(',')
|
||
age = int(parts[1])
|
||
if age > 20: # 筛选年龄>20的
|
||
out.write('\n' + line)
|
||
```
|
||
|
||
---
|
||
|
||
# 4. CSV文件:表格数据存储
|
||
|
||
> CSV是最常用的表格数据格式,可以用Excel打开。
|
||
|
||
## 4.1 什么是CSV?
|
||
|
||
```
|
||
CSV = Comma Separated Values(逗号分隔值)
|
||
|
||
示例内容:
|
||
姓名,年龄,专业
|
||
张三,20,人工智能
|
||
李四,21,计算机科学
|
||
|
||
特点:
|
||
- 每行一条记录
|
||
- 字段之间用逗号分隔
|
||
- 第一行通常是表头
|
||
- 可以直接用Excel打开编辑
|
||
```
|
||
|
||
---
|
||
|
||
## 4.2 CSV模块基础
|
||
|
||
Python内置了`csv`模块,使用前需要导入。
|
||
|
||
### 4.2.1 写入CSV
|
||
|
||
**重要:一定要加 `newline=''`!**
|
||
|
||
```python
|
||
import csv
|
||
|
||
# 准备数据
|
||
header = ['姓名', '年龄', '专业', '成绩']
|
||
students = [
|
||
['张三', 20, '人工智能', 88.5],
|
||
['李四', 21, '计算机科学', 92.0],
|
||
['王五', 19, '软件工程', 85.5],
|
||
]
|
||
|
||
# 写入CSV(关键:newline='')
|
||
with open('students.csv', 'w', encoding='utf-8', newline='') as f:
|
||
writer = csv.writer(f)
|
||
writer.writerow(header) # 写入一行
|
||
writer.writerows(students) # 写入多行
|
||
```
|
||
|
||
### 4.2.2 为什么要加 newline=''?
|
||
|
||
**不加会产生多余空行!**
|
||
|
||
```python
|
||
# 错误方式
|
||
with open('wrong.csv', 'w', encoding='utf-8') as f: # 少了newline=''
|
||
writer = csv.writer(f)
|
||
writer.writerow(['A', 'B', 'C'])
|
||
|
||
# 读取看结果:会有多余空行!
|
||
with open('wrong.csv', 'r') as f:
|
||
print(repr(f.read()))
|
||
# 输出: 'A,B,C\r\n\r\nA,B,C\r\n\r\nA,B,C\r\n' ← 多余的\r\n
|
||
|
||
# 正确方式
|
||
with open('correct.csv', 'w', encoding='utf-8', newline='') as f:
|
||
writer = csv.writer(f)
|
||
writer.writerow(['A', 'B', 'C'])
|
||
|
||
with open('correct.csv', 'r') as f:
|
||
print(repr(f.read()))
|
||
# 输出: 'A,B,C' ← 正确!
|
||
```
|
||
|
||
### 4.2.3 读取CSV
|
||
|
||
```python
|
||
import csv
|
||
|
||
with open('students.csv', 'r', encoding='utf-8') as f:
|
||
reader = csv.reader(f)
|
||
for row in reader:
|
||
print(row)
|
||
```
|
||
|
||
---
|
||
|
||
## 4.3 CSV字典方式(更直观!)
|
||
|
||
### 4.3.1 写入
|
||
|
||
```python
|
||
import csv
|
||
|
||
students = [
|
||
{'姓名': '张三', '年龄': 20, '专业': '人工智能', '成绩': 88.5},
|
||
{'姓名': '李四', '年龄': 21, '专业': '计算机科学', '成绩': 92.0},
|
||
{'姓名': '王五', '年龄': 19, '专业': '软件工程', '成绩': 85.5},
|
||
]
|
||
|
||
fieldnames = ['姓名', '年龄', '专业', '成绩']
|
||
|
||
with open('students_dict.csv', 'w', encoding='utf-8', newline='') as f:
|
||
writer = csv.DictWriter(f, fieldnames=fieldnames)
|
||
writer.writeheader() # 写入表头
|
||
writer.writerows(students) # 写入多行
|
||
```
|
||
|
||
### 4.3.2 读取
|
||
|
||
```python
|
||
import csv
|
||
|
||
with open('students_dict.csv', 'r', encoding='utf-8') as f:
|
||
reader = csv.DictReader(f)
|
||
|
||
# reader.fieldnames 自动包含表头
|
||
print(f'表头: {reader.fieldnames}')
|
||
|
||
for row in reader:
|
||
# 用键名访问,直观!
|
||
print(f"姓名: {row['姓名']}, 专业: {row['专业']}, 成绩: {row['成绩']}")
|
||
```
|
||
|
||
### 4.3.3 对比:列表方式 vs 字典方式
|
||
|
||
```python
|
||
# 列表方式
|
||
with open('students.csv', 'r', encoding='utf-8') as f:
|
||
reader = csv.reader(f)
|
||
next(reader) # 跳过表头
|
||
for row in reader:
|
||
print(f'{row[0]} - {row[2]}') # row[0]是什么?row[2]是什么?容易搞混
|
||
# 还要记住:0是姓名,2是专业
|
||
|
||
# 字典方式
|
||
with open('students_dict.csv', 'r', encoding='utf-8') as f:
|
||
reader = csv.DictReader(f)
|
||
for row in reader:
|
||
print(f"{row['姓名']} - {row['专业']}") # 一目了然!
|
||
```
|
||
|
||
---
|
||
|
||
# 5. JSON文件:结构化数据存储
|
||
|
||
> JSON是现代编程中最流行的数据格式,用于API传输、配置文件等。
|
||
|
||
## 5.1 什么是JSON?
|
||
|
||
```
|
||
JSON = JavaScript Object Notation
|
||
|
||
特点:
|
||
- 轻量级数据交换格式
|
||
- 浏览器和服务器之间的标准数据传输格式
|
||
- 配置文件常用格式(config.json, package.json)
|
||
|
||
数据类型:
|
||
- 字符串:"Hello"
|
||
- 数字:123, 45.67
|
||
- 布尔值:true, false
|
||
- 空值:null
|
||
- 数组:[]
|
||
- 对象:{}
|
||
```
|
||
|
||
**Python数据类型 ↔ JSON数据类型**
|
||
|
||
| Python | JSON |
|
||
|--------|------|
|
||
| str | string |
|
||
| int, float | number |
|
||
| bool | boolean |
|
||
| None | null |
|
||
| list, tuple | array |
|
||
| dict | object |
|
||
|
||
---
|
||
|
||
## 5.2 JSON读写操作
|
||
|
||
### 5.2.1 写入JSON文件
|
||
|
||
```python
|
||
import json
|
||
|
||
config = {
|
||
'app_name': '人工智能数据服务平台',
|
||
'version': '1.0.0',
|
||
'debug': True,
|
||
'max_users': 100,
|
||
'database': {
|
||
'host': 'localhost',
|
||
'port': 3306,
|
||
'username': 'root',
|
||
'password': '123456'
|
||
},
|
||
'allowed_modules': ['图像处理', '文本处理', '语音处理'],
|
||
'settings': None
|
||
}
|
||
|
||
# 写入JSON
|
||
with open('config.json', 'w', encoding='utf-8') as f:
|
||
# ensure_ascii=False:保留中文字符(重要!)
|
||
# indent=2:格式化缩进,易读
|
||
json.dump(config, f, ensure_ascii=False, indent=2)
|
||
```
|
||
|
||
### 5.2.2 读取JSON文件
|
||
|
||
```python
|
||
import json
|
||
|
||
with open('config.json', 'r', encoding='utf-8') as f:
|
||
data = json.load(f) # 读取并解析为Python对象
|
||
|
||
print(f'应用名称: {data["app_name"]}')
|
||
print(f'版本: {data["version"]}')
|
||
print(f'调试模式: {data["debug"]}')
|
||
print(f'允许的模块: {data["allowed_modules"]}')
|
||
print(f'数据库主机: {data["database"]["host"]}')
|
||
```
|
||
|
||
### 5.2.3 dump/dumps/load/loads 区别
|
||
|
||
```python
|
||
import json
|
||
|
||
data = {'name': '张三', 'age': 20}
|
||
|
||
# json.dump() - 写入文件
|
||
with open('test.json', 'w', encoding='utf-8') as f:
|
||
json.dump(data, f, ensure_ascii=False)
|
||
|
||
# json.dumps() - 转为字符串(用于网络传输)
|
||
json_string = json.dumps(data, ensure_ascii=False)
|
||
print(f'字符串: {json_string}')
|
||
print(f'类型: {type(json_string)}') # str
|
||
|
||
# json.loads() - 从字符串解析
|
||
parsed = json.loads(json_string)
|
||
print(f'解析后: {parsed}')
|
||
```
|
||
|
||
### 5.2.4 实战:保存学生成绩
|
||
|
||
```python
|
||
import json
|
||
|
||
scores = [
|
||
{'姓名': '张三', '成绩': 88.5},
|
||
{'姓名': '李四', '成绩': 92.0},
|
||
{'姓名': '王五', '成绩': 85.5},
|
||
]
|
||
|
||
# 保存
|
||
with open('scores.json', 'w', encoding='utf-8') as f:
|
||
json.dump(scores, f, ensure_ascii=False, indent=2)
|
||
|
||
# 读取
|
||
with open('scores.json', 'r', encoding='utf-8') as f:
|
||
loaded = json.load(f)
|
||
|
||
# 计算平均分
|
||
total = sum(s['成绩'] for s in loaded)
|
||
avg = total / len(loaded)
|
||
print(f'平均成绩: {avg:.2f}')
|
||
```
|
||
|
||
---
|
||
|
||
# 6. 二进制文件:图片的读写
|
||
|
||
> 图片、音频、视频都是二进制文件,和文本文件处理方式不同!
|
||
|
||
## 6.1 文本文件 vs 二进制文件
|
||
|
||
| 类型 | 读出来 | 写进去 | 特点 |
|
||
|------|--------|--------|------|
|
||
| 文本文件 (t) | str | str | 有编码(UTF-8等) |
|
||
| 二进制文件 (b) | bytes | bytes | 无编码,原始字节 |
|
||
|
||
**文件模式:**
|
||
```python
|
||
'r' 或 'rt' # 文本只读
|
||
'w' 或 'wt' # 文本写入
|
||
'rb' # 二进制只读
|
||
'wb' # 二进制写入
|
||
```
|
||
|
||
---
|
||
|
||
## 6.2 保存从网络下载的图片
|
||
|
||
**上节课学过的requests爬取图片:**
|
||
```python
|
||
import requests
|
||
|
||
# requests.get().content 返回的是bytes(字节数据)
|
||
response = requests.get('https://example.com/image.jpg')
|
||
image_bytes = response.content
|
||
```
|
||
|
||
**保存图片到本地:**
|
||
|
||
```python
|
||
# 假设这是从网络获取的图片字节
|
||
image_bytes = b'\x89PNG\r\n\x1a\n...' # 实际的图片字节数据
|
||
|
||
# 用'wb'模式写入二进制文件
|
||
with open('downloaded_image.png', 'wb') as f:
|
||
f.write(image_bytes)
|
||
|
||
print('图片保存成功!')
|
||
```
|
||
|
||
---
|
||
|
||
## 6.3 读取图片到内存
|
||
|
||
```python
|
||
# 读取图片
|
||
with open('downloaded_image.png', 'rb') as f:
|
||
image_data = f.read()
|
||
|
||
print(f'图片大小: {len(image_data)} 字节')
|
||
print(f'文件头: {image_data[:8].hex()}') # PNG文件头是 89 50 4E 47
|
||
```
|
||
|
||
---
|
||
|
||
## 6.4 content vs text 的区别
|
||
|
||
```python
|
||
import requests
|
||
|
||
# response.text → str(文本内容,如HTML、JSON)
|
||
# response.content → bytes(二进制内容,如图片、音频、视频)
|
||
|
||
# 示例:
|
||
response = requests.get('https://api.example.com/data')
|
||
html = response.text # 字符串
|
||
json_data = response.json() # 自动解析JSON
|
||
|
||
# 下载图片
|
||
response = requests.get('https://example.com/photo.jpg')
|
||
image_bytes = response.content # 字节数据
|
||
```
|
||
|
||
---
|
||
|
||
## 6.5 复制图片文件
|
||
|
||
```python
|
||
# 一次性读取写入(适合小文件)
|
||
with open('photo1.png', 'rb') as src:
|
||
data = src.read()
|
||
|
||
with open('photo1_copy.png', 'wb') as dst:
|
||
dst.write(data)
|
||
|
||
print('图片复制完成!')
|
||
|
||
# 验证
|
||
import os
|
||
print(f'原文件: {os.path.getsize("photo1.png")} 字节')
|
||
print(f'复制文件: {os.path.getsize("photo1_copy.png")} 字节')
|
||
```
|
||
|
||
---
|
||
|
||
|
||
---
|
||
|
||
# 7. 动手练习:爬取豆瓣电影Top250
|
||
|
||
> 这一节我们综合运用所学知识,爬取豆瓣电影Top250的数据,并保存到文件中。
|
||
|
||
## 7.1 准备知识:豆瓣电影Top250
|
||
|
||
豆瓣电影Top250是豆瓣网精选的250部高分电影,网址是:
|
||
```
|
||
https://movie.douban.com/top250
|
||
```
|
||
|
||
我们需要爬取:
|
||
- 电影排名
|
||
- 中文名称
|
||
- 英文名称
|
||
- 评分
|
||
- 经典台词(如果有的话)
|
||
|
||
---
|
||
|
||
## 练习1:爬取并保存电影名称到文本文件
|
||
|
||
**目标:** 用requests爬取豆瓣电影Top250首页,获取前10部电影的中文名称,保存到 `movies.txt`
|
||
|
||
**步骤:**
|
||
1. 发送网络请求获取页面
|
||
2. 用正则表达式提取电影名称
|
||
3. 保存到文本文件
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
|
||
# 1. 发送请求获取页面
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
url = 'https://movie.douban.com/top250'
|
||
|
||
response = requests.get(url, headers=headers)
|
||
html = response.text
|
||
|
||
# 2. 用正则表达式提取电影名称
|
||
# 电影名称在 <span class="title"> 中
|
||
pattern = r'<span class="title">([^<&]+)</span>'
|
||
titles = re.findall(pattern, html)
|
||
|
||
# 3. 过滤掉英文名(只保留中文名)
|
||
chinese_titles = [t for t in titles if not t.startswith('/')]
|
||
|
||
# 取前10个
|
||
top10 = chinese_titles[:10]
|
||
|
||
# 4. 保存到文本文件
|
||
with open('movies.txt', 'w', encoding='utf-8') as f:
|
||
for i, title in enumerate(top10, 1):
|
||
f.write(f'{i}. {title}\n')
|
||
|
||
print('已保存前10部电影到 movies.txt')
|
||
|
||
# 显示内容验证
|
||
with open('movies.txt', 'r', encoding='utf-8') as f:
|
||
print(f.read())
|
||
```
|
||
|
||
**运行结果:**
|
||
```
|
||
1. 肖申克的救赎
|
||
2. 霸王别姬
|
||
3. 泰坦尼克号
|
||
4. 阿甘正传
|
||
5. 千与千寻
|
||
6. 美丽人生
|
||
7. 星际穿越
|
||
8. 这个杀手不太冷
|
||
9. 盗梦空间
|
||
10. 楚门的世界
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习2:爬取并保存为CSV文件
|
||
|
||
**目标:** 爬取前10部电影的完整信息(排名、中文名、英文名、评分),保存到 `movies.csv`
|
||
|
||
**数据示例:**
|
||
```
|
||
排名,中文名,英文名,评分
|
||
1,肖申克的救赎,The Shawshank Redemption,9.7
|
||
2,霸王别姬,,9.6
|
||
...
|
||
```
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
import csv
|
||
|
||
# 1. 爬取页面
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
url = 'https://movie.douban.com/top250'
|
||
response = requests.get(url, headers=headers)
|
||
html = response.text
|
||
|
||
# 2. 用正则提取数据
|
||
# 电影名称
|
||
title_pattern = r'<span class="title">([^<&]+)</span>'
|
||
# 评分
|
||
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>'
|
||
|
||
titles = re.findall(title_pattern, html)
|
||
ratings = re.findall(rating_pattern, html)
|
||
|
||
# 3. 整理数据(中文名和英文名配对)
|
||
movies = []
|
||
for i in range(min(10, len(titles))):
|
||
# 每两个title为一组(中文 + 可能有的英文)
|
||
title = titles[i] if not titles[i].startswith('/') else ''
|
||
en_title = titles[i+1] if i+1 < len(titles) and titles[i+1].startswith('/') else ''
|
||
en_title = en_title.replace('/ ', '') if en_title else ''
|
||
|
||
movie = {
|
||
'rank': i + 1,
|
||
'title': title,
|
||
'en_title': en_title,
|
||
'rating': ratings[i] if i < len(ratings) else ''
|
||
}
|
||
movies.append(movie)
|
||
|
||
# 4. 保存到CSV
|
||
with open('movies.csv', 'w', encoding='utf-8', newline='') as f:
|
||
writer = csv.DictWriter(f, fieldnames=['rank', 'title', 'en_title', 'rating'])
|
||
writer.writeheader()
|
||
writer.writerows(movies)
|
||
|
||
print('已保存到 movies.csv')
|
||
|
||
# 验证内容
|
||
with open('movies.csv', 'r', encoding='utf-8') as f:
|
||
for line in f:
|
||
print(line.strip())
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习3:爬取并保存为JSON文件
|
||
|
||
**目标:** 把电影数据保存为JSON格式,便于后续处理和API传输
|
||
|
||
**JSON格式示例:**
|
||
```json
|
||
[
|
||
{
|
||
"rank": 1,
|
||
"title": "肖申克的救赎",
|
||
"en_title": "The Shawshank Redemption",
|
||
"rating": "9.7",
|
||
"quote": ""
|
||
},
|
||
...
|
||
]
|
||
```
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
import json
|
||
|
||
# 1. 爬取页面
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
url = 'https://movie.douban.com/top250'
|
||
response = requests.get(url, headers=headers)
|
||
html = response.text
|
||
|
||
# 2. 提取数据
|
||
title_pattern = r'<span class="title">([^<&]+)</span>'
|
||
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>'
|
||
quote_pattern = r'<span class="inq">([^<]+)</span>'
|
||
|
||
titles = re.findall(title_pattern, html)
|
||
ratings = re.findall(rating_pattern, html)
|
||
quotes = re.findall(quote_pattern, html)
|
||
|
||
# 3. 构建电影列表
|
||
movies = []
|
||
title_index = 0
|
||
for i in range(10):
|
||
# 跳过英文名(带/的)
|
||
while title_index < len(titles) and titles[title_index].startswith('/'):
|
||
title_index += 1
|
||
|
||
movie = {
|
||
'rank': i + 1,
|
||
'title': titles[title_index] if title_index < len(titles) else '',
|
||
'en_title': '',
|
||
'rating': ratings[i] if i < len(ratings) else '',
|
||
'quote': quotes[i] if i < len(quotes) else ''
|
||
}
|
||
# 检查下一个是不是英文名
|
||
if title_index + 1 < len(titles) and titles[title_index + 1].startswith('/'):
|
||
movie['en_title'] = titles[title_index + 1].replace('/ ', '')
|
||
|
||
movies.append(movie)
|
||
title_index += 1
|
||
|
||
# 4. 保存到JSON
|
||
with open('movies.json', 'w', encoding='utf-8') as f:
|
||
json.dump(movies, f, ensure_ascii=False, indent=2)
|
||
|
||
print('已保存到 movies.json')
|
||
|
||
# 验证:读取并显示
|
||
with open('movies.json', 'r', encoding='utf-8') as f:
|
||
data = json.load(f)
|
||
print(f'共保存 {len(data)} 部电影')
|
||
for m in data[:3]:
|
||
print(f" {m['rank']}. {m['title']} ({m['en_title']}) - {m['rating']}")
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习4:读取CSV并筛选数据
|
||
|
||
**目标:** 读取之前保存的 `movies.csv`,筛选出评分高于9.5的电影
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import csv
|
||
|
||
# 读取CSV文件
|
||
with open('movies.csv', 'r', encoding='utf-8') as f:
|
||
reader = csv.DictReader(f)
|
||
|
||
print('评分高于9.5的电影:')
|
||
print('-' * 40)
|
||
|
||
count = 0
|
||
for row in reader:
|
||
# 评分是字符串,转为浮点数比较
|
||
if float(row['rating']) > 9.5:
|
||
count += 1
|
||
print(f"{row['rank']}. {row['title']}")
|
||
print(f" 英文名: {row['en_title']}")
|
||
print(f" 评分: {row['rating']}")
|
||
print()
|
||
|
||
print(f'共 {count} 部评分超过9.5')
|
||
```
|
||
|
||
**运行结果:**
|
||
```
|
||
评分高于9.5的电影:
|
||
----------------------------------------
|
||
1. 肖申克的救赎
|
||
英文名: The Shawshank Redemption
|
||
评分: 9.7
|
||
|
||
2. 霸王别姬
|
||
英文名:
|
||
评分: 9.6
|
||
|
||
共 2 部评分超过9.5
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习5:读取JSON并统计
|
||
|
||
**目标:** 读取 `movies.json`,计算平均分,找出评分最高的电影
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import json
|
||
|
||
# 读取JSON
|
||
with open('movies.json', 'r', encoding='utf-8') as f:
|
||
movies = json.load(f)
|
||
|
||
# 计算平均分
|
||
total = sum(float(m['rating']) for m in movies)
|
||
average = total / len(movies)
|
||
print(f'Top10 电影平均分: {average:.2f}')
|
||
|
||
# 找出最高分
|
||
highest = max(movies, key=lambda m: float(m['rating']))
|
||
print(f'\n评分最高的电影:')
|
||
print(f" {highest['rank']}. {highest['title']} ({highest['en_title']})")
|
||
print(f" 评分: {highest['rating']}")
|
||
|
||
# 统计有经典台词的电影
|
||
with_quote = [m for m in movies if m['quote']]
|
||
print(f'\n有经典台词的电影: {len(with_quote)} 部')
|
||
for m in with_quote:
|
||
print(f" \"{m['quote']}\" —— {m['title']}")
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习6:保存电影海报图片(模拟)
|
||
|
||
**目标:** 模拟爬取电影海报(图片),保存到本地
|
||
|
||
实际场景中,海报URL从网页源码中提取:
|
||
```html
|
||
<img alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg">
|
||
```
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import requests
|
||
import os
|
||
import json
|
||
|
||
# 模拟:从网页提取的海报URL(实际应从HTML中提取)
|
||
poster_urls = [
|
||
{'rank': 1, 'title': '肖申克的救赎', 'url': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg'},
|
||
{'rank': 2, 'title': '霸王别姬', 'url': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2911205318.jpg'},
|
||
{'rank': 3, 'title': '泰坦尼克号', 'url': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p457760035.jpg'},
|
||
]
|
||
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||
}
|
||
|
||
# 创建目录
|
||
os.makedirs('posters', exist_ok=True)
|
||
|
||
# 保存图片
|
||
saved_info = []
|
||
for info in poster_urls:
|
||
try:
|
||
# 发送请求获取图片
|
||
response = requests.get(info['url'], headers=headers, timeout=10)
|
||
image_data = response.content
|
||
|
||
# 保存图片
|
||
filename = f"posters/{info['rank']}_{info['title']}.jpg"
|
||
with open(filename, 'wb') as f:
|
||
f.write(image_data)
|
||
|
||
saved_info.append({
|
||
'rank': info['rank'],
|
||
'title': info['title'],
|
||
'filename': filename,
|
||
'size': len(image_data)
|
||
})
|
||
print(f'已保存: {filename} ({len(image_data)} bytes)')
|
||
|
||
except Exception as e:
|
||
print(f'下载失败 {info["title"]}: {e}')
|
||
|
||
# 保存图片信息到JSON
|
||
with open('posters/info.json', 'w', encoding='utf-8') as f:
|
||
json.dump(saved_info, f, ensure_ascii=False, indent=2)
|
||
|
||
print('\n图片信息已保存到 posters/info.json')
|
||
```
|
||
|
||
**注意:** 实际爬取时请添加延时(`time.sleep(1)`),不要过快请求,以免被封IP。
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## 练习7:综合练习 - 批量爬取并保存
|
||
|
||
**目标:** 编写一个完整的爬虫脚本,爬取豆瓣Top10电影的所有信息,保存为CSV和JSON
|
||
|
||
<details>
|
||
<summary>参考答案</summary>
|
||
|
||
```python
|
||
import requests
|
||
import re
|
||
import csv
|
||
import json
|
||
import os
|
||
import time
|
||
|
||
def crawl_douban_top10():
|
||
"""爬取豆瓣Top10电影信息"""
|
||
|
||
print('开始爬取豆瓣电影Top10...')
|
||
|
||
# 1. 爬取页面
|
||
headers = {
|
||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
|
||
'Accept': 'text/html,application/xhtml+xml',
|
||
}
|
||
url = 'https://movie.douban.com/top250'
|
||
|
||
response = requests.get(url, headers=headers, timeout=10)
|
||
html = response.text
|
||
|
||
# 2. 提取数据
|
||
# 电影名称(中文)
|
||
title_cn = re.findall(r'<span class="title">([^<&]+)</span>', html)
|
||
# 评分
|
||
ratings = re.findall(r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>', html)
|
||
# 经典台词
|
||
quotes = re.findall(r'<span class="inq">([^<]+)</span>', html)
|
||
|
||
# 3. 整理数据
|
||
movies = []
|
||
cn_index = 0
|
||
for i in range(10):
|
||
# 跳过英文名
|
||
while cn_index < len(title_cn) and title_cn[cn_index].startswith('/'):
|
||
cn_index += 1
|
||
|
||
movie = {
|
||
'rank': i + 1,
|
||
'title': title_cn[cn_index] if cn_index < len(title_cn) else '',
|
||
'rating': ratings[i] if i < len(ratings) else '',
|
||
'quote': quotes[i] if i < len(quotes) else ''
|
||
}
|
||
movies.append(movie)
|
||
cn_index += 1
|
||
|
||
return movies
|
||
|
||
def save_to_csv(movies, filename):
|
||
"""保存为CSV"""
|
||
with open(filename, 'w', encoding='utf-8', newline='') as f:
|
||
writer = csv.DictWriter(f, fieldnames=['rank', 'title', 'rating', 'quote'])
|
||
writer.writeheader()
|
||
writer.writerows(movies)
|
||
print(f'CSV已保存: {filename}')
|
||
|
||
def save_to_json(movies, filename):
|
||
"""保存为JSON"""
|
||
with open(filename, 'w', encoding='utf-8') as f:
|
||
json.dump(movies, f, ensure_ascii=False, indent=2)
|
||
print(f'JSON已保存: {filename}')
|
||
|
||
def main():
|
||
# 创建输出目录
|
||
os.makedirs('douban_output', exist_ok=True)
|
||
|
||
# 爬取数据
|
||
movies = crawl_douban_top10()
|
||
|
||
# 保存文件
|
||
save_to_csv(movies, 'douban_output/movies.csv')
|
||
save_to_json(movies, 'douban_output/movies.json')
|
||
|
||
# 显示结果
|
||
print('\n爬取结果:')
|
||
print('-' * 50)
|
||
for m in movies:
|
||
quote_text = f'「{m["quote"]}」' if m['quote'] else ''
|
||
print(f"{m['rank']}. {m['title']} - 评分: {m['rating']} {quote_text}")
|
||
|
||
print('\n完成!')
|
||
|
||
if __name__ == '__main__':
|
||
main()
|
||
```
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
# 附录:常见错误汇总
|
||
|
||
| 错误 | 原因 | 解决方法 |
|
||
|------|------|----------|
|
||
| `UnicodeDecodeError` | 用文本模式读二进制文件 | 用 `'rb'` 模式 |
|
||
| `requests.exceptions.SSLError` | SSL证书问题 | 换用其他网站或忽略验证 |
|
||
| CSV多余空行 | 没加 `newline=''` | 加上 `newline=''` |
|
||
| `FileNotFoundError` | 文件不存在 | 用 `'w'` 模式创建,或先检查 |
|
||
| 数据没换行 | `write()` 不会自动换行 | 手动加 `'\n'` |
|
||
| JSON中文变乱码 | 没加 `ensure_ascii=False` | 加上 `ensure_ascii=False` |
|
||
| 爬虫被封 | 请求过快 | 添加 `time.sleep(1)` 延时 |
|
||
| 中文显示乱码 | 文件编码不对 | 确保用 `encoding='utf-8'` |
|
||
|
||
---
|
||
|
||
# 课程回顾
|
||
|
||
本节学习了:
|
||
|
||
1. ✅ Python基础(变量、列表、字典、字符串)
|
||
2. ✅ with语句(自动关闭文件)
|
||
3. ✅ 文本文件读写(open/read/write)
|
||
4. ✅ CSV文件操作(csv模块)
|
||
5. ✅ JSON文件操作(json模块)
|
||
6. ✅ 二进制文件(图片读写)
|
||
7. ✅ 综合实战:爬取豆瓣电影Top250
|
||
|
||
**学习建议:**
|
||
- 先理解每个知识点的原理
|
||
- 跟着示例代码动手敲一遍
|
||
- 修改代码中的参数,观察结果变化
|
||
- 尝试完成所有练习题
|
||
|
||
**拓展挑战:**
|
||
- 爬取豆瓣Top250全部250部电影
|
||
- 保存电影海报图片
|
||
- 用matplotlib绘制评分分布图
|
||
- 将数据存入SQLite数据库
|
||
|