# 整数 - 就像数学里的整数
age = 20
print(f'年龄: {age}')

# 浮点数 - 带小数点的数
score = 88.5
print(f'成绩: {score}')

# 字符串 - 用引号包起来的文字
name = '张三'
print(f'姓名: {name}')

# 布尔值 - 只有两个值：True（真）和 False（假）
is_passed = True
print(f'是否及格: {is_passed}')

# 空值 - 表示"什么都没有"
empty_value = None
print(f'空值: {empty_value}')

小提示： Python中的变量不需要声明类型，同一个变量可以赋不同类型的值：

x = 10        # 现在x是整数
print(f'x = {x}, 类型: {type(x)}')

x = 'hello'  # 现在x变成字符串了！
print(f'x = {x}, 类型: {type(x)}')

常见数据类型一览：

数据类型	示例	说明
int	20, 100, -5	整数
float	88.5, 3.14	浮点数（小数）
str	'你好', "Python"	字符串（文字）
bool	True, False	布尔值
None	None	空值

1.2 列表 List

什么是列表？ 列表就像一排连续的储物盒，每个盒子里放一个数据，通过编号（索引）来访问。

1.2.1 创建列表

# 用方括号创建列表，元素之间用逗号分隔
fruits = ['苹果', '香蕉', '橙子', '葡萄']  # 字符串列表
numbers = [1, 2, 3, 4, 5]                   # 整数列表
mixed = ['hello', 123, True, 3.14]           # 混合列表

1.2.2 访问列表元素

重要：索引从0开始！

fruits = ['苹果', '香蕉', '橙子', '葡萄']

print(fruits[0])   # 第一个元素：苹果
print(fruits[1])   # 第二个元素：香蕉
print(fruits[-1])  # 最后一个元素：葡萄
print(fruits[-2])  # 倒数第二个：橙子

索引示意图：

fruits = ['苹果', '香蕉', '橙子', '葡萄']
            [0]      [1]      [2]      [3]
           [-4]     [-3]     [-2]     [-1]

1.2.3 修改列表元素

fruits = ['苹果', '香蕉', '橙子', '葡萄']

fruits[0] = '西瓜'    # 修改第一个元素
print(fruits)        # ['西瓜', '香蕉', '橙子', '葡萄']

1.2.4 添加元素

fruits = ['苹果', '香蕉', '橙子', '葡萄']

fruits.append('草莓')     # 在末尾添加
print(fruits)             # ['苹果', '香蕉', '橙子', '葡萄', '草莓']

fruits.insert(1, '桃子')  # 在索引1的位置插入
print(fruits)             # ['苹果', '桃子', '香蕉', '橙子', '葡萄', '草莓']

1.2.5 删除元素

fruits = ['苹果', '香蕉', '橙子', '葡萄']

fruits.remove('香蕉')    # 删除指定元素（删除第一个匹配的）
print(fruits)             # ['苹果', '橙子', '葡萄']

del fruits[0]             # 删除指定索引的元素
print(fruits)             # ['橙子', '葡萄']

1.2.6 列表长度

fruits = ['苹果', '香蕉', '橙子', '葡萄']
print(len(fruits))  # 4

1.2.7 遍历列表

fruits = ['苹果', '香蕉', '橙子', '葡萄']

# 方法1：直接遍历（最常用）
for fruit in fruits:
    print(fruit)

# 方法2：带索引遍历
for i, fruit in enumerate(fruits):
    print(f'{i}: {fruit}')

输出：

苹果
香蕉
橙子
葡萄

1.2.8 列表切片

切片就像从列表中"切"出一部分。

numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print(numbers[2:5])   # [2, 3, 4]    - 从索引2到4（不包含5）
print(numbers[:4])    # [0, 1, 2, 3] - 从开头到索引3
print(numbers[5:])    # [5, 6, 7, 8, 9] - 从索引5到结尾
print(numbers[-3:])   # [7, 8, 9]   - 最后3个元素

切片语法：列表[开始:结束:步长]

1.3 字典 Dict

什么是字典？ 字典就像真实的字典，用"键"来找"值"。每个键对应一个值，形成键值对。

1.3.1 创建字典

student = {
    'name': '张三',      # 键是'name'，值是'张三'
    'age': 20,          # 键是'age'，值是20
    'major': '人工智能', # 键是'major'，值是'人工智能'
    'score': 88.5       # 键是'score'，值是88.5
}

1.3.2 访问字典的值

# 通过键来访问值
print(student['name'])   # 张三
print(student['age'])    # 20

# 使用get()方法（更安全）
print(student.get('city'))              # None（键不存在，返回None）
print(student.get('city', '未知'))      # 未知（键不存在，返回默认值）

1.3.3 修改和添加

student = {'name': '张三', 'age': 20}

student['age'] = 21      # 修改已有键的值
student['city'] = '北京' # 添加新的键值对

print(student)
# {'name': '张三', 'age': 21, 'city': '北京'}

1.3.4 删除键值对

student = {'name': '张三', 'age': 20, 'city': '北京'}

del student['city']      # 删除指定键值对
print(student)            # {'name': '张三', 'age': 20}

1.3.5 遍历字典

student = {'name': '张三', 'age': 20, 'major': '人工智能'}

# 遍历所有键
for key in student:
    print(f'{key}: {student[key]}')

# 遍历所有键值对（常用！）
for key, value in student.items():
    print(f'{key} = {value}')

# 只遍历值
for value in student.values():
    print(value)

1.4 字符串基础

1.4.1 字符串的创建

s1 = '单引号字符串'
s2 = "双引号字符串"
s3 = '''三引号字符串
可以换行'''                # 三引号可以写多行

1.4.2 字符串拼接

first_name = '张'
last_name = '三'
full_name = first_name + last_name
print(full_name)  # 张三

1.4.3 字符串格式化（重要！）

name = '李四'
age = 21

# 方法1：f-string（推荐，最常用！）
info = f'姓名: {name}, 年龄: {age}'
print(info)  # 姓名: 李四, 年龄: 21

# 方法2：format()
info = '姓名: {}, 年龄: {}'.format(name, age)
print(info)  # 姓名: 李四, 年龄: 21

1.4.4 常用字符串方法

text = '  Hello, Python!  '

print(text.strip())      # 'Hello, Python!'    - 去除首尾空白
print(text.lower())      # '  hello, python!  ' - 转小写
print(text.upper())      # '  HELLO, PYTHON!  ' - 转大写
print(text.replace('Python', 'World'))  # '  Hello, World!  ' - 替换

1.4.5 字符串分割

csv_line = '张三,20,人工智能,88.5'

# split() 把字符串按分隔符拆分成列表
parts = csv_line.split(',')
print(parts)  # ['张三', '20', '人工智能', '88.5']

1.4.6 判断包含

text = 'Hello, Python!'

if 'Python' in text:
    print('包含Python')

2. with语句：文件操作的好帮手

这一节非常重要！with语句是Python文件操作的核心，必须完全掌握。

2.1 为什么需要with语句？

2.1.1 普通方式的问题

# 普通方式打开文件
f = open('test.txt', 'w', encoding='utf-8')
f.write('Hello')
# 问题：如果这里发生异常，close()不会执行！
# f.close() 容易被遗忘
print('文件已打开，但忘记关闭')

问题在哪？

如果写入过程中出错，程序会崩溃，close()不会执行
文件没关闭可能导致数据丢失
忘记关文件还浪费系统资源

2.1.2 try-finally方式

# 用try-finally确保关闭
f = None
try:
    f = open('test.txt', 'w', encoding='utf-8')
    f.write('try-finally方式')
finally:
    if f:
        f.close()  # 无论如何都会执行
print('已关闭')

缺点： 代码太繁琐！

2.1.3 with语句（推荐！）

# 用with语句，自动管理关闭
with open('test.txt', 'w', encoding='utf-8') as f:
    f.write('with语句方式')
    # with会自动调用 f.close()

print('文件已自动关闭')

优点：

自动关闭，无需手动写close()
即使出错也会正确关闭
代码更简洁

2.2 with语句的工作原理

with语句背后依赖两个方法：__enter__() 和 __exit__()

# with 相当于自动调用这两个方法：
# 1. with进入时 → 调用 __enter__()
# 2. with退出时 → 调用 __exit__()

# 想象成：
# f = open('test.txt', 'w')      → __enter__()返回文件对象
# f.write('Hello')
# f.close()                        → __exit__()自动调用

为什么用as？

with open('test.txt', 'r') as f:
    # open()返回的文件对象传给f
    content = f.read()
# 离开with块时，自动调用f.close()
print(content)  # 在外面仍然可以使用content

2.3 with的多种用法

用法1：单个文件

with open('file1.txt', 'w') as f:
    f.write('写入内容')

用法2：同时操作多个文件

with open('source.txt', 'w') as src, open('dest.txt', 'w') as dst:
    src.write('源文件内容')
    dst.write('目标文件内容')

用法3：嵌套with

with open('outer.txt', 'w') as outer:
    outer.write('外层')
    with open('inner.txt', 'w') as inner:
        inner.write('内层')

用法4：with结合循环

lines = ['第一行', '第二行', '第三行']
with open('lines.txt', 'w') as f:
    for i, line in enumerate(lines):
        f.write(f'{i+1}. {line}\n')

3. 文本文件读写

文本文件是最常见的文件类型，.txt、.py、.md都是文本文件。

3.1 文件打开模式

打开文件时，需要指定模式：

模式	字符	说明	注意事项
只读	'r'	读取文件	文件不存在会报错
写入	'w'	写入文件	文件存在会清空内容
追加	'a'	在末尾添加	文件不存在会创建
创建	'x'	创建文件	文件存在会报错

加上 b 表示二进制模式：

rb、wb、ab - 二进制读写

加上 + 表示同时读写：

r+、w+、a+ - 读写模式

模式对比

# 'w' 写入模式 - 文件存在会清空
f = open('test.txt', 'w')
f.write('第一次')
f.close()

f = open('test.txt', 'w')  # 再打开，内容被清空！
f.write('第二次')
f.close()

with open('test.txt', 'r') as f:
    print(f.read())  # 只有"第二次"

# 'a' 追加模式 - 在末尾添加
f = open('test.txt', 'a')
f.write('\n追加的内容')
f.close()

with open('test.txt', 'r') as f:
    print(f.read())  # "第二次" + "追加的内容"

# 'r' 只读模式
try:
    with open('not_exist.txt', 'r') as f:
        print(f.read())
except FileNotFoundError:
    print('文件不存在！')

# 'x' 创建模式
try:
    with open('new_file.txt', 'x') as f:
        f.write('新文件')
except FileExistsError:
    print('文件已存在！')

3.2 文件读取方法

准备测试文件：

with open('read_test.txt', 'w', encoding='utf-8') as f:
    f.write('第一行内容\n')
    f.write('第二行内容\n')
    f.write('第三行内容\n')
    f.write('第四行内容\n')
    f.write('第五行（无换行）')

方法1：read() - 读取全部

with open('read_test.txt', 'r', encoding='utf-8') as f:
    content = f.read()
    print(content)

方法2：read(n) - 读取n个字符

with open('read_test.txt', 'r', encoding='utf-8') as f:
    content = f.read(10)  # 只读10个字符
    print(content)

方法3：readline() - 读取一行

with open('read_test.txt', 'r', encoding='utf-8') as f:
    line1 = f.readline()  # 读第一行
    line2 = f.readline()  # 读第二行
    print(f'第一行: {line1}')
    print(f'第二行: {line2}')
    # 注意：readline会保留换行符\n

方法4：readlines() - 读取所有行到列表

with open('read_test.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    print(f'共{len(lines)}行')
    for i, line in enumerate(lines):
        print(f'{i}: {repr(line)}')  # repr显示原始内容

方法5：for循环遍历（推荐！）

with open('read_test.txt', 'r', encoding='utf-8') as f:
    for line in f:
        print(line.strip())  # strip()去除换行符

推荐原因： 内存友好，大文件也不会卡

3.3 文件写入方法

方法1：write() - 写入字符串

with open('write_test.txt', 'w', encoding='utf-8') as f:
    f.write('第一行')
    f.write('\n第二行')  # 换行要自己加\n
    f.write('\n第三行')

方法2：writelines() - 写入多行

lines = ['第一行\n', '第二行\n', '第三行\n']
with open('writelines_test.txt', 'w', encoding='utf-8') as f:
    f.writelines(lines)

注意： writelines不会自动加换行符！

3.4 逐行处理实战

例1：读取并处理CSV格式数据

# 准备数据
data = '''姓名,年龄,专业
张三,20,人工智能
李四,21,计算机科学
王五,19,软件工程'''

with open('students.txt', 'w', encoding='utf-8') as f:
    f.write(data)

# 逐行处理
with open('students.txt', 'r', encoding='utf-8') as f:
    header = f.readline().strip()  # 读取第一行（表头）
    print(f'表头: {header}')
    
    for line in f:  # 遍历其余行
        line = line.strip()
        if line:
            parts = line.split(',')
            name, age, major = parts
            print(f'{name} - {major}')

例2：计算平均年龄

with open('students.txt', 'r', encoding='utf-8') as f:
    next(f)  # 跳过表头
    
    total_age = 0
    count = 0
    
    for line in f:
        line = line.strip()
        if line:
            parts = line.split(',')
            age = int(parts[1])
            total_age += age
            count += 1
    
    average = total_age / count
    print(f'学生人数: {count}')
    print(f'平均年龄: {average:.1f}')

例3：筛选并保存

with open('students.txt', 'r', encoding='utf-8') as f:
    header = f.readline()  # 保存表头
    
    with open('filtered.txt', 'w', encoding='utf-8') as out:
        out.write(header)  # 写入表头
        
        for line in f:
            line = line.strip()
            if line:
                parts = line.split(',')
                age = int(parts[1])
                if age > 20:  # 筛选年龄>20的
                    out.write('\n' + line)

4. CSV文件：表格数据存储

CSV是最常用的表格数据格式，可以用Excel打开。

4.1 什么是CSV？

CSV = Comma Separated Values（逗号分隔值）

示例内容：
姓名,年龄,专业
张三,20,人工智能
李四,21,计算机科学

特点：
- 每行一条记录
- 字段之间用逗号分隔
- 第一行通常是表头
- 可以直接用Excel打开编辑

4.2 CSV模块基础

Python内置了csv模块，使用前需要导入。

4.2.1 写入CSV

重要：一定要加 newline=''！

import csv

# 准备数据
header = ['姓名', '年龄', '专业', '成绩']
students = [
    ['张三', 20, '人工智能', 88.5],
    ['李四', 21, '计算机科学', 92.0],
    ['王五', 19, '软件工程', 85.5],
]

# 写入CSV（关键：newline=''）
with open('students.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(header)    # 写入一行
    writer.writerows(students)  # 写入多行

4.2.2 为什么要加 newline=''？

不加会产生多余空行！

# 错误方式
with open('wrong.csv', 'w', encoding='utf-8') as f:  # 少了newline=''
    writer = csv.writer(f)
    writer.writerow(['A', 'B', 'C'])

# 读取看结果：会有多余空行！
with open('wrong.csv', 'r') as f:
    print(repr(f.read()))
# 输出: 'A,B,C\r\n\r\nA,B,C\r\n\r\nA,B,C\r\n'  ← 多余的\r\n

# 正确方式
with open('correct.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['A', 'B', 'C'])

with open('correct.csv', 'r') as f:
    print(repr(f.read()))
# 输出: 'A,B,C'  ← 正确！

4.2.3 读取CSV

import csv

with open('students.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

4.3 CSV字典方式（更直观！）

4.3.1 写入

import csv

students = [
    {'姓名': '张三', '年龄': 20, '专业': '人工智能', '成绩': 88.5},
    {'姓名': '李四', '年龄': 21, '专业': '计算机科学', '成绩': 92.0},
    {'姓名': '王五', '年龄': 19, '专业': '软件工程', '成绩': 85.5},
]

fieldnames = ['姓名', '年龄', '专业', '成绩']

with open('students_dict.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()      # 写入表头
    writer.writerows(students)  # 写入多行

4.3.2 读取

import csv

with open('students_dict.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    
    # reader.fieldnames 自动包含表头
    print(f'表头: {reader.fieldnames}')
    
    for row in reader:
        # 用键名访问，直观！
        print(f"姓名: {row['姓名']}, 专业: {row['专业']}, 成绩: {row['成绩']}")

4.3.3 对比：列表方式 vs 字典方式

# 列表方式
with open('students.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)
    next(reader)  # 跳过表头
    for row in reader:
        print(f'{row[0]} - {row[2]}')  # row[0]是什么？row[2]是什么？容易搞混
        # 还要记住：0是姓名，2是专业

# 字典方式
with open('students_dict.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(f"{row['姓名']} - {row['专业']}")  # 一目了然！

5. JSON文件：结构化数据存储

JSON是现代编程中最流行的数据格式，用于API传输、配置文件等。

5.1 什么是JSON？

JSON = JavaScript Object Notation

特点：
- 轻量级数据交换格式
- 浏览器和服务器之间的标准数据传输格式
- 配置文件常用格式（config.json, package.json）

数据类型：
- 字符串："Hello"
- 数字：123, 45.67
- 布尔值：true, false
- 空值：null
- 数组：[]
- 对象：{}

Python数据类型 ↔ JSON数据类型

Python	JSON
str	string
int, float	number
bool	boolean
None	null
list, tuple	array
dict	object

5.2 JSON读写操作

5.2.1 写入JSON文件

import json

config = {
    'app_name': '人工智能数据服务平台',
    'version': '1.0.0',
    'debug': True,
    'max_users': 100,
    'database': {
        'host': 'localhost',
        'port': 3306,
        'username': 'root',
        'password': '123456'
    },
    'allowed_modules': ['图像处理', '文本处理', '语音处理'],
    'settings': None
}

# 写入JSON
with open('config.json', 'w', encoding='utf-8') as f:
    # ensure_ascii=False：保留中文字符（重要！）
    # indent=2：格式化缩进，易读
    json.dump(config, f, ensure_ascii=False, indent=2)

5.2.2 读取JSON文件

import json

with open('config.json', 'r', encoding='utf-8') as f:
    data = json.load(f)  # 读取并解析为Python对象

print(f'应用名称: {data["app_name"]}')
print(f'版本: {data["version"]}')
print(f'调试模式: {data["debug"]}')
print(f'允许的模块: {data["allowed_modules"]}')
print(f'数据库主机: {data["database"]["host"]}')

5.2.3 dump/dumps/load/loads 区别

import json

data = {'name': '张三', 'age': 20}

# json.dump() - 写入文件
with open('test.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

# json.dumps() - 转为字符串（用于网络传输）
json_string = json.dumps(data, ensure_ascii=False)
print(f'字符串: {json_string}')
print(f'类型: {type(json_string)}')  # str

# json.loads() - 从字符串解析
parsed = json.loads(json_string)
print(f'解析后: {parsed}')

5.2.4 实战：保存学生成绩

import json

scores = [
    {'姓名': '张三', '成绩': 88.5},
    {'姓名': '李四', '成绩': 92.0},
    {'姓名': '王五', '成绩': 85.5},
]

# 保存
with open('scores.json', 'w', encoding='utf-8') as f:
    json.dump(scores, f, ensure_ascii=False, indent=2)

# 读取
with open('scores.json', 'r', encoding='utf-8') as f:
    loaded = json.load(f)

# 计算平均分
total = sum(s['成绩'] for s in loaded)
avg = total / len(loaded)
print(f'平均成绩: {avg:.2f}')

6. 二进制文件：图片的读写

图片、音频、视频都是二进制文件，和文本文件处理方式不同！

6.1 文本文件 vs 二进制文件

类型	读出来	写进去	特点
文本文件 (t)	str	str	有编码（UTF-8等）
二进制文件 (b)	bytes	bytes	无编码，原始字节

文件模式：

'r'  或 'rt'   # 文本只读
'w'  或 'wt'   # 文本写入
'rb'           # 二进制只读
'wb'           # 二进制写入

6.2 保存从网络下载的图片

上节课学过的requests爬取图片：

import requests

# requests.get().content 返回的是bytes（字节数据）
response = requests.get('https://example.com/image.jpg')
image_bytes = response.content

保存图片到本地：

# 假设这是从网络获取的图片字节
image_bytes = b'\x89PNG\r\n\x1a\n...'  # 实际的图片字节数据

# 用'wb'模式写入二进制文件
with open('downloaded_image.png', 'wb') as f:
    f.write(image_bytes)

print('图片保存成功！')

6.3 读取图片到内存

# 读取图片
with open('downloaded_image.png', 'rb') as f:
    image_data = f.read()

print(f'图片大小: {len(image_data)} 字节')
print(f'文件头: {image_data[:8].hex()}')  # PNG文件头是 89 50 4E 47

6.4 content vs text 的区别

import requests

# response.text  → str（文本内容，如HTML、JSON）
# response.content → bytes（二进制内容，如图片、音频、视频）

# 示例：
response = requests.get('https://api.example.com/data')
html = response.text        # 字符串
json_data = response.json() # 自动解析JSON

# 下载图片
response = requests.get('https://example.com/photo.jpg')
image_bytes = response.content  # 字节数据

6.5 复制图片文件

# 一次性读取写入（适合小文件）
with open('photo1.png', 'rb') as src:
    data = src.read()

with open('photo1_copy.png', 'wb') as dst:
    dst.write(data)

print('图片复制完成！')

# 验证
import os
print(f'原文件: {os.path.getsize("photo1.png")} 字节')
print(f'复制文件: {os.path.getsize("photo1_copy.png")} 字节')

7. 动手练习：爬取豆瓣电影Top250

这一节我们综合运用所学知识，爬取豆瓣电影Top250的数据，并保存到文件中。

7.1 准备知识：豆瓣电影Top250

豆瓣电影Top250是豆瓣网精选的250部高分电影，网址是：

https://movie.douban.com/top250

我们需要爬取：

电影排名
中文名称
英文名称
评分
经典台词（如果有的话）

练习1：爬取并保存电影名称到文本文件

目标： 用requests爬取豆瓣电影Top250首页，获取前10部电影的中文名称，保存到 movies.txt

步骤：

发送网络请求获取页面
用正则表达式提取电影名称
保存到文本文件

参考答案

import requests
import re

# 1. 发送请求获取页面
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://movie.douban.com/top250'

response = requests.get(url, headers=headers)
html = response.text

# 2. 用正则表达式提取电影名称
# 电影名称在 <span class="title"> 中
pattern = r'<span class="title">([^<&]+)</span>'
titles = re.findall(pattern, html)

# 3. 过滤掉英文名（只保留中文名）
chinese_titles = [t for t in titles if not t.startswith('/')]

# 取前10个
top10 = chinese_titles[:10]

# 4. 保存到文本文件
with open('movies.txt', 'w', encoding='utf-8') as f:
    for i, title in enumerate(top10, 1):
        f.write(f'{i}. {title}\n')

print('已保存前10部电影到 movies.txt')

# 显示内容验证
with open('movies.txt', 'r', encoding='utf-8') as f:
    print(f.read())

运行结果：

1. 肖申克的救赎
2. 霸王别姬
3. 泰坦尼克号
4. 阿甘正传
5. 千与千寻
6. 美丽人生
7. 星际穿越
8. 这个杀手不太冷
9. 盗梦空间
10. 楚门的世界

练习2：爬取并保存为CSV文件

目标： 爬取前10部电影的完整信息（排名、中文名、英文名、评分），保存到 movies.csv

数据示例：

排名,中文名,英文名,评分
1,肖申克的救赎,The Shawshank Redemption,9.7
2,霸王别姬,,9.6
...

参考答案

import requests
import re
import csv

# 1. 爬取页面
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://movie.douban.com/top250'
response = requests.get(url, headers=headers)
html = response.text

# 2. 用正则提取数据
# 电影名称
title_pattern = r'<span class="title">([^<&]+)</span>'
# 评分
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>'

titles = re.findall(title_pattern, html)
ratings = re.findall(rating_pattern, html)

# 3. 整理数据（中文名和英文名配对）
movies = []
for i in range(min(10, len(titles))):
    # 每两个title为一组（中文 + 可能有的英文）
    title = titles[i] if not titles[i].startswith('/') else ''
    en_title = titles[i+1] if i+1 < len(titles) and titles[i+1].startswith('/') else ''
    en_title = en_title.replace('/ ', '') if en_title else ''
    
    movie = {
        'rank': i + 1,
        'title': title,
        'en_title': en_title,
        'rating': ratings[i] if i < len(ratings) else ''
    }
    movies.append(movie)

# 4. 保存到CSV
with open('movies.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['rank', 'title', 'en_title', 'rating'])
    writer.writeheader()
    writer.writerows(movies)

print('已保存到 movies.csv')

# 验证内容
with open('movies.csv', 'r', encoding='utf-8') as f:
    for line in f:
        print(line.strip())

练习3：爬取并保存为JSON文件

目标： 把电影数据保存为JSON格式，便于后续处理和API传输

JSON格式示例：

[
  {
    "rank": 1,
    "title": "肖申克的救赎",
    "en_title": "The Shawshank Redemption",
    "rating": "9.7",
    "quote": ""
  },
  ...
]

参考答案

import requests
import re
import json

# 1. 爬取页面
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://movie.douban.com/top250'
response = requests.get(url, headers=headers)
html = response.text

# 2. 提取数据
title_pattern = r'<span class="title">([^<&]+)</span>'
rating_pattern = r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>'
quote_pattern = r'<span class="inq">([^<]+)</span>'

titles = re.findall(title_pattern, html)
ratings = re.findall(rating_pattern, html)
quotes = re.findall(quote_pattern, html)

# 3. 构建电影列表
movies = []
title_index = 0
for i in range(10):
    # 跳过英文名（带/的）
    while title_index < len(titles) and titles[title_index].startswith('/'):
        title_index += 1
    
    movie = {
        'rank': i + 1,
        'title': titles[title_index] if title_index < len(titles) else '',
        'en_title': '',
        'rating': ratings[i] if i < len(ratings) else '',
        'quote': quotes[i] if i < len(quotes) else ''
    }
    # 检查下一个是不是英文名
    if title_index + 1 < len(titles) and titles[title_index + 1].startswith('/'):
        movie['en_title'] = titles[title_index + 1].replace('/ ', '')
    
    movies.append(movie)
    title_index += 1

# 4. 保存到JSON
with open('movies.json', 'w', encoding='utf-8') as f:
    json.dump(movies, f, ensure_ascii=False, indent=2)

print('已保存到 movies.json')

# 验证：读取并显示
with open('movies.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(f'共保存 {len(data)} 部电影')
    for m in data[:3]:
        print(f"  {m['rank']}. {m['title']} ({m['en_title']}) - {m['rating']}")

练习4：读取CSV并筛选数据

目标： 读取之前保存的 movies.csv，筛选出评分高于9.5的电影

参考答案

import csv

# 读取CSV文件
with open('movies.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    
    print('评分高于9.5的电影：')
    print('-' * 40)
    
    count = 0
    for row in reader:
        # 评分是字符串，转为浮点数比较
        if float(row['rating']) > 9.5:
            count += 1
            print(f"{row['rank']}. {row['title']}")
            print(f"   英文名: {row['en_title']}")
            print(f"   评分: {row['rating']}")
            print()
    
    print(f'共 {count} 部评分超过9.5')

运行结果：

评分高于9.5的电影：
----------------------------------------
1. 肖申克的救赎
   英文名: The Shawshank Redemption
   评分: 9.7

2. 霸王别姬
   英文名: 
   评分: 9.6

共 2 部评分超过9.5

练习5：读取JSON并统计

目标： 读取 movies.json，计算平均分，找出评分最高的电影

参考答案

import json

# 读取JSON
with open('movies.json', 'r', encoding='utf-8') as f:
    movies = json.load(f)

# 计算平均分
total = sum(float(m['rating']) for m in movies)
average = total / len(movies)
print(f'Top10 电影平均分: {average:.2f}')

# 找出最高分
highest = max(movies, key=lambda m: float(m['rating']))
print(f'\n评分最高的电影:')
print(f"  {highest['rank']}. {highest['title']} ({highest['en_title']})")
print(f"  评分: {highest['rating']}")

# 统计有经典台词的电影
with_quote = [m for m in movies if m['quote']]
print(f'\n有经典台词的电影: {len(with_quote)} 部')
for m in with_quote:
    print(f"  \"{m['quote']}\" —— {m['title']}")

练习6：保存电影海报图片（模拟）

目标： 模拟爬取电影海报（图片），保存到本地

实际场景中，海报URL从网页源码中提取：

<img alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg">

参考答案

import requests
import os
import json

# 模拟：从网页提取的海报URL（实际应从HTML中提取）
poster_urls = [
    {'rank': 1, 'title': '肖申克的救赎', 'url': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg'},
    {'rank': 2, 'title': '霸王别姬', 'url': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2911205318.jpg'},
    {'rank': 3, 'title': '泰坦尼克号', 'url': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p457760035.jpg'},
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

# 创建目录
os.makedirs('posters', exist_ok=True)

# 保存图片
saved_info = []
for info in poster_urls:
    try:
        # 发送请求获取图片
        response = requests.get(info['url'], headers=headers, timeout=10)
        image_data = response.content
        
        # 保存图片
        filename = f"posters/{info['rank']}_{info['title']}.jpg"
        with open(filename, 'wb') as f:
            f.write(image_data)
        
        saved_info.append({
            'rank': info['rank'],
            'title': info['title'],
            'filename': filename,
            'size': len(image_data)
        })
        print(f'已保存: {filename} ({len(image_data)} bytes)')
        
    except Exception as e:
        print(f'下载失败 {info["title"]}: {e}')

# 保存图片信息到JSON
with open('posters/info.json', 'w', encoding='utf-8') as f:
    json.dump(saved_info, f, ensure_ascii=False, indent=2)

print('\n图片信息已保存到 posters/info.json')

注意： 实际爬取时请添加延时（time.sleep(1)），不要过快请求，以免被封IP。

练习7：综合练习 - 批量爬取并保存

目标： 编写一个完整的爬虫脚本，爬取豆瓣Top10电影的所有信息，保存为CSV和JSON

参考答案

import requests
import re
import csv
import json
import os
import time

def crawl_douban_top10():
    """爬取豆瓣Top10电影信息"""
    
    print('开始爬取豆瓣电影Top10...')
    
    # 1. 爬取页面
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml',
    }
    url = 'https://movie.douban.com/top250'
    
    response = requests.get(url, headers=headers, timeout=10)
    html = response.text
    
    # 2. 提取数据
    # 电影名称（中文）
    title_cn = re.findall(r'<span class="title">([^<&]+)</span>', html)
    # 评分
    ratings = re.findall(r'<span class="rating_num"[^>]*>(\d+\.?\d*)</span>', html)
    # 经典台词
    quotes = re.findall(r'<span class="inq">([^<]+)</span>', html)
    
    # 3. 整理数据
    movies = []
    cn_index = 0
    for i in range(10):
        # 跳过英文名
        while cn_index < len(title_cn) and title_cn[cn_index].startswith('/'):
            cn_index += 1
        
        movie = {
            'rank': i + 1,
            'title': title_cn[cn_index] if cn_index < len(title_cn) else '',
            'rating': ratings[i] if i < len(ratings) else '',
            'quote': quotes[i] if i < len(quotes) else ''
        }
        movies.append(movie)
        cn_index += 1
    
    return movies

def save_to_csv(movies, filename):
    """保存为CSV"""
    with open(filename, 'w', encoding='utf-8', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=['rank', 'title', 'rating', 'quote'])
        writer.writeheader()
        writer.writerows(movies)
    print(f'CSV已保存: {filename}')

def save_to_json(movies, filename):
    """保存为JSON"""
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(movies, f, ensure_ascii=False, indent=2)
    print(f'JSON已保存: {filename}')

def main():
    # 创建输出目录
    os.makedirs('douban_output', exist_ok=True)
    
    # 爬取数据
    movies = crawl_douban_top10()
    
    # 保存文件
    save_to_csv(movies, 'douban_output/movies.csv')
    save_to_json(movies, 'douban_output/movies.json')
    
    # 显示结果
    print('\n爬取结果：')
    print('-' * 50)
    for m in movies:
        quote_text = f'「{m["quote"]}」' if m['quote'] else ''
        print(f"{m['rank']}. {m['title']} - 评分: {m['rating']} {quote_text}")
    
    print('\n完成！')

if __name__ == '__main__':
    main()

附录：常见错误汇总

错误	原因	解决方法
`UnicodeDecodeError`	用文本模式读二进制文件	用 `'rb'` 模式
`requests.exceptions.SSLError`	SSL证书问题	换用其他网站或忽略验证
CSV多余空行	没加 `newline=''`	加上 `newline=''`
`FileNotFoundError`	文件不存在	用 `'w'` 模式创建，或先检查
数据没换行	`write()` 不会自动换行	手动加 `'\n'`
JSON中文变乱码	没加 `ensure_ascii=False`	加上 `ensure_ascii=False`
爬虫被封	请求过快	添加 `time.sleep(1)` 延时
中文显示乱码	文件编码不对	确保用 `encoding='utf-8'`

课程回顾

本节学习了：

✅ Python基础（变量、列表、字典、字符串）
✅ with语句（自动关闭文件）
✅ 文本文件读写（open/read/write）
✅ CSV文件操作（csv模块）
✅ JSON文件操作（json模块）
✅ 二进制文件（图片读写）
✅ 综合实战：爬取豆瓣电影Top250

学习建议：

先理解每个知识点的原理
跟着示例代码动手敲一遍
修改代码中的参数，观察结果变化
尝试完成所有练习题

拓展挑战：

爬取豆瓣Top250全部250部电影
保存电影海报图片
用matplotlib绘制评分分布图

README.md Unescape Escape

📚 Python文件操作完全指南

目录

1. 先导知识：Python基础回顾

1.1 变量和数据类型

1.2 列表 List

1.2.1 创建列表

1.2.2 访问列表元素

1.2.3 修改列表元素

1.2.4 添加元素

1.2.5 删除元素

1.2.6 列表长度

1.2.7 遍历列表

1.2.8 列表切片

1.3 字典 Dict

1.3.1 创建字典

1.3.2 访问字典的值

1.3.3 修改和添加

1.3.4 删除键值对

1.3.5 遍历字典

1.4 字符串基础

1.4.1 字符串的创建

1.4.2 字符串拼接

1.4.3 字符串格式化（重要！）

1.4.4 常用字符串方法

1.4.5 字符串分割

1.4.6 判断包含

2. with语句：文件操作的好帮手

2.1 为什么需要with语句？

2.1.1 普通方式的问题

2.1.2 try-finally方式

2.1.3 with语句（推荐！）

2.2 with语句的工作原理

2.3 with的多种用法

用法1：单个文件

用法2：同时操作多个文件

用法3：嵌套with

用法4：with结合循环

3. 文本文件读写

3.1 文件打开模式

模式对比

3.2 文件读取方法

方法1：read() - 读取全部

方法2：read(n) - 读取n个字符

方法3：readline() - 读取一行

方法4：readlines() - 读取所有行到列表

方法5：for循环遍历（推荐！）

3.3 文件写入方法

方法1：write() - 写入字符串

方法2：writelines() - 写入多行

3.4 逐行处理实战

例1：读取并处理CSV格式数据

例2：计算平均年龄

例3：筛选并保存

4. CSV文件：表格数据存储

4.1 什么是CSV？

4.2 CSV模块基础

4.2.1 写入CSV

4.2.2 为什么要加 newline=''？

4.2.3 读取CSV

4.3 CSV字典方式（更直观！）

4.3.1 写入

4.3.2 读取

4.3.3 对比：列表方式 vs 字典方式

5. JSON文件：结构化数据存储

5.1 什么是JSON？

5.2 JSON读写操作

5.2.1 写入JSON文件

5.2.2 读取JSON文件

5.2.3 dump/dumps/load/loads 区别

5.2.4 实战：保存学生成绩

6. 二进制文件：图片的读写

6.1 文本文件 vs 二进制文件

6.2 保存从网络下载的图片

6.3 读取图片到内存

6.4 content vs text 的区别

6.5 复制图片文件

7. 动手练习：爬取豆瓣电影Top250

7.1 准备知识：豆瓣电影Top250

练习1：爬取并保存电影名称到文本文件

README.md