From 4d52003d48254d01191810f3ba5925b3cda72546 Mon Sep 17 00:00:00 2001
From: 2509165025 <2509165025@student.edu.cn>
Date: Tue, 24 Mar 2026 11:37:14 +0800
Subject: [PATCH] =?UTF-8?q?=E5=AE=8C=E6=88=90=E4=BD=9C=E4=B8=9A=E4=B8=80?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md            | 502 -------------------------------------------
 爬虫/爬虫.py (2).txt |  30 +++
 2 files changed, 30 insertions(+), 502 deletions(-)
 delete mode 100644 README.md
 create mode 100644 爬虫/爬虫.py (2).txt
diff --git a/README.md b/README.md
deleted file mode 100644
index 3887288..0000000
--- a/README.md
+++ /dev/null
@@ -1,502 +0,0 @@
-# 网络数据采集（爬虫基础）
-
-## 课程定位
-
-| 项目 | 内容 |
-|------|------|
-| **课程名称** | 网络数据采集（爬虫基础） |
-| **前置课程** | Python基础 |
-| **后续课程** | 数据清洗 |
-
----
-
-## 教学目标
-
-1. 掌握网络爬虫的基本原理
-2. 理解HTTP协议及网页结构
-3. 熟练使用requests、BeautifulSoup、lxml/Xpath进行数据抓取
-
----
-
-## 课程衔接
-
-```
-数据采集(爬虫) ──→ 数据存储 ──→ 数据清洗 ──→ 数据增广
-```
-
----
-
-## 第1部分：爬虫原理与HTTP协议
-
-### 1.1 为什么要学爬虫
-
-- 数据从哪里来
-- 爬虫在AI数据服务中的价值
-- 合法合规： robots.txt、爬虫协议
-
-### 1.2 爬虫基本原理
-
-**什么是爬虫？**
-
-爬虫（Crawler/Spider）是一种自动化程序，用于抓取互联网上的数据。
-
-**工作流程：**
-
-```
-┌──────────┐    HTTP请求     ┌──────────┐    HTTP响应     ┌──────────┐
-│  爬虫程序 │ ──────────────→ │  Web服务器 │ ──────────────→ │  爬虫程序 │
-└──────────┘                  └──────────┘                  └──────────┘
-       │                                                      │
-       │                    ┌──────────┐                      │
-       └──────────────────→│ 解析提取 │←─────────────────────┘
-                            │ 保存数据 │
-                            └──────────┘
-```
-
-**爬虫的核心步骤：**
-
-1. **发送请求**：向目标服务器发送HTTP请求
-2. **获取响应**：接收服务器返回的HTML/JSON等内容
-3. **解析数据**：从响应中提取所需信息
-4. **保存数据**：将数据存储为CSV/JSON等格式
-
-### 1.3 HTTP协议
-
-**请求与响应**
-
-- 请求由请求行、请求头、请求体组成
-- 响应由状态行、响应头、响应体组成
-
-**GET与POST区别**
-
-| 方法 | 用途 | 参数位置 | 安全性 |
-|------|------|----------|--------|
-| GET | 获取数据 | URL查询字符串 | 低（参数可见） |
-| POST | 提交数据 | 请求体中 | 高（参数不可见） |
-
-**常见请求头**
-
-- `User-Agent`：浏览器标识
-- `Referer`：请求来源页面
-- `Cookie`：会话跟踪
-
-**HTTP状态码**
-
-| 状态码 | 含义 |
-|--------|------|
-| 200 | 请求成功 |
-| 301/302 | 永久/临时重定向 |
-| 403 | 禁止访问 |
-| 404 | 页面不存在 |
-| 500 | 服务器内部错误 |
-
-### 1.4 网页结构基础
-
-**HTML**
-
-HTML是网页的结构标记语言。常用标签：`<div>`, `<span>`, `<a>`, `<p>`, `<table>`, `<li>`等
-
-**CSS选择器**
-
-通过class、id、标签名定位元素：
-- `.title` —— class为title的元素
-- `#header` —— id为header的元素
-- `a` —— 所有a标签
-
-**XPath表达式**
-
-XML路径语言，用于定位节点：
-
-| 表达式 | 含义 |
-|--------|------|
-| `/html/body/div` | 从根节点开始的绝对路径 |
-| `//div` | 任意位置的div标签 |
-| `//div[@class="title"]` | class为title的div |
-| `//a/text()` | 获取a标签内的文本 |
-| `//a/@href` | 获取a标签的href属性 |
-
-**DOM树**
-
-网页内容以树形结构组织，包括父节点、子节点、兄弟节点。
-
----
-
-## 第2部分：requests库
-
-### 2.1 requests库简介
-
-- Python最流行的HTTP库
-- 简单易用的API设计
-- 支持HTTP/HTTPS协议
-
-**安装：**
-```bash
-pip install requests
-```
-
-### 2.2 基本用法
-
-```python
-import requests
-
-# 发送GET请求
-response = requests.get('https://example.com')
-
-# 查看响应状态码
-print(response.status_code)  # 200
-
-# 查看响应内容（文本）
-print(response.text)
-
-# 查看响应内容（JSON）
-print(response.json())
-
-# 查看响应头
-print(response.headers)
-```
-
-### 2.3 常用方法
-
-| 方法 | 用途 |
-|------|------|
-| `requests.get()` | 发送GET请求 |
-| `requests.post()` | 发送POST请求 |
-| `requests.put()` | 发送PUT请求 |
-| `requests.delete()` | 发送DELETE请求 |
-
-### 2.4 常用属性
-
-| 属性 | 用途 |
-|------|------|
-| `response.status_code` | HTTP状态码 |
-| `response.text` | 响应内容（字符串） |
-| `response.content` | 响应内容（字节） |
-| `response.json()` | 响应内容（JSON解析） |
-| `response.headers` | 响应头 |
-| `response.cookies` | Cookies |
-
-### 2.5 模拟浏览器请求
-
-```python
-headers = {
-    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
-    'Referer': 'https://www.example.com',
-}
-
-response = requests.get(url, headers=headers)
-```
-
-### 2.6 带参数的请求
-
-```python
-# URL查询参数
-params = {'keyword': 'Python', 'page': 1}
-response = requests.get('https://example.com/search', params=params)
-
-# POST请求（表单数据）
-data = {'username': 'test', 'password': '123456'}
-response = requests.post('https://example.com/login', data=data)
-```
-
-### 2.7 Cookie与Session
-
-```python
-# 设置Cookie
-cookies = {'session_id': 'abc123'}
-response = requests.get(url, cookies=cookies)
-
-# 使用Session保持Cookie
-session = requests.Session()
-session.cookies.set('session_id', 'abc123')
-response = session.get(url)
-```
-
-### 2.8 超时设置
-
-```python
-response = requests.get(url, timeout=10)
-```
-
----
-
-## 第3部分：BeautifulSoup库
-
-### 3.1 BeautifulSoup简介
-
-- 用于解析HTML和XML的Python库
-- 提供导航、搜索、修改DOM树的功能
-- 常与requests配合使用
-
-**安装：**
-```bash
-pip install beautifulsoup4 lxml
-```
-
-### 3.2 基本用法
-
-```python
-from bs4 import BeautifulSoup
-
-# 解析HTML（使用lxml解析器）
-soup = BeautifulSoup(html_content, 'lxml')
-```
-
-**解析器对比**
-
-| 解析器 | 速度 | 特点 |
-|--------|------|------|
-| lxml | 快 | 推荐使用 |
-| html.parser | 中 | Python内置 |
-| html5lib | 慢 | 最接近浏览器 |
-
-### 3.3 查找元素
-
-```python
-# 查找第一个匹配的元素
-soup.find('div')                    # 标签名
-soup.find('div', class_='title')    # 带class筛选
-soup.find('div', id='header')        # 带id筛选
-
-# 查找所有匹配的元素
-soup.find_all('a')                   # 所有a标签
-soup.find_all('a', limit=10)         # 限制数量
-
-# CSS选择器（推荐）
-soup.select('.title')                # class为title
-soup.select('#header')               # id为header
-soup.select('div a')                 # div下的a标签
-soup.select('div > a')               # div直接子节点a
-```
-
-### 3.4 获取元素内容
-
-```python
-# 获取文本
-element.get_text()                   # 获取所有文本
-element.string                       # 获取直接文本
-
-# 获取属性
-element['href']                      # 获取href属性
-element.get('class')                 # 获取class属性
-```
-
-### 3.5 导航DOM树
-
-```python
-# 向下导航
-soup.head                            # head标签
-soup.body                            # body标签
-soup.div.p                           # 层层深入
-
-# 向上/横向导航
-element.parent                       # 父节点
-element.next_sibling                 # 下一个兄弟元素
-```
-
----
-
-## 第4部分：lxml库与XPath
-
-### 4.1 lxml简介
-
-- 高性能的XML和HTML处理库
-- 支持XPath和XSLT
-- C语言实现，速度快
-
-**安装：**
-```bash
-pip install lxml
-```
-
-### 4.2 XPath基础语法
-
-| 表达式 | 含义 |
-|--------|------|
-| `/` | 根节点 |
-| `//` | 任意位置 |
-| `.` | 当前节点 |
-| `..` | 父节点 |
-| `@` | 属性 |
-| `*` | 任意元素 |
-| `[n]` | 第n个元素（从1开始） |
-
-### 4.3 XPath示例
-
-```python
-from lxml import etree
-
-tree = etree.HTML(html_content)
-
-# 基本路径
-tree.xpath('//div')                  # 所有div标签
-tree.xpath('//div/p')                 # div下的p标签
-
-# 属性选择
-tree.xpath('//div[@class="title"]')          # class等于title
-tree.xpath('//div[contains(@class, "title")]')  # class包含title
-tree.xpath('//a[@href]')             # 有href属性的a标签
-
-# 获取文本
-tree.xpath('//div/text()')           # div内的文本
-
-# 获取属性
-tree.xpath('//a/@href')              # 所有a标签的href属性
-
-# 位置选择
-tree.xpath('//li[1]')                # 第一个li
-tree.xpath('//li[last()]')          # 最后一个li
-
-# 逻辑运算
-tree.xpath('//div[@id="main" and @class="content"]')
-```
-
-### 4.4 在Python中使用lxml
-
-```python
-from lxml import etree
-
-# 解析HTML字符串
-html = '<html><body><div class="title">Hello</div></body></html>'
-tree = etree.HTML(html)
-
-# 获取元素
-titles = tree.xpath('//div[@class="title"]')
-for title in titles:
-    print(title.text)                # 获取文本
-    print(title.get('class'))        # 获取属性
-
-# 获取文本内容（安全方式）
-text = tree.xpath('string(//div[@class="title"])')
-```
-
-### 4.5 lxml与BeautifulSoup对比
-
-| 特性 | BeautifulSoup | lxml |
-|------|---------------|------|
-| API设计 | 面向对象，友好 | XPath表达式 |
-| 速度 | 较慢 | 快 |
-| 灵活性 | 高 | 中 |
-| 适用场景 | 复杂DOM结构 | 结构清晰的页面 |
-
----
-
-## 第5部分：实战案例
-
-### 爬取电影评分数据
-
-```python
-import requests
-from bs4 import BeautifulSoup
-import csv
-import time
-
-# 1. 发送请求
-url = 'https://movie.douban.com/top250'
-headers = {'User-Agent': 'Mozilla/5.0...'}
-response = requests.get(url, headers=headers)
-
-# 2. 解析数据
-soup = BeautifulSoup(response.text, 'lxml')
-movies = []
-
-for item in soup.select('.item'):
-    title = item.select_one('.title').get_text()
-    rating = item.select_one('.rating_num').get_text()
-    quote = item.select_one('.inq').get_text() if item.select_one('.inq') else ''
-    
-    movies.append({
-        'title': title.strip(),
-        'rating': rating,
-        'quote': quote
-    })
-
-# 3. 保存为CSV
-with open('movies.csv', 'w', newline='', encoding='utf-8') as f:
-    writer = csv.DictWriter(f, fieldnames=['title', 'rating', 'quote'])
-    writer.writeheader()
-    writer.writerows(movies)
-
-print(f'已保存 {len(movies)} 部电影到 movies.csv')
-```
-
-### 使用lxml/XPath的写法
-
-```python
-import requests
-from lxml import etree
-
-url = 'https://movie.douban.com/top250'
-headers = {'User-Agent': 'Mozilla/5.0...'}
-
-response = requests.get(url, headers=headers)
-tree = etree.HTML(response.text)
-
-# 使用XPath提取数据
-movies = []
-for item in tree.xpath('//div[@class="item"]'):
-    title = item.xpath('.//span[@class="title"]/text()')
-    rating = item.xpath('.//span[@class="rating_num"]/text()')
-    
-    if title and rating:
-        movies.append({
-            'title': title[0],
-            'rating': rating[0]
-        })
-
-print(f'共提取 {len(movies)} 部电影')
-```
-
-### 进阶技巧
-
-- 分页爬取
-- 错误处理与重试
-- 爬取间隔（防止被封）
-
----
-
-## 第6部分：数据存储
-
-### CSV存储
-
-```python
-import pandas as pd
-df = pd.DataFrame(movies)
-df.to_csv('data.csv', index=False, encoding='utf-8-sig')
-```
-
-### JSON存储
-
-```python
-import json
-with open('data.json', 'w', encoding='utf-8') as f:
-    json.dump(movies, f, ensure_ascii=False, indent=2)
-```
-
-### Excel存储
-
-```python
-df.to_excel('data.xlsx', index=False)
-```
-
----
-
-## 核心库总结
-
-| 库名 | 功能定位 | 适用场景 |
-|------|----------|----------|
-| **requests** | HTTP请求库 | 发送网络请求，获取网页内容 |
-| **BeautifulSoup** | HTML/XML解析库 | 解析HTML，提取数据，支持CSS选择器 |
-| **lxml** | XML/HTML处理库 | 高性能解析，支持XPath |
-| **XPath** | 路径查询语言 | 精确快速定位XML/HTML中的元素 |
-
-**推荐组合：**
-- 新手入门：requests + BeautifulSoup（CSS选择器更直观）
-- 进阶开发：requests + lxml（XPath更强大，速度更快）
-
----
-
-## 所需安装的Python库
-
-```bash
-pip install requests beautifulsoup4 lxml pandas openpyxl
-```
diff --git a/爬虫/爬虫.py (2).txt b/爬虫/爬虫.py (2).txt
new file mode 100644
index 0000000..00dee05
--- /dev/null
+++ b/爬虫/爬虫.py (2).txt	
@@ -0,0 +1,30 @@
+import requests
+from bs4 import BeautifulSoup
+import time
+
+headers = {
+    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+
+all_movies = []
+
+for page in range(0, 250, 25):
+    url = f"https://movie.douban.com/top250?start={page}&filter="
+    print(f"������ȡ�� {page//25 + 1} ҳ��{url}")
+
+    response = requests.get(url, headers=headers)
+    response.encoding = "utf-8"
+    soup = BeautifulSoup(response.text, "html.parser")
+
+   
+    items = soup.find_all("div", class_="item")
+    for item in items:
+        title = item.find("span", class_="title").get_text(strip=True)
+        all_movies.append(title)
+        print(title)
+
+    
+    time.sleep(1)
+
+
+print(f"\nһ��������Ӱ��{len(all_movies)} ��")
\ No newline at end of file