3357 lines
155 KiB
Plaintext
3357 lines
155 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 3-2-1 文本数据处理导论\n",
|
||
"## 课堂演示notebook\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## 目录\n",
|
||
"\n",
|
||
"1. [什么是文本数据?](#第一部分-什么是文本数据)\n",
|
||
"2. [计算机如何读取文本?](#第二部分-计算机如何读取文本)\n",
|
||
"3. [向量基础入门](#第三部分-向量基础入门)\n",
|
||
"4. [余弦相似度](#第四部分-余弦相似度)\n",
|
||
"5. [文本向量化的核心思想](#第五部分-文本向量化的核心思想)\n",
|
||
"6. [BoW词袋模型](#第六部分-bow词袋模型)\n",
|
||
"7. [TF-IDF词频-逆文档频率](#第七部分-tf-idf)\n",
|
||
"8. [Word Embedding词嵌入](#第八部分-word-embedding词嵌入)\n",
|
||
"9. [文本处理完整流程](#第九部分-文本处理完整流程)\n",
|
||
"10. [实战:用jieba进行中文分词](#第十部分-实战用jieba进行中文分词)\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"**注意**:运行本notebook需要安装以下依赖:\n",
|
||
"```bash\n",
|
||
"pip install numpy matplotlib jieba\n",
|
||
"```\n",
|
||
"- BoW和TF-IDF代码使用纯Python+NumPy实现,不依赖sklearn\n",
|
||
"- 如果服务器没有中文字体,图表中的中文可能显示为方块,这是正常现象。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 第一部分:什么是文本数据?\n",
|
||
"\n",
|
||
"## 1.1 文本数据的定义\n",
|
||
"\n",
|
||
"**文本数据**是由文字、符号组成的序列信息,是人类语言在计算机中的表示形式。\n",
|
||
"\n",
|
||
"### 生活中的文本数据例子\n",
|
||
"\n",
|
||
"| 类型 | 示例 |\n",
|
||
"|------|------|\n",
|
||
"| 一句话 | \"今天天气真好\" |\n",
|
||
"| 一篇文章 | 一篇新闻报道 |\n",
|
||
"| 一条评论 | \"这家餐厅的菜太好吃了!\" |\n",
|
||
"| 一段对话 | \"你好,请问这本书多少钱?\" |\n",
|
||
"| 一首诗 | \"床前明月光,疑是地上霜\" |\n",
|
||
"| 一段代码 | `print('Hello World')` |\n",
|
||
"| 一封邮件 | 包含正文、收件人、发件人等 |\n",
|
||
"| 聊天记录 | 微信对话、短信 |\n",
|
||
"\n",
|
||
"**简单来说:只要是文字组成的信息,都是文本数据!**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1.2 文本数据的特点\n",
|
||
"\n",
|
||
"文本数据与图像、音频等数据有显著区别:\n",
|
||
"\n",
|
||
"| 特点 | 说明 | 示例 |\n",
|
||
"|------|------|------|\n",
|
||
"| **离散符号** | 由离散的字符/词组成,不是连续的数值 | \"hello\" 由 h,e,l,l,o 这5个字符组成 |\n",
|
||
"| **序列性** | 符号按特定顺序排列,顺序改变意思就改变 | \"我爱你\" ≠ \"你爱我\" |\n",
|
||
"| **语义丰富** | 同样的词在不同场景意思可能不同 | \"苹果\"可以是水果或手机品牌 |\n",
|
||
"| **上下文相关** | 词的意思依赖上下文 | \"他打了猫,猫跑了\" 中两个\"猫\"意思相同 |\n",
|
||
"| **歧义性** | 同样的话可能有多重理解 | \"天气真不错\"可以是正面或反讽 |\n",
|
||
"\n",
|
||
"### 思考:序列性有多重要?\n",
|
||
"\n",
|
||
"```\n",
|
||
"文本1: \"我吃了饭\"\n",
|
||
"文本2: \"饭了我吃\"\n",
|
||
"文本3: \"饭吃了我\"\n",
|
||
"\n",
|
||
"这三个文本由完全相同的字符组成,但顺序不同,意思也完全不同!\n",
|
||
"这说明:文本的顺序承载了重要的语义信息。\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第二部分:计算机如何\"读取\"文本?\n",
|
||
"\n",
|
||
"## 2.1 对比:图像数据 vs 文本数据的存储方式\n",
|
||
"\n",
|
||
"### 图像数据的读取\n",
|
||
"\n",
|
||
"```\n",
|
||
"图像文件(.jpg/.png)\n",
|
||
" ↓\n",
|
||
"计算机读取像素值(每个像素是0-255的数值)\n",
|
||
" ↓\n",
|
||
"存储为3维矩阵 [高度, 宽度, 通道(RGB)]\n",
|
||
" ↓\n",
|
||
"一张 1920×1080 的彩色图 = 1920 × 1080 × 3 = 6,220,800 个数字\n",
|
||
"```\n",
|
||
"\n",
|
||
"**图像的本质:密集的数值矩阵,计算机可以直接处理!**\n",
|
||
"\n",
|
||
"### 文本数据的读取\n",
|
||
"\n",
|
||
"```\n",
|
||
"文本文件(.txt/.md/.py)\n",
|
||
" ↓\n",
|
||
"计算机读取字符编码(ASCII/UTF-8/GBK)\n",
|
||
" ↓\n",
|
||
"存储为字符序列(每个字符是一个数字编码)\n",
|
||
" ↓\n",
|
||
"\"Python\" → [80, 121, 116, 104, 111](ASCII编码)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**文本的本质:符号序列,计算机需要额外处理才能理解!**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2.2 字符编码:用数字表示字符\n",
|
||
"\n",
|
||
"### ASCII编码(英文和部分符号)\n",
|
||
"\n",
|
||
"ASCII码使用0-127的数字来表示128个字符:\n",
|
||
"\n",
|
||
"| 字符 | ASCII码 | 说明 |\n",
|
||
"|------|---------|------|\n",
|
||
"| 'A' | 65 | 大写字母 |\n",
|
||
"| 'B' | 66 | 大写字母 |\n",
|
||
"| ... | ... | ... |\n",
|
||
"| 'Z' | 90 | 大写字母 |\n",
|
||
"| 'a' | 97 | 小写字母 |\n",
|
||
"| 'b' | 98 | 小写字母 |\n",
|
||
"| ... | ... | ... |\n",
|
||
"| 'z' | 122 | 小写字母 |\n",
|
||
"| '0' | 48 | 数字 |\n",
|
||
"| '1' | 49 | 数字 |\n",
|
||
"| ... | ... | ... |\n",
|
||
"| '9' | 57 | 数字 |\n",
|
||
"\n",
|
||
"### UTF-8编码(支持全球所有语言,包括中文)\n",
|
||
"\n",
|
||
"UTF-8是一种变长编码,中文通常用3-4个字节表示:\n",
|
||
"\n",
|
||
"| 字符 | UTF-8编码值 | 字节数 |\n",
|
||
"|------|-------------|--------|\n",
|
||
"| '中' | 20013 | 2字节 |\n",
|
||
"| '文' | 25991 | 2字节 |\n",
|
||
"| 'P' | 80 | 1字节 |\n",
|
||
"| 'y' | 121 | 1字节 |\n",
|
||
"| '👍' | 128077 | 4字节(emoji) |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"英文文本的字符编码\n",
|
||
"==================================================\n",
|
||
"文本: Hello\n",
|
||
"每个字符的ASCII码: [72, 101, 108, 108, 111]\n",
|
||
"\n",
|
||
" 'H' -> 72\n",
|
||
" 'e' -> 101\n",
|
||
" 'l' -> 108\n",
|
||
" 'l' -> 108\n",
|
||
" 'o' -> 111\n",
|
||
"\n",
|
||
"==================================================\n",
|
||
"中文文本的字符编码\n",
|
||
"==================================================\n",
|
||
"文本: 你好\n",
|
||
"每个字符的UTF-8编码值: [20320, 22909]\n",
|
||
"\n",
|
||
" '你' -> 20320\n",
|
||
" '好' -> 22909\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 实际演示:查看字符的编码值\n",
|
||
"\n",
|
||
"# 英文例子\n",
|
||
"text_en = \"Hello\"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"英文文本的字符编码\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(f\"文本: {text_en}\")\n",
|
||
"print(f\"每个字符的ASCII码: {[ord(c) for c in text_en]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 逐个显示\n",
|
||
"for c in text_en:\n",
|
||
" print(f\" '{c}' -> {ord(c)}\")\n",
|
||
"\n",
|
||
"print()\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"中文文本的字符编码\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 中文例子\n",
|
||
"text_cn = \"你好\"\n",
|
||
"print(f\"文本: {text_cn}\")\n",
|
||
"print(f\"每个字符的UTF-8编码值: {[ord(c) for c in text_cn]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 逐个显示\n",
|
||
"for c in text_cn:\n",
|
||
" print(f\" '{c}' -> {ord(c)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"验证:数字编码转字符\n",
|
||
"\n",
|
||
"chr(65) = 'A' # 应该是大写字母 A\n",
|
||
"chr(97) = 'a' # 应该是小写字母 a\n",
|
||
"chr(20013) = '中' # 应该是中文'中'\n",
|
||
"chr(25991) = '文' # 应该是中文'文'\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 用chr()函数反向验证:数字编码转字符\n",
|
||
"print(\"验证:数字编码转字符\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 65是大写字母A\n",
|
||
"print(f\"chr(65) = '{chr(65)}' # 应该是大写字母 A\")\n",
|
||
"\n",
|
||
"# 97是小写字母a\n",
|
||
"print(f\"chr(97) = '{chr(97)}' # 应该是小写字母 a\")\n",
|
||
"\n",
|
||
"# 20013是中文\"中\"\n",
|
||
"print(f\"chr(20013) = '{chr(20013)}' # 应该是中文'中'\")\n",
|
||
"\n",
|
||
"# 25991是中文\"文\"\n",
|
||
"print(f\"chr(25991) = '{chr(25991)}' # 应该是中文'文'\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"练习题1答案\n",
|
||
"==================================================\n",
|
||
"1. 'Hello' 的ASCII码:\n",
|
||
"[72, 101, 108, 108, 111]\n",
|
||
"\n",
|
||
"2. 验证 chr(65):\n",
|
||
"chr(65) = 'A'\n",
|
||
"\n",
|
||
"验证 A-Z 的ASCII码范围 (65-90):\n",
|
||
"['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 练习题1答案:验证字符编码\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"练习题1答案\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 1. 用 ord() 函数打印 \"Hello\" 每个字符的ASCII码\n",
|
||
"print(\"1. 'Hello' 的ASCII码:\")\n",
|
||
"print([ord(c) for c in \"Hello\"])\n",
|
||
"\n",
|
||
"# 2. 验证字符65对应大写字母A\n",
|
||
"print()\n",
|
||
"print(\"2. 验证 chr(65):\")\n",
|
||
"print(f\"chr(65) = '{chr(65)}'\")\n",
|
||
"\n",
|
||
"# 验证范围\n",
|
||
"print()\n",
|
||
"print(\"验证 A-Z 的ASCII码范围 (65-90):\")\n",
|
||
"print([chr(i) for i in range(65, 91)])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2.3 计算机擅长什么?不擅长什么?\n",
|
||
"\n",
|
||
"### 计算机擅长的任务 ✅\n",
|
||
"\n",
|
||
"| 任务类型 | 示例 | 说明 |\n",
|
||
"|----------|------|------|\n",
|
||
"| 数字计算 | 1 + 2 = 3 | 加减乘除、方程求解 |\n",
|
||
"| 逻辑判断 | if a > b then ... | 条件分支、布尔运算 |\n",
|
||
"| 矩阵运算 | 图像卷积、矩阵乘法 | 深度学习核心 |\n",
|
||
"| 精确匹配 | 字符串完全相同比较 | 数据库查询 |\n",
|
||
"| 模式识别 | 符合规则的数据查找 | 正则表达式 |\n",
|
||
"| 存储检索 | 海量数据快速存取 | 搜索引擎 |\n",
|
||
"\n",
|
||
"### 计算机不擅长的任务 ❌\n",
|
||
"\n",
|
||
"| 任务类型 | 示例 | 为什么困难 |\n",
|
||
"|----------|------|-------------|\n",
|
||
"| 语义理解 | \"今天天气真好\"是好是坏? | 需要常识和上下文 |\n",
|
||
"| 情感判断 | \"真是绝了\"是夸还是骂? | 歧义性、反讽 |\n",
|
||
"| 模糊推理 | \"大概\"、\"也许\" | 无法精确处理 |\n",
|
||
"| 创意创作 | 写诗、写小说 | 需要想象力 |\n",
|
||
"| 常识理解 | \"水往低处流\" | 缺乏物理常识 |\n",
|
||
"| 多义性理解 | \"苹果\"指什么? | 需要世界知识 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 为什么计算机不擅长理解文本?\n",
|
||
"\n",
|
||
"**原因一:文本是\"符号\",不是\"数值\"**\n",
|
||
"\n",
|
||
"```\n",
|
||
"计算机的大脑 = 计算器(专门处理数字)\n",
|
||
"文本 = 一堆符号(对计算机来说就像乱码)\n",
|
||
"\n",
|
||
"数字:1, 2, 3, 100.5, -7 → 计算机直接能算\n",
|
||
"文本:\"好\"、\"bad\"、\"hello\" → 计算机不知道啥意思\n",
|
||
"```\n",
|
||
"\n",
|
||
"**原因二:语义不是显式表达的**\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# 人类理解的文本:\n",
|
||
"text = \"他今天心情不太好,因为下雨了\"\n",
|
||
"\n",
|
||
"# 人类理解:\n",
|
||
"# - \"心情不太好\" = 不开心\n",
|
||
"# - \"因为下雨了\" = 原因是下雨\n",
|
||
"# - 因果关系:下雨 → 心情不好\n",
|
||
"\n",
|
||
"# 计算机只能看到:\n",
|
||
"print(text)\n",
|
||
"# 计算机:???不理解下雨和心情的因果关系\n",
|
||
"```\n",
|
||
"\n",
|
||
"**原因三:同样的符号,不同的语境,不同的意思**\n",
|
||
"\n",
|
||
"```\n",
|
||
"语境1: \"苹果真好吃\" → 说的是水果(吃的苹果)\n",
|
||
"\n",
|
||
"语境2: \"苹果手机真贵\" → 说的是手机品牌(Apple)\n",
|
||
"\n",
|
||
"语境3: \"牛顿被苹果砸到了\" → 说的是水果(引发万有引力灵感)\n",
|
||
"\n",
|
||
"计算机怎么知道?需要上下文理解能力!\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 关键结论:为什么需要文本向量化?\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌─────────────────────────────────────────────────────────────┐\n",
|
||
"│ 核心矛盾 │\n",
|
||
"│ │\n",
|
||
"│ 文本(符号序列) ←→ 计算机擅长(数值计算) │\n",
|
||
"│ ↓ │\n",
|
||
"│ 需要一座桥梁 │\n",
|
||
"│ 这座桥梁就是 │\n",
|
||
"│ 【文本向量化】 │\n",
|
||
"│ │\n",
|
||
"│ 文本 → 数值向量 → 计算机可以计算 → AI模型处理 │\n",
|
||
"└─────────────────────────────────────────────────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第三部分:向量基础入门\n",
|
||
"\n",
|
||
"## 3.1 什么是向量?\n",
|
||
"\n",
|
||
"**向量 = 有方向的量**,是数学中描述\"大小+方向\"的基本工具。\n",
|
||
"\n",
|
||
"### 生活中的向量例子\n",
|
||
"\n",
|
||
"| 例子 | 大小 | 方向 | 说明 |\n",
|
||
"|------|------|------|------|\n",
|
||
"| 速度 | 60 km/h | 向北 | 速度是向量 |\n",
|
||
"| 力 | 10 N | 向右推 | 力是向量 |\n",
|
||
"| 风向 | 5 m/s | 东南风 | 风向是向量 |\n",
|
||
"| 位移 | 100 km | 北京→上海 | 位移是向量 |\n",
|
||
"\n",
|
||
"### 向量在数学中的表示\n",
|
||
"\n",
|
||
"**一维向量(数轴上的点)**:\n",
|
||
"\n",
|
||
"```\n",
|
||
" ←———————————|———————————→\n",
|
||
" -3 -2 -1 0 1 2 3\n",
|
||
"\n",
|
||
" 点A在位置 2 → 向量A = [2] (只有1个数字)\n",
|
||
" 点B在位置 -3 → 向量B = [-3] (负数表示方向相反)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"NumPy向量创建演示\n",
|
||
"==================================================\n",
|
||
"一维向量 v1 = [3]\n",
|
||
"v1 有 1 个元素\n",
|
||
"\n",
|
||
"二维向量 v2 = [2 3]\n",
|
||
"v2 有 2 个元素\n",
|
||
"\n",
|
||
"三维向量 v3 = [1 2 3]\n",
|
||
"v3 有 3 个元素\n",
|
||
"\n",
|
||
"10维向量 v10 = [ 0.1 0.5 -0.3 0.8 0.2 -0.1 0.7 0.3 -0.2 0.6]\n",
|
||
"v10 有 10 个元素\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Python中使用NumPy创建向量\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"NumPy向量创建演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 一维向量(只有1个数字)\n",
|
||
"v1 = np.array([3])\n",
|
||
"print(f\"一维向量 v1 = {v1}\")\n",
|
||
"print(f\"v1 有 {len(v1)} 个元素\")\n",
|
||
"\n",
|
||
"# 二维向量(2个数字,表示平面上的一个点)\n",
|
||
"v2 = np.array([2, 3])\n",
|
||
"print(f\"\\n二维向量 v2 = {v2}\")\n",
|
||
"print(f\"v2 有 {len(v2)} 个元素\")\n",
|
||
"\n",
|
||
"# 三维向量(3个数字,表示立体空间的一个点)\n",
|
||
"v3 = np.array([1, 2, 3])\n",
|
||
"print(f\"\\n三维向量 v3 = {v3}\")\n",
|
||
"print(f\"v3 有 {len(v3)} 个元素\")\n",
|
||
"\n",
|
||
"# 高维向量(机器学习中常用,几十维到几千维)\n",
|
||
"v10 = np.array([0.1, 0.5, -0.3, 0.8, 0.2, -0.1, 0.7, 0.3, -0.2, 0.6])\n",
|
||
"print(f\"\\n10维向量 v10 = {v10}\")\n",
|
||
"print(f\"v10 有 {len(v10)} 个元素\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 二维向量的几何直观\n",
|
||
"\n",
|
||
"```\n",
|
||
" y (纵坐标)\n",
|
||
" ↑\n",
|
||
" |\n",
|
||
" 3 | * A(2,3)\n",
|
||
" |\n",
|
||
" 2 |\n",
|
||
" |\n",
|
||
" 1 | * B(4,1)\n",
|
||
" |\n",
|
||
" 0---+—————————————→ x (横坐标)\n",
|
||
" 0 1 2 3 4 5\n",
|
||
"\n",
|
||
" 向量A = [2, 3] (横坐标2,纵坐标3)\n",
|
||
" 向量B = [4, 1]\n",
|
||
"\n",
|
||
" 从原点(0,0)出发,到点(2,3)的箭头,就是向量A的图形表示\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
|
||
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArwAAAIrCAYAAAAN2Uq4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaLhJREFUeJzt3Xd8FHX+x/H3JpACIdQk9Ca9hA4GkIAgHGCJBZHfCbHrCR6IFRvFErtwyoGoEAsIB0pAQRBB4BAsgJHiiYAIqCShJhBMgMz8/lizsKRvNtnJ5PV8PPYhM/udmc/kG/CdyWdnHKZpmgIAAABsys/XBQAAAAAlicALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAOWUw+FQ3759fV1GrtauXSuHw6FJkya5re/bt68cDodvirpIfHy8HA6H4uPjfV0KgAIQeAEU2u+//66pU6dq4MCBatiwoQICAlS7dm1df/31+uabb3LdJjugZL8qVqyomjVrqmPHjrr99tu1YsUKGYZRqOPv2rVLDodDrVq1KnDs448/LofDoeeee65I51gUv/76qxwOh2655ZYSO0ZBHnvsMTkcDsXFxeU7zjAMNWzYUP7+/jp48GApVVe2WWF+AXhHBV8XAKDseP311/XCCy/okksu0cCBAxUWFqbdu3crISFBCQkJmjdvnoYPH57rtg888IBCQkJkGIZOnDih//3vf5o7d65mz56tnj176sMPP1TDhg3zPX7Lli3Vu3dvbdiwQV999ZV69eqV6zjDMPTee+/J39/f9mHltttuU1xcnObMmaMJEybkOW7VqlU6ePCg/va3v6lBgwaSpP/973+qVKlSaZXqFe+9955Onz7t6zIkSddee60uvfRS1alTx9elACgAgRdAoXXv3l1r165VdHS02/r//ve/6t+/v/7xj38oJiZGgYGBObZ98MEHVbt2bbd1R44c0T//+U99+OGHGjRokDZv3qzKlSvnW8Ptt9+uDRs2aPbs2XkG3pUrV+q3337T0KFDVbdu3SKeZdnSrFkzRUdHa926dfrvf/+ryy67LNdxs2fPluT8+mUrzJVyqynoh6LSVLVqVVWtWtXXZQAoBFoaABTaddddlyPsStJll12mfv366fjx49q+fXuh91erVi198MEHuvzyy/XTTz9p+vTpBW4zbNgwValSRf/5z3+Unp6e65jcwl1KSoruv/9+NWvWTIGBgapVq5auv/567dixI9d9pKSk6IEHHlDLli0VHBysGjVqqEePHnr55ZclOfs3mzRpIkl699133do21q5d69pPenq6Jk6cqFatWikoKEg1atTQ0KFD9dVXX+U45qRJk1zbx8fHq3PnzqpUqVKBfbbZ55l93hc7duyYlixZolq1aunqq692rc+thzc1NVVPPfWU2rRpo5CQEIWGhqpZs2aKjY3V/v37XeNuueUWORwO/frrr/meR7YzZ87o9ddf16BBg9SgQQMFBgYqPDxc1113nb7//vt8z+9CufXwXvi1z+11YY/t4sWLNWLECDVr1kyVKlVS1apVddlll+mjjz5y22dh5je/Ht6vvvpKQ4cOVY0aNRQUFKRWrVpp4sSJuV6dzp6H5ORkxcbGqlatWgoODtall17q9jUE4Dmu8ALwiooVK0qSKlQo2j8rfn5+evzxx7VmzRotWLBADz/8cL7jK1eurJtuuklvvfWW/vOf/+jWW291e//o0aNaunSpwsPDdeWVV0qS9u7dq759++q3337TwIEDFRMTo5SUFH300UdauXKlVq9erR49erj2sWvXLvXr10+HDh1S7969FRMTo/T0dO3cuVPPPfecHnzwQXXs2FFjx47VtGnT1KFDB8XExLi2b9y4sSQpIyNDl19+ub799lt17txZ48aNU3JyshYsWKCVK1fqww8/1LBhw3Kc40svvaQvv/xS11xzjQYOHCh/f/98vyY33HCD7rvvPi1cuFCvv/66QkJC3N6fN2+eMjMzde+99yogICDP/ZimqUGDBumbb75Rr1699Le//U1+fn7av3+/li5dqpEjR6pRo0b51pKXY8eOady4cbrssss0ZMgQVa9eXb/88ouWLl2qzz77TOvXr1e3bt082vfEiRNzXT9jxgylpKS4tW1MmDBBAQEB6t27t+rUqaPDhw9r6dKluuGGG/Svf/1L9913nyQVan7zsnDhQo0YMUKBgYEaPny4wsPD9fnnn2vKlClauXKl1q5dq6CgILdtTpw4od69e6tq1aoaOXKkUlJStGDBAg0aNEhbtmxRu3btPPraAPiLCQDFtH//fjMwMNCsU6eOee7cObf3oqOjTUnmoUOH8tw+IyPDrFChgunn52eePXu2wON9/fXXpiSzd+/eOd6bNm2aKcl88MEHXet69uxp+vv7mytWrHAbu2vXLrNKlSpm+/bt3dZ37drVlGTOmjUrx/4PHjzo+vO+fftMSWZsbGyudU6ePNmUZP797383DcNwrd+6dasZEBBgVqtWzUxLS3OtnzhxoinJrFy5srlt27b8vwgXueeee0xJ5ttvv53jvU6dOpmSzB07dritl2RGR0e7lrdt22ZKMmNiYnLsIyMjwzx58qRrOTY21pRk7tu3L8fY7PP48ssv3bb/7bffcozdsWOHGRISYg4YMMBt/ZdffmlKMidOnOi2Pvv7qSDPP/+8Kcm85pprzKysLNf6vXv35hh78uRJs3379mbVqlXN9PR01/qC5nfOnDmmJHPOnDmudampqWbVqlXNwMBA84cffnCtz8rKMocPH25KMqdMmeK2H0mmJPPee+91q/Xtt982JZl33313gecLIH+0NAAolrNnz2rkyJHKzMzUCy+8UODVyNwEBgaqZs2aMgxDx44dK3B8jx491K5dO23YsEG7d+92e2/OnDmSnB/mkqTvv/9eGzduVGxsrAYNGuQ2tkWLFrrzzju1fft2V2vDt99+q82bN6tPnz668847cxy7fv36hT6vd999VxUrVtTzzz/v9mv4Tp06KTY2VidOnFBCQkKO7e666y61b9++0MeR8m5r+OGHH/T999+re/fuatu2baH2FRwcnGNdYGBgjivHRREYGKh69erlWN+2bVv169dP69ev19mzZz3e/4U+/vhjTZgwQZ07d9bcuXPl53f+f3VNmzbNMT4kJES33HKLUlNT9d133xXr2EuWLFFqaqpuu+02RUZGutb7+fnpxRdfVIUKFXJtgahcubJeeOEFt1pjY2NVoUKFYtcEgJYGAMVgGIZuueUWrV+/XnfeeadGjhxZase+/fbbdf/992v27NmuW3Jt3bpViYmJioqKUuvWrSVJX3/9tSQpOTk5xz1dJemnn35y/bddu3b69ttvJUkDBw4sVn1paWn65Zdf1Lp161xDcr9+/fTWW28pMTExx9ete/fuRT5e165d1aFDB23cuFG7du1Sy5YtJUnvvPOOJPd+5ry0bt1akZGR+vDDD/Xbb78pJiZGffv2VceOHd2CmKcSExP14osvasOGDUpKSsoRcI8cOVLsOx5s3rxZI0eOVN26dfXJJ5/k+BBkSkqKnn/+eX322Wfav3+//vzzT7f3//jjj2IdP7sfObe+64YNG6pp06b6+eefdfLkSVWpUsX1XosWLXL8QFGhQgVFREToxIkTxaoJAIEXgIcMw9Btt92mefPm6eabb9bMmTM93ldmZqaOHj0qf39/1ahRo1Db3HzzzXrkkUf03nvv6ZlnnpG/v3+uH1bLvmK8bNkyLVu2LM/9ZX8ALjU1VZJyvRpZFGlpaZKkiIiIXN/PDnbZ4y6U1zYFuf322/XPf/5Ts2fP1gsvvKAzZ85o3rx5qlSpkm666aYCt69QoYLWrFmjSZMm6aOPPtIDDzwgSQoLC9OYMWP0+OOPe3QFX5I2btyoyy+/XJLzh4nmzZsrJCREDodDCQkJ+uGHH5SZmenRvrMdPHhQV111lRwOhz755JMcd+g4duyYunXrpgMHDqhXr14aMGCAqlWrJn9/fyUmJmrJkiXFrqEw8/7zzz8rLS3NLfCGhobmOr5ChQrKysoqVk0AuEsDAA8YhqFbb71V7777rkaMGKH4+PhiXQH86quvdO7cOXXs2LHQH3qrVauWrrnmGv3xxx/67LPPlJmZqXnz5ikkJMTtXsDZQeL111+XaZp5vmJjYyVJ1apVk+R8yEZxZB83OTk51/eTkpLcxl3I0yeJ/f3vf1dgYKDee+89nTt3TkuWLNHRo0c1bNiwPAPVxWrWrKnXX39dv//+u3788Ue98cYbqlGjhiZOnKgXX3zRNS57vs+dO5djH9k/NFzo2WefVWZmpr744gstXbpUr7zyiiZPnqxJkybluF2dJ06ePKkrr7xSKSkpmjdvnjp16pRjzDvvvKMDBw7o6aef1oYNG/T666/r6aef1qRJk3TppZcWuwapePMOoOQQeAEUSXbYfe+99zR8+HC9//77Hl/1y97fs88+K0kaMWJEkba9sG81ISFBx48f14033uj2q+Hsuy9s2rSpUPvMbif4/PPPCxybfd65XYELDQ1V06ZNtWfPnlzDc/btpjp27FiougqjRo0auvbaa5WUlKTly5fnesW7sBwOh1q3bq3Ro0dr1apVkqSlS5e63q9evbqk3H8wyO02Y3v37lWNGjXUu3dvt/WnT5/W1q1bi1zfhbKysnTTTTdp27Zteumll9xuvXZxDZJ0zTXX5Hjvv//9b451+c1vXrKDdm63Ezt48KD27t2rpk2bul3dBVDyCLwACi27jeG9997TsGHD9MEHHxQr7B45ckQ333yz1qxZozZt2ugf//hHkba/4oor1KBBA3366ad69dVXJeUMd927d1ePHj304YcfasGCBbme07p161zL3bp1U7du3bR+/Xq99dZbOcZfGPCqV68uh8OR56N6Y2NjdfbsWU2YMEGmabrWb9u2TfHx8apatarb7a68Ifv84+Li9Pnnn6tFixZ5PoziYr/++muu99XNvlp54a20sm8hdvEHsBYtWuT29czWqFEjHT9+XDt37nSty8rK0oMPPqjDhw8Xqr68jBs3TsuXL9ddd92l8ePH5zku+5ZqGzZscFs/b948LV++PMf4guY3N9dcc42qVq2qOXPmuJ2raZp65JFHdO7cOds//Q+wInp4ARTalClT9O677yokJEQtWrTQM888k2NMTExMrlctX375ZdejhdPS0vTjjz/qv//9rzIyMtSrVy99+OGHRX7MrZ+fn2699VZNmTJF3377rVq1aqWePXvmGPfhhx+qX79+uummmzR16lR17txZwcHBOnDggDZt2qTDhw8rIyPDNX7u3Lnq27ev7rrrLr3//vuKiopSRkaGdu7cqe+//15Hjx6V5Px0f3Y4HjlypJo3by4/Pz/X/WoffvhhLVu2TO+//77+97//qX///q77q547d05vvfWW16/09e/fX40bN3Z9WC/7bhWFkZiYqOuuu07du3dXmzZtVLt2bf3+++9KSEiQn5+f7r//ftfYa665Rpdcconi4+N18OBBderUSf/73/+0Zs0aDRkyJEeAvO+++/T555+rd+/euvHGGxUUFKS1a9fq999/V9++fT1+wMK3336rN954Q8HBwQoLC8v1g4nZ35MjR47UCy+8oPvuu09ffvmlGjVqpB9++EGrV6/Wddddp48//thtu4LmNzehoaF66623NGLECPXo0UPDhw9XWFiYvvjiC23ZskXdu3fXQw895NG5AigGn90QDUCZk33v1fxeF96T1DTP3zc1+1WhQgWzevXqZocOHczbbrvNXLFihdu9R4tq3759psPhMCWZL774Yp7jjh07Zj7xxBNmu3btzODgYDMkJMRs3ry5+X//93/mxx9/nGN8UlKSOXbsWLNp06ZmQECAWaNGDbNHjx7mq6++6jZu165d5pAhQ8xq1aq56rjw/rOnTp0yn3zySbNFixaue+8OHjzY/O9//5vjmLndv9YT2ff/9ff3N//44488x+mi+/AePHjQfPTRR81LL73UDA8PNwMCAsyGDRua1113nblp06Yc2+/bt8+MiYkxq1SpYlauXNns37+/+d133+V5HosWLTI7d+5sVqpUyaxVq5Z54403mnv37s31nr6FvQ9v9rjCfk8mJiaaAwcONKtXr25WqVLFjI6ONr/44otc76lrmvnPb17bmKZprl+/3hw8eLBZrVo1MyAgwGzRooX55JNPmqdOnSpwHi7UqFEjs1GjRrm+B6DwHKZ5we/ZAAAAAJuhhxcAAAC2RuAFAACArRF4AQAAYGtlKvBmP49+3Lhx+Y5buHChWrVqpaCgILVv3z7X280AAACgfCgzgfe7777Tm2++qcjIyHzHbdy4USNGjNDtt9+u77//XjExMYqJidGOHTtKqVIAAABYSZm4S8OpU6fUuXNn/fvf/9Yzzzyjjh07aurUqbmOHT58uNLT0/Xpp5+61l166aXq2LGjZs6cWUoVAwAAwCrKxIMnRo8eraFDh2rAgAG53uj+Qps2bcrxpJ1BgwYpISEhz20yMzOVmZnpWjYMQ8eOHVPNmjU9fqY9AAAASo5pmjp58qTq1q0rP7/8mxYsH3jnz5+vrVu36rvvvivU+KSkJEVERLiti4iIUFJSUp7bxMXFafLkycWqEwAAAKXv4MGDql+/fr5jLB14Dx48qLFjx2rVqlVuz3D3tgkTJrhdFU5NTVXDhg21f/9+hYaGlthxS4thGLrhhhu0aNGiAn8CQukyDENHjhxRrVq1mBuLYW6sjfmxLubGuuw2N2lpaWrUqFGhHtFu6cC7ZcsWpaSkqHPnzq51WVlZWr9+vd544w1lZmbK39/fbZvatWsrOTnZbV1ycrJq166d53ECAwMVGBiYY321atVsE3grVqyoatWq2eIb3E4Mw9CZM2eYGwtibqyN+bEu5sa67DY32edQmPZTS59t//79tX37diUmJrpeXbt21d///nclJibmCLuSFBUVpdWrV7utW7VqlaKiokqrbAAAAFiIpa/wVqlSRe3atXNbV7lyZdWsWdO1ftSoUapXr57i4uIkSWPHjlV0dLReeeUVDR06VPPnz9fmzZs1a9asUq8fAAAAvmfpK7yFceDAAR06dMi13LNnT82bN0+zZs1Shw4dtGjRIiUkJOQIzgAAACgfLH2FNzdr167Nd1mShg0bpmHDhpVOQQAAALC0Mn+FFwAAAMgPgRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGuWDrwzZsxQZGSkQkNDFRoaqqioKH322Wd5jo+Pj5fD4XB7BQUFlWLFAAAAsJoKvi4gP/Xr19fzzz+v5s2byzRNvfvuu7rmmmv0/fffq23btrluExoaql27drmWHQ5HaZULAAAAC7J04L3qqqvclp999lnNmDFDX3/9dZ6B1+FwqHbt2qVRHgAAAMoASwfeC2VlZWnhwoVKT09XVFRUnuNOnTqlRo0ayTAMde7cWc8991ye4ThbZmamMjMzXctpaWmSJMMwZBiGd07AhwzDkGmatjgXu2FurIu5sTbmx7qYG+uy29wU5TwsH3i3b9+uqKgoZWRkKCQkRIsXL1abNm1yHduyZUvNnj1bkZGRSk1N1csvv6yePXtq586dql+/fp7HiIuL0+TJk3OsP3z4sDIyMrx2Lr5iGIbOnTunlJQU+flZum273DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj3WYpmmWYC3FdubMGR04cECpqalatGiR3n77ba1bty7P0Huhs2fPqnXr1hoxYoSefvrpPMfldoW3QYMGOn78uEJDQ71yHr5kGIaGDBmi5cuX2+Ib3E4Mw9Dhw4cVFhbG3FgMc2NtzI91MTfWZbe5SUtLU/Xq1ZWamlpgXrP8Fd6AgAA1a9ZMktSlSxd99913mjZtmt58880Ct61YsaI6deqkPXv25DsuMDBQgYGBOdb7+fnZ4htCcvY22+l87IS5sS7mxtqYH+tibqzLTnNTlHMoc2drGIbb1dj8ZGVlafv27apTp04JVwUAAACrsvQV3gkTJmjw4MFq2LChTp48qXnz5mnt2rVauXKlJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1xxx2+PA0AAAD4kKUDb0pKikaNGqVDhw6patWqioyM1MqVK3XFFVdIkg4cOOB2Ofv48eO68847lZSUpOrVq6tLly7auHFjofp9AQAAYE+WDrzvvPNOvu+vXbvWbfm1117Ta6+9VoIVAQAAoKwpcz28AAAAQFEQeAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AsKvGjSWHw/kaM8bX1ZyXkHC+LodD2rzZ1xUBsDkCLwCUhH//2xnmevTwbR2XXSa9/74UG3t+3cGD0uTJUvfuUvXqUq1aUt++0hdfFO9Yzz0nXXqpFBYmBQVJzZtL48ZJhw+7j+va1VnTXXcV73gAUEgEXgAoCXPnOq+wfvuttGeP7+po2lS6+WapW7fz65YskV54QWrWTHrmGenJJ6WTJ6UrrpDmzPH8WFu2SB07So8/Lk2fLl1zjXN/PXtK6ennx9Wv76wpKsrzYwFAEVTwdQEAYDv79kkbN0offyzdfbcz/E6c6OuqzuvXTzpwwHllN9s99zjD6lNPSbfe6tl+P/oo57qoKOmGG6RPPpFuusmz/QJAMXGFFwC8be5cZ6vA0KHOsDd3rq8rcte2rXvYlaTAQGnIEOm335xXe72lcWPnf0+c8N4+AaCIuMILAN42d6503XVSQIA0YoQ0Y4b03XfubQV5OXVKysgoeFzFilLVqsWv9UJJSVKlSs6Xp0xTOnpUOndO2r1bevRRyd/f2SMMAD5C4AUAb9qyRfrpJ+n1153LvXs7e1bnzi1c4B0zRnr33YLHRUdLa9cWq1Q3e/Y4WzCGDXMGVE8lJ0t16pxfrl9fmjdPatWq+DUCgIcIvADgTXPnShERzj5ZyXmnhuHDpQ8+kF55peAw+fDDzg90FaR69eLXmu30aWfQDQ6Wnn++ePuqUUNatcp5lfr7750h+tQp79QJAB4i8AKAt2RlSfPnO8Puvn3n1/fo4Qy7q1dLAwfmv482bZyv0pKV5fww2Y8/Sp99JtWtW7z9BQRIAwY4/3zllVL//lKvXlJ4uHMZAHyAwAsA3rJmjXTokDP0zp+f8/25cwsOvKmp0p9/FnysgADn1dTiuvNO6dNPnbVdfnnx93exnj2dLQ5z5xJ4AfgMgRcAvGXuXOeVzOnTc7738cfS4sXSzJnO1oG8jB1bej28Dz3kvE/u1KnOD9eVlIwMZ5AHAB8h8AKAN/z55/kPfd1wQ87369aVPvxQWrrU2dObl9Lq4X3pJenll6XHHnOG7OJKT3f2K198h4ePPpKOH3c+XQ0AfITACwDesHSp8/61V1+d+/vZj9ydOzf/wFsaPbyLFzuDdfPmUuvWzg/UXeiKK5wfvJOkX3+VmjRxPpo4Pj7vfe7e7ezdHT7ceUcGPz9p82bnvhs39k6oBgAPEXgBwBvmzpWCgpxhMTd+fs4HUcyd67xPbc2apVvfhX74wfnf3bulkSNzvv/ll+cDb/YdFi681Vhu6teXrr/e2cf87rvS2bNSo0bO26w9/rhvzxdAuUfgBQBvWLq04DFz5jhfpSkzUzpyxNk3XLmyc92kSc5XYaxf79xu3Lj8x9WqJb35ZuH2eeaMlJbG7coAlBoeLQwAdjZ/vrOV4pFHPNv+yy+lf/7z/BVfb1i+3FnTffd5b58AkA+u8AKAXc2de/4WZw0aeLaPhQu9V0+2Xr2cD6fI1rKl948BABcg8AKAXfXq5esKchcWdv7hFABQCmhpAAAAgK1ZOvDOmDFDkZGRCg0NVWhoqKKiovTZZ5/lu83ChQvVqlUrBQUFqX379lq+fHkpVQsAAAArsnTgrV+/vp5//nlt2bJFmzdv1uWXX65rrrlGO3fuzHX8xo0bNWLECN1+++36/vvvFRMTo5iYGO3YsaOUKwcAAIBVWDrwXnXVVRoyZIiaN2+uFi1a6Nlnn1VISIi+/vrrXMdPmzZNf/vb3/TQQw+pdevWevrpp9W5c2e98cYbpVw5AAAArKLMfGgtKytLCxcuVHp6uqKionIds2nTJo0fP95t3aBBg5SQkJDvvjMzM5WZmelaTktLkyQZhiHDMIpXuAUYhiHTNG1xLnbD3FgXc2NtzI91MTfWZbe5Kcp5WD7wbt++XVFRUcrIyFBISIgWL16sNnk8djMpKUkRF90rMiIiQklJSfkeIy4uTpMnT86x/vDhw8rIyPC8eIswDEPnzp1TSkqK/PwsfVG/3DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj7V84G3ZsqUSExOVmpqqRYsWKTY2VuvWrcsz9HpiwoQJbleG09LS1KBBA4WFhSk0NNRrx/EVwzBUoUIFhYeH2+Ib3E4Mw5DD4VBYWBhzYzHMjbUxP9bF3FiX3eYmKCio0GMtH3gDAgLUrFkzSVKXLl303Xffadq0aXozl0dY1q5dW8nJyW7rkpOTVbt27XyPERgYqMDAwBzr/fz8bPENIUkOh8NW52MnzI11MTfWxvxYF3NjXXaam6KcQ5k7W8Mw3PptLxQVFaXVq1e7rVu1alWePb8A4GKa0jvvSHFxvq4EAOBllr7CO2HCBA0ePFgNGzbUyZMnNW/ePK1du1YrV66UJI0aNUr16tVT3F//gxo7dqyio6P1yiuvaOjQoZo/f742b96sWbNm+fI0AFjdtm1SbKyUmOhc/vNPadIkX1YEAPAiSwfelJQUjRo1SocOHVLVqlUVGRmplStX6oorrpAkHThwwO1yds+ePTVv3jw98cQTeuyxx9S8eXMlJCSoXbt2vjoFAFaWkiI9+aT09tvShZ/2PXzYdzUBALzO0oH3nXfeyff9tWvX5lg3bNgwDRs2rIQqAmALWVnSa69JTz8t/XUbQjeDB5d+TQCAEmPpwAsAJeKVV6RHHsn7/SZNSq8WAECJK3MfWgOAYsv+4KvDIVWunPP9Ro1Ktx4AQIki8AIofx57TFqyRLruOik93bku+36ONWpINrj/NgDgPAIvgPLH318KDJQ++si5XLmydPas88+NG/usLABAySDwAih/UlOlO+44v/zyy1J4uPPPvXv7piYAQInhQ2sAyp8HHpB++8355wEDpLvvlvr3l9avl7jLCwDYDoEXQPmycqXziWqSFBLivAevwyE1b+58AQBsh5YGAOVHbq0M3JEBAGyPwAug/Li4leGuu3xbDwCgVBB4AZQPebUyAABsj8ALwP5oZQCAco3AC8D+aGUAgHKNwAvA3mhlAIByj8ALwL5oZQAAiMALwM5oZQAAiMALwK5oZQAA/IXAC8B+aGUAAFyAwAvAfmhlAABcgMALwF5oZQAAXITAC8A+aGUAAOSCwAvAPmhlAADkgsALwB5oZQAA5IHAC6Dso5UBAJAPAi+Aso9WBgBAPgi8AMo2WhkAAAUg8AIou2hlAAAUAoEXQNlFKwMAoBAIvADKJloZAACFROAFUPbQygAAKAICL4Cyh1YGAEAREHgBlC20MgAAiojAC6DsoJUBAOABAi+AsoNWBgCABwi8AMoGWhkAAB4i8AKwPloZAADFQOAFYH20MgAAioHAC8DaaGUAABQTgReAddHKAADwAgIvAOuilQEA4AUEXgDWRCsDAMBLCLwArIdWBgCAF1k68MbFxalbt26qUqWKwsPDFRMTo127duW7TXx8vBwOh9srKCiolCoG4BW0MgAAvMjSgXfdunUaPXq0vv76a61atUpnz57VwIEDlZ6enu92oaGhOnTokOu1f//+UqoYQLHRygAA8LIKvi4gPytWrHBbjo+PV3h4uLZs2aI+ffrkuZ3D4VDt2rVLujwA3kYrAwCgBFg68F4sNTVVklSjRo18x506dUqNGjWSYRjq3LmznnvuObVt2zbP8ZmZmcrMzHQtp6WlSZIMw5BhGF6o3LcMw5BpmrY4F7thbtw5xo+X469WBrN/f5l33CH56GvD3Fgb82NdzI112W1uinIeZSbwGoahcePGqVevXmrXrl2e41q2bKnZs2crMjJSqampevnll9WzZ0/t3LlT9evXz3WbuLg4TZ48Ocf6w4cPKyMjw2vn4CuGYejcuXNKSUmRn5+lu1jKHcMwlJqaKtM0y/3cBHz5pWrMni1JMipX1pG4OBmHD/usHubG2pgf62JurMtuc3Py5MlCj3WYpmmWYC1e849//EOfffaZNmzYkGdwzc3Zs2fVunVrjRgxQk8//XSuY3K7wtugQQMdP35coaGhxa7d1wzD0JAhQ7R8+XJbfIPbiWEYOnz4sMLCwsr33KSmyhEZ6bq6a/z739Ldd/u0JObG2pgf62JurMtuc5OWlqbq1asrNTW1wLxWJq7wjhkzRp9++qnWr19fpLArSRUrVlSnTp20Z8+ePMcEBgYqMDAwx3o/Pz9bfENIzr5mO52PnTA3kh56yO2uDH733GOJD6oxN9bG/FgXc2NddpqbopyDpc/WNE2NGTNGixcv1po1a9SkSZMi7yMrK0vbt29XnTp1SqBCAMXGXRkAACXM0ld4R48erXnz5mnJkiWqUqWKkpKSJElVq1ZVcHCwJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1x4Se/AVgDd2UAAJQCSwfeGTNmSJL69u3rtn7OnDm65ZZbJEkHDhxwu6R9/Phx3XnnnUpKSlL16tXVpUsXbdy4UW3atCmtsgEUFg+YAACUAksH3sJ8nm7t2rVuy6+99ppee+21EqoIgNfQygAAKCWW7uEFYFO0MgAAShGBF0Dpo5UBAFCKCLwAShetDACAUkbgBVB6aGUAAPgAgRdA6aGVAQDgAwReAKWDVgYAgI8QeAGUPFoZAAA+ROAFUPJoZQAA+BCBF0DJopUBAOBjBF4AJYdWBgCABRB4AZQcWhkAABZA4AVQMmhlAABYBIEXgPfRygAAsBACLwDvo5UBAGAhBF4A3kUrAwDAYgi8ALyHVgYAgAUReAF4D60MAAALIvAC8A5aGQAAFkXgBVB8tDIAACyMwAug+GhlAABYGIEXQPHQygAAsDgCLwDP0coAACgDCLwAPPfgg7QyAAAsj8ALwDMrVzrbFyRaGQAAllahqBucPn1aq1at0ldffaUff/xRR44ckcPhUK1atdS6dWv16tVLAwYMUOXKlUuiXgBWQCsDAKAMKfQV3u3bt+uWW25R7dq1de2112r69Onas2ePHA6HTNPUzz//rDfeeEPXXnutateurVtuuUXbt28vydoB+AqtDACAMqRQV3iHDx+ujz76SF27dtWkSZN0xRVXqE2bNvL393cbl5WVpR9//FGff/65Fi1apE6dOmnYsGH68MMPS6R4AD5AKwMAoIwpVOD18/PT5s2b1bFjx3zH+fv7q3379mrfvr0eeOABJSYm6oUXXvBGnQCsgFYGAEAZVKjA6+kV2o4dO3J1F7ATWhkAAGUQd2kAUDi0MgAAyiiPA29aWpqef/55DRo0SJ06ddK3334rSTp27JheffVV7dmzx2tFAvAxWhkAAGVYkW9LJkm//faboqOjdfDgQTVv3lw//fSTTp06JUmqUaOG3nzzTe3fv1/Tpk3zarEAfIRWBgBAGeZR4H3ooYd08uRJJSYmKjw8XOHh4W7vx8TE6NNPP/VKgQB8jFYGAEAZ51FLw+eff65//vOfatOmjRy5/I+vadOmOnjwYLGLA+BjtDIAAGzAo8D7559/KiwsLM/3T5486XFBACyEVgYAgA14FHjbtGmj9evX5/l+QkKCOnXq5HFRACyAVgYAgE14FHjHjRun+fPn64UXXlBqaqokyTAM7dmzRyNHjtSmTZt0//33e7VQAKWIVgYAgI149KG1m2++Wfv379cTTzyhxx9/XJL0t7/9TaZpys/PT88995xiYmK8WSeA0kQrAwDARjwKvJL0+OOPa+TIkfroo4+0Z88eGYahSy65RNddd52aNm3qzRoBlCZaGQAANuNR4D1w4IDCwsLUsGHDXFsX/vzzTx0+fFgNGzYsdoEAShGtDAAAG/Koh7dJkyZavHhxnu8vXbpUTZo08biobHFxcerWrZuqVKmi8PBwxcTEaNeuXQVut3DhQrVq1UpBQUFq3769li9fXuxagHKBVgYAgA15FHhN08z3/bNnz8rPz+OnFrusW7dOo0eP1tdff61Vq1bp7NmzGjhwoNLT0/PcZuPGjRoxYoRuv/12ff/994qJiVFMTIx27NhR7HoAW6OVAQBgU4VuaUhLS9OJEydcy0ePHtWBAwdyjDtx4oTmz5+vOnXqFLu4FStWuC3Hx8crPDxcW7ZsUZ8+fXLdZtq0afrb3/6mhx56SJL09NNPa9WqVXrjjTc0c+bMYtcE2BKtDAAAGyt04H3ttdc0ZcoUSZLD4dC4ceM0bty4XMeapqlnnnnGKwVeKPsWaDVq1MhzzKZNmzR+/Hi3dYMGDVJCQkKe22RmZiozM9O1nJaWJsl5qzXDMIpRsTUYhiHTNG1xLnZjlblxPPCAHH+1Mpj9+8u84w6pnH+/WGVukDvmx7qYG+uy29wU5TwKHXgHDhyokJAQmaaphx9+WCNGjFDnzp3dxjgcDlWuXFldunRR165dC19xIRiGoXHjxqlXr15q165dnuOSkpIUERHhti4iIkJJSUl5bhMXF6fJkyfnWH/48GFlZGR4XrRFGIahc+fOKSUlxSutJvAewzCUmprquqWfLwR8+aVqvPOOs57KlXUkLk7G4cM+qcVKrDA3yBvzY13MjXXZbW6K8mTfQgfeqKgoRUVFSZLS09N1/fXX5xs8vW306NHasWOHNmzY4PV9T5gwwe2qcFpamho0aKCwsDCFhoZ6/XilzTAMVahQQeHh4bb4BrcTwzDkcDgUFhbmm7lJTZXj4YfPL7/0kmp16VL6dViQz+cG+WJ+rIu5sS67zU1QUFChx3p0W7KJEyd6spnHxowZo08//VTr169X/fr18x1bu3ZtJScnu61LTk5W7dq189wmMDBQgYGBOdb7+fnZ4htCcl59t9P52IlP5+bhh93uyuB3zz18UO0C/L2xNubHupgb67LT3BTlHDx+8IQkffXVV9q6datSU1Nz9FE4HA49+eSTxdm9TNPUfffdp8WLF2vt2rWFutVZVFSUVq9e7dZfvGrVKtfVaQB/4a4MAIBywqPAe+zYMQ0dOlTffvutTNOUw+Fw3aos+8/eCLyjR4/WvHnztGTJElWpUsXVh1u1alUFBwdLkkaNGqV69eopLi5OkjR27FhFR0frlVde0dChQzV//nxt3rxZs2bNKlYtgK1wVwYAQDni0fXshx56SNu2bdO8efP0yy+/yDRNrVy5Uj///LPuuecedezYUX/88Uexi5sxY4ZSU1PVt29f1alTx/VasGCBa8yBAwd06NAh13LPnj01b948zZo1Sx06dNCiRYuUkJBQqv3GgOXxgAkAQDni0RXe5cuX6+6779bw4cN19OhRSc4+imbNmmn69Om67rrrNG7cOH344YfFKq6gB1xI0tq1a3OsGzZsmIYNG1asYwO2RSsDAKCc8egK74kTJ9S2bVtJUkhIiCTp1KlTrvcHDhyolStXeqE8AF5FKwMAoBzyKPDWrVvX1U8bGBio8PBw/fDDD673f//9dzm4YgRYD60MAIByyKOWhj59+mjVqlV6/PHHJUnDhw/Xiy++KH9/fxmGoalTp2rQoEFeLRRAMdHKAAAopzwKvOPHj9eqVauUmZmpwMBATZo0STt37nTdlaFPnz56/fXXvVoogGKglQEAUI55FHjbt2+v9u3bu5arV6+uL774QidOnJC/v7+qVKnitQIBeAGtDACAcqxYD564WLVq1by5OwDeQCsDAKCc8zjwZmVlaeXKlfrll190/PjxHLcQ88aDJwAUE60MAAB4Fng3b96s66+/Xr/99lue98ol8AIWQCsDAACe3Zbs3nvv1Z9//qmEhAQdO3ZMhmHkeGVlZXm7VgBFQSsDAACSPLzCu23bNj377LO66qqrvF0PAG+glQEAABePrvDWr1+/UI/9BeAjtDIAAODiUeB95JFH9NZbbyktLc3b9QAoLloZAABw41FLw8mTJxUSEqJmzZrppptuUoMGDeTv7+82xuFw6P777/dKkQAKiVYGAABy8CjwPvjgg64/v/HGG7mOIfACPkArAwAAOXgUePft2+ftOgAUF60MAADkyqPA24hfkQLWQisDAAB58uhDawAshlYGAADyVKgrvE2aNJGfn59++uknVaxYUU2aNJGjgF+VOhwO7d271ytFAsgHrQwAAOSrUIE3OjpaDodDfn5+bssAfIxWBgAAClSowBsfH5/vMgAfoZUBAIAC0cMLlFW0MgAAUCiFusK7fv16j3bep08fj7YDUABaGQAAKLRCBd6+ffu69eyaplmoHt6srCzPKwOQN1oZAAAotEIF3i+//NJtOTMzUw8//LBOnz6tu+66Sy1btpQk/fTTT3rrrbdUuXJlvfjii96vFgCtDAAAFFGh79JwofHjxysgIEBff/21goKCXOuvuuoqjR49WtHR0VqxYoWuuOIK71YLlHe0MgAAUGQefWht7ty5GjlypFvYzVapUiWNHDlSH3zwQbGLA3ARWhkAACgyjwJvenq6Dh06lOf7hw4d0unTpz0uCkAuaGUAAMAjHgXeAQMGaNq0afr4449zvPfRRx9p2rRpGjBgQLGLA/AXWhkAAPBYoXp4LzZ9+nRdfvnlGjZsmOrUqaNmzZpJkvbu3as//vhDl1xyiV5//XWvFgqUa7QyAADgMY+u8NarV08//PCDXn31VbVr107JyclKTk5W27Zt9dprr+mHH35Q/fr1vV0rUD7RygAAQLEU+QpvRkaGZs2apY4dO2rs2LEaO3ZsSdQFQKKVAQAALyjyFd6goCA98sgj2rVrV0nUA+BCtDIAAFBsHrU0tGvXTr/++quXSwHghlYGAAC8wqPA++yzz+rNN9/UF1984e16AEi0MgAA4EUe3aXhjTfeUI0aNTRo0CA1adJETZo0UXBwsNsYh8OhJUuWeKVIoNyhlQEAAK/xKPBu27ZNDodDDRs2VFZWlvbs2ZNjjINfvQKeoZUBAACv8ijw0r8LlBBaGQAA8DqPengBlBBaGQAA8DqPrvBmW7dunZYtW6b9+/dLkho1aqShQ4cqOjraK8UB5QqtDAAAlAiPAu+ZM2c0YsQIJSQkyDRNVatWTZJ04sQJvfLKK7r22mv14YcfqmLFit6sFbAvWhkAACgxHrU0TJ48WYsXL9YDDzygQ4cO6dixYzp27JiSkpL04IMP6uOPP9aUKVO8XStgW46HHqKVAQCAEuJR4J03b55iY2P14osvKiIiwrU+PDxcL7zwgkaNGqX333/fKwWuX79eV111lerWrSuHw6GEhIR8x69du1YOhyPHKykpySv1AN4W8OWXcrzzjnOBVgYAALzOo8B76NAh9ejRI8/3e/To4bWAmZ6erg4dOmj69OlF2m7Xrl06dOiQ6xUeHu6VegCvSk1V1QcfPL9MKwMAAF7nUQ9v/fr1tXbtWt1zzz25vr9u3TrVr1+/WIVlGzx4sAYPHlzk7cLDw129xYBVOR56SH5//OFcoJUBAIAS4VHgjY2N1cSJE1WtWjXdf//9atasmRwOh3bv3q2pU6dq4cKFmjx5srdrLZKOHTsqMzNT7dq106RJk9SrV688x2ZmZiozM9O1nJaWJkkyDEOGYZR4rSXNMAyZpmmLc7GVzz+X31+tDGZIiMxZsyTTdL7gc/y9sTbmx7qYG+uy29wU5Tw8CryPPfaY9u7dq1mzZumtt96Sn5+f68CmaSo2NlaPPfaYJ7sutjp16mjmzJnq2rWrMjMz9fbbb6tv37765ptv1Llz51y3iYuLyzWgHz58WBkZGSVdcokzDEPnzp1TSkqKa67gW460NNW6/XbXcuoTTygjOFhKSfFhVbiQYRhKTU2VaZr8vbEg5se6mBvrstvcnDx5stBjHabp+eWkbdu2afny5W734R0yZIgiIyM93WW+HA6HFi9erJiYmCJtFx0drYYNG+b5QbrcrvA2aNBAx48fV2hoaHFKtgTDMDRkyBAtX77cFt/gduC46y7XB9UyL7tM/qtXy8/f38dV4UKGYejw4cMKCwvj740FMT/WxdxYl93mJi0tTdWrV1dqamqBea1YD56IjIwssXDrTd27d9eGDRvyfD8wMFCBgYE51vv5+dniG0Jy/rBgp/Mp01aulC5oZUh95RXV8vdnbiyIvzfWxvxYF3NjXXaam6KcQ6FGnj592uNiirOttyQmJqpOnTq+LgPI8YAJ88UXZTRo4MOCAACwv0IF3gYNGmjKlCk6dOhQoXf8+++/66mnnlLDhg09Lk6STp06pcTERCUmJkqS9u3bp8TERB04cECSNGHCBI0aNco1furUqVqyZIn27NmjHTt2aNy4cVqzZo1Gjx5drDoAr3jwQR4wAQBAKStUS8OMGTM0adIkTZkyRb169dKAAQPUuXNnNWnSRNWrV5dpmjp+/Lj27dunzZs364svvtDXX3+t5s2b69///nexCty8ebP69evnWh4/frwk550i4uPjdejQIVf4lZyPPX7ggQf0+++/q1KlSoqMjNQXX3zhtg/AJ1audD5UQuIBEwAAlKJCf2jNMAwtXbpU8fHxWrFihc6cOSPHRf+zNk1TAQEBGjhwoG677TZdffXVZbJHJC0tTVWrVi1UE3RZYBiGBg8erM8++6xMzoctpKZK7dqdv7o7c6Z0990yDEMpKSkKDw9nbiyGubE25se6mBvrstvcFCWvFfpDa35+foqJiVFMTIwyMzO1ZcsW/fTTTzp69KgkqWbNmmrVqpW6dOmS6wfAgHKNVgYAAHzGo7s0BAYGqmfPnurZs6e36wHsh1YGAAB8quxfzwas7KK7Mujll6VGjXxXDwAA5RCBFyhJtDIAAOBzBF6gpNDKAACAJRB4gZJAKwMAAJZB4AVKAq0MAABYhkeB95tvvvF2HYB90MoAAICleBR4o6Ki1KJFCz399NP65ZdfvF0TUHbRygAAgOV4FHg/+OADNW/eXE8//bSaN2+uXr16aebMmTp27Ji36wPKFloZAACwHI8C7//93/9p2bJl+uOPPzRt2jSZpql7771XdevWVUxMjBYtWqQzZ854u1bA2mhlAADAkor1obVatWppzJgx2rhxo3bv3q3HH39cP/30k4YPH67atWvrrrvu0oYNG7xVK2BdtDIAAGBZXrtLQ3BwsCpVqqSgoCCZpimHw6ElS5YoOjpa3bp1048//uitQwHWQysDAACWVazAe/LkSc2ZM0cDBgxQo0aN9Nhjj6lx48ZatGiRkpKS9Mcff2jBggVKSUnRrbfe6q2aAWuhlQEAAEur4MlGS5Ys0dy5c/Xpp58qIyND3bp109SpU3XTTTepZs2abmNvuOEGHT9+XKNHj/ZKwYCl0MoAAIDleRR4r732WjVo0ED333+/Ro0apZYtW+Y7vkOHDvr73//uUYGApdHKAACA5XkUeNesWaO+ffsWenz37t3VvXt3Tw4FWBetDAAAlAke9fAWJewCtkQrAwAAZYbX7tIAlCu0MgAAUGYQeIGiopUBAIAyhcALFAWtDAAAlDkEXqAoaGVw0ze+rxyTHXJMdujKeVf6uhyXxKREV12OyQ4t+nGRr0sCUEY0buz8pZ3DIY0Z4+tqzps69XxdDod05IivKypbCLxAYVmolWHvsb26+5O71XRaUwU9E6TQuFD1mt1L076epj/P/lmqtbSq1UrvX/u+Huz5oNv6BTsW6OaPb1bz15vLMdmhvvF9i32sb3//Vvcuu1ddZnVRxacryjE5969/o6qN9P617+ux3o8V+5gASlZ8vHuQczik8HCpXz/ps898U9Nll0nvvy/FxuY9ZsMG74TPBQukm2+Wmjd37iuv+wL87W/Omq691vNjlWce3ZYMKHcs1Mqw7OdlGrZwmAIrBGpU5Ci1C2+nM1lntOHgBj206iHtPLxTs66aVWr1RFSO0M2RN+dYP2PzDG05tEXd6nbT0dNHvXKs5buX6+2tbysyIlJNqzfVz0d/znVc9eDqujnyZq39da2e2/CcV44NoGRNmSI1aSKZppSc7AzCQ4ZIn3wiXVnKv0Bq2tQZQvNiGNJ990mVK0vp6cU71owZ0pYtUrdu0tF8/qls1cr52rNHWry4eMcsjwi8QGFYpJVh3/F9uumjm9SoWiOtGbVGdarUcb03uvto7em3R8t+XuaT2i72/rXvq15oPfk5/NTu3+28ss9/dP2HHun1iIIrBmvM8jF5Bl4AZc/gwVLXrueXb79dioiQPvyw9ANvQWbNkg4edF4HmTatePt6/32pXj3Jz09q551/KpELAi9QEAu1Mrz41Ys6deaU3rn6Hbewm61ZjWYae+lYH1SWU4OqDby+z4iQCK/vE4A1VasmBQdLFSyWVI4dk554wnlFOiWl+Ptr4P1/KpELi30bARZjoVYGSfrk50/UtHpT9WzQ0+N9nD57WqfPni5wnL/DX9WDq3t8HAAoitRUZy+saTqD5OuvS6dO5d9akO3UKSkjo+BxFStKVasWr84nn5Rq15buvlt6+uni7Qulh8AL5McirQySlJaZpt9P/q5rWl5TrP28+NWLmrxucoHjGlVtpF/H/VqsYwFAYQ0Y4L4cGCjNni1dcUXB244ZI737bsHjoqOltWs9Kk+StG2b9Oab0vLlkr+/5/tB6SPwAnmxUCuD5Ay8klQlsEqx9jOqwyj1bti7wHHBFYKLdRwAKIrp06UWLZx/Tk6WPvjA+Qu2KlWk667Lf9uHHy7cleDqxfyl1T//6ew1HjiwePtB6SPwArmxWCuDJIUGhkqSTmaeLNZ+mlZvqqbVm3qjJADwmu7d3T+0NmKE1KmT8+rtlVdKAQF5b9umjfNVkhYskDZulHbsKNnjoGQQeIHcWKiVIVtoYKjqVqmrHSnF+9f21JlTOnXmVIHj/B3+CqscVqxjAYCn/Pyc9+KdNk3avVtq2zbvsamp0p+FuAV5QIBUo4Zn9Tz0kDRsmHMfv/7qXHfihPO/Bw9KZ85Idet6tm+UPAIvcDGLtTJc6MrmV2rW1lnadHCTohpEebSPlze+TA8vgDLh3Dnnf08V8DP62LEl38N78KA0b57zdbHOnaUOHaTERM/2jZJH4AUuZMFWhgs93Othzd0+V3d8cofWjFqT4zZde4/t1ac/f5rvrcno4QVQFpw9K33+ufOKauvW+Y8tjR7e3B72MH++s9Xhvfek+vU93zdKHoEXuJAFWxkudEmNSzTv+nkavmi4Wk9vrVEdzj9pbePBjVr440Ld0uGWfPdRWj286/ev1/r96yVJh08fVvrZdD2z/hlJUp9GfdSnUR/XWMdkh6IbRWvtLWvz3ef+E/v1/rb3JUmb/9gsSa59NqraSCM7jPT2aQAoJZ99Jv30k/PPKSnOK6m7d0uPPiqFhua/bWn08MbE5FyXfUV38GCpVq3z69eudbZjTJwoTZqU/37Xr3e+JOnwYeeT255x/rOmPn2cLxQfgRfIZuFWhgtd3fJqbbtnm17a+JKW7FqiGZtnKNA/UJERkXpl4Cu6s/Odvi5RkrRm35ocrRNPfvmkJGli9ERX4M3uJ87tQRoX23din2sfF+8zulE0gRcow5566vyfg4Kcj9GdMcN5v9uyJrsFo07B/6xpzRpp8kVdZk/+9c/cxIkEXm8h8AKS5VsZLta8ZnPNumqWr8uQJJ01zurI6SMK8A9w3UlCkib1naRJfScVuP36/evlkEOP9X6swLF9G/eVOdEscFyWkaXjGceVmpFa4FgAvnXLLc6XlWRmOh+CERwsVa6c97hJk3K/grt+vbPFoTDnldc+LpaR4QzSpwt+bhBy4efrAgBLsHgrg5VtPLhRYS+F6f8++j+Ptv9y35e6qd1Nah/R3ms1bU/ZrrCXwhSzIMZr+wRQfsyfL4WFSY884tn2X37pvEobGOi9mmbOdNb00kve22d5whVeoIy0MljRKwNf0fGM45KksEqe3cLspYHe/9e7WY1mWjVylWs5MiLS68cAYE9z556/xVmDBp7t47vvvFdPtuuvl9q1O79c3EcklzcEXpRvZayVwWq61O3i6xJyFRIQogFNBxQ8EAAu0quXryvIXYMGngdw0NKA8o5WBgAAbI/Ai/KLVgYAAMoFywfe9evX66qrrlLdunXlcDiUkJBQ4DZr165V586dFRgYqGbNmik+Pr7E60QZQysDAADlhuUDb3p6ujp06KDp06cXavy+ffs0dOhQ9evXT4mJiRo3bpzuuOMOrVy5soQrRZlCKwMAAOWG5T+0NnjwYA0ePLjQ42fOnKkmTZrolVdekSS1bt1aGzZs0GuvvaZBgwaVVJkoS2hlAIBC+/JL56t1aykqyvnLMP7JRFlj+cBbVJs2bdKAAe6fzh40aJDGjRuX5zaZmZnKzMx0LaelpUmSDMOQYRglUmdpMgxDpmna4lyKLTVVjjvuUPa/1caLLzo/9uqjrw1zY13MjbUxP6Xj9GnpyisdOn36fMKtXdvUpZdK7dub6txZuuoq9wDM3FiX3eamKOdhu8CblJSkiIgIt3URERFKS0vTn3/+qeDg4BzbxMXFafLFz/WTdPjwYWVkZJRYraXFMAydO3dOKSkp8vOzfBdLiQp94AFV+quVIbNPHx2PiXE+tN1HDMNQamqqTNMs93NjNcyNtTE/pcM0pUsuqant2yu61iUlOZSQICUkOFPuVVf9qVmzzj/VkLmxLrvNzcmTJws91naB1xMTJkzQ+PHjXctpaWlq0KCBwsLCFBoams+WZYNhGKpQoYLCw8Nt8Q3usZUr5TdvniTJDAlRxfh4hV/0w1FpMwxDDodDYWFh5XtuLIi5sTbmp+SlpDgfkRsV5VBqqqkDB3LvY/jttyCFh59/pBhzY112m5ugoKBCj7Vd4K1du7aSk5Pd1iUnJys0NDTXq7uSFBgYqMBcnv/n5+dni28ISXI4HLY6nyJLTXX7YJrj5ZflaNKkVEv4+cjPenHji+oQ0UH39bjvfC3lfW4sjLmxNubHu1JSpHXrpLVrna8ffyx4m7p1nVd6/fzcwzBzY112mpuinIPtAm9UVJSWL1/utm7VqlWKioryUUWwhAce8MldGX45/osW7lyo//z4H209tNW1vkf9Huper3up1AAAuSlKwPXzk8LDpaSk8+uuv1567z2pUqWSrhQoPssH3lOnTmnPnj2u5X379ikxMVE1atRQw4YNNWHCBP3+++967733JEn33HOP3njjDT388MO67bbbtGbNGv3nP//RsmXLfHUK8LWVK6V33nH+uRTuymCapmZ/P1szNs/QlkNbch1TJaBKiR0fAHJTlIDr7y917Sr17et8ffWV9Mwz599/+GEpLs4ZhIGywPKBd/PmzerXr59rObvXNjY2VvHx8Tp06JAOHDjger9JkyZatmyZ7r//fk2bNk3169fX22+/zS3JyisfPGBizb41uuOTO/J8v2XNlmod1rpEawCA4gTcXr2kKhf8XP7XNSX5+0szZkh33llydQMlwfKBt2/fvjJNM8/3c3uKWt++ffX999+XYFUoM3zQylCnSh1VqlhJp8+eVtXAqkrNTHV7//ZOt5d4DQDKH28G3Iu98orUrp00cKBzO6CssXzgBTxWyq0M2dqEtdHu+3YrdnGsvtj3hSTJIYdMOX9wu6HNDSVeAwD7K8mAe7E6daTHHitevYAvEXhhTz5oZch2JuuM7vn0HlfYDfQPVGaW88EmXet2VZPqpXt3CAD2UJoBF7AbAi/syUd3ZTiTdUY3/OcGffLzJ5Kk4ArBGtNtjF7a9JIkaVibYaVSB4Cyj4ALeA+BF/bjo1aG3MLusv9bpmpB1TRzy0xVD66uWzreUuJ1ACibCLhAySHwwl581MqQV9jt18R5h5E/HvhDgf6BquhfMb/dAChHCLhA6SHwwl580MpQUNiVpJCAkBKvA4C1EXAB3yHwwj580MpQmLALoHwi4ALWQeCFPfiglYGwC+BCBFzAugi8sIdSbmUg7AIg4AJlB4EXZV8ptzIQdoHyiYALlF0EXpRtpdzKQNgFyg8CLmAfBF6UbaXYykDYBeyNgAvYF4EXZVcptjIQdgH7IeAC5QeBF2VTKbYyEHYBeyhqwO3SxT3ghoaWTp0AvI/Ai7KplFoZCLtA2UXABZCNwIuyp5RaGQi7QNlCwAWQFwIvypZSamUg7ALWd+SIn9atk9avJ+ACyB+BF2VLKbQyEHYBa3K/guvQjz+G5zmWgAvgQgRelB2l0MpA2AWsI/8WBfe/+wRcAPkh8KJsKIVWBsIu4FspKefbE9aulXbuzHusv7+pyMizGjCgovr1cxBwAeSLwIuyoYRbGQi7QOkrWsB1v4IbFWUqI+OYwsPD5edXco8SB2APBF5YXwm3MhB2gdJRnIB78RVcw5AyMkq2XgD2QeCFtZVwKwNhFyg53gy4AFAcBF5YWwm2MhB2Ae8i4AKwKgIvrKsEWxkIu0DxEXABlBUEXlhTCbYyEHYBzxBwAZRVBF5YUwm1MhB2gcIj4AKwCwIvrKeEWhkIu0D+CLgA7IrAC2spoVYGwi6QEwEXQHlB4IW1lEArA2EXcCLgAiivCLywjhJoZSDsojwj4AKAE4EX1lACrQyEXZQ3BFwAyB2BF9bg5VYGwi7KAwIuABQOgRe+5+VWBsIu7IqACwCeIfDCt7zcykDYhZ0QcAHAOwi88C0vtjIQdlHWEXABoGQQeOE7XmxlIOyiLCLgAkDpIPDCN7zYykDYRVlBwAUA3yDwwje81MpA2IWVEXABwBoIvCh9XmplIOzCagi4AGBNZSLwTp8+XS+99JKSkpLUoUMHvf766+revXuuY+Pj43Xrrbe6rQsMDFRGRkZplIqCeKmVgbALKyDgAkDZYPnAu2DBAo0fP14zZ85Ujx49NHXqVA0aNEi7du1SeHh4rtuEhoZq165drmVHMR9PCy/yQisDYRe+QsAFgLLJ8oH31Vdf1Z133um6ajtz5kwtW7ZMs2fP1qOPPprrNg6HQ7Vr1y7NMlEYK1YUu5WBsIvSdOSIn9avPx9yCbgAUDZZOvCeOXNGW7Zs0YQJE1zr/Pz8NGDAAG3atCnP7U6dOqVGjRrJMAx17txZzz33nNq2bZvn+MzMTGVmZrqW09LSJEmGYcgwDC+ciW8ZhiHTNH17Lqmpctx5p7LjrfHii1KDBlIRajqTdUbDFg7Tp7s/leQMu5+M+ETRjaLL7DxZYm7gkn0Fd906h9atc2jnztx/iyRJ/v6munSRoqOl6Ggz14DLtJYc/u5YF3NjXXabm6Kch6UD75EjR5SVlaWIiAi39REREfrpp59y3aZly5aaPXu2IiMjlZqaqpdfflk9e/bUzp07Vb9+/Vy3iYuL0+TJk3OsP3z4sC16fw3D0Llz55SSkiI/Pz+f1BD6wAOq9FcrQ2afPjoeE+NMF4V0JuuM7lx1pz7f/7kkKahCkN7/2/tqW6mtUoqwH6sxDEOpqakyTdNnc1OeHTnip6+/rqiNGwO0cWOAdu2qmOdYf39TkZFn1bPnGUVFnVH37mdVpYrpej8jw/lC6eDvjnUxN9Zlt7k5efJkocdaOvB6IioqSlFRUa7lnj17qnXr1nrzzTf19NNP57rNhAkTNH78eNdyWlqaGjRooLCwMIXa4HeShmGoQoUKCg8P9803+IoV8ps3T5JkhoSoYny8wi/6ISY/2Vd2s8Nu9pXdfo3LfhuDYRhyOBwKCwuzxT8+Vud+BVfauTPvlho/P1MdOpxV//4VLmhRqCDnP5uVSqtk5IG/O9bF3FiX3eYmKCio0GMtHXhr1aolf39/JScnu61PTk4udI9uxYoV1alTJ+3ZsyfPMYGBgQoMDMyx3s/PzxbfEJKzr9kn55OaKt199/k6Xn5ZjiZNCr35mawzunHRjW5tDHbr2fXZ3JQDxfmQWVSUqYyMY777QREF4u+OdTE31mWnuSnKOVg68AYEBKhLly5avXq1YmJiJDl/Olm9erXGjBlTqH1kZWVp+/btGjJkSAlWijwV464MfEANReXNuygYBi0KAGAXlg68kjR+/HjFxsaqa9eu6t69u6ZOnar09HTXXRtGjRqlevXqKS4uTpI0ZcoUXXrppWrWrJlOnDihl156Sfv379cdF977FaWjGHdlIOyiMLhNGACgMCwfeIcPH67Dhw/rqaeeUlJSkjp27KgVK1a4Psh24MABt0vax48f15133qmkpCRVr15dXbp00caNG9WmTRtfnUL5lJoq3Xnn+eUiPGCCsIu8EHABAJ6wfOCVpDFjxuTZwrB27Vq35ddee02vvfZaKVSFfHnYykDYxYUIuAAAbygTgRdljIetDIRdEHABACWBwAvv8rCVgbBbPhFwAQClgcAL7/KglYGwW34QcAEAvkDghfd40MpA2LU3Ai4AwAoIvPAOD1oZCLv2Q8AFAFgRgRfeUcRWBsKuPRBwAQBlAYEXxVfEVgbCbtlFwAUAlEUEXhRPEVsZCLtlCwEXAGAHBF4UTxFaGQi71kfABQDYEYEXnitCKwNh15oIuACA8oDAC88UoZWBsGsdBFwAQHlE4IVnCtnKQNj1LQIuAAAEXniikK0MhN3SR8AFACAnAi+KppCtDITd0kHABQCgYAReFE0hWhkIuyWHgAsAQNEReFF4hWhlIOx6FwEXAIDiI/CicArRykDYLT4CLgAA3kfgReEU0MpA2PVMSor06aeB+v57h9atI+ACAFASCLwoWAGtDITdwst5BddPUvVcxxJwAQDwDgIv8ldAKwNhN3+0KAAA4HsEXuQvn1YGwm5ORQ+4prp1S9fgwZV02WV+BFwAAEoAgRd5y6eVgbDrVNwruCEhplJSTik8vJL8/EqnZgAAyhsCL3KXTytDeQ673m5RMIySqxUAADgReJG7PFoZylvYpQcXAICyj8CLnPJoZSgPYZeACwCA/RB44S6PVga7hl0CLgAA9kfghbtcWhnsFHYJuAAAlD8EXpyXSyvDGeNsmQ67BFwAAEDghVMurQxn6tcpc2GXgAsAAC5G4IXTRa0MZ26/pUyEXQIuAAAoCIEXOVoZzrz5b92wcJglwy4BFwAAFBWBt7y7qJXhzEvP64ZvHrBM2CXgAgCA4iLwlncXtDKcueJy3VB1pU/DLgEXAAB4G4G3PLugleFMaGXdcJNfqYddAi4AAChpBN7y6oJWhjP+0g0TLtEnB7+QVLJhl4ALAABKG4G3vPqrleGMv3TDvbX0SeY2Sd4PuwRcAADgawTe8uivVoYz/tINI/z1Sc0jkrwTdgm4AADAagi85c1frQxn/KUbbpQ+aZYlyfOwS8AFAABWR+Atbx54QGcO/eYMuy2dq4oSdgm4AACgrCHwlicrVuhM/DtFCrsEXAAAUNYReMuJSmfP6uw/7tKwAsIuARcAANiNn68LKIzp06ercePGCgoKUo8ePfTtt9/mO37hwoVq1aqVgoKC1L59ey1fvryUKrWuW/fs0rBev+cIu20r99OiRdKYMVK7dlJEhDRsmDR9es6w6+8vde8uPfywtHy5dOyY9M030gsvSIMHE3YBAIA1Wf4K74IFCzR+/HjNnDlTPXr00NSpUzVo0CDt2rVL4eHhOcZv3LhRI0aMUFxcnK688krNmzdPMTEx2rp1q9q1a+eDM7CA9HQlNvnDFXYDFKyBR5bpvqv6cQUXAADYnsM0TdPXReSnR48e6tatm9544w1JkmEYatCgge677z49+uijOcYPHz5c6enp+vTTT13rLr30UnXs2FEzZ84s1DHT0tJUtWpVpaamKtQGCc84c0a3DAvV+50zpbPB0txl0q85e3YJuKXPMAylpKQoPDxcfn5l4hcu5QZzY23Mj3UxN9Zlt7kpSl6z9BXeM2fOaMuWLZowYYJrnZ+fnwYMGKBNmzblus2mTZs0fvx4t3WDBg1SQkJCnsfJzMxUZmamazktLU2S8xvDMIxinIE1GH5++i0jStW/jNXxbZdJxy+RJPn7m+rSRYqOlqKjzVwDrg1O39IMw5Bpmrb4PrMb5sbamB/rYm6sy25zU5TzsHTgPXLkiLKyshQREeG2PiIiQj/99FOu2yQlJeU6PikpKc/jxMXFafLkyTnWX3/99apQwdJfokIxTVPbtm5Xw/r7VbFic1VqvEXVq/+g6tV/VIUKp/XDD9IPP0j/+pevKy1/TNPUuXPnVKFCBTkcDl+XgwswN9bG/FgXc2Nddpubc+fOFXps2U9zXjBhwgS3q8JpaWlq0KCBPvroI3u0NBiGhgwZouXLH7/gVxg3+LQmOBmGocOHDyssLMwWv16yE+bG2pgf62JurMtuc5OWlqbq1asXaqylA2+tWrXk7++v5ORkt/XJycmqXbt2rtvUrl27SOMlKTAwUIGBgTnW+/n52eIbQpIcDoetzsdOmBvrYm6sjfmxLubGuuw0N0U5B0ufbUBAgLp06aLVq1e71hmGodWrVysqKirXbaKiotzGS9KqVavyHA8AAAB7s/QVXkkaP368YmNj1bVrV3Xv3l1Tp05Venq6br31VknSqFGjVK9ePcXFxUmSxo4dq+joaL3yyisaOnSo5s+fr82bN2vWrFm+PA0AAAD4iOUD7/Dhw3X48GE99dRTSkpKUseOHbVixQrXB9MOHDjgdkm7Z8+emjdvnp544gk99thjat68uRISEsrvPXgBAADKOcsHXkkaM2aMxowZk+t7a9euzbFu2LBhGjZsWAlXBQAAgLLA0j28AAAAQHEReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK1ZOvCapqmnnnpKderUUXBwsAYMGKDdu3fnu82kSZPkcDjcXq1atSqligEAAGA1lg68L774ov71r39p5syZ+uabb1S5cmUNGjRIGRkZ+W7Xtm1bHTp0yPXasGFDKVUMAAAAq6ng6wLyYpqmpk6dqieeeELXXHONJOm9995TRESEEhISdNNNN+W5bYUKFVS7du3SKhUAAAAWZtnAu2/fPiUlJWnAgAGudVWrVlWPHj20adOmfAPv7t27VbduXQUFBSkqKkpxcXFq2LBhnuMzMzOVmZnpWk5NTZUknThxQoZheOFsfMswDJ09e1YnTpyQn5+lL+qXO4ZhKC0tTQEBAcyNxTA31sb8WBdzY112m5u0tDRJzoukBbFs4E1KSpIkRUREuK2PiIhwvZebHj16KD4+Xi1bttShQ4c0efJkXXbZZdqxY4eqVKmS6zZxcXGaPHlyjvWNGjUqxhlYT82aNX1dAgAAgFedPHlSVatWzXeMwyxMLC4Fc+fO1d133+1aXrZsmfr27as//vhDderUca2/8cYb5XA4tGDBgkLt98SJE2rUqJFeffVV3X777bmOufgKr2EYOnbsmGrWrCmHw+HhGVlHWlqaGjRooIMHDyo0NNTX5eACzI11MTfWxvxYF3NjXXabG9M0dfLkSdWtW7fAK9aWucJ79dVXq0ePHq7l7ACanJzsFniTk5PVsWPHQu+3WrVqatGihfbs2ZPnmMDAQAUGBubYzm5CQ0Nt8Q1uR8yNdTE31sb8WBdzY112mpuCruxms0wDR5UqVdSsWTPXq02bNqpdu7ZWr17tGpOWlqZvvvlGUVFRhd7vqVOntHfvXrfQDAAAgPLDMoH3Yg6HQ+PGjdMzzzyjpUuXavv27Ro1apTq1q2rmJgY17j+/fvrjTfecC0/+OCDWrdunX799Vdt3LhR1157rfz9/TVixAgfnAUAAAB8zTItDbl5+OGHlZ6errvuuksnTpxQ7969tWLFCgUFBbnG7N27V0eOHHEt//bbbxoxYoSOHj2qsLAw9e7dW19//bXCwsJ8cQqWEBgYqIkTJ+Zo24DvMTfWxdxYG/NjXcyNdZXnubHMh9YAAACAkmDZlgYAAADAGwi8AAAAsDUCLwAAAGyNwAsAAABbI/Da3PTp09W4cWMFBQWpR48e+vbbb31dEiStX79eV111lerWrSuHw6GEhARfl4S/xMXFqVu3bqpSpYrCw8MVExOjXbt2+bosSJoxY4YiIyNdN82PiorSZ5995uuykIvnn3/edXtR+N6kSZPkcDjcXq1atfJ1WaWKwGtjCxYs0Pjx4zVx4kRt3bpVHTp00KBBg5SSkuLr0sq99PR0dejQQdOnT/d1KbjIunXrNHr0aH399ddatWqVzp49q4EDByo9Pd3XpZV79evX1/PPP68tW7Zo8+bNuvzyy3XNNddo586dvi4NF/juu+/05ptvKjIy0tel4AJt27bVoUOHXK8NGzb4uqRSxW3JbKxHjx7q1q2b68EchmGoQYMGuu+++/Too4/6uDpkczgcWrx4sdsDVWAdhw8fVnh4uNatW6c+ffr4uhxcpEaNGnrppZd0++23+7oUyPl0086dO+vf//63nnnmGXXs2FFTp071dVnl3qRJk5SQkKDExERfl+IzXOG1qTNnzmjLli0aMGCAa52fn58GDBigTZs2+bAyoGxJTU2V5AxWsI6srCzNnz9f6enpRXrcPErW6NGjNXToULf/98Aadu/erbp166pp06b6+9//rgMHDvi6pFJl6SetwXNHjhxRVlaWIiIi3NZHRETop59+8lFVQNliGIbGjRunXr16qV27dr4uB5K2b9+uqKgoZWRkKCQkRIsXL1abNm18XRYkzZ8/X1u3btV3333n61JwkR49eig+Pl4tW7bUoUOHNHnyZF122WXasWOHqlSp4uvySgWBFwDyMHr0aO3YsaPc9bpZWcuWLZWYmKjU1FQtWrRIsbGxWrduHaHXxw4ePKixY8dq1apVCgoK8nU5uMjgwYNdf46MjFSPHj3UqFEj/ec//yk37UAEXpuqVauW/P39lZyc7LY+OTlZtWvX9lFVQNkxZswYffrpp1q/fr3q16/v63Lwl4CAADVr1kyS1KVLF3333XeaNm2a3nzzTR9XVr5t2bJFKSkp6ty5s2tdVlaW1q9frzfeeEOZmZny9/f3YYW4ULVq1dSiRQvt2bPH16WUGnp4bSogIEBdunTR6tWrXesMw9Dq1avpdwPyYZqmxowZo8WLF2vNmjVq0qSJr0tCPgzDUGZmpq/LKPf69++v7du3KzEx0fXq2rWr/v73vysxMZGwazGnTp3S3r17VadOHV+XUmq4wmtj48ePV2xsrLp27aru3btr6tSpSk9P16233urr0sq9U6dOuf1kvW/fPiUmJqpGjRpq2LChDyvD6NGjNW/ePC1ZskRVqlRRUlKSJKlq1aoKDg72cXXl24QJEzR48GA1bNhQJ0+e1Lx587R27VqtXLnS16WVe1WqVMnR5165cmXVrFmT/ncLePDBB3XVVVepUaNG+uOPPzRx4kT5+/trxIgRvi6t1BB4bWz48OE6fPiwnnrqKSUlJaljx45asWJFjg+yofRt3rxZ/fr1cy2PHz9ekhQbG6v4+HgfVQXJ+XADSerbt6/b+jlz5uiWW24p/YLgkpKSolGjRunQoUOqWrWqIiMjtXLlSl1xxRW+Lg2wtN9++00jRozQ0aNHFRYWpt69e+vrr79WWFiYr0srNdyHFwAAALZGDy8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AlKL//Oc/qlGjhk6dOlXkbRs3bqwrr7yyBKrKXXx8vBwOh3799ddSO+aFfvzxR1WoUEE7duzwyfEB2AeBFwBKSVZWliZOnKj77rtPISEhvi7H8tq0aaOhQ4fqqaee8nUpAMo4Ai8AlJJPPvlEu3bt0l133eXrUgpl5MiR+vPPP9WoUSOf1XDPPfdo8eLF2rt3r89qAFD2EXgBoJTMmTNHvXr1Ur169XxdSqH4+/srKChIDofDZzUMGDBA1atX17vvvuuzGgCUfQReACiEP//8U61atVKrVq30559/utYfO3ZMderUUc+ePZWVlZXn9hkZGVqxYoUGDBiQ4705c+bo8ssvV3h4uAIDA9WmTRvNmDEjz319/vnn6tixo4KCgtSmTRt9/PHHbu+fPXtWkydPVvPmzRUUFKSaNWuqd+/eWrVqldu4n376STfeeKPCwsIUHBysli1b6vHHH3e9n1sP7+bNmzVo0CDVqlVLwcHBatKkiW677Ta3/c6fP19dunRRlSpVFBoaqvbt22vatGluX7MHH3xQ7du3V0hIiEJDQzV48GD98MMPOc61YsWK6tu3r5YsWZLn1wMAClLB1wUAQFkQHBysd999V7169dLjjz+uV199VZI0evRopaamKj4+Xv7+/nluv2XLFp05c0adO3fO8d6MGTPUtm1bXX311apQoYI++eQT3XvvvTIMQ6NHj3Ybu3v3bg0fPlz33HOPYmNjNWfOHA0bNkwrVqzQFVdcIUmaNGmS4uLidMcdd6h79+5KS0vT5s2btXXrVteYbdu26bLLLlPFihV11113qXHjxtq7d68++eQTPfvss7meQ0pKigYOHKiwsDA9+uijqlatmn799Ve3wL1q1SqNGDFC/fv31wsvvCBJ+t///qevvvpKY8eOlST98ssvSkhI0LBhw9SkSRMlJyfrzTffVHR0tH788UfVrVvX7bhdunTRkiVLlJaWptDQ0HznCQByZQIACm3ChAmmn5+fuX79enPhwoWmJHPq1KkFbvf222+bkszt27fneO/06dM51g0aNMhs2rSp27pGjRqZksyPPvrItS41NdWsU6eO2alTJ9e6Dh06mEOHDs23nj59+phVqlQx9+/f77beMAzXn+fMmWNKMvft22eapmkuXrzYlGR+9913ee537NixZmhoqHnu3Lk8x2RkZJhZWVlu6/bt22cGBgaaU6ZMyTF+3rx5piTzm2++yfecACAvtDQAQBFMmjRJbdu2VWxsrO69915FR0frn//8Z4HbHT16VJJUvXr1HO8FBwe7/pyamqojR44oOjpav/zyi1JTU93G1q1bV9dee61rOTQ0VKNGjdL333+vpKQkSVK1atW0c+dO7d69O9daDh8+rPXr1+u2225Tw4YN3d7Lr1+3WrVqkqRPP/1UZ8+ezXNMenp6jvaJCwUGBsrPz/m/n6ysLB09elQhISFq2bKltm7dmmN89tfsyJEjee4TAPJD4AWAIggICNDs2bO1b98+nTx5UnPmzCnSh7pM08yx7quvvtKAAQNUuXJlVatWTWFhYXrsscckKUfgbdasWY7jtWjRQpJcvbZTpkzRiRMn1KJFC7Vv314PPfSQtm3b5hr/yy+/SJLatWtX6LolKTo6Wtdff70mT56sWrVq6ZprrtGcOXOUmZnpGnPvvfeqRYsWGjx4sOrXr6/bbrtNK1ascNuPYRh67bXX1Lx5cwUGBqpWrVoKCwvTtm3bcpyvdP5r5ssPzwEo2wi8AFBEK1eulOT8IFpeV1EvVrNmTUnS8ePH3dbv3btX/fv315EjR/Tqq69q2bJlWrVqle6//35JznBYVH369NHevXs1e/ZstWvXTm+//bY6d+6st99+u8j7upDD4dCiRYu0adMmjRkzRr///rtuu+02denSxfUgjfDwcCUmJmrp0qW6+uqr9eWXX2rw4MGKjY117ee5557T+PHj1adPH33wwQdauXKlVq1apbZt2+Z6vtlfs1q1ahWrfgDlmK97KgCgLPnhhx/MgIAA89ZbbzU7depkNmjQwDxx4kSB223YsMGUZC5ZssRt/WuvvWZKytFL+9hjj7n1z5qms4e3bt26bn22pmmajzzyiCnJPHToUK7HPnnypNmpUyezXr16pmmaZkpKiinJHDt2bL41X9zDm5u5c+eaksy33nor1/ezsrLMu+++25Rk7t692zRNZ49xv379coytV6+eGR0dnWP9M888Y/r5+RXq6wwAueEKLwAU0tmzZ3XLLbeobt26mjZtmuLj45WcnOy6GpufLl26KCAgQJs3b3Zbn31nB/OCVofU1FTNmTMn1/388ccfWrx4sWs5LS1N7733njp27KjatWtLOt8vnC0kJETNmjVztR6EhYWpT58+mj17tg4cOOA21syl5SLb8ePHc7zfsWNHSXLt++Jj+/n5KTIy0m2Mv79/jv0sXLhQv//+e67H3bJli9q2bauqVavmWRsA5IfbkgFAIT3zzDNKTEzU6tWrVaVKFUVGRuqpp57SE088oRtuuEFDhgzJc9ugoCANHDhQX3zxhaZMmeJaP3DgQAUEBOiqq67S3XffrVOnTumtt95SeHi4Dh06lGM/LVq00O23367vvvtOERERmj17tpKTk90Ccps2bdS3b1916dJFNWrU0ObNm7Vo0SKNGTPGNeZf//qXevfurc6dO+uuu+5SkyZN9Ouvv2rZsmVKTEzM9Rzeffdd/fvf/9a1116rSy65RCdPntRbb72l0NBQ17nfcccdOnbsmC6//HLVr19f+/fv1+uvv66OHTuqdevWkqQrr7xSU6ZM0a233qqePXtq+/btmjt3rpo2bZrjmGfPntW6det077335j85AJAf315gBoCyYcuWLWaFChXM++67z239uXPnzG7dupl169Y1jx8/nu8+Pv74Y9PhcJgHDhxwW7906VIzMjLSDAoKMhs3bmy+8MIL5uzZs3NtaRg6dKi5cuVKMzIy0gwMDDRbtWplLly40G1/zzzzjNm9e3ezWrVqZnBwsNmqVSvz2WefNc+cOeM2bseOHea1115rVqtWzQwKCjJbtmxpPvnkk673L25p2Lp1qzlixAizYcOGZmBgoBkeHm5eeeWV5ubNm13bLFq0yBw4cKAZHh5uBgQEmA0bNjTvvvtut3aLjIwM84EHHjDr1KljBgcHm7169TI3bdpkRkdH52hp+Oyzz9zaIQDAEw7TzOf3VwAAr8nKylKbNm1044036umnn/Z1OWVCTEyMHA6HWxsHABQVgRcAStGCBQv0j3/8QwcOHFBISIivy7G0//3vf2rfvr0SExOLfAs1ALgQgRcAAAC2xl0aAAAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANja/wO/kav13fKFjwAAAABJRU5ErkJggg==",
|
||
"text/plain": [
|
||
"<Figure size 800x800 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 可视化二维向量\n",
|
||
"import numpy as np\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"\n",
|
||
"# 设置中文字体(如果系统有的话)\n",
|
||
"try:\n",
|
||
" plt.rcParams['font.sans-serif'] = ['SimHei', 'Noto Sans CJK SC', 'WenQuanYi Micro Hei']\n",
|
||
" plt.rcParams['axes.unicode_minus'] = False\n",
|
||
"except:\n",
|
||
" pass # 如果没有中文字体就用默认\n",
|
||
"\n",
|
||
"# 创建画布\n",
|
||
"fig, ax = plt.subplots(figsize=(8, 8))\n",
|
||
"\n",
|
||
"# 定义向量\n",
|
||
"vectors = {\n",
|
||
" 'A = [2, 3]': np.array([2, 3]),\n",
|
||
" 'B = [4, 1]': np.array([4, 1]),\n",
|
||
" 'C = [1, 1]': np.array([1, 1]),\n",
|
||
"}\n",
|
||
"\n",
|
||
"# 画每个向量\n",
|
||
"colors = ['red', 'blue', 'green']\n",
|
||
"for (name, vec), color in zip(vectors.items(), colors):\n",
|
||
" ax.annotate('', xy=vec, xytext=(0, 0),\n",
|
||
" arrowprops=dict(arrowstyle='->', color=color, lw=2))\n",
|
||
" ax.text(vec[0]+0.1, vec[1]+0.1, name, fontsize=12, color=color)\n",
|
||
"\n",
|
||
"# 画坐标系\n",
|
||
"ax.axhline(y=0, color='black', linewidth=0.5)\n",
|
||
"ax.axvline(x=0, color='black', linewidth=0.5)\n",
|
||
"\n",
|
||
"# 设置范围\n",
|
||
"ax.set_xlim(-0.5, 5.5)\n",
|
||
"ax.set_ylim(-0.5, 4)\n",
|
||
"ax.set_xlabel('x (abscissa)', fontsize=12)\n",
|
||
"ax.set_ylabel('y (ordinate)', fontsize=12)\n",
|
||
"ax.set_title('2D Vector Visualization', fontsize=14)\n",
|
||
"ax.grid(True, alpha=0.3)\n",
|
||
"ax.set_aspect('equal')\n",
|
||
"\n",
|
||
"plt.show()\n",
|
||
"print(\"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3.2 向量的基本运算\n",
|
||
"\n",
|
||
"### 3.2.1 向量加法\n",
|
||
"\n",
|
||
"**规则:对应位置相加**\n",
|
||
"\n",
|
||
"```python\n",
|
||
"[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n",
|
||
"```\n",
|
||
"\n",
|
||
"**几何直观**:先走向量a,再走向量b,等价于直接从原点走到a+b\n",
|
||
"\n",
|
||
"```\n",
|
||
" b=[4,5,6]\n",
|
||
" ↗\n",
|
||
" |\n",
|
||
" a+b |\n",
|
||
" ↙|\n",
|
||
" ↙ |\n",
|
||
"O →——→ a=[1,2,3]\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"向量加法演示\n",
|
||
"==================================================\n",
|
||
"向量 a = [1 2 3]\n",
|
||
"向量 b = [4 5 6]\n",
|
||
"a + b = [5 7 9]\n",
|
||
"\n",
|
||
"计算过程:\n",
|
||
" 位置0: 1 + 4 = 5\n",
|
||
" 位置1: 2 + 5 = 7\n",
|
||
" 位置2: 3 + 6 = 9\n",
|
||
"\n",
|
||
"验证: True True True\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 向量加法演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"向量加法演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"a = np.array([1, 2, 3])\n",
|
||
"b = np.array([4, 5, 6])\n",
|
||
"c = a + b\n",
|
||
"\n",
|
||
"print(f\"向量 a = {a}\")\n",
|
||
"print(f\"向量 b = {b}\")\n",
|
||
"print(f\"a + b = {c}\")\n",
|
||
"print()\n",
|
||
"print(\"计算过程:\")\n",
|
||
"print(f\" 位置0: {a[0]} + {b[0]} = {a[0]+b[0]}\")\n",
|
||
"print(f\" 位置1: {a[1]} + {b[1]} = {a[1]+b[1]}\")\n",
|
||
"print(f\" 位置2: {a[2]} + {b[2]} = {a[2]+b[2]}\")\n",
|
||
"print()\n",
|
||
"print(\"验证:\", a[0]+b[0] == c[0], a[1]+b[1] == c[1], a[2]+b[2] == c[2])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.2.2 向量数乘(标量乘法)\n",
|
||
"\n",
|
||
"**规则:每个元素都乘以这个标量(数字)**\n",
|
||
"\n",
|
||
"```python\n",
|
||
"2 × [1, 2, 3] = [2×1, 2×2, 2×3] = [2, 4, 6]\n",
|
||
"3 × [1, 2, 3] = [3×1, 3×2, 3×3] = [3, 6, 9]\n",
|
||
"0.5 × [1, 2, 3] = [0.5, 1.0, 1.5]\n",
|
||
"```\n",
|
||
"\n",
|
||
"**几何直观**:\n",
|
||
"- 正数:方向不变,长度缩放\n",
|
||
"- 负数:方向相反,长度缩放\n",
|
||
"- 0:变成零向量"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"向量数乘(标量乘法)演示\n",
|
||
"==================================================\n",
|
||
"原始向量 v = [1 2 3]\n",
|
||
"\n",
|
||
"2 × v = [2 4 6]\n",
|
||
"3 × v = [3 6 9]\n",
|
||
"0.5 × v = [0.5 1. 1.5]\n",
|
||
"-1 × v = [-1 -2 -3]\n",
|
||
"0 × v = [0 0 0]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 向量数乘演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"向量数乘(标量乘法)演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"v = np.array([1, 2, 3])\n",
|
||
"\n",
|
||
"print(f\"原始向量 v = {v}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"for scalar in [2, 3, 0.5, -1, 0]:\n",
|
||
" result = scalar * v\n",
|
||
" print(f\"{scalar} × v = {result}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.2.3 向量的长度(模/范数)\n",
|
||
"\n",
|
||
"**定义:从原点到向量终点的距离**\n",
|
||
"\n",
|
||
"对于二维向量 `[a, b]`:\n",
|
||
"```\n",
|
||
"长度 = √(a² + b²)\n",
|
||
"\n",
|
||
"这就是\"勾股定理\"!\n",
|
||
"\n",
|
||
" |\n",
|
||
" b |\n",
|
||
" | |\n",
|
||
" | √(a²+b²)\n",
|
||
" | /\n",
|
||
" | /\n",
|
||
" |/ a\n",
|
||
" O——————\n",
|
||
"```\n",
|
||
"\n",
|
||
"对于n维向量 `[a₁, a₂, ..., aₙ]`:\n",
|
||
"```\n",
|
||
"长度 = √(a₁² + a₂² + ... + aₙ²)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"向量长度(模/范数)演示\n",
|
||
"==================================================\n",
|
||
"向量 v = [3 4]\n",
|
||
"长度 = √(3² + 4²) = √(9 + 16) = √25 = 5.0\n",
|
||
"\n",
|
||
"向量长度计算例子:\n",
|
||
" [np.int64(1), np.int64(1)] -> 长度 = 1.4142\n",
|
||
" [np.int64(0), np.int64(5)] -> 长度 = 5.0000\n",
|
||
" [np.int64(3), np.int64(4)] -> 长度 = 5.0000\n",
|
||
" [np.int64(1), np.int64(2), np.int64(2)] -> 长度 = 3.0000\n",
|
||
" [np.int64(1), np.int64(1), np.int64(1), np.int64(1)] -> 长度 = 2.0000\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 向量长度计算\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"向量长度(模/范数)演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 二维向量例子\n",
|
||
"v2d = np.array([3, 4])\n",
|
||
"length_2d = np.linalg.norm(v2d)\n",
|
||
"\n",
|
||
"print(f\"向量 v = {v2d}\")\n",
|
||
"print(f\"长度 = √({v2d[0]}² + {v2d[1]}²) = √({v2d[0]**2} + {v2d[1]**2}) = √{v2d[0]**2 + v2d[1]**2} = {length_2d}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 更多例子\n",
|
||
"examples = [\n",
|
||
" np.array([1, 1]), # 45度角\n",
|
||
" np.array([0, 5]), # 在y轴上\n",
|
||
" np.array([3, 4]), # 经典勾股数\n",
|
||
" np.array([1, 2, 2]), # 三维向量\n",
|
||
" np.array([1, 1, 1, 1]) # 四维向量\n",
|
||
"]\n",
|
||
"\n",
|
||
"print(\"向量长度计算例子:\")\n",
|
||
"for v in examples:\n",
|
||
" length = np.linalg.norm(v)\n",
|
||
" print(f\" {list(v)} -> 长度 = {length:.4f}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"练习题3答案\n",
|
||
"==================================================\n",
|
||
"A = [3 4], B = [1 2]\n",
|
||
"\n",
|
||
"1. A + B = [4 6]\n",
|
||
"2. 2 × A = [6 8]\n",
|
||
"3. A的长度 = 5.0\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 练习题3答案\n",
|
||
"import numpy as np\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"练习题3答案\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"A = np.array([3, 4])\n",
|
||
"B = np.array([1, 2])\n",
|
||
"\n",
|
||
"print(f\"A = {A}, B = {B}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 1. A + B\n",
|
||
"print(\"1. A + B =\", A + B)\n",
|
||
"\n",
|
||
"# 2. 2 × A\n",
|
||
"print(\"2. 2 × A =\", 2 * A)\n",
|
||
"\n",
|
||
"# 3. A的长度\n",
|
||
"print(f\"3. A的长度 = {np.linalg.norm(A)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第四部分:余弦相似度\n",
|
||
"\n",
|
||
"## 4.1 什么是相似度?\n",
|
||
"\n",
|
||
"**相似度 = 两个向量有多\"像\"**\n",
|
||
"\n",
|
||
"### 日常生活中的相似例子\n",
|
||
"\n",
|
||
"| 相似度高 | 原因 | 相似度低 | 原因 |\n",
|
||
"|----------|------|----------|------|\n",
|
||
"| \"猫\" 和 \"狗\" | 都是动物,都四只脚 | \"猫\" 和 \"石头\" | 一个是动物,一个不是 |\n",
|
||
"| \"红色\" 和 \"黄色\" | 都是颜色,暖色调 | \"热\" 和 \"冷\" | 意思相反 |\n",
|
||
"| \"跑步\" 和 \"游泳\" | 都是运动 | \"太阳\" 和 \"细菌\" | 几乎没有共同点 |\n",
|
||
"| \"苹果\" 和 \"梨\" | 都是水果 | \"苹果\" 和 \"手机\" | 需要上下文才能关联 |\n",
|
||
"\n",
|
||
"### 计算机如何量化相似度?\n",
|
||
"\n",
|
||
"文本相似度在计算机中的应用:\n",
|
||
"\n",
|
||
"```\n",
|
||
"搜索场景:\n",
|
||
" 用户输入: \"如何学习编程?\"\n",
|
||
" 文档1: \"Python入门教程\" → 相似度高 ✅\n",
|
||
" 文档2: \"做蛋糕的100种方法\" → 相似度低 ❌\n",
|
||
"\n",
|
||
"推荐场景:\n",
|
||
" 用户喜欢: \"猫和狗的搞笑视频\"\n",
|
||
" 推荐1: \"仓鼠的可爱瞬间\" → 相似度高 ✅\n",
|
||
" 推荐2: \"汽车发动机维修教程\" → 相似度低 ❌\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.2 点积(Dot Product)— 最重要的运算\n",
|
||
"\n",
|
||
"### 定义:对应位置相乘,再求和\n",
|
||
"\n",
|
||
"```python\n",
|
||
"a = [1, 2, 3]\n",
|
||
"b = [4, 5, 6]\n",
|
||
"\n",
|
||
"点积 = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n",
|
||
"```\n",
|
||
"\n",
|
||
"### 点积的几何意义\n",
|
||
"\n",
|
||
"```\n",
|
||
"点积 = |A| × |B| × cos(θ)\n",
|
||
"\n",
|
||
"其中:\n",
|
||
" |A| = 向量A的长度\n",
|
||
" |B| = 向量B的长度\n",
|
||
" θ = 两个向量之间的夹角\n",
|
||
"```\n",
|
||
"\n",
|
||
"| 夹角 θ | cos(θ) | 点积结果 | 含义 |\n",
|
||
"|--------|--------|----------|------|\n",
|
||
"| 0° | 1 | |A|×|B|(最大) | 方向完全相同 |\n",
|
||
"| 90° | 0 | 0 | 垂直/正交 |\n",
|
||
"| 180° | -1 | -|A|×|B|(最小) | 方向完全相反 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"向量点积演示\n",
|
||
"==================================================\n",
|
||
"向量 a = [1 2 3]\n",
|
||
"向量 b = [4 5 6]\n",
|
||
"\n",
|
||
"点积 a · b = 32\n",
|
||
"验证: a @ b = 32\n",
|
||
"手动计算: 32\n",
|
||
"\n",
|
||
"计算过程:\n",
|
||
" a[0]×b[0] = 1×4 = 4\n",
|
||
" a[1]×b[1] = 2×5 = 10\n",
|
||
" a[2]×b[2] = 3×6 = 18\n",
|
||
" 求和: 4 + 10 + 18 = 32\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 点积计算演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"向量点积演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"a = np.array([1, 2, 3])\n",
|
||
"b = np.array([4, 5, 6])\n",
|
||
"\n",
|
||
"# 方法1:使用np.dot()\n",
|
||
"dot1 = np.dot(a, b)\n",
|
||
"\n",
|
||
"# 方法2:使用@运算符\n",
|
||
"dot2 = a @ b\n",
|
||
"\n",
|
||
"# 方法3:手动计算\n",
|
||
"dot3 = sum(a[i] * b[i] for i in range(len(a)))\n",
|
||
"\n",
|
||
"print(f\"向量 a = {a}\")\n",
|
||
"print(f\"向量 b = {b}\")\n",
|
||
"print()\n",
|
||
"print(f\"点积 a · b = {dot1}\")\n",
|
||
"print(f\"验证: a @ b = {dot2}\")\n",
|
||
"print(f\"手动计算: {dot3}\")\n",
|
||
"print()\n",
|
||
"print(\"计算过程:\")\n",
|
||
"print(f\" a[0]×b[0] = {a[0]}×{b[0]} = {a[0]*b[0]}\")\n",
|
||
"print(f\" a[1]×b[1] = {a[1]}×{b[1]} = {a[1]*b[1]}\")\n",
|
||
"print(f\" a[2]×b[2] = {a[2]}×{b[2]} = {a[2]*b[2]}\")\n",
|
||
"print(f\" 求和: {a[0]*b[0]} + {a[1]*b[1]} + {a[2]*b[2]} = {dot1}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"点积与夹角的关系\n",
|
||
"==================================================\n",
|
||
"夹角0°: a=[1 0], b=[2 0], 点积=2\n",
|
||
"夹角90°: a=[1 0], b=[0 1], 点积=0\n",
|
||
"夹角180°: a=[1 0], b=[-1 0], 点积=-1\n",
|
||
"\n",
|
||
"任意角度: a=[1 1], b=[1 0]\n",
|
||
" 点积 = 1\n",
|
||
" cos(θ) = 0.7071\n",
|
||
" 夹角 θ = 45.0°\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 点积与夹角的关系\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"点积与夹角的关系\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 夹角为0度:方向完全相同\n",
|
||
"a = np.array([1, 0])\n",
|
||
"b = np.array([2, 0])\n",
|
||
"dot = np.dot(a, b)\n",
|
||
"print(f\"夹角0°: a={a}, b={b}, 点积={dot}\")\n",
|
||
"\n",
|
||
"# 夹角为90度:垂直\n",
|
||
"a = np.array([1, 0])\n",
|
||
"b = np.array([0, 1])\n",
|
||
"dot = np.dot(a, b)\n",
|
||
"print(f\"夹角90°: a={a}, b={b}, 点积={dot}\")\n",
|
||
"\n",
|
||
"# 夹角为180度:方向相反\n",
|
||
"a = np.array([1, 0])\n",
|
||
"b = np.array([-1, 0])\n",
|
||
"dot = np.dot(a, b)\n",
|
||
"print(f\"夹角180°: a={a}, b={b}, 点积={dot}\")\n",
|
||
"\n",
|
||
"# 任意角度\n",
|
||
"import math\n",
|
||
"a = np.array([1, 1])\n",
|
||
"b = np.array([1, 0])\n",
|
||
"dot = np.dot(a, b)\n",
|
||
"cos_angle = dot / (np.linalg.norm(a) * np.linalg.norm(b))\n",
|
||
"angle = math.acos(cos_angle) * 180 / math.pi\n",
|
||
"print(f\"\\n任意角度: a={a}, b={b}\")\n",
|
||
"print(f\" 点积 = {dot}\")\n",
|
||
"print(f\" cos(θ) = {cos_angle:.4f}\")\n",
|
||
"print(f\" 夹角 θ = {angle:.1f}°\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4.3 余弦相似度 — 用点积判断\"像不像\"\n",
|
||
"\n",
|
||
"### 公式\n",
|
||
"\n",
|
||
"```\n",
|
||
" A · B\n",
|
||
"cos(θ) = ──────────\n",
|
||
" |A| × |B|\n",
|
||
"\n",
|
||
"其中:\n",
|
||
" A · B = 向量A和B的点积\n",
|
||
" |A| = 向量A的长度(模)\n",
|
||
" |B| = 向量B的长度(模)\n",
|
||
" cos(θ) = 相似度,范围是 [-1, 1]\n",
|
||
"```\n",
|
||
"\n",
|
||
"### 为什么叫\"余弦\"相似度?\n",
|
||
"\n",
|
||
"因为公式中计算的就是两个向量夹角的余弦值!\n",
|
||
"\n",
|
||
"从点积公式推导:\n",
|
||
"```\n",
|
||
"A · B = |A| × |B| × cos(θ)\n",
|
||
" ↓\n",
|
||
"cos(θ) = (A · B) / (|A| × |B|)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"余弦相似度函数已定义:cosine_similarity(a, b)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 定义余弦相似度函数\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"def cosine_similarity(a, b):\n",
|
||
" \"\"\"\n",
|
||
" 计算余弦相似度\n",
|
||
" \n",
|
||
" 参数:\n",
|
||
" a, b: 两个numpy数组(向量)\n",
|
||
" \n",
|
||
" 返回:\n",
|
||
" float: 余弦相似度,范围[-1, 1]\n",
|
||
" \"\"\"\n",
|
||
" dot = np.dot(a, b) # 点积\n",
|
||
" norm_a = np.linalg.norm(a) # 向量a的长度\n",
|
||
" norm_b = np.linalg.norm(b) # 向量b的长度\n",
|
||
" \n",
|
||
" # 防止除以零\n",
|
||
" if norm_a == 0 or norm_b == 0:\n",
|
||
" return 0.0\n",
|
||
" \n",
|
||
" return dot / (norm_a * norm_b)\n",
|
||
"\n",
|
||
"print(\"余弦相似度函数已定义:cosine_similarity(a, b)\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"余弦相似度计算示例\n",
|
||
"==================================================\n",
|
||
"1. 方向完全相同: a=[1 2 3], b=[2 4 6]\n",
|
||
" 相似度 = 1.000 (应该是1.000)\n",
|
||
"\n",
|
||
"2. 方向完全相反: a=[1 2 3], b=[-1 -2 -3]\n",
|
||
" 相似度 = -1.000 (应该是-1.000)\n",
|
||
"\n",
|
||
"3. 垂直向量: a=[1 0], b=[0 1]\n",
|
||
" 相似度 = 0.000 (应该是0.000)\n",
|
||
"\n",
|
||
"4. 45度夹角: a=[1 1], b=[1 0]\n",
|
||
" 相似度 = 0.707 (应该是0.707)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 余弦相似度计算示例\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"余弦相似度计算示例\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 示例1:方向完全相同的向量\n",
|
||
"a = np.array([1, 2, 3])\n",
|
||
"b = np.array([2, 4, 6]) # b是a的两倍,方向完全相同\n",
|
||
"sim = cosine_similarity(a, b)\n",
|
||
"print(f\"1. 方向完全相同: a={a}, b={b}\")\n",
|
||
"print(f\" 相似度 = {sim:.3f} (应该是1.000)\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 示例2:方向完全相反的向量\n",
|
||
"a = np.array([1, 2, 3])\n",
|
||
"b = np.array([-1, -2, -3]) # b是a的相反方向\n",
|
||
"sim = cosine_similarity(a, b)\n",
|
||
"print(f\"2. 方向完全相反: a={a}, b={b}\")\n",
|
||
"print(f\" 相似度 = {sim:.3f} (应该是-1.000)\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 示例3:垂直的向量\n",
|
||
"a = np.array([1, 0])\n",
|
||
"b = np.array([0, 1])\n",
|
||
"sim = cosine_similarity(a, b)\n",
|
||
"print(f\"3. 垂直向量: a={a}, b={b}\")\n",
|
||
"print(f\" 相似度 = {sim:.3f} (应该是0.000)\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 示例4:45度夹角\n",
|
||
"a = np.array([1, 1])\n",
|
||
"b = np.array([1, 0])\n",
|
||
"sim = cosine_similarity(a, b)\n",
|
||
"print(f\"4. 45度夹角: a={a}, b={b}\")\n",
|
||
"print(f\" 相似度 = {sim:.3f} (应该是0.707)\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 余弦相似度的值代表什么?\n",
|
||
"\n",
|
||
"| cos(θ) 值 | 夹角 θ | 相似程度 | 示例 |\n",
|
||
"|----------|--------|---------|------|\n",
|
||
"| 1.0 | 0° | **完全相同** | 同一向量 |\n",
|
||
"| 0.8~0.99 | 0~37° | **非常相似** | \"猫\" vs \"狗\" |\n",
|
||
"| 0.5~0.8 | 37~60° | **比较相似** | \"跑步\" vs \"运动\" |\n",
|
||
"| 0.3~0.5 | 60~72° | **有些相似** | \"苹果\" vs \"水果\" |\n",
|
||
"| 0 | 90° | **毫不相关** | \"猫\" vs \"石头\" |\n",
|
||
"| -0.5~0 | 90~120° | **有些相反** | \"热\" vs \"冷\" |\n",
|
||
"| -1.0 | 180° | **完全相反** | \"高\" vs \"矮\" |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"语义相似度示例(用向量模拟词义)\n",
|
||
"==================================================\n",
|
||
"\n",
|
||
"词向量(简化模拟):\n",
|
||
" 猫 = [0.9 0.1 0.7 0.8 0.9]\n",
|
||
" 狗 = [0.8 0.2 0.6 0.8 0.9]\n",
|
||
" 苹果 = [0.1 0.9 0.9 0. 0. ]\n",
|
||
" 汽车 = [0. 0. 0. 0.9 0. ]\n",
|
||
" 石头 = [0. 0.1 0. 0. 0. ]\n",
|
||
"\n",
|
||
"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n",
|
||
"\n",
|
||
"相似度计算结果:\n",
|
||
" 猫 vs 狗: 0.996 (都是动物,都有宠物属性)\n",
|
||
" 猫 vs 苹果: 0.382 (动物vs植物,很不同)\n",
|
||
" 猫 vs 汽车: 0.482 (动物vs机械)\n",
|
||
" 猫 vs 石头: 0.060 (动物vs无机物)\n",
|
||
" 狗 vs 汽车: 0.507 (动物vs机械,但都能移动)\n",
|
||
" 苹果 vs 石头: 0.705 (都是静态的)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 语义相似度示例\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"语义相似度示例(用向量模拟词义)\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 假设这些是词的\"意义向量\"(简化版)\n",
|
||
"# 维度解释: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n",
|
||
"# 每个维度取值0-1,表示该属性的强弱\n",
|
||
"\n",
|
||
"cat = np.array([0.9, 0.1, 0.7, 0.8, 0.9]) # 猫\n",
|
||
"dog = np.array([0.8, 0.2, 0.6, 0.8, 0.9]) # 狗\n",
|
||
"apple = np.array([0.1, 0.9, 0.9, 0.0, 0.0]) # 苹果\n",
|
||
"car = np.array([0.0, 0.0, 0.0, 0.9, 0.0]) # 汽车\n",
|
||
"rock = np.array([0.0, 0.1, 0.0, 0.0, 0.0]) # 石头\n",
|
||
"\n",
|
||
"print(\"词向量(简化模拟):\")\n",
|
||
"print(f\" 猫 = {cat}\")\n",
|
||
"print(f\" 狗 = {dog}\")\n",
|
||
"print(f\" 苹果 = {apple}\")\n",
|
||
"print(f\" 汽车 = {car}\")\n",
|
||
"print(f\" 石头 = {rock}\")\n",
|
||
"print()\n",
|
||
"print(\"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 计算相似度\n",
|
||
"print(\"相似度计算结果:\")\n",
|
||
"print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n",
|
||
"print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物,很不同)\")\n",
|
||
"print(f\" 猫 vs 汽车: {cosine_similarity(cat, car):.3f} (动物vs机械)\")\n",
|
||
"print(f\" 猫 vs 石头: {cosine_similarity(cat, rock):.3f} (动物vs无机物)\")\n",
|
||
"print(f\" 狗 vs 汽车: {cosine_similarity(dog, car):.3f} (动物vs机械,但都能移动)\")\n",
|
||
"print(f\" 苹果 vs 石头: {cosine_similarity(apple, rock):.3f} (都是静态的)\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"练习题4答案\n",
|
||
"==================================================\n",
|
||
"A = [1 2 3], B = [4 5 6]\n",
|
||
"\n",
|
||
"1. 点积 A · B = 32\n",
|
||
" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n",
|
||
"\n",
|
||
"2. 余弦相似度 = 0.9746\n",
|
||
"\n",
|
||
"3. A=[1,0], B=[0,1] 的余弦相似度 = 0.0\n",
|
||
" 原因:这两个向量垂直,夹角90°,cos(90°)=0\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 练习题4答案\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"练习题4答案\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"A = np.array([1, 2, 3])\n",
|
||
"B = np.array([4, 5, 6])\n",
|
||
"\n",
|
||
"print(f\"A = {A}, B = {B}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 1. 点积\n",
|
||
"dot = np.dot(A, B)\n",
|
||
"print(f\"1. 点积 A · B = {dot}\")\n",
|
||
"print(f\" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = {dot}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 2. 余弦相似度\n",
|
||
"cos_sim = cosine_similarity(A, B)\n",
|
||
"print(f\"2. 余弦相似度 = {cos_sim:.4f}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 3. 垂直向量的相似度\n",
|
||
"A = np.array([1, 0])\n",
|
||
"B = np.array([0, 1])\n",
|
||
"cos_sim = cosine_similarity(A, B)\n",
|
||
"print(f\"3. A=[1,0], B=[0,1] 的余弦相似度 = {cos_sim}\")\n",
|
||
"print(\" 原因:这两个向量垂直,夹角90°,cos(90°)=0\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第五部分:文本向量化的核心思想\n",
|
||
"\n",
|
||
"## 5.1 核心目标:把所有文本变成\"向量\"\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌──────────────────────────────────────────────────────────────────┐\n",
|
||
"│ │\n",
|
||
"│ 文本(符号) ──→ 数值向量 ──→ 计算机可以计算 ──→ AI模型处理 │\n",
|
||
"│ │\n",
|
||
"│ \"猫\" [0.9, 0.1, 0.8] │\n",
|
||
"│ \"狗\" [0.8, 0.2, 0.7] │\n",
|
||
"│ │\n",
|
||
"└──────────────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### 为什么必须是向量?\n",
|
||
"\n",
|
||
"| 计算机擅长 | 计算机不擅长 |\n",
|
||
"|------------|-------------|\n",
|
||
"| 向量加减:v1 + v2 = ? | 字符串比较:\"Python\" == \"Java\" ? |\n",
|
||
"| 向量点积:v1 · v2 = ? | 词语推理:\"猫\" 类似于 \"狗\" ? |\n",
|
||
"| 向量距离:|v1 - v2| = ? | 语义理解:\"你好\"是问候语 |\n",
|
||
"| 余弦相似度:cos(θ) = ? | 情感判断:\"绝了\"是夸还是骂? |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 5.2 向量化示例:从\"词\"到\"数\"\n",
|
||
"\n",
|
||
"### 方法1:位置编码(只有位置信息,没有语义)\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# 假设我们有一个很小的词汇表(只有5个词)\n",
|
||
"vocab = [\"猫\", \"狗\", \"鱼\", \"苹果\", \"香蕉\"]\n",
|
||
"\n",
|
||
"# 位置编码:每个词对应一个位置\n",
|
||
"# \"猫\" → [1, 0, 0, 0, 0] 第1个位置是1,其他是0\n",
|
||
"# \"狗\" → [0, 1, 0, 0, 0] 第2个位置是1,其他是0\n",
|
||
"# \"苹果\" → [0, 0, 0, 1, 0] 第4个位置是1,其他是0\n",
|
||
"```\n",
|
||
"\n",
|
||
"**问题**:这只是\"位置编码\",没有语义信息!\n",
|
||
"\n",
|
||
"```\n",
|
||
"\"猫\" = [1, 0, 0, 0, 0]\n",
|
||
"\"狗\" = [0, 1, 0, 0, 0]\n",
|
||
"\n",
|
||
"余弦相似度 = 0 (完全不相似)\n",
|
||
"\n",
|
||
"但实际上\"猫\"和\"狗\"都是动物,应该很相似!\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"位置编码的缺陷\n",
|
||
"==================================================\n",
|
||
"位置编码向量:\n",
|
||
" 猫 = [1 0 0 0 0]\n",
|
||
" 狗 = [0 1 0 0 0]\n",
|
||
" 苹果 = [0 0 0 1 0]\n",
|
||
"\n",
|
||
"余弦相似度(用位置编码):\n",
|
||
" 猫 vs 狗: 0.000\n",
|
||
" 猫 vs 苹果: 0.000\n",
|
||
"\n",
|
||
"问题:猫和狗都是动物,相似度却是0!\n",
|
||
" 猫和苹果不是同类,相似度也是0!\n",
|
||
" 位置编码没有语义信息!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 位置编码的缺陷演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"位置编码的缺陷\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 位置编码向量\n",
|
||
"cat_onehot = np.array([1, 0, 0, 0, 0]) # \"猫\"\n",
|
||
"dog_onehot = np.array([0, 1, 0, 0, 0]) # \"狗\"\n",
|
||
"apple_onehot = np.array([0, 0, 0, 1, 0]) # \"苹果\"\n",
|
||
"\n",
|
||
"print(\"位置编码向量:\")\n",
|
||
"print(f\" 猫 = {cat_onehot}\")\n",
|
||
"print(f\" 狗 = {dog_onehot}\")\n",
|
||
"print(f\" 苹果 = {apple_onehot}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 相似度计算\n",
|
||
"print(\"余弦相似度(用位置编码):\")\n",
|
||
"print(f\" 猫 vs 狗: {cosine_similarity(cat_onehot, dog_onehot):.3f}\")\n",
|
||
"print(f\" 猫 vs 苹果: {cosine_similarity(cat_onehot, apple_onehot):.3f}\")\n",
|
||
"print()\n",
|
||
"print(\"问题:猫和狗都是动物,相似度却是0!\")\n",
|
||
"print(\" 猫和苹果不是同类,相似度也是0!\")\n",
|
||
"print(\" 位置编码没有语义信息!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 方法2:语义编码(有语义信息)\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# 语义编码:每个词用\"含义\"来表示\n",
|
||
"# 维度:[动物性, 植物性, 可食用性, 宠物性]\n",
|
||
"\n",
|
||
"cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n",
|
||
"dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n",
|
||
"apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n",
|
||
"```\n",
|
||
"\n",
|
||
"**这就是文本向量化的威力:把\"语义\"变成\"可计算的数值\"!**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"语义编码的优点\n",
|
||
"==================================================\n",
|
||
"语义编码向量:\n",
|
||
" 猫 = [0.9 0.1 0.7 0.9]\n",
|
||
" 狗 = [0.8 0.2 0.6 0.9]\n",
|
||
" 苹果 = [0.1 0.9 0.9 0. ]\n",
|
||
"\n",
|
||
"维度说明: [动物性, 植物性, 可食用性, 宠物性]\n",
|
||
"\n",
|
||
"余弦相似度(用语义编码):\n",
|
||
" 猫 vs 狗: 0.995 (都是动物,都有宠物属性)\n",
|
||
" 猫 vs 苹果: 0.436 (动物vs植物)\n",
|
||
"\n",
|
||
"太棒了!语义编码可以捕捉到词的语义相似性!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 语义编码的优点演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"语义编码的优点\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 语义编码向量\n",
|
||
"# 维度: [动物性, 植物性, 可食用性, 宠物性]\n",
|
||
"cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n",
|
||
"dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n",
|
||
"apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n",
|
||
"\n",
|
||
"print(\"语义编码向量:\")\n",
|
||
"print(f\" 猫 = {cat}\")\n",
|
||
"print(f\" 狗 = {dog}\")\n",
|
||
"print(f\" 苹果 = {apple}\")\n",
|
||
"print()\n",
|
||
"print(\"维度说明: [动物性, 植物性, 可食用性, 宠物性]\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 相似度计算\n",
|
||
"print(\"余弦相似度(用语义编码):\")\n",
|
||
"print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n",
|
||
"print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物)\")\n",
|
||
"print()\n",
|
||
"print(\"太棒了!语义编码可以捕捉到词的语义相似性!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 5.3 向量化方法演进\n",
|
||
"\n",
|
||
"```\n",
|
||
"文本向量化的三种主要方法:\n",
|
||
"\n",
|
||
"[ BoW ] ───→ [ TF-IDF ] ───→ [ Word Embedding ]\n",
|
||
" (词袋模型) (词频权重) (词向量嵌入)\n",
|
||
" \n",
|
||
" 简单粗暴 加入词重要性 蕴含语义信息\n",
|
||
" 无语义 部分语义 深度语义\n",
|
||
" \n",
|
||
" 1980年代 1990年代 2013年后\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第六部分:BoW词袋模型\n",
|
||
"\n",
|
||
"## 6.1 原理\n",
|
||
"\n",
|
||
"把文本看成\"一袋词\",**不考虑顺序**,只管词出现了几次。\n",
|
||
"\n",
|
||
"```\n",
|
||
"文本1: \"Python 是 编程 语言\"\n",
|
||
"文本2: \"Java 是 编程 语言\"\n",
|
||
"\n",
|
||
"分词后:\n",
|
||
" Doc1: [\"Python\", \"是\", \"编程\", \"语言\"]\n",
|
||
" Doc2: [\"Java\", \"是\", \"编程\", \"语言\"]\n",
|
||
"\n",
|
||
"构建词表(所有文档的词集合):\n",
|
||
" 词表: [\"Python\", \"Java\", \"是\", \"编程\", \"语言\"]\n",
|
||
"\n",
|
||
"向量化:统计每个词出现的次数\n",
|
||
" Doc1 → [1, 0, 1, 1, 1] # Python出现1次,Java出现0次,...\n",
|
||
" Doc2 → [0, 1, 1, 1, 1] # Python出现0次,Java出现1次,...\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"BoW词袋模型演示(手动实现)\n",
|
||
"==================================================\n",
|
||
"【示例1】文档集合:\n",
|
||
" Doc1: Python 是 编程 语言\n",
|
||
" Doc2: Java 是 编程 语言\n",
|
||
"\n",
|
||
"词表: ['Java', 'Python', '是', '编程', '语言']\n",
|
||
"\n",
|
||
"BoW矩阵(每行是一个文档,每列是一个词):\n",
|
||
" Doc1: [0, 1, 1, 1, 1]\n",
|
||
" Doc2: [1, 0, 1, 1, 1]\n",
|
||
"\n",
|
||
"详细解释:\n",
|
||
"\n",
|
||
"Doc1: Python 是 编程 语言\n",
|
||
" -> 'Python' 出现 1 次\n",
|
||
" -> '是' 出现 1 次\n",
|
||
" -> '编程' 出现 1 次\n",
|
||
" -> '语言' 出现 1 次\n",
|
||
"\n",
|
||
"Doc2: Java 是 编程 语言\n",
|
||
" -> 'Java' 出现 1 次\n",
|
||
" -> '是' 出现 1 次\n",
|
||
" -> '编程' 出现 1 次\n",
|
||
" -> '语言' 出现 1 次\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# BoW词袋模型演示(纯Python实现,不依赖sklearn)\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"BoW词袋模型演示(手动实现)\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_bow(docs):\n",
|
||
" \"\"\"\n",
|
||
" 简单的BoW实现\n",
|
||
" \n",
|
||
" 参数:\n",
|
||
" docs: 文档列表,每篇文档已经是分词后的词列表\n",
|
||
" 返回:\n",
|
||
" vocab: 词表(有序列表)\n",
|
||
" bow_matrix: BoW矩阵 (n_docs x n_vocab)\n",
|
||
" \"\"\"\n",
|
||
" # 1. 构建词表\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set)) # 排序保证顺序一致\n",
|
||
" \n",
|
||
" # 2. 构建BoW矩阵\n",
|
||
" bow_matrix = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow_matrix.append(vec)\n",
|
||
" \n",
|
||
" return vocab, bow_matrix\n",
|
||
"\n",
|
||
"\n",
|
||
"# 示例1:中文文档(用空格分词)\n",
|
||
"docs = [\n",
|
||
" [\"Python\", \"是\", \"编程\", \"语言\"],\n",
|
||
" [\"Java\", \"是\", \"编程\", \"语言\"],\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab, bow_matrix = simple_bow(docs)\n",
|
||
"\n",
|
||
"print(\"【示例1】文档集合:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"BoW矩阵(每行是一个文档,每列是一个词):\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 详细解释\n",
|
||
"print(\"详细解释:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n",
|
||
" for j, word in enumerate(vocab):\n",
|
||
" if bow_matrix[i][j] > 0:\n",
|
||
" print(f\" -> '{word}' 出现 {bow_matrix[i][j]} 次\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"BoW词袋模型:更多示例\n",
|
||
"==================================================\n",
|
||
"文档集合:\n",
|
||
" Doc1: 我 爱 Python 编程\n",
|
||
" Doc2: Python 很 好 学\n",
|
||
" Doc3: 我 爱 写 代码\n",
|
||
"\n",
|
||
"词表: ['Python', '代码', '写', '好', '学', '很', '我', '爱', '编程']\n",
|
||
"\n",
|
||
"BoW矩阵:\n",
|
||
" Doc1: [1, 0, 0, 0, 0, 0, 1, 1, 1]\n",
|
||
" Doc2: [1, 0, 0, 1, 1, 1, 0, 0, 0]\n",
|
||
" Doc3: [0, 1, 1, 0, 0, 0, 1, 1, 0]\n",
|
||
"\n",
|
||
"表格形式:\n",
|
||
"Doc | Python | 代码 | 写 | 好 | 学 | 很\n",
|
||
"----------------------------------\n",
|
||
"Doc1 | 1 | 0 | 0 | 0 | 0 | 0\n",
|
||
"Doc2 | 1 | 0 | 0 | 1 | 1 | 1\n",
|
||
"Doc3 | 0 | 1 | 1 | 0 | 0 | 0\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 更多BoW示例(纯Python实现)\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"BoW词袋模型:更多示例\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_bow(docs):\n",
|
||
" \"\"\"简单的BoW实现\"\"\"\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow_matrix = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow_matrix.append(vec)\n",
|
||
" return vocab, bow_matrix\n",
|
||
"\n",
|
||
"docs = [\n",
|
||
" [\"我\", \"爱\", \"Python\", \"编程\"],\n",
|
||
" [\"Python\", \"很\", \"好\", \"学\"],\n",
|
||
" [\"我\", \"爱\", \"写\", \"代码\"]\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab, bow_matrix = simple_bow(docs)\n",
|
||
"\n",
|
||
"print(\"文档集合:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"BoW矩阵:\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
"\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 显示成表格\n",
|
||
"print(\"表格形式:\")\n",
|
||
"header = \"Doc | \" + \" | \".join(vocab[:6])\n",
|
||
"print(header)\n",
|
||
"print(\"-\" * len(header))\n",
|
||
"for i, row in enumerate(bow_matrix):\n",
|
||
" print(f\"Doc{i+1} | \" + \" | \".join(map(str, row[:6])))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 6.2 BoW 的优缺点\n",
|
||
"\n",
|
||
"| 优点 | 缺点 |\n",
|
||
"|------|------|\n",
|
||
"| **简单直观** | 忽略词序 |\n",
|
||
"| **容易实现** | \"我爱你\"和\"你爱我\"向量完全相同 |\n",
|
||
"| **计算速度快** | 所有词同等重要 |\n",
|
||
"| **适合基线模型** | 无法捕捉语义 |\n",
|
||
"| | 无法处理同义词:\"电脑\"和\"计算机\"完全不同 |\n",
|
||
"| | 维度很高(词表有多大,维度就多大) |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"BoW忽略词序的演示\n",
|
||
"==================================================\n",
|
||
"文档:\n",
|
||
" Doc1: 我爱你\n",
|
||
" Doc2: 你爱我\n",
|
||
" Doc3: 爱你我\n",
|
||
"\n",
|
||
"BoW矩阵:\n",
|
||
" Doc1: [1, 1, 1, 0]\n",
|
||
" Doc2: [1, 1, 1, 0]\n",
|
||
" Doc3: [0, 0, 0, 1]\n",
|
||
"\n",
|
||
"词表: ['你', '我', '爱', '爱你我']\n",
|
||
"\n",
|
||
"问题:这三个完全不同的句子,BoW向量完全相同!\n",
|
||
"Doc1: 我爱你(表达爱意)\n",
|
||
"Doc2: 你爱我(对方爱我)\n",
|
||
"Doc3: 爱你我(意义不明)\n",
|
||
"\n",
|
||
"结论:BoW模型丢失了词序信息!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# BoW忽略词序的演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"BoW忽略词序的演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_bow(docs):\n",
|
||
" \"\"\"简单的BoW实现\"\"\"\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow_matrix = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow_matrix.append(vec)\n",
|
||
" return vocab, bow_matrix\n",
|
||
"\n",
|
||
"# 两个完全不同的句子,但BoW向量相同\n",
|
||
"docs = [\n",
|
||
" [\"我\", \"爱\", \"你\"], # 正常语序\n",
|
||
" [\"你\", \"爱\", \"我\"], # 完全相反\n",
|
||
" [\"爱你我\"], # 没有空格(中文连续)\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab, bow_matrix = simple_bow(docs)\n",
|
||
"\n",
|
||
"print(\"文档:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {''.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"BoW矩阵:\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"问题:这三个完全不同的句子,BoW向量完全相同!\")\n",
|
||
"print(\"Doc1: 我爱你(表达爱意)\")\n",
|
||
"print(\"Doc2: 你爱我(对方爱我)\")\n",
|
||
"print(\"Doc3: 爱你我(意义不明)\")\n",
|
||
"print()\n",
|
||
"print(\"结论:BoW模型丢失了词序信息!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"练习题5答案\n",
|
||
"==================================================\n",
|
||
"文档集合:\n",
|
||
" Doc1: Python 是 编程 语言\n",
|
||
" Doc2: Java 是 编程 语言\n",
|
||
" Doc3: Python Python Python\n",
|
||
"\n",
|
||
"词表: ['Java', 'Python', '是', '编程', '语言']\n",
|
||
"\n",
|
||
"BoW矩阵(每行是一个文档的向量):\n",
|
||
" Doc1: [0, 1, 1, 1, 1]\n",
|
||
" Doc2: [1, 0, 1, 1, 1]\n",
|
||
" Doc3: [0, 3, 0, 0, 0]\n",
|
||
"\n",
|
||
"解析:\n",
|
||
" Doc1: [0, 1, 1, 1, 1]\n",
|
||
" - 'Python' 出现 1 次\n",
|
||
" - '是' 出现 1 次\n",
|
||
" - '编程' 出现 1 次\n",
|
||
" - '语言' 出现 1 次\n",
|
||
" Doc2: [1, 0, 1, 1, 1]\n",
|
||
" - 'Java' 出现 1 次\n",
|
||
" - '是' 出现 1 次\n",
|
||
" - '编程' 出现 1 次\n",
|
||
" - '语言' 出现 1 次\n",
|
||
" Doc3: [0, 3, 0, 0, 0]\n",
|
||
" - 'Python' 出现 3 次\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 练习题5答案\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"练习题5答案\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_bow(docs):\n",
|
||
" \"\"\"简单的BoW实现\"\"\"\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow_matrix = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow_matrix.append(vec)\n",
|
||
" return vocab, bow_matrix\n",
|
||
"\n",
|
||
"docs = [\n",
|
||
" [\"Python\", \"是\", \"编程\", \"语言\"],\n",
|
||
" [\"Java\", \"是\", \"编程\", \"语言\"],\n",
|
||
" [\"Python\", \"Python\", \"Python\"]\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab, bow_matrix = simple_bow(docs)\n",
|
||
"\n",
|
||
"print(\"文档集合:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"BoW矩阵(每行是一个文档的向量):\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"解析:\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
" for j, count in enumerate(vec):\n",
|
||
" if count > 0:\n",
|
||
" print(f\" - '{vocab[j]}' 出现 {count} 次\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第七部分:TF-IDF\n",
|
||
"\n",
|
||
"## 7.1 为什么需要TF-IDF?\n",
|
||
"\n",
|
||
"**BoW的问题**:所有词同等重要!\n",
|
||
"\n",
|
||
"```\n",
|
||
"文档A: \"Python 是 编程 语言,Python Python Python\"\n",
|
||
"文档B: \"Python 是 编程 语言\"\n",
|
||
"\n",
|
||
"BoW结果:\n",
|
||
" 文档A: Python=4, 是=1, 编程=1, 语言=1\n",
|
||
" 文档B: Python=1, 是=1, 编程=1, 语言=1\n",
|
||
"\n",
|
||
"问题:\"Python\"在A中出现4次,在B中出现1次\n",
|
||
" 但\"是\"、\"编程\"、\"语言\"出现次数相同\n",
|
||
" 我们希望\"Python\"的权重更高(因为它更重要)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**关键洞察**:\n",
|
||
"- 高频出现的词 ≠ 一定重要(\"的\"、\"了\"在所有文章都出现)\n",
|
||
"- 罕见词 ≠ 不重要(\"TensorFlow\"只在AI文章出现,很重要)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 7.2 TF-IDF公式\n",
|
||
"\n",
|
||
"**TF-IDF = 词频(TF) × 逆文档频率(IDF)**\n",
|
||
"\n",
|
||
"```\n",
|
||
"TF = 这个词在本文中出现了多少次\n",
|
||
"IDF = log(总文档数 / 包含该词的文档数)\n",
|
||
"\n",
|
||
"TF-IDF = TF × IDF\n",
|
||
"```\n",
|
||
"\n",
|
||
"### IDF的含义\n",
|
||
"\n",
|
||
"| 词 | 在多少文档出现 | IDF值 | 解释 |\n",
|
||
"|----|----------------|-------|------|\n",
|
||
"| \"的\" | 所有文档 | log(很高) ≈ 0 | 到处都是,不重要 |\n",
|
||
"| \"Python\" | 少数文档 | log(中等) = 高 | 较独特,重要 |\n",
|
||
"| \"TensorFlow\" | 极少数文档 | log(很低) = 更高 | 很独特,非常重要 |\n",
|
||
"| \"AI\" | 只有1篇 | log(总文档数/1) = 最高 | 最独特,最重要 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"TF-IDF词频-逆文档频率演示\n",
|
||
"==================================================\n",
|
||
"文档集合:\n",
|
||
" Doc1: Python 编程 语言\n",
|
||
" Doc2: Python Python Python\n",
|
||
" Doc3: Java 编程 语言\n",
|
||
"\n",
|
||
"词表: ['Java', 'Python', '编程', '语言']\n",
|
||
"\n",
|
||
"IDF值: [1.4055, 1.0, 1.0, 1.0]\n",
|
||
"\n",
|
||
"TF-IDF矩阵:\n",
|
||
" Doc1: [0.0, 1.0, 1.0, 1.0]\n",
|
||
" Doc2: [0.0, 3.0, 0.0, 0.0]\n",
|
||
" Doc3: [1.4055, 0.0, 1.0, 1.0]\n",
|
||
"\n",
|
||
"详细分析:\n",
|
||
"\n",
|
||
"Doc1: Python 编程 语言\n",
|
||
" 'Python': TF-IDF = 1.0000\n",
|
||
" '编程': TF-IDF = 1.0000\n",
|
||
" '语言': TF-IDF = 1.0000\n",
|
||
"\n",
|
||
"Doc2: Python Python Python\n",
|
||
" 'Python': TF-IDF = 3.0000\n",
|
||
"\n",
|
||
"Doc3: Java 编程 语言\n",
|
||
" 'Java': TF-IDF = 1.4055\n",
|
||
" '编程': TF-IDF = 1.0000\n",
|
||
" '语言': TF-IDF = 1.0000\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# TF-IDF演示(纯Python实现)\n",
|
||
"import math\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"TF-IDF词频-逆文档频率演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_tfidf(docs):\n",
|
||
" \"\"\"\n",
|
||
" 简单的TF-IDF实现\n",
|
||
" \n",
|
||
" 参数:\n",
|
||
" docs: 文档列表,每篇文档已经是分词后的词列表\n",
|
||
" 返回:\n",
|
||
" vocab: 词表\n",
|
||
" tfidf_matrix: TF-IDF矩阵\n",
|
||
" idf: 每个词的IDF值\n",
|
||
" \"\"\"\n",
|
||
" # 1. 构建词表和BoW\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" \n",
|
||
" # 2. 构建BoW矩阵\n",
|
||
" bow = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow.append(vec)\n",
|
||
" \n",
|
||
" n_docs = len(docs)\n",
|
||
" \n",
|
||
" # 3. 计算IDF\n",
|
||
" idf = []\n",
|
||
" for j, word in enumerate(vocab):\n",
|
||
" df = sum(1 for vec in bow if vec[j] > 0)\n",
|
||
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
|
||
" idf.append(idf_j)\n",
|
||
" \n",
|
||
" # 4. 计算TF-IDF\n",
|
||
" tfidf = []\n",
|
||
" for vec in bow:\n",
|
||
" tfidf_vec = []\n",
|
||
" for i, tf in enumerate(vec):\n",
|
||
" tfidf_vec.append(tf * idf[i])\n",
|
||
" tfidf.append(tfidf_vec)\n",
|
||
" \n",
|
||
" return vocab, tfidf, idf\n",
|
||
"\n",
|
||
"docs = [\n",
|
||
" [\"Python\", \"编程\", \"语言\"],\n",
|
||
" [\"Python\", \"Python\", \"Python\"], # Python出现3次\n",
|
||
" [\"Java\", \"编程\", \"语言\"],\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab, tfidf_matrix, idf = simple_tfidf(docs)\n",
|
||
"\n",
|
||
"print(\"文档集合:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"TF-IDF矩阵:\")\n",
|
||
"for i, vec in enumerate(tfidf_matrix):\n",
|
||
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"详细分析:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n",
|
||
" for j, score in enumerate(tfidf_matrix[i]):\n",
|
||
" if score > 0:\n",
|
||
" print(f\" '{vocab[j]}': TF-IDF = {score:.4f}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"TF-IDF vs BoW 对比\n",
|
||
"==================================================\n",
|
||
"文档:\n",
|
||
" Doc1: Python 编程\n",
|
||
" Doc2: Java 编程\n",
|
||
" Doc3: Python Python Python\n",
|
||
"\n",
|
||
"BoW矩阵:\n",
|
||
" Doc1: [0, 1, 1]\n",
|
||
" Doc2: [1, 0, 1]\n",
|
||
" Doc3: [0, 3, 0]\n",
|
||
"\n",
|
||
"TF-IDF矩阵:\n",
|
||
" Doc1: [0.0, 1.0, 1.0]\n",
|
||
" Doc2: [1.4055, 0.0, 1.0]\n",
|
||
" Doc3: [0.0, 3.0, 0.0]\n",
|
||
"\n",
|
||
"重点分析:\n",
|
||
"Doc3 'Python Python Python':\n",
|
||
" BoW: Python出现3次\n",
|
||
" TF-IDF: Python的TF-IDF = 0.0000\n",
|
||
"\n",
|
||
"为什么Doc3的TF-IDF不是最高的?\n",
|
||
"因为Python在Doc1和Doc2也出现了,IDF值被稀释\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# TF-IDF vs BoW 对比\n",
|
||
"import math\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"TF-IDF vs BoW 对比\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_bow(docs):\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow_matrix = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow_matrix.append(vec)\n",
|
||
" return vocab, bow_matrix\n",
|
||
"\n",
|
||
"def simple_tfidf(docs):\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow.append(vec)\n",
|
||
" \n",
|
||
" n_docs = len(docs)\n",
|
||
" idf = []\n",
|
||
" for j, word in enumerate(vocab):\n",
|
||
" df = sum(1 for vec in bow if vec[j] > 0)\n",
|
||
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
|
||
" idf.append(idf_j)\n",
|
||
" \n",
|
||
" tfidf = []\n",
|
||
" for vec in bow:\n",
|
||
" tfidf_vec = []\n",
|
||
" for i, tf in enumerate(vec):\n",
|
||
" tfidf_vec.append(tf * idf[i])\n",
|
||
" tfidf.append(tfidf_vec)\n",
|
||
" \n",
|
||
" return vocab, tfidf, idf\n",
|
||
"\n",
|
||
"docs = [\n",
|
||
" [\"Python\", \"编程\"],\n",
|
||
" [\"Java\", \"编程\"],\n",
|
||
" [\"Python\", \"Python\", \"Python\"] # Python出现3次\n",
|
||
"]\n",
|
||
"\n",
|
||
"vocab_bow, bow_matrix = simple_bow(docs)\n",
|
||
"vocab_tfidf, tfidf_matrix, idf = simple_tfidf(docs)\n",
|
||
"\n",
|
||
"print(\"文档:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"BoW矩阵:\")\n",
|
||
"for i, vec in enumerate(bow_matrix):\n",
|
||
" print(f\" Doc{i+1}: {vec}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"TF-IDF矩阵:\")\n",
|
||
"for i, vec in enumerate(tfidf_matrix):\n",
|
||
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 重点分析Doc3\n",
|
||
"print(\"重点分析:\")\n",
|
||
"print(f\"Doc3 'Python Python Python':\")\n",
|
||
"print(f\" BoW: Python出现3次\")\n",
|
||
"print(f\" TF-IDF: Python的TF-IDF = {tfidf_matrix[2][0]:.4f}\")\n",
|
||
"print()\n",
|
||
"print(\"为什么Doc3的TF-IDF不是最高的?\")\n",
|
||
"print(\"因为Python在Doc1和Doc2也出现了,IDF值被稀释\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"附加题答案\n",
|
||
"==================================================\n",
|
||
"文档:\n",
|
||
" Doc1: Python 编程\n",
|
||
" Doc2: Java 编程\n",
|
||
" Doc3: Python Python\n",
|
||
"\n",
|
||
"词表: ['Java', 'Python', '编程']\n",
|
||
"\n",
|
||
"IDF值: [1.4055, 1.0, 1.0]\n",
|
||
"\n",
|
||
"TF-IDF矩阵:\n",
|
||
" Doc1: [0.0, 1.0, 1.0]\n",
|
||
" Doc2: [1.4055, 0.0, 1.0]\n",
|
||
" Doc3: [0.0, 2.0, 0.0]\n",
|
||
"\n",
|
||
"问题1:为什么Python在Doc3中的TF-IDF值不是最高?\n",
|
||
"答:因为Python在Doc1、Doc2、Doc3中都出现了,\n",
|
||
" IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\n",
|
||
"\n",
|
||
"问题2:Java在Doc2中的TF-IDF值是多少?\n",
|
||
"答:Java在Doc2的TF-IDF值 = 1.4055\n",
|
||
" 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 附加题答案\n",
|
||
"import math\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"附加题答案\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_tfidf(docs):\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in docs:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" bow = []\n",
|
||
" for doc in docs:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" vec[vocab.index(word)] += 1\n",
|
||
" bow.append(vec)\n",
|
||
" \n",
|
||
" n_docs = len(docs)\n",
|
||
" idf = []\n",
|
||
" for j, word in enumerate(vocab):\n",
|
||
" df = sum(1 for vec in bow if vec[j] > 0)\n",
|
||
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
|
||
" idf.append(idf_j)\n",
|
||
" \n",
|
||
" tfidf = []\n",
|
||
" for vec in bow:\n",
|
||
" tfidf_vec = []\n",
|
||
" for i, tf in enumerate(vec):\n",
|
||
" tfidf_vec.append(tf * idf[i])\n",
|
||
" tfidf.append(tfidf_vec)\n",
|
||
" \n",
|
||
" return vocab, tfidf, idf\n",
|
||
"\n",
|
||
"docs = [[\"Python\", \"编程\"], [\"Java\", \"编程\"], [\"Python\", \"Python\"]]\n",
|
||
"\n",
|
||
"vocab, tfidf_matrix, idf = simple_tfidf(docs)\n",
|
||
"\n",
|
||
"print(\"文档:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表: {vocab}\")\n",
|
||
"print()\n",
|
||
"print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"TF-IDF矩阵:\")\n",
|
||
"for i, vec in enumerate(tfidf_matrix):\n",
|
||
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"问题1:为什么Python在Doc3中的TF-IDF值不是最高?\")\n",
|
||
"print(\"答:因为Python在Doc1、Doc2、Doc3中都出现了,\")\n",
|
||
"print(\" IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\")\n",
|
||
"print()\n",
|
||
"print(\"问题2:Java在Doc2中的TF-IDF值是多少?\")\n",
|
||
"java_idx = vocab.index(\"Java\")\n",
|
||
"print(f\"答:Java在Doc2的TF-IDF值 = {tfidf_matrix[1][java_idx]:.4f}\")\n",
|
||
"print(\" 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 7.3 TF-IDF的优缺点\n",
|
||
"\n",
|
||
"| 优点 | 缺点 |\n",
|
||
"|------|------|\n",
|
||
"| 考虑词的重要性 | 忽略词序 |\n",
|
||
"| 降低常见词权重 | 无法捕捉语义 |\n",
|
||
"| 提高独特词权重 | \"猫\"和\"狗\"的TF-IDF可能相似也可能不相似 |\n",
|
||
"| 可以提取关键词 | 无法处理同义词 \"电脑\" vs \"计算机\" |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第八部分:Word Embedding词嵌入\n",
|
||
"\n",
|
||
"## 8.1 BoW和TF-IDF的根本问题\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# 位置编码的问题\n",
|
||
"\"猫\" → [1, 0, 0, ...] # 只是\"位置编码\"\n",
|
||
"\"狗\" → [0, 1, 0, ...] # 猫和狗的位置不同\n",
|
||
"\"小猫\" → [0, 0, 1, ...] # 但它们语义相近,向量却正交!\n",
|
||
"\n",
|
||
"# 问题:无法表达语义相似性!\n",
|
||
"# \"猫\"和\"狗\"都是动物,语义很相似\n",
|
||
"# 但在BoW/TF-IDF中,它们的向量可能完全不同\n",
|
||
"```\n",
|
||
"\n",
|
||
"### 词嵌入的核心思想\n",
|
||
"\n",
|
||
"```\n",
|
||
"不再用\"位置\"表示词,而是用\"语义空间\"表示词\n",
|
||
"\n",
|
||
"语义空间示例(二维简化):\n",
|
||
" ↑ 动物性\n",
|
||
" 狗 | ↑ 猫\n",
|
||
" | ↗\n",
|
||
" 0 |↗ ↑ 苹果\n",
|
||
" |___________→ 植物性\n",
|
||
" ↑ 香蕉\n",
|
||
" \n",
|
||
" 语义相近的词在空间中距离近\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"词嵌入(Word Embedding)概念演示\n",
|
||
"==================================================\n",
|
||
"\n",
|
||
"词向量(简化版3维)示意:\n",
|
||
"维度含义: [动物性, 植物性, 其他/技术性]\n",
|
||
"\n",
|
||
" 猫: [0.9 0.1 0.2]\n",
|
||
" 狗: [0.8 0.3 0.1]\n",
|
||
" 小猫: [0.85 0.2 0.15]\n",
|
||
" 苹果: [0.1 0.2 0.9]\n",
|
||
" 香蕉: [0.1 0.1 0.85]\n",
|
||
" Python: [0.1 0. 0.9]\n",
|
||
" Java: [0.1 0. 0.85]\n",
|
||
"\n",
|
||
"语义相似度:\n",
|
||
" 猫 vs 狗: 0.965\n",
|
||
" 猫 vs 小猫: 0.992\n",
|
||
" 猫 vs 苹果: 0.337\n",
|
||
" 苹果 vs 香蕉: 0.995\n",
|
||
" Python vs Java: 1.000\n",
|
||
"\n",
|
||
"词嵌入的优势:\n",
|
||
" - 语义相似的词,向量也相似\n",
|
||
" - 可以做类比推理:国王-男人+女人=女王\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Word2Vec词嵌入的概念演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"词嵌入(Word Embedding)概念演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 假设这些是用Word2Vec等方法训练出来的词向量(简化版,3维)\n",
|
||
"# 实际中向量通常是50/100/300维\n",
|
||
"word_vectors = {\n",
|
||
" \"猫\": np.array([0.9, 0.1, 0.2]), # 动物属性高,其他低\n",
|
||
" \"狗\": np.array([0.8, 0.3, 0.1]), # 动物属性高\n",
|
||
" \"小猫\": np.array([0.85, 0.2, 0.15]), # 小动物,也像猫\n",
|
||
" \"苹果\": np.array([0.1, 0.2, 0.9]), # 水果属性高\n",
|
||
" \"香蕉\": np.array([0.1, 0.1, 0.85]), # 水果属性高\n",
|
||
" \"Python\": np.array([0.1, 0.0, 0.9]), # 编程语言\n",
|
||
" \"Java\": np.array([0.1, 0.0, 0.85]), # 编程语言\n",
|
||
"}\n",
|
||
"\n",
|
||
"print(\"词向量(简化版3维)示意:\")\n",
|
||
"print(\"维度含义: [动物性, 植物性, 其他/技术性]\")\n",
|
||
"print()\n",
|
||
"for word, vec in word_vectors.items():\n",
|
||
" print(f\" {word}: {vec}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 计算相似度\n",
|
||
"print(\"语义相似度:\")\n",
|
||
"print(f\" 猫 vs 狗: {cosine_similarity(word_vectors['猫'], word_vectors['狗']):.3f}\")\n",
|
||
"print(f\" 猫 vs 小猫: {cosine_similarity(word_vectors['猫'], word_vectors['小猫']):.3f}\")\n",
|
||
"print(f\" 猫 vs 苹果: {cosine_similarity(word_vectors['猫'], word_vectors['苹果']):.3f}\")\n",
|
||
"print(f\" 苹果 vs 香蕉: {cosine_similarity(word_vectors['苹果'], word_vectors['香蕉']):.3f}\")\n",
|
||
"print(f\" Python vs Java: {cosine_similarity(word_vectors['Python'], word_vectors['Java']):.3f}\")\n",
|
||
"print()\n",
|
||
"print(\"词嵌入的优势:\")\n",
|
||
"print(\" - 语义相似的词,向量也相似\")\n",
|
||
"print(\" - 可以做类比推理:国王-男人+女人=女王\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"词嵌入的类比推理\n",
|
||
"==================================================\n",
|
||
"\n",
|
||
"词向量(简化版):\n",
|
||
" King: [0.9 0.1 0.8 0.3]\n",
|
||
" Man: [0.8 0.1 0.2 0.5]\n",
|
||
" Woman: [0.1 0.8 0.2 0.5]\n",
|
||
" Queen: [0.1 0.9 0.8 0.3]\n",
|
||
"\n",
|
||
"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\n",
|
||
"\n",
|
||
"King - Man + Woman = [0.2 0.8 0.8 0.3]\n",
|
||
"Queen = [0.1 0.9 0.8 0.3]\n",
|
||
"\n",
|
||
"相似度验证:\n",
|
||
" (King-Man+Woman) vs Queen: 0.994\n",
|
||
"\n",
|
||
"结论:词嵌入可以捕捉语义关系!\n",
|
||
" '国王' - '男人' + '女人' ≈ '女王'\n",
|
||
" 这说明词向量编码了语义信息!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 词嵌入的类比推理演示\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"词嵌入的类比推理\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 经典例子:King - Man + Woman ≈ Queen\n",
|
||
"# 这个例子说明了词嵌入可以捕捉语义关系\n",
|
||
"\n",
|
||
"# 简化版词向量(实际中这些向量由神经网络学习得到)\n",
|
||
"king = np.array([0.9, 0.1, 0.8, 0.3]) # 皇室、男性、有权力\n",
|
||
"man = np.array([0.8, 0.1, 0.2, 0.5]) # 男性\n",
|
||
"woman = np.array([0.1, 0.8, 0.2, 0.5]) # 女性\n",
|
||
"queen = np.array([0.1, 0.9, 0.8, 0.3]) # 皇室、女性、有权力\n",
|
||
"\n",
|
||
"print(\"词向量(简化版):\")\n",
|
||
"print(f\" King: {king}\")\n",
|
||
"print(f\" Man: {man}\")\n",
|
||
"print(f\" Woman: {woman}\")\n",
|
||
"print(f\" Queen: {queen}\")\n",
|
||
"print()\n",
|
||
"print(\"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 计算 King - Man + Woman\n",
|
||
"result = king - man + woman\n",
|
||
"print(f\"King - Man + Woman = {result}\")\n",
|
||
"print(f\"Queen = {queen}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 相似度\n",
|
||
"print(\"相似度验证:\")\n",
|
||
"print(f\" (King-Man+Woman) vs Queen: {cosine_similarity(result, queen):.3f}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"结论:词嵌入可以捕捉语义关系!\")\n",
|
||
"print(\" '国王' - '男人' + '女人' ≈ '女王'\")\n",
|
||
"print(\" 这说明词向量编码了语义信息!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 8.2 词嵌入的发展历史\n",
|
||
"\n",
|
||
"| 方法 | 年份 | 特点 |\n",
|
||
"|------|------|------|\n",
|
||
"| Word2Vec | 2013 | Google开源,开启词嵌入时代 |\n",
|
||
"| GloVe | 2014 | Stanford提出,基于全局共现矩阵 |\n",
|
||
"| FastText | 2016 | Facebook开源,支持子词 |\n",
|
||
"| ELMo | 2018 | 考虑上下文,动态词向量 |\n",
|
||
"| BERT | 2018 | Transformer架构,预训练大模型 |\n",
|
||
"| GPT系列 | 2018-现在 | 生成式AI,ChatGPT核心 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"==================================================\n",
|
||
"预训练词向量演示(使用内置示例向量)\n",
|
||
"==================================================\n",
|
||
"\n",
|
||
"注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\n",
|
||
"本notebook使用内置示例向量进行演示\n",
|
||
"\n",
|
||
"词向量示例(每个词用一个5维向量表示):\n",
|
||
"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n",
|
||
"\n",
|
||
" cat : [0.9, 0.1, 0.2, 0.8, 0.3]\n",
|
||
" dog : [0.8, 0.2, 0.1, 0.9, 0.3]\n",
|
||
" bird : [0.7, 0.3, 0.1, 0.9, 0.2]\n",
|
||
" fish : [0.6, 0.2, 0.1, 0.8, 0.2]\n",
|
||
" apple : [0.1, 0.9, 0.3, 0.0, 0.2]\n",
|
||
" rose : [0.1, 0.8, 0.1, 0.0, 0.1]\n",
|
||
" python : [0.1, 0.0, 0.9, 0.0, 0.5]\n",
|
||
" java : [0.1, 0.0, 0.8, 0.0, 0.4]\n",
|
||
" computer : [0.1, 0.0, 0.9, 0.3, 0.4]\n",
|
||
" love : [0.3, 0.2, 0.1, 0.1, 0.9]\n",
|
||
" hate : [0.2, 0.1, 0.1, 0.1, 0.8]\n",
|
||
"\n",
|
||
"==================================================\n",
|
||
"1. 语义相似度计算\n",
|
||
"==================================================\n",
|
||
" cat vs dog : 0.987\n",
|
||
" cat vs apple : 0.244\n",
|
||
" python vs java : 0.998\n",
|
||
" python vs cat : 0.322\n",
|
||
" love vs hate : 0.993\n",
|
||
"\n",
|
||
"==================================================\n",
|
||
"2. 类比推理(Word2Vec核心能力)\n",
|
||
"==================================================\n",
|
||
"类比问题:man -> woman, king -> ?\n",
|
||
"\n",
|
||
" King = [0.6 0.1 0.3 0.3 0.6]\n",
|
||
" Man = [0.8 0.1 0.2 0.5 0.3]\n",
|
||
" Woman = [0.2 0.8 0.2 0.5 0.5]\n",
|
||
" King - Man + Woman = [-0. 0.8 0.3 0.3 0.8]\n",
|
||
" Queen (真实) = [0.2 0.9 0.3 0.3 0.6]\n",
|
||
"\n",
|
||
" 相似度: 0.969\n",
|
||
"\n",
|
||
"太棒了!词嵌入可以捕捉语义关系!\n",
|
||
"\n",
|
||
"==================================================\n",
|
||
"真实环境中加载Gensim预训练模型的方法\n",
|
||
"==================================================\n",
|
||
"如需加载真实的预训练词向量,可以运行:\n",
|
||
"\n",
|
||
" import gensim.downloader as api\n",
|
||
" model = api.load('glove-wiki-gigaword-50')\n",
|
||
"\n",
|
||
"这会下载约66MB的预训练词向量模型\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 实战:用预训练词向量演示词嵌入(跳过实际下载)\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"def cosine_similarity(a, b):\n",
|
||
" \"\"\"计算余弦相似度\"\"\"\n",
|
||
" dot = np.dot(a, b)\n",
|
||
" norm_a = np.linalg.norm(a)\n",
|
||
" norm_b = np.linalg.norm(b)\n",
|
||
" if norm_a == 0 or norm_b == 0:\n",
|
||
" return 0.0\n",
|
||
" return dot / (norm_a * norm_b)\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"预训练词向量演示(使用内置示例向量)\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print()\n",
|
||
"print(\"注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\")\n",
|
||
"print(\"本notebook使用内置示例向量进行演示\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 使用内置的小规模词向量示例(模拟真实词向量)\n",
|
||
"# 维度: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n",
|
||
"word_vectors = {\n",
|
||
" # 动物\n",
|
||
" \"cat\": np.array([0.9, 0.1, 0.2, 0.8, 0.3]),\n",
|
||
" \"dog\": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),\n",
|
||
" \"bird\": np.array([0.7, 0.3, 0.1, 0.9, 0.2]),\n",
|
||
" \"fish\": np.array([0.6, 0.2, 0.1, 0.8, 0.2]),\n",
|
||
" # 植物\n",
|
||
" \"apple\": np.array([0.1, 0.9, 0.3, 0.0, 0.2]),\n",
|
||
" \"rose\": np.array([0.1, 0.8, 0.1, 0.0, 0.1]),\n",
|
||
" # 技术\n",
|
||
" \"python\": np.array([0.1, 0.0, 0.9, 0.0, 0.5]),\n",
|
||
" \"java\": np.array([0.1, 0.0, 0.85, 0.0, 0.4]),\n",
|
||
" \"computer\": np.array([0.1, 0.0, 0.9, 0.3, 0.4]),\n",
|
||
" # 抽象概念\n",
|
||
" \"love\": np.array([0.3, 0.2, 0.1, 0.1, 0.9]),\n",
|
||
" \"hate\": np.array([0.2, 0.1, 0.1, 0.1, 0.8]),\n",
|
||
"}\n",
|
||
"\n",
|
||
"# 显示词向量\n",
|
||
"print(\"词向量示例(每个词用一个5维向量表示):\")\n",
|
||
"print(\"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\")\n",
|
||
"print()\n",
|
||
"for word, vec in word_vectors.items():\n",
|
||
" print(f\" {word:12s}: [{vec[0]:.1f}, {vec[1]:.1f}, {vec[2]:.1f}, {vec[3]:.1f}, {vec[4]:.1f}]\")\n",
|
||
"\n",
|
||
"print()\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"1. 语义相似度计算\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"pairs = [\n",
|
||
" (\"cat\", \"dog\"), # 都是动物\n",
|
||
" (\"cat\", \"apple\"), # 动物 vs 植物\n",
|
||
" (\"python\", \"java\"), # 都是编程语言\n",
|
||
" (\"python\", \"cat\"), # 编程语言 vs 动物\n",
|
||
" (\"love\", \"hate\"), # 情感词\n",
|
||
"]\n",
|
||
"for w1, w2 in pairs:\n",
|
||
" sim = cosine_similarity(word_vectors[w1], word_vectors[w2])\n",
|
||
" print(f\" {w1:10s} vs {w2:10s}: {sim:.3f}\")\n",
|
||
"\n",
|
||
"print()\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"2. 类比推理(Word2Vec核心能力)\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"类比问题:man -> woman, king -> ?\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 简化版类比:使用语义维度\n",
|
||
"# man=[0.8, 0.1, 0.2, 0.5, 0.3], woman=[0.2, 0.8, 0.2, 0.5, 0.5]\n",
|
||
"# king=[0.6, 0.1, 0.3, 0.3, 0.6], queen=[0.2, 0.9, 0.3, 0.3, 0.6]\n",
|
||
"man = np.array([0.8, 0.1, 0.2, 0.5, 0.3])\n",
|
||
"woman = np.array([0.2, 0.8, 0.2, 0.5, 0.5])\n",
|
||
"king = np.array([0.6, 0.1, 0.3, 0.3, 0.6])\n",
|
||
"queen = np.array([0.2, 0.9, 0.3, 0.3, 0.6])\n",
|
||
"\n",
|
||
"# king - man + woman ≈ queen\n",
|
||
"result = king - man + woman\n",
|
||
"\n",
|
||
"print(f\" King = {king}\")\n",
|
||
"print(f\" Man = {man}\")\n",
|
||
"print(f\" Woman = {woman}\")\n",
|
||
"print(f\" King - Man + Woman = {np.round(result, 2)}\")\n",
|
||
"print(f\" Queen (真实) = {queen}\")\n",
|
||
"print()\n",
|
||
"print(f\" 相似度: {cosine_similarity(result, queen):.3f}\")\n",
|
||
"print()\n",
|
||
"print(\"太棒了!词嵌入可以捕捉语义关系!\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 真实环境中加载Gensim模型的方法(仅供参考,不执行)\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"真实环境中加载Gensim预训练模型的方法\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"如需加载真实的预训练词向量,可以运行:\")\n",
|
||
"print()\n",
|
||
"print(\" import gensim.downloader as api\")\n",
|
||
"print(\" model = api.load('glove-wiki-gigaword-50')\")\n",
|
||
"print()\n",
|
||
"print(\"这会下载约66MB的预训练词向量模型\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第九部分:文本处理完整流程\n",
|
||
"\n",
|
||
"## 9.1 流程图\n",
|
||
"\n",
|
||
"```\n",
|
||
"┌──────────────────────────────────────────────────────────────────┐\n",
|
||
"│ 文本数据 │\n",
|
||
"│ \"今天天气真不错!\" │\n",
|
||
"└─────────────────────────┬────────────────────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
"┌──────────────────────────────────────────────────────────────────┐\n",
|
||
"│ 1. 文本预处理 │\n",
|
||
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
|
||
"│ │ 分词 │→ │ 去停用词│→ │ 统一大小│→ │ 去除标点│ │\n",
|
||
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
|
||
"│ \"今天/天气/真/不错\" → \"今天/天气/不错\" │\n",
|
||
"└─────────────────────────┬────────────────────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
"┌──────────────────────────────────────────────────────────────────┐\n",
|
||
"│ 2. 文本向量化 │\n",
|
||
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
|
||
"│ │ BoW │ │ TF-IDF │ │ Embedding│ │ 预训练模型│ │\n",
|
||
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
|
||
"│ ↓ ↓ ↓ ↓ │\n",
|
||
"│ [1,0,2,0,1] [0.5,0,0.8] [0.9,0.3] [BERT向量] │\n",
|
||
"└─────────────────────────┬────────────────────────────────────────┘\n",
|
||
" │\n",
|
||
" ▼\n",
|
||
"┌──────────────────────────────────────────────────────────────────┐\n",
|
||
"│ 3. 下游任务 │\n",
|
||
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
|
||
"│ │ 分类 │ │ 相似度 │ │ 聚类 │ │ 生成 │ │\n",
|
||
"│ │ 情感分析│ │ 文本匹配│ │ 主题分组│ │ 聊天机器人│ │\n",
|
||
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
|
||
"└──────────────────────────────────────────────────────────────────┘\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 9.2 各环节详解\n",
|
||
"\n",
|
||
"### 环节1:文本预处理\n",
|
||
"\n",
|
||
"| 步骤 | 输入 | 输出 | 作用 |\n",
|
||
"|------|------|------|------|\n",
|
||
"| 分词 | \"今天天气不错\" | [\"今天\", \"天气\", \"不错\"] | 把文本切成词 |\n",
|
||
"| 去停用词 | [\"今天\", \"天气\", \"不错\"] | [\"天气\", \"不错\"] | 去掉\"的、了、在\"等无意义词 |\n",
|
||
"| 统一大小写 | [\"Python\", \"python\"] | [\"python\", \"python\"] | 归一化 |\n",
|
||
"| 去标点 | [\"语言!!!\"] | [\"语言\"] | 清理噪音 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 环节2:文本向量化\n",
|
||
"\n",
|
||
"| 方法 | 适用场景 | 不适用场景 |\n",
|
||
"|------|---------|-----------|\n",
|
||
"| BoW | 基线模型、快速原型 | 需要语义理解 |\n",
|
||
"| TF-IDF | 文本分类、关键词提取 | 同义词识别 |\n",
|
||
"| Embedding | 语义相似度、推荐系统 | 需要精确匹配 |\n",
|
||
"| 预训练模型 | 通用NLP任务 | 计算资源有限 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 环节3:下游任务\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# 分类任务:\n",
|
||
"\"这部电影太好看了!\" → 情感分类 → 正面 ✅\n",
|
||
"\n",
|
||
"# 相似度任务:\n",
|
||
"\"如何学习Python?\" → 查找相似文档 → \"Python入门教程\" ✅\n",
|
||
"\n",
|
||
"# 生成任务:\n",
|
||
"\"今天天气\" → GPT续写 → \"今天天气真好,适合出去玩\" ✅\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 第十部分:实战用jieba进行中文分词\n",
|
||
"\n",
|
||
"## 10.1 安装jieba\n",
|
||
"\n",
|
||
"```bash\n",
|
||
"!pip install jieba\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 安装jieba\n",
|
||
"import subprocess\n",
|
||
"subprocess.run(['pip', 'install', 'jieba', '-q'])\n",
|
||
"\n",
|
||
"print(\"jieba安装完成!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 10.2 基础分词\n",
|
||
"\n",
|
||
"jieba支持三种分词模式:\n",
|
||
"\n",
|
||
"| 模式 | 说明 | 适用场景 |\n",
|
||
"|------|------|---------|\n",
|
||
"| 精确模式 | 试图将句子最精确地切开,适合文本分析 | **默认,推荐** |\n",
|
||
"| 全模式 | 把所有可能的词都扫描出来,速度快 | 速度要求高 |\n",
|
||
"| 搜索引擎模式 | 在精确模式基础上,对长词再次切分 | 搜索引擎 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import jieba\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"jieba分词演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"text = \"我喜欢深度学习和人工智能\"\n",
|
||
"\n",
|
||
"print(f\"原文: {text}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 精确模式(默认)\n",
|
||
"words精确 = list(jieba.cut(text, cut_all=False))\n",
|
||
"print(f\"精确模式: {' / '.join(words精确)}\")\n",
|
||
"\n",
|
||
"# 全模式\n",
|
||
"words全 = list(jieba.cut(text, cut_all=True))\n",
|
||
"print(f\"全模式: {' / '.join(words全)}\")\n",
|
||
"\n",
|
||
"# 搜索引擎模式\n",
|
||
"words搜索 = list(jieba.cut_for_search(text))\n",
|
||
"print(f\"搜索模式: {' / '.join(words搜索)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 更多分词示例\n",
|
||
"import jieba\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"更多分词示例\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"examples = [\n",
|
||
" \"今天天气真不错\",\n",
|
||
" \"人工智能是未来的发展方向\",\n",
|
||
" \"Python是一门非常流行的编程语言\",\n",
|
||
" \"小明毕业于清华大学计算机系\",\n",
|
||
" \"我今天在京东买了一部iPhone手机\"\n",
|
||
"]\n",
|
||
"\n",
|
||
"for i, text in enumerate(examples):\n",
|
||
" words = list(jieba.cut(text))\n",
|
||
" print(f\"{i+1}. {text}\")\n",
|
||
" print(f\" → {' / '.join(words)}\")\n",
|
||
" print()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 10.3 词性标注\n",
|
||
"\n",
|
||
"jieba支持词性标注,可以标注每个词是名词、动词、形容词等。\n",
|
||
"\n",
|
||
"| 词性代码 | 含义 | 示例 |\n",
|
||
"|----------|------|------|\n",
|
||
"| n | 名词 | 人、山、电脑 |\n",
|
||
"| v | 动词 | 跑、吃、学习 |\n",
|
||
"| adj | 形容词 | 漂亮、好吃、优秀 |\n",
|
||
"| adv | 副词 | 很、非常、慢慢 |\n",
|
||
"| m | 数词 | 一、百、千 |\n",
|
||
"| q | 量词 | 个、本、件 |"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import jieba.posseg as pseg\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"jieba词性标注演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"text = \"我喜欢深度学习和人工智能\"\n",
|
||
"\n",
|
||
"print(f\"原文: {text}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"words = pseg.cut(text)\n",
|
||
"print(\"分词 + 词性标注:\")\n",
|
||
"for word, flag in words:\n",
|
||
" print(f\" {word}: {flag}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 10.4 停用词处理\n",
|
||
"\n",
|
||
"停用词是在文本处理中需要过滤掉的常见词,如\"的\"、\"了\"、\"在\"等。\n",
|
||
"\n",
|
||
"这些词在所有文档中都可能出现,对区分文档没有帮助。"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import jieba\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"停用词处理演示\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 常见停用词列表\n",
|
||
"stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'])\n",
|
||
"\n",
|
||
"text = \"人工智能是未来的发展方向,也是当前科技领域的热门话题\"\n",
|
||
"\n",
|
||
"print(f\"原文: {text}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 不使用停用词\n",
|
||
"words_all = list(jieba.cut(text))\n",
|
||
"print(f\"不使用停用词: {' / '.join(words_all)}\")\n",
|
||
"\n",
|
||
"# 使用停用词\n",
|
||
"words_filtered = [w for w in words_all if w not in stopwords]\n",
|
||
"print(f\"使用停用词: {' / '.join(words_filtered)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 更完整的停用词表可以从网上下载\n",
|
||
"print(\"提示:实际项目中可以从以下地方获取停用词表:\")\n",
|
||
"print(\" - 哈工大停用词表\")\n",
|
||
"print(\" - 百度停用词表\")\n",
|
||
"print(\" - 四川大学机器学习实验室停用词表\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 实战:完整的文本预处理流程\n",
|
||
"import jieba\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"完整的文本预处理流程\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"# 示例文档集合\n",
|
||
"docs = [\n",
|
||
" \"今天天气真不错!适合出去玩。\",\n",
|
||
" \"Python是一门很棒的编程语言。\",\n",
|
||
" \"人工智能和机器学习是未来的发展方向。\",\n",
|
||
" \"今天在咖啡馆喝了一杯很好喝的拿铁。\"\n",
|
||
"]\n",
|
||
"\n",
|
||
"# 停用词表\n",
|
||
"stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这', '!', '。', ','])\n",
|
||
"\n",
|
||
"def preprocess_text(text):\n",
|
||
" \"\"\"完整的文本预处理流程\"\"\"\n",
|
||
" # 1. 分词\n",
|
||
" words = jieba.cut(text)\n",
|
||
" \n",
|
||
" # 2. 去除停用词\n",
|
||
" words = [w for w in words if w not in stopwords and len(w) > 0]\n",
|
||
" \n",
|
||
" # 3. 去除空格\n",
|
||
" words = [w for w in words if w.strip()]\n",
|
||
" \n",
|
||
" return words\n",
|
||
"\n",
|
||
"print(\"预处理结果:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" words = preprocess_text(doc)\n",
|
||
" print(f\"\\nDoc{i+1}: {doc}\")\n",
|
||
" print(f\" → {' / '.join(words)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# 实战:jieba分词 + TF-IDF完整流程\n",
|
||
"import jieba\n",
|
||
"import math\n",
|
||
"\n",
|
||
"print(\"=\" * 50)\n",
|
||
"print(\"实战:jieba分词 + TF-IDF完整流程\")\n",
|
||
"print(\"=\" * 50)\n",
|
||
"\n",
|
||
"def simple_tfidf_tokenized(docs, stopwords=None):\n",
|
||
" \"\"\"\n",
|
||
" 结合分词的TF-IDF实现\n",
|
||
" 参数:\n",
|
||
" docs: 原始文档列表\n",
|
||
" stopwords: 停用词集合\n",
|
||
" 返回:\n",
|
||
" vocab, tfidf_matrix\n",
|
||
" \"\"\"\n",
|
||
" # 1. 分词\n",
|
||
" tokenized = []\n",
|
||
" for doc in docs:\n",
|
||
" words = jieba.cut(doc)\n",
|
||
" if stopwords:\n",
|
||
" words = [w for w in words if w not in stopwords and len(w) > 1]\n",
|
||
" else:\n",
|
||
" words = [w for w in words if len(w) > 1]\n",
|
||
" tokenized.append(words)\n",
|
||
" \n",
|
||
" # 2. 构建词表\n",
|
||
" vocab_set = set()\n",
|
||
" for doc in tokenized:\n",
|
||
" vocab_set.update(doc)\n",
|
||
" vocab = sorted(list(vocab_set))\n",
|
||
" \n",
|
||
" # 3. 构建TF矩阵并计算IDF\n",
|
||
" n_docs = len(tokenized)\n",
|
||
" tf_matrix = []\n",
|
||
" df_dict = {word: 0 for word in vocab}\n",
|
||
" \n",
|
||
" for doc in tokenized:\n",
|
||
" vec = [0] * len(vocab)\n",
|
||
" for word in doc:\n",
|
||
" if word in vocab:\n",
|
||
" idx = vocab.index(word)\n",
|
||
" vec[idx] += 1\n",
|
||
" tf_matrix.append(vec)\n",
|
||
" \n",
|
||
" # 计算DF\n",
|
||
" for vec in tf_matrix:\n",
|
||
" for j, count in enumerate(vec):\n",
|
||
" if count > 0:\n",
|
||
" word = vocab[j]\n",
|
||
" df_dict[word] += 1\n",
|
||
" \n",
|
||
" # 计算IDF\n",
|
||
" idf = []\n",
|
||
" for word in vocab:\n",
|
||
" df = df_dict[word]\n",
|
||
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
|
||
" idf.append(idf_j)\n",
|
||
" \n",
|
||
" # 计算TF-IDF\n",
|
||
" tfidf = []\n",
|
||
" for vec in tf_matrix:\n",
|
||
" tfidf_vec = [vec[i] * idf[i] for i in range(len(vec))]\n",
|
||
" tfidf.append(tfidf_vec)\n",
|
||
" \n",
|
||
" return vocab, tfidf, tokenized\n",
|
||
"\n",
|
||
"# 示例文档集合\n",
|
||
"docs = [\n",
|
||
" \"Python是一门很棒的编程语言\",\n",
|
||
" \"人工智能是未来的发展方向\",\n",
|
||
" \"深度学习是机器学习的一个分支\",\n",
|
||
" \"Python和Java都是很流行的编程语言\"\n",
|
||
"]\n",
|
||
"\n",
|
||
"# 停用词\n",
|
||
"stopwords = set([\"的\", \"是\", \"一个\", \"很\", \"和\", \"在\", \"了\"])\n",
|
||
"\n",
|
||
"vocab, tfidf_matrix, tokenized = simple_tfidf_tokenized(docs, stopwords)\n",
|
||
"\n",
|
||
"print(\"文档集合:\")\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" print(f\" Doc{i+1}: {doc}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"分词结果:\")\n",
|
||
"for i, words in enumerate(tokenized):\n",
|
||
" print(f\" Doc{i+1}: {' / '.join(words)}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(f\"词表(共{len(vocab)}个词):\")\n",
|
||
"print(f\" {vocab}\")\n",
|
||
"print()\n",
|
||
"\n",
|
||
"print(\"TF-IDF矩阵:\")\n",
|
||
"for i, vec in enumerate(tfidf_matrix):\n",
|
||
" # 只显示非零值\n",
|
||
" nonzero = [(vocab[j], round(vec[j], 4)) for j in range(len(vec)) if vec[j] > 0]\n",
|
||
" print(f\" Doc{i+1}: {nonzero}\")\n",
|
||
"\n",
|
||
"print()\n",
|
||
"\n",
|
||
"# 找每个文档最重要的词\n",
|
||
"print(\"每个文档最重要的词(TF-IDF值最高):\")\n",
|
||
"for i, vec in enumerate(tfidf_matrix):\n",
|
||
" max_idx = max(range(len(vec)), key=lambda j: vec[j])\n",
|
||
" max_score = vec[max_idx]\n",
|
||
" if max_score > 0:\n",
|
||
" print(f\" Doc{i+1}: '{vocab[max_idx]}' (TF-IDF={max_score:.4f})\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"jp-MarkdownHeadingCollapsed": true
|
||
},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"# 📋 总结\n",
|
||
"\n",
|
||
"## 本章核心概念\n",
|
||
"\n",
|
||
"```\n",
|
||
"文本数据处理\n",
|
||
" │\n",
|
||
" ├── 核心问题:文本(符号) → 向量(数字)\n",
|
||
" │\n",
|
||
" ├── 向量化方法\n",
|
||
" │ ├── BoW(词袋模型)\n",
|
||
" │ │ └── 核心:统计词频,忽略顺序\n",
|
||
" │ │\n",
|
||
" │ ├── TF-IDF(词频-逆文档频率)\n",
|
||
" │ │ └── 核心:词的重要性 × 词的独特性\n",
|
||
" │ │\n",
|
||
" │ └── Word Embedding(词嵌入)\n",
|
||
" │ └── 核心:用语义空间表示词\n",
|
||
" │\n",
|
||
" └── 处理流程\n",
|
||
" ├── 文本预处理(分词、去停用词)\n",
|
||
" ├── 向量化\n",
|
||
" └── 下游任务(分类、相似度、生成)\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 关键公式速查\n",
|
||
"\n",
|
||
"| 方法 | 公式 | 含义 |\n",
|
||
"|------|------|------|\n",
|
||
"| 向量加法 | [1,2] + [3,4] = [4,6] | 对应位置相加 |\n",
|
||
"| 向量数乘 | 2 × [1,2] = [2,4] | 每个元素乘以标量 |\n",
|
||
"| 向量点积 | [1,2] · [3,4] = 11 | 对应相乘再求和 |\n",
|
||
"| 向量长度 | |[3,4]| = √(3²+4²) = 5 | 勾股定理 |\n",
|
||
"| 余弦相似度 | cos(θ) = (A·B) / (|A|×|B|) | 向量相似程度 |\n",
|
||
"| TF-IDF | TF × IDF | 词频 × 逆文档频率 |\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"> **记住:文本向量化的核心目标是把\"符号\"变成\"可计算的数值向量\"!**"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.6"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|