Files
task-3-2-1-Text-Processing-…/3-2-1_文本数据处理导论_课堂演示.ipynb
2026-04-23 16:01:39 +08:00

3357 lines
155 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3-2-1 文本数据处理导论\n",
"## 课堂演示notebook\n",
"\n",
"---\n",
"\n",
"## 目录\n",
"\n",
"1. [什么是文本数据?](#第一部分-什么是文本数据)\n",
"2. [计算机如何读取文本?](#第二部分-计算机如何读取文本)\n",
"3. [向量基础入门](#第三部分-向量基础入门)\n",
"4. [余弦相似度](#第四部分-余弦相似度)\n",
"5. [文本向量化的核心思想](#第五部分-文本向量化的核心思想)\n",
"6. [BoW词袋模型](#第六部分-bow词袋模型)\n",
"7. [TF-IDF词频-逆文档频率](#第七部分-tf-idf)\n",
"8. [Word Embedding词嵌入](#第八部分-word-embedding词嵌入)\n",
"9. [文本处理完整流程](#第九部分-文本处理完整流程)\n",
"10. [实战用jieba进行中文分词](#第十部分-实战用jieba进行中文分词)\n",
"\n",
"---\n",
"\n",
"**注意**运行本notebook需要安装以下依赖\n",
"```bash\n",
"pip install numpy matplotlib jieba\n",
"```\n",
"- BoW和TF-IDF代码使用纯Python+NumPy实现不依赖sklearn\n",
"- 如果服务器没有中文字体,图表中的中文可能显示为方块,这是正常现象。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 第一部分:什么是文本数据?\n",
"\n",
"## 1.1 文本数据的定义\n",
"\n",
"**文本数据**是由文字、符号组成的序列信息,是人类语言在计算机中的表示形式。\n",
"\n",
"### 生活中的文本数据例子\n",
"\n",
"| 类型 | 示例 |\n",
"|------|------|\n",
"| 一句话 | \"今天天气真好\" |\n",
"| 一篇文章 | 一篇新闻报道 |\n",
"| 一条评论 | \"这家餐厅的菜太好吃了!\" |\n",
"| 一段对话 | \"你好,请问这本书多少钱?\" |\n",
"| 一首诗 | \"床前明月光,疑是地上霜\" |\n",
"| 一段代码 | `print('Hello World')` |\n",
"| 一封邮件 | 包含正文、收件人、发件人等 |\n",
"| 聊天记录 | 微信对话、短信 |\n",
"\n",
"**简单来说:只要是文字组成的信息,都是文本数据!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2 文本数据的特点\n",
"\n",
"文本数据与图像、音频等数据有显著区别:\n",
"\n",
"| 特点 | 说明 | 示例 |\n",
"|------|------|------|\n",
"| **离散符号** | 由离散的字符/词组成,不是连续的数值 | \"hello\" 由 h,e,l,l,o 这5个字符组成 |\n",
"| **序列性** | 符号按特定顺序排列,顺序改变意思就改变 | \"我爱你\" ≠ \"你爱我\" |\n",
"| **语义丰富** | 同样的词在不同场景意思可能不同 | \"苹果\"可以是水果或手机品牌 |\n",
"| **上下文相关** | 词的意思依赖上下文 | \"他打了猫,猫跑了\" 中两个\"猫\"意思相同 |\n",
"| **歧义性** | 同样的话可能有多重理解 | \"天气真不错\"可以是正面或反讽 |\n",
"\n",
"### 思考:序列性有多重要?\n",
"\n",
"```\n",
"文本1: \"我吃了饭\"\n",
"文本2: \"饭了我吃\"\n",
"文本3: \"饭吃了我\"\n",
"\n",
"这三个文本由完全相同的字符组成,但顺序不同,意思也完全不同!\n",
"这说明:文本的顺序承载了重要的语义信息。\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第二部分:计算机如何\"读取\"文本?\n",
"\n",
"## 2.1 对比:图像数据 vs 文本数据的存储方式\n",
"\n",
"### 图像数据的读取\n",
"\n",
"```\n",
"图像文件(.jpg/.png\n",
" ↓\n",
"计算机读取像素值每个像素是0-255的数值\n",
" ↓\n",
"存储为3维矩阵 [高度, 宽度, 通道(RGB)]\n",
" ↓\n",
"一张 1920×1080 的彩色图 = 1920 × 1080 × 3 = 6,220,800 个数字\n",
"```\n",
"\n",
"**图像的本质:密集的数值矩阵,计算机可以直接处理!**\n",
"\n",
"### 文本数据的读取\n",
"\n",
"```\n",
"文本文件(.txt/.md/.py\n",
" ↓\n",
"计算机读取字符编码ASCII/UTF-8/GBK\n",
" ↓\n",
"存储为字符序列(每个字符是一个数字编码)\n",
" ↓\n",
"\"Python\" → [80, 121, 116, 104, 111]ASCII编码\n",
"```\n",
"\n",
"**文本的本质:符号序列,计算机需要额外处理才能理解!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 字符编码:用数字表示字符\n",
"\n",
"### ASCII编码英文和部分符号\n",
"\n",
"ASCII码使用0-127的数字来表示128个字符\n",
"\n",
"| 字符 | ASCII码 | 说明 |\n",
"|------|---------|------|\n",
"| 'A' | 65 | 大写字母 |\n",
"| 'B' | 66 | 大写字母 |\n",
"| ... | ... | ... |\n",
"| 'Z' | 90 | 大写字母 |\n",
"| 'a' | 97 | 小写字母 |\n",
"| 'b' | 98 | 小写字母 |\n",
"| ... | ... | ... |\n",
"| 'z' | 122 | 小写字母 |\n",
"| '0' | 48 | 数字 |\n",
"| '1' | 49 | 数字 |\n",
"| ... | ... | ... |\n",
"| '9' | 57 | 数字 |\n",
"\n",
"### UTF-8编码支持全球所有语言包括中文\n",
"\n",
"UTF-8是一种变长编码中文通常用3-4个字节表示\n",
"\n",
"| 字符 | UTF-8编码值 | 字节数 |\n",
"|------|-------------|--------|\n",
"| '中' | 20013 | 2字节 |\n",
"| '文' | 25991 | 2字节 |\n",
"| 'P' | 80 | 1字节 |\n",
"| 'y' | 121 | 1字节 |\n",
"| '👍' | 128077 | 4字节emoji |"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"英文文本的字符编码\n",
"==================================================\n",
"文本: Hello\n",
"每个字符的ASCII码: [72, 101, 108, 108, 111]\n",
"\n",
" 'H' -> 72\n",
" 'e' -> 101\n",
" 'l' -> 108\n",
" 'l' -> 108\n",
" 'o' -> 111\n",
"\n",
"==================================================\n",
"中文文本的字符编码\n",
"==================================================\n",
"文本: 你好\n",
"每个字符的UTF-8编码值: [20320, 22909]\n",
"\n",
" '你' -> 20320\n",
" '好' -> 22909\n"
]
}
],
"source": [
"# 实际演示:查看字符的编码值\n",
"\n",
"# 英文例子\n",
"text_en = \"Hello\"\n",
"print(\"=\" * 50)\n",
"print(\"英文文本的字符编码\")\n",
"print(\"=\" * 50)\n",
"print(f\"文本: {text_en}\")\n",
"print(f\"每个字符的ASCII码: {[ord(c) for c in text_en]}\")\n",
"print()\n",
"\n",
"# 逐个显示\n",
"for c in text_en:\n",
" print(f\" '{c}' -> {ord(c)}\")\n",
"\n",
"print()\n",
"print(\"=\" * 50)\n",
"print(\"中文文本的字符编码\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 中文例子\n",
"text_cn = \"你好\"\n",
"print(f\"文本: {text_cn}\")\n",
"print(f\"每个字符的UTF-8编码值: {[ord(c) for c in text_cn]}\")\n",
"print()\n",
"\n",
"# 逐个显示\n",
"for c in text_cn:\n",
" print(f\" '{c}' -> {ord(c)}\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"验证:数字编码转字符\n",
"\n",
"chr(65) = 'A' # 应该是大写字母 A\n",
"chr(97) = 'a' # 应该是小写字母 a\n",
"chr(20013) = '中' # 应该是中文'中'\n",
"chr(25991) = '文' # 应该是中文'文'\n"
]
}
],
"source": [
"# 用chr()函数反向验证:数字编码转字符\n",
"print(\"验证:数字编码转字符\")\n",
"print()\n",
"\n",
"# 65是大写字母A\n",
"print(f\"chr(65) = '{chr(65)}' # 应该是大写字母 A\")\n",
"\n",
"# 97是小写字母a\n",
"print(f\"chr(97) = '{chr(97)}' # 应该是小写字母 a\")\n",
"\n",
"# 20013是中文\"中\"\n",
"print(f\"chr(20013) = '{chr(20013)}' # 应该是中文'中'\")\n",
"\n",
"# 25991是中文\"文\"\n",
"print(f\"chr(25991) = '{chr(25991)}' # 应该是中文'文'\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"练习题1答案\n",
"==================================================\n",
"1. 'Hello' 的ASCII码:\n",
"[72, 101, 108, 108, 111]\n",
"\n",
"2. 验证 chr(65):\n",
"chr(65) = 'A'\n",
"\n",
"验证 A-Z 的ASCII码范围 (65-90):\n",
"['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']\n"
]
}
],
"source": [
"# 练习题1答案验证字符编码\n",
"print(\"=\" * 50)\n",
"print(\"练习题1答案\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 1. 用 ord() 函数打印 \"Hello\" 每个字符的ASCII码\n",
"print(\"1. 'Hello' 的ASCII码:\")\n",
"print([ord(c) for c in \"Hello\"])\n",
"\n",
"# 2. 验证字符65对应大写字母A\n",
"print()\n",
"print(\"2. 验证 chr(65):\")\n",
"print(f\"chr(65) = '{chr(65)}'\")\n",
"\n",
"# 验证范围\n",
"print()\n",
"print(\"验证 A-Z 的ASCII码范围 (65-90):\")\n",
"print([chr(i) for i in range(65, 91)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 计算机擅长什么?不擅长什么?\n",
"\n",
"### 计算机擅长的任务 ✅\n",
"\n",
"| 任务类型 | 示例 | 说明 |\n",
"|----------|------|------|\n",
"| 数字计算 | 1 + 2 = 3 | 加减乘除、方程求解 |\n",
"| 逻辑判断 | if a > b then ... | 条件分支、布尔运算 |\n",
"| 矩阵运算 | 图像卷积、矩阵乘法 | 深度学习核心 |\n",
"| 精确匹配 | 字符串完全相同比较 | 数据库查询 |\n",
"| 模式识别 | 符合规则的数据查找 | 正则表达式 |\n",
"| 存储检索 | 海量数据快速存取 | 搜索引擎 |\n",
"\n",
"### 计算机不擅长的任务 ❌\n",
"\n",
"| 任务类型 | 示例 | 为什么困难 |\n",
"|----------|------|-------------|\n",
"| 语义理解 | \"今天天气真好\"是好是坏? | 需要常识和上下文 |\n",
"| 情感判断 | \"真是绝了\"是夸还是骂? | 歧义性、反讽 |\n",
"| 模糊推理 | \"大概\"、\"也许\" | 无法精确处理 |\n",
"| 创意创作 | 写诗、写小说 | 需要想象力 |\n",
"| 常识理解 | \"水往低处流\" | 缺乏物理常识 |\n",
"| 多义性理解 | \"苹果\"指什么? | 需要世界知识 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 为什么计算机不擅长理解文本?\n",
"\n",
"**原因一:文本是\"符号\",不是\"数值\"**\n",
"\n",
"```\n",
"计算机的大脑 = 计算器(专门处理数字)\n",
"文本 = 一堆符号(对计算机来说就像乱码)\n",
"\n",
"数字1, 2, 3, 100.5, -7 → 计算机直接能算\n",
"文本:\"好\"、\"bad\"、\"hello\" → 计算机不知道啥意思\n",
"```\n",
"\n",
"**原因二:语义不是显式表达的**\n",
"\n",
"```python\n",
"# 人类理解的文本:\n",
"text = \"他今天心情不太好,因为下雨了\"\n",
"\n",
"# 人类理解:\n",
"# - \"心情不太好\" = 不开心\n",
"# - \"因为下雨了\" = 原因是下雨\n",
"# - 因果关系:下雨 → 心情不好\n",
"\n",
"# 计算机只能看到:\n",
"print(text)\n",
"# 计算机:???不理解下雨和心情的因果关系\n",
"```\n",
"\n",
"**原因三:同样的符号,不同的语境,不同的意思**\n",
"\n",
"```\n",
"语境1: \"苹果真好吃\" → 说的是水果(吃的苹果)\n",
"\n",
"语境2: \"苹果手机真贵\" → 说的是手机品牌Apple\n",
"\n",
"语境3: \"牛顿被苹果砸到了\" → 说的是水果(引发万有引力灵感)\n",
"\n",
"计算机怎么知道?需要上下文理解能力!\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 关键结论:为什么需要文本向量化?\n",
"\n",
"```\n",
"┌─────────────────────────────────────────────────────────────┐\n",
"│ 核心矛盾 │\n",
"│ │\n",
"│ 文本(符号序列) ←→ 计算机擅长(数值计算) │\n",
"│ ↓ │\n",
"│ 需要一座桥梁 │\n",
"│ 这座桥梁就是 │\n",
"│ 【文本向量化】 │\n",
"│ │\n",
"│ 文本 → 数值向量 → 计算机可以计算 → AI模型处理 │\n",
"└─────────────────────────────────────────────────────────────┘\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第三部分:向量基础入门\n",
"\n",
"## 3.1 什么是向量?\n",
"\n",
"**向量 = 有方向的量**,是数学中描述\"大小+方向\"的基本工具。\n",
"\n",
"### 生活中的向量例子\n",
"\n",
"| 例子 | 大小 | 方向 | 说明 |\n",
"|------|------|------|------|\n",
"| 速度 | 60 km/h | 向北 | 速度是向量 |\n",
"| 力 | 10 N | 向右推 | 力是向量 |\n",
"| 风向 | 5 m/s | 东南风 | 风向是向量 |\n",
"| 位移 | 100 km | 北京→上海 | 位移是向量 |\n",
"\n",
"### 向量在数学中的表示\n",
"\n",
"**一维向量(数轴上的点)**\n",
"\n",
"```\n",
" ←———————————|———————————→\n",
" -3 -2 -1 0 1 2 3\n",
"\n",
" 点A在位置 2 → 向量A = [2] 只有1个数字\n",
" 点B在位置 -3 → 向量B = [-3] (负数表示方向相反)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"NumPy向量创建演示\n",
"==================================================\n",
"一维向量 v1 = [3]\n",
"v1 有 1 个元素\n",
"\n",
"二维向量 v2 = [2 3]\n",
"v2 有 2 个元素\n",
"\n",
"三维向量 v3 = [1 2 3]\n",
"v3 有 3 个元素\n",
"\n",
"10维向量 v10 = [ 0.1 0.5 -0.3 0.8 0.2 -0.1 0.7 0.3 -0.2 0.6]\n",
"v10 有 10 个元素\n"
]
}
],
"source": [
"# Python中使用NumPy创建向量\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"NumPy向量创建演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 一维向量只有1个数字\n",
"v1 = np.array([3])\n",
"print(f\"一维向量 v1 = {v1}\")\n",
"print(f\"v1 有 {len(v1)} 个元素\")\n",
"\n",
"# 二维向量2个数字表示平面上的一个点\n",
"v2 = np.array([2, 3])\n",
"print(f\"\\n二维向量 v2 = {v2}\")\n",
"print(f\"v2 有 {len(v2)} 个元素\")\n",
"\n",
"# 三维向量3个数字表示立体空间的一个点\n",
"v3 = np.array([1, 2, 3])\n",
"print(f\"\\n三维向量 v3 = {v3}\")\n",
"print(f\"v3 有 {len(v3)} 个元素\")\n",
"\n",
"# 高维向量(机器学习中常用,几十维到几千维)\n",
"v10 = np.array([0.1, 0.5, -0.3, 0.8, 0.2, -0.1, 0.7, 0.3, -0.2, 0.6])\n",
"print(f\"\\n10维向量 v10 = {v10}\")\n",
"print(f\"v10 有 {len(v10)} 个元素\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 二维向量的几何直观\n",
"\n",
"```\n",
" y (纵坐标)\n",
" ↑\n",
" |\n",
" 3 | * A(2,3)\n",
" |\n",
" 2 |\n",
" |\n",
" 1 | * B(4,1)\n",
" |\n",
" 0---+—————————————→ x (横坐标)\n",
" 0 1 2 3 4 5\n",
"\n",
" 向量A = [2, 3] 横坐标2纵坐标3\n",
" 向量B = [4, 1]\n",
"\n",
" 从原点(0,0)出发,到点(2,3)的箭头就是向量A的图形表示\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n",
"findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAArwAAAIrCAYAAAAN2Uq4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaLhJREFUeJzt3Xd8FHX+x/H3JpACIdQk9Ca9hA4GkIAgHGCJBZHfCbHrCR6IFRvFErtwyoGoEAsIB0pAQRBB4BAsgJHiiYAIqCShJhBMgMz8/lizsKRvNtnJ5PV8PPYhM/udmc/kG/CdyWdnHKZpmgIAAABsys/XBQAAAAAlicALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAOWUw+FQ3759fV1GrtauXSuHw6FJkya5re/bt68cDodvirpIfHy8HA6H4uPjfV0KgAIQeAEU2u+//66pU6dq4MCBatiwoQICAlS7dm1df/31+uabb3LdJjugZL8qVqyomjVrqmPHjrr99tu1YsUKGYZRqOPv2rVLDodDrVq1KnDs448/LofDoeeee65I51gUv/76qxwOh2655ZYSO0ZBHnvsMTkcDsXFxeU7zjAMNWzYUP7+/jp48GApVVe2WWF+AXhHBV8XAKDseP311/XCCy/okksu0cCBAxUWFqbdu3crISFBCQkJmjdvnoYPH57rtg888IBCQkJkGIZOnDih//3vf5o7d65mz56tnj176sMPP1TDhg3zPX7Lli3Vu3dvbdiwQV999ZV69eqV6zjDMPTee+/J39/f9mHltttuU1xcnObMmaMJEybkOW7VqlU6ePCg/va3v6lBgwaSpP/973+qVKlSaZXqFe+9955Onz7t6zIkSddee60uvfRS1alTx9elACgAgRdAoXXv3l1r165VdHS02/r//ve/6t+/v/7xj38oJiZGgYGBObZ98MEHVbt2bbd1R44c0T//+U99+OGHGjRokDZv3qzKlSvnW8Ptt9+uDRs2aPbs2XkG3pUrV+q3337T0KFDVbdu3SKeZdnSrFkzRUdHa926dfrvf/+ryy67LNdxs2fPluT8+mUrzJVyqynoh6LSVLVqVVWtWtXXZQAoBFoaABTaddddlyPsStJll12mfv366fjx49q+fXuh91erVi198MEHuvzyy/XTTz9p+vTpBW4zbNgwValSRf/5z3+Unp6e65jcwl1KSoruv/9+NWvWTIGBgapVq5auv/567dixI9d9pKSk6IEHHlDLli0VHBysGjVqqEePHnr55ZclOfs3mzRpIkl699133do21q5d69pPenq6Jk6cqFatWikoKEg1atTQ0KFD9dVXX+U45qRJk1zbx8fHq3PnzqpUqVKBfbbZ55l93hc7duyYlixZolq1aunqq692rc+thzc1NVVPPfWU2rRpo5CQEIWGhqpZs2aKjY3V/v37XeNuueUWORwO/frrr/meR7YzZ87o9ddf16BBg9SgQQMFBgYqPDxc1113nb7//vt8z+9CufXwXvi1z+11YY/t4sWLNWLECDVr1kyVKlVS1apVddlll+mjjz5y22dh5je/Ht6vvvpKQ4cOVY0aNRQUFKRWrVpp4sSJuV6dzp6H5ORkxcbGqlatWgoODtall17q9jUE4Dmu8ALwiooVK0qSKlQo2j8rfn5+evzxx7VmzRotWLBADz/8cL7jK1eurJtuuklvvfWW/vOf/+jWW291e//o0aNaunSpwsPDdeWVV0qS9u7dq759++q3337TwIEDFRMTo5SUFH300UdauXKlVq9erR49erj2sWvXLvXr10+HDh1S7969FRMTo/T0dO3cuVPPPfecHnzwQXXs2FFjx47VtGnT1KFDB8XExLi2b9y4sSQpIyNDl19+ub799lt17txZ48aNU3JyshYsWKCVK1fqww8/1LBhw3Kc40svvaQvv/xS11xzjQYOHCh/f/98vyY33HCD7rvvPi1cuFCvv/66QkJC3N6fN2+eMjMzde+99yogICDP/ZimqUGDBumbb75Rr1699Le//U1+fn7av3+/li5dqpEjR6pRo0b51pKXY8eOady4cbrssss0ZMgQVa9eXb/88ouWLl2qzz77TOvXr1e3bt082vfEiRNzXT9jxgylpKS4tW1MmDBBAQEB6t27t+rUqaPDhw9r6dKluuGGG/Svf/1L9913nyQVan7zsnDhQo0YMUKBgYEaPny4wsPD9fnnn2vKlClauXKl1q5dq6CgILdtTpw4od69e6tq1aoaOXKkUlJStGDBAg0aNEhbtmxRu3btPPraAPiLCQDFtH//fjMwMNCsU6eOee7cObf3oqOjTUnmoUOH8tw+IyPDrFChgunn52eePXu2wON9/fXXpiSzd+/eOd6bNm2aKcl88MEHXet69uxp+vv7mytWrHAbu2vXLrNKlSpm+/bt3dZ37drVlGTOmjUrx/4PHjzo+vO+fftMSWZsbGyudU6ePNmUZP797383DcNwrd+6dasZEBBgVqtWzUxLS3OtnzhxoinJrFy5srlt27b8vwgXueeee0xJ5ttvv53jvU6dOpmSzB07dritl2RGR0e7lrdt22ZKMmNiYnLsIyMjwzx58qRrOTY21pRk7tu3L8fY7PP48ssv3bb/7bffcozdsWOHGRISYg4YMMBt/ZdffmlKMidOnOi2Pvv7qSDPP/+8Kcm85pprzKysLNf6vXv35hh78uRJs3379mbVqlXN9PR01/qC5nfOnDmmJHPOnDmudampqWbVqlXNwMBA84cffnCtz8rKMocPH25KMqdMmeK2H0mmJPPee+91q/Xtt982JZl33313gecLIH+0NAAolrNnz2rkyJHKzMzUCy+8UODVyNwEBgaqZs2aMgxDx44dK3B8jx491K5dO23YsEG7d+92e2/OnDmSnB/mkqTvv/9eGzduVGxsrAYNGuQ2tkWLFrrzzju1fft2V2vDt99+q82bN6tPnz668847cxy7fv36hT6vd999VxUrVtTzzz/v9mv4Tp06KTY2VidOnFBCQkKO7e666y61b9++0MeR8m5r+OGHH/T999+re/fuatu2baH2FRwcnGNdYGBgjivHRREYGKh69erlWN+2bVv169dP69ev19mzZz3e/4U+/vhjTZgwQZ07d9bcuXPl53f+f3VNmzbNMT4kJES33HKLUlNT9d133xXr2EuWLFFqaqpuu+02RUZGutb7+fnpxRdfVIUKFXJtgahcubJeeOEFt1pjY2NVoUKFYtcEgJYGAMVgGIZuueUWrV+/XnfeeadGjhxZase+/fbbdf/992v27NmuW3Jt3bpViYmJioqKUuvWrSVJX3/9tSQpOTk5xz1dJemnn35y/bddu3b69ttvJUkDBw4sVn1paWn65Zdf1Lp161xDcr9+/fTWW28pMTExx9ete/fuRT5e165d1aFDB23cuFG7du1Sy5YtJUnvvPOOJPd+5ry0bt1akZGR+vDDD/Xbb78pJiZGffv2VceOHd2CmKcSExP14osvasOGDUpKSsoRcI8cOVLsOx5s3rxZI0eOVN26dfXJJ5/k+BBkSkqKnn/+eX322Wfav3+//vzzT7f3//jjj2IdP7sfObe+64YNG6pp06b6+eefdfLkSVWpUsX1XosWLXL8QFGhQgVFREToxIkTxaoJAIEXgIcMw9Btt92mefPm6eabb9bMmTM93ldmZqaOHj0qf39/1ahRo1Db3HzzzXrkkUf03nvv6ZlnnpG/v3+uH1bLvmK8bNkyLVu2LM/9ZX8ALjU1VZJyvRpZFGlpaZKkiIiIXN/PDnbZ4y6U1zYFuf322/XPf/5Ts2fP1gsvvKAzZ85o3rx5qlSpkm666aYCt69QoYLWrFmjSZMm6aOPPtIDDzwgSQoLC9OYMWP0+OOPe3QFX5I2btyoyy+/XJLzh4nmzZsrJCREDodDCQkJ+uGHH5SZmenRvrMdPHhQV111lRwOhz755JMcd+g4duyYunXrpgMHDqhXr14aMGCAqlWrJn9/fyUmJmrJkiXFrqEw8/7zzz8rLS3NLfCGhobmOr5ChQrKysoqVk0AuEsDAA8YhqFbb71V7777rkaMGKH4+PhiXQH86quvdO7cOXXs2LHQH3qrVauWrrnmGv3xxx/67LPPlJmZqXnz5ikkJMTtXsDZQeL111+XaZp5vmJjYyVJ1apVk+R8yEZxZB83OTk51/eTkpLcxl3I0yeJ/f3vf1dgYKDee+89nTt3TkuWLNHRo0c1bNiwPAPVxWrWrKnXX39dv//+u3788Ue98cYbqlGjhiZOnKgXX3zRNS57vs+dO5djH9k/NFzo2WefVWZmpr744gstXbpUr7zyiiZPnqxJkybluF2dJ06ePKkrr7xSKSkpmjdvnjp16pRjzDvvvKMDBw7o6aef1oYNG/T666/r6aef1qRJk3TppZcWuwapePMOoOQQeAEUSXbYfe+99zR8+HC9//77Hl/1y97fs88+K0kaMWJEkba9sG81ISFBx48f14033uj2q+Hsuy9s2rSpUPvMbif4/PPPCxybfd65XYELDQ1V06ZNtWfPnlzDc/btpjp27FiougqjRo0auvbaa5WUlKTly5fnesW7sBwOh1q3bq3Ro0dr1apVkqSlS5e63q9evbqk3H8wyO02Y3v37lWNGjXUu3dvt/WnT5/W1q1bi1zfhbKysnTTTTdp27Zteumll9xuvXZxDZJ0zTXX5Hjvv//9b451+c1vXrKDdm63Ezt48KD27t2rpk2bul3dBVDyCLwACi27jeG9997TsGHD9MEHHxQr7B45ckQ333yz1qxZozZt2ugf//hHkba/4oor1KBBA3366ad69dVXJeUMd927d1ePHj304YcfasGCBbme07p161zL3bp1U7du3bR+/Xq99dZbOcZfGPCqV68uh8OR56N6Y2NjdfbsWU2YMEGmabrWb9u2TfHx8apatarb7a68Ifv84+Li9Pnnn6tFixZ5PoziYr/++muu99XNvlp54a20sm8hdvEHsBYtWuT29czWqFEjHT9+XDt37nSty8rK0oMPPqjDhw8Xqr68jBs3TsuXL9ddd92l8ePH5zku+5ZqGzZscFs/b948LV++PMf4guY3N9dcc42qVq2qOXPmuJ2raZp65JFHdO7cOds//Q+wInp4ARTalClT9O677yokJEQtWrTQM888k2NMTExMrlctX375ZdejhdPS0vTjjz/qv//9rzIyMtSrVy99+OGHRX7MrZ+fn2699VZNmTJF3377rVq1aqWePXvmGPfhhx+qX79+uummmzR16lR17txZwcHBOnDggDZt2qTDhw8rIyPDNX7u3Lnq27ev7rrrLr3//vuKiopSRkaGdu7cqe+//15Hjx6V5Px0f3Y4HjlypJo3by4/Pz/X/WoffvhhLVu2TO+//77+97//qX///q77q547d05vvfWW16/09e/fX40bN3Z9WC/7bhWFkZiYqOuuu07du3dXmzZtVLt2bf3+++9KSEiQn5+f7r//ftfYa665Rpdcconi4+N18OBBderUSf/73/+0Zs0aDRkyJEeAvO+++/T555+rd+/euvHGGxUUFKS1a9fq999/V9++fT1+wMK3336rN954Q8HBwQoLC8v1g4nZ35MjR47UCy+8oPvuu09ffvmlGjVqpB9++EGrV6/Wddddp48//thtu4LmNzehoaF66623NGLECPXo0UPDhw9XWFiYvvjiC23ZskXdu3fXQw895NG5AigGn90QDUCZk33v1fxeF96T1DTP3zc1+1WhQgWzevXqZocOHczbbrvNXLFihdu9R4tq3759psPhMCWZL774Yp7jjh07Zj7xxBNmu3btzODgYDMkJMRs3ry5+X//93/mxx9/nGN8UlKSOXbsWLNp06ZmQECAWaNGDbNHjx7mq6++6jZu165d5pAhQ8xq1aq56rjw/rOnTp0yn3zySbNFixaue+8OHjzY/O9//5vjmLndv9YT2ff/9ff3N//44488x+mi+/AePHjQfPTRR81LL73UDA8PNwMCAsyGDRua1113nblp06Yc2+/bt8+MiYkxq1SpYlauXNns37+/+d133+V5HosWLTI7d+5sVqpUyaxVq5Z54403mnv37s31nr6FvQ9v9rjCfk8mJiaaAwcONKtXr25WqVLFjI6ONr/44otc76lrmvnPb17bmKZprl+/3hw8eLBZrVo1MyAgwGzRooX55JNPmqdOnSpwHi7UqFEjs1GjRrm+B6DwHKZ5we/ZAAAAAJuhhxcAAAC2RuAFAACArRF4AQAAYGtlKvBmP49+3Lhx+Y5buHChWrVqpaCgILVv3z7X280AAACgfCgzgfe7777Tm2++qcjIyHzHbdy4USNGjNDtt9+u77//XjExMYqJidGOHTtKqVIAAABYSZm4S8OpU6fUuXNn/fvf/9Yzzzyjjh07aurUqbmOHT58uNLT0/Xpp5+61l166aXq2LGjZs6cWUoVAwAAwCrKxIMnRo8eraFDh2rAgAG53uj+Qps2bcrxpJ1BgwYpISEhz20yMzOVmZnpWjYMQ8eOHVPNmjU9fqY9AAAASo5pmjp58qTq1q0rP7/8mxYsH3jnz5+vrVu36rvvvivU+KSkJEVERLiti4iIUFJSUp7bxMXFafLkycWqEwAAAKXv4MGDql+/fr5jLB14Dx48qLFjx2rVqlVuz3D3tgkTJrhdFU5NTVXDhg21f/9+hYaGlthxS4thGLrhhhu0aNGiAn8CQukyDENHjhxRrVq1mBuLYW6sjfmxLubGuuw2N2lpaWrUqFGhHtFu6cC7ZcsWpaSkqHPnzq51WVlZWr9+vd544w1lZmbK39/fbZvatWsrOTnZbV1ycrJq166d53ECAwMVGBiYY321atVsE3grVqyoatWq2eIb3E4Mw9CZM2eYGwtibqyN+bEu5sa67DY32edQmPZTS59t//79tX37diUmJrpeXbt21d///nclJibmCLuSFBUVpdWrV7utW7VqlaKiokqrbAAAAFiIpa/wVqlSRe3atXNbV7lyZdWsWdO1ftSoUapXr57i4uIkSWPHjlV0dLReeeUVDR06VPPnz9fmzZs1a9asUq8fAAAAvmfpK7yFceDAAR06dMi13LNnT82bN0+zZs1Shw4dtGjRIiUkJOQIzgAAACgfLH2FNzdr167Nd1mShg0bpmHDhpVOQQAAALC0Mn+FFwAAAMgPgRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGuWDrwzZsxQZGSkQkNDFRoaqqioKH322Wd5jo+Pj5fD4XB7BQUFlWLFAAAAsJoKvi4gP/Xr19fzzz+v5s2byzRNvfvuu7rmmmv0/fffq23btrluExoaql27drmWHQ5HaZULAAAAC7J04L3qqqvclp999lnNmDFDX3/9dZ6B1+FwqHbt2qVRHgAAAMoASwfeC2VlZWnhwoVKT09XVFRUnuNOnTqlRo0ayTAMde7cWc8991ye4ThbZmamMjMzXctpaWmSJMMwZBiGd07AhwzDkGmatjgXu2FurIu5sTbmx7qYG+uy29wU5TwsH3i3b9+uqKgoZWRkKCQkRIsXL1abNm1yHduyZUvNnj1bkZGRSk1N1csvv6yePXtq586dql+/fp7HiIuL0+TJk3OsP3z4sDIyMrx2Lr5iGIbOnTunlJQU+flZum273DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj3WYpmmWYC3FdubMGR04cECpqalatGiR3n77ba1bty7P0Huhs2fPqnXr1hoxYoSefvrpPMfldoW3QYMGOn78uEJDQ71yHr5kGIaGDBmi5cuX2+Ib3E4Mw9Dhw4cVFhbG3FgMc2NtzI91MTfWZbe5SUtLU/Xq1ZWamlpgXrP8Fd6AgAA1a9ZMktSlSxd99913mjZtmt58880Ct61YsaI6deqkPXv25DsuMDBQgYGBOdb7+fnZ4htCcvY22+l87IS5sS7mxtqYH+tibqzLTnNTlHMoc2drGIbb1dj8ZGVlafv27apTp04JVwUAAACrsvQV3gkTJmjw4MFq2LChTp48qXnz5mnt2rVauXKlJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1xxx2+PA0AAAD4kKUDb0pKikaNGqVDhw6patWqioyM1MqVK3XFFVdIkg4cOOB2Ofv48eO68847lZSUpOrVq6tLly7auHFjofp9AQAAYE+WDrzvvPNOvu+vXbvWbfm1117Ta6+9VoIVAQAAoKwpcz28AAAAQFEQeAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AsKvGjSWHw/kaM8bX1ZyXkHC+LodD2rzZ1xUBsDkCLwCUhH//2xnmevTwbR2XXSa9/74UG3t+3cGD0uTJUvfuUvXqUq1aUt++0hdfFO9Yzz0nXXqpFBYmBQVJzZtL48ZJhw+7j+va1VnTXXcV73gAUEgEXgAoCXPnOq+wfvuttGeP7+po2lS6+WapW7fz65YskV54QWrWTHrmGenJJ6WTJ6UrrpDmzPH8WFu2SB07So8/Lk2fLl1zjXN/PXtK6ennx9Wv76wpKsrzYwFAEVTwdQEAYDv79kkbN0offyzdfbcz/E6c6OuqzuvXTzpwwHllN9s99zjD6lNPSbfe6tl+P/oo57qoKOmGG6RPPpFuusmz/QJAMXGFFwC8be5cZ6vA0KHOsDd3rq8rcte2rXvYlaTAQGnIEOm335xXe72lcWPnf0+c8N4+AaCIuMILAN42d6503XVSQIA0YoQ0Y4b03XfubQV5OXVKysgoeFzFilLVqsWv9UJJSVKlSs6Xp0xTOnpUOndO2r1bevRRyd/f2SMMAD5C4AUAb9qyRfrpJ+n1153LvXs7e1bnzi1c4B0zRnr33YLHRUdLa9cWq1Q3e/Y4WzCGDXMGVE8lJ0t16pxfrl9fmjdPatWq+DUCgIcIvADgTXPnShERzj5ZyXmnhuHDpQ8+kF55peAw+fDDzg90FaR69eLXmu30aWfQDQ6Wnn++ePuqUUNatcp5lfr7750h+tQp79QJAB4i8AKAt2RlSfPnO8Puvn3n1/fo4Qy7q1dLAwfmv482bZyv0pKV5fww2Y8/Sp99JtWtW7z9BQRIAwY4/3zllVL//lKvXlJ4uHMZAHyAwAsA3rJmjXTokDP0zp+f8/25cwsOvKmp0p9/FnysgADn1dTiuvNO6dNPnbVdfnnx93exnj2dLQ5z5xJ4AfgMgRcAvGXuXOeVzOnTc7738cfS4sXSzJnO1oG8jB1bej28Dz3kvE/u1KnOD9eVlIwMZ5AHAB8h8AKAN/z55/kPfd1wQ87369aVPvxQWrrU2dObl9Lq4X3pJenll6XHHnOG7OJKT3f2K198h4ePPpKOH3c+XQ0AfITACwDesHSp8/61V1+d+/vZj9ydOzf/wFsaPbyLFzuDdfPmUuvWzg/UXeiKK5wfvJOkX3+VmjRxPpo4Pj7vfe7e7ezdHT7ceUcGPz9p82bnvhs39k6oBgAPEXgBwBvmzpWCgpxhMTd+fs4HUcyd67xPbc2apVvfhX74wfnf3bulkSNzvv/ll+cDb/YdFi681Vhu6teXrr/e2cf87rvS2bNSo0bO26w9/rhvzxdAuUfgBQBvWLq04DFz5jhfpSkzUzpyxNk3XLmyc92kSc5XYaxf79xu3Lj8x9WqJb35ZuH2eeaMlJbG7coAlBoeLQwAdjZ/vrOV4pFHPNv+yy+lf/7z/BVfb1i+3FnTffd5b58AkA+u8AKAXc2de/4WZw0aeLaPhQu9V0+2Xr2cD6fI1rKl948BABcg8AKAXfXq5esKchcWdv7hFABQCmhpAAAAgK1ZOvDOmDFDkZGRCg0NVWhoqKKiovTZZ5/lu83ChQvVqlUrBQUFqX379lq+fHkpVQsAAAArsnTgrV+/vp5//nlt2bJFmzdv1uWXX65rrrlGO3fuzHX8xo0bNWLECN1+++36/vvvFRMTo5iYGO3YsaOUKwcAAIBVWDrwXnXVVRoyZIiaN2+uFi1a6Nlnn1VISIi+/vrrXMdPmzZNf/vb3/TQQw+pdevWevrpp9W5c2e98cYbpVw5AAAArKLMfGgtKytLCxcuVHp6uqKionIds2nTJo0fP95t3aBBg5SQkJDvvjMzM5WZmelaTktLkyQZhiHDMIpXuAUYhiHTNG1xLnbD3FgXc2NtzI91MTfWZbe5Kcp5WD7wbt++XVFRUcrIyFBISIgWL16sNnk8djMpKUkRF90rMiIiQklJSfkeIy4uTpMnT86x/vDhw8rIyPC8eIswDEPnzp1TSkqK/PwsfVG/3DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj7V84G3ZsqUSExOVmpqqRYsWKTY2VuvWrcsz9HpiwoQJbleG09LS1KBBA4WFhSk0NNRrx/EVwzBUoUIFhYeH2+Ib3E4Mw5DD4VBYWBhzYzHMjbUxP9bF3FiX3eYmKCio0GMtH3gDAgLUrFkzSVKXLl303Xffadq0aXozl0dY1q5dW8nJyW7rkpOTVbt27XyPERgYqMDAwBzr/fz8bPENIUkOh8NW52MnzI11MTfWxvxYF3NjXXaam6KcQ5k7W8Mw3PptLxQVFaXVq1e7rVu1alWePb8A4GKa0jvvSHFxvq4EAOBllr7CO2HCBA0ePFgNGzbUyZMnNW/ePK1du1YrV66UJI0aNUr16tVT3F//gxo7dqyio6P1yiuvaOjQoZo/f742b96sWbNm+fI0AFjdtm1SbKyUmOhc/vNPadIkX1YEAPAiSwfelJQUjRo1SocOHVLVqlUVGRmplStX6oorrpAkHThwwO1yds+ePTVv3jw98cQTeuyxx9S8eXMlJCSoXbt2vjoFAFaWkiI9+aT09tvShZ/2PXzYdzUBALzO0oH3nXfeyff9tWvX5lg3bNgwDRs2rIQqAmALWVnSa69JTz8t/XUbQjeDB5d+TQCAEmPpwAsAJeKVV6RHHsn7/SZNSq8WAECJK3MfWgOAYsv+4KvDIVWunPP9Ro1Ktx4AQIki8AIofx57TFqyRLruOik93bku+36ONWpINrj/NgDgPAIvgPLH318KDJQ++si5XLmydPas88+NG/usLABAySDwAih/UlOlO+44v/zyy1J4uPPPvXv7piYAQInhQ2sAyp8HHpB++8355wEDpLvvlvr3l9avl7jLCwDYDoEXQPmycqXziWqSFBLivAevwyE1b+58AQBsh5YGAOVHbq0M3JEBAGyPwAug/Li4leGuu3xbDwCgVBB4AZQPebUyAABsj8ALwP5oZQCAco3AC8D+aGUAgHKNwAvA3mhlAIByj8ALwL5oZQAAiMALwM5oZQAAiMALwK5oZQAA/IXAC8B+aGUAAFyAwAvAfmhlAABcgMALwF5oZQAAXITAC8A+aGUAAOSCwAvAPmhlAADkgsALwB5oZQAA5IHAC6Dso5UBAJAPAi+Aso9WBgBAPgi8AMo2WhkAAAUg8AIou2hlAAAUAoEXQNlFKwMAoBAIvADKJloZAACFROAFUPbQygAAKAICL4Cyh1YGAEAREHgBlC20MgAAiojAC6DsoJUBAOABAi+AsoNWBgCABwi8AMoGWhkAAB4i8AKwPloZAADFQOAFYH20MgAAioHAC8DaaGUAABQTgReAddHKAADwAgIvAOuilQEA4AUEXgDWRCsDAMBLCLwArIdWBgCAF1k68MbFxalbt26qUqWKwsPDFRMTo127duW7TXx8vBwOh9srKCiolCoG4BW0MgAAvMjSgXfdunUaPXq0vv76a61atUpnz57VwIEDlZ6enu92oaGhOnTokOu1f//+UqoYQLHRygAA8LIKvi4gPytWrHBbjo+PV3h4uLZs2aI+ffrkuZ3D4VDt2rVLujwA3kYrAwCgBFg68F4sNTVVklSjRo18x506dUqNGjWSYRjq3LmznnvuObVt2zbP8ZmZmcrMzHQtp6WlSZIMw5BhGF6o3LcMw5BpmrY4F7thbtw5xo+X469WBrN/f5l33CH56GvD3Fgb82NdzI112W1uinIeZSbwGoahcePGqVevXmrXrl2e41q2bKnZs2crMjJSqampevnll9WzZ0/t3LlT9evXz3WbuLg4TZ48Ocf6w4cPKyMjw2vn4CuGYejcuXNKSUmRn5+lu1jKHcMwlJqaKtM0y/3cBHz5pWrMni1JMipX1pG4OBmHD/usHubG2pgf62JurMtuc3Py5MlCj3WYpmmWYC1e849//EOfffaZNmzYkGdwzc3Zs2fVunVrjRgxQk8//XSuY3K7wtugQQMdP35coaGhxa7d1wzD0JAhQ7R8+XJbfIPbiWEYOnz4sMLCwsr33KSmyhEZ6bq6a/z739Ldd/u0JObG2pgf62JurMtuc5OWlqbq1asrNTW1wLxWJq7wjhkzRp9++qnWr19fpLArSRUrVlSnTp20Z8+ePMcEBgYqMDAwx3o/Pz9bfENIzr5mO52PnTA3kh56yO2uDH733GOJD6oxN9bG/FgXc2NddpqbopyDpc/WNE2NGTNGixcv1po1a9SkSZMi7yMrK0vbt29XnTp1SqBCAMXGXRkAACXM0ld4R48erXnz5mnJkiWqUqWKkpKSJElVq1ZVcHCwJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1x4Se/AVgDd2UAAJQCSwfeGTNmSJL69u3rtn7OnDm65ZZbJEkHDhxwu6R9/Phx3XnnnUpKSlL16tXVpUsXbdy4UW3atCmtsgEUFg+YAACUAksH3sJ8nm7t2rVuy6+99ppee+21EqoIgNfQygAAKCWW7uEFYFO0MgAAShGBF0Dpo5UBAFCKCLwAShetDACAUkbgBVB6aGUAAPgAgRdA6aGVAQDgAwReAKWDVgYAgI8QeAGUPFoZAAA+ROAFUPJoZQAA+BCBF0DJopUBAOBjBF4AJYdWBgCABRB4AZQcWhkAABZA4AVQMmhlAABYBIEXgPfRygAAsBACLwDvo5UBAGAhBF4A3kUrAwDAYgi8ALyHVgYAgAUReAF4D60MAAALIvAC8A5aGQAAFkXgBVB8tDIAACyMwAug+GhlAABYGIEXQPHQygAAsDgCLwDP0coAACgDCLwAPPfgg7QyAAAsj8ALwDMrVzrbFyRaGQAAllahqBucPn1aq1at0ldffaUff/xRR44ckcPhUK1atdS6dWv16tVLAwYMUOXKlUuiXgBWQCsDAKAMKfQV3u3bt+uWW25R7dq1de2112r69Onas2ePHA6HTNPUzz//rDfeeEPXXnutateurVtuuUXbt28vydoB+AqtDACAMqRQV3iHDx+ujz76SF27dtWkSZN0xRVXqE2bNvL393cbl5WVpR9//FGff/65Fi1apE6dOmnYsGH68MMPS6R4AD5AKwMAoIwpVOD18/PT5s2b1bFjx3zH+fv7q3379mrfvr0eeOABJSYm6oUXXvBGnQCsgFYGAEAZVKjA6+kV2o4dO3J1F7ATWhkAAGUQd2kAUDi0MgAAyiiPA29aWpqef/55DRo0SJ06ddK3334rSTp27JheffVV7dmzx2tFAvAxWhkAAGVYkW9LJkm//faboqOjdfDgQTVv3lw//fSTTp06JUmqUaOG3nzzTe3fv1/Tpk3zarEAfIRWBgBAGeZR4H3ooYd08uRJJSYmKjw8XOHh4W7vx8TE6NNPP/VKgQB8jFYGAEAZ51FLw+eff65//vOfatOmjRy5/I+vadOmOnjwYLGLA+BjtDIAAGzAo8D7559/KiwsLM/3T5486XFBACyEVgYAgA14FHjbtGmj9evX5/l+QkKCOnXq5HFRACyAVgYAgE14FHjHjRun+fPn64UXXlBqaqokyTAM7dmzRyNHjtSmTZt0//33e7VQAKWIVgYAgI149KG1m2++Wfv379cTTzyhxx9/XJL0t7/9TaZpys/PT88995xiYmK8WSeA0kQrAwDARjwKvJL0+OOPa+TIkfroo4+0Z88eGYahSy65RNddd52aNm3qzRoBlCZaGQAANuNR4D1w4IDCwsLUsGHDXFsX/vzzTx0+fFgNGzYsdoEAShGtDAAAG/Koh7dJkyZavHhxnu8vXbpUTZo08biobHFxcerWrZuqVKmi8PBwxcTEaNeuXQVut3DhQrVq1UpBQUFq3769li9fXuxagHKBVgYAgA15FHhN08z3/bNnz8rPz+OnFrusW7dOo0eP1tdff61Vq1bp7NmzGjhwoNLT0/PcZuPGjRoxYoRuv/12ff/994qJiVFMTIx27NhR7HoAW6OVAQBgU4VuaUhLS9OJEydcy0ePHtWBAwdyjDtx4oTmz5+vOnXqFLu4FStWuC3Hx8crPDxcW7ZsUZ8+fXLdZtq0afrb3/6mhx56SJL09NNPa9WqVXrjjTc0c+bMYtcE2BKtDAAAGyt04H3ttdc0ZcoUSZLD4dC4ceM0bty4XMeapqlnnnnGKwVeKPsWaDVq1MhzzKZNmzR+/Hi3dYMGDVJCQkKe22RmZiozM9O1nJaWJsl5qzXDMIpRsTUYhiHTNG1xLnZjlblxPPCAHH+1Mpj9+8u84w6pnH+/WGVukDvmx7qYG+uy29wU5TwKHXgHDhyokJAQmaaphx9+WCNGjFDnzp3dxjgcDlWuXFldunRR165dC19xIRiGoXHjxqlXr15q165dnuOSkpIUERHhti4iIkJJSUl5bhMXF6fJkyfnWH/48GFlZGR4XrRFGIahc+fOKSUlxSutJvAewzCUmprquqWfLwR8+aVqvPOOs57KlXUkLk7G4cM+qcVKrDA3yBvzY13MjXXZbW6K8mTfQgfeqKgoRUVFSZLS09N1/fXX5xs8vW306NHasWOHNmzY4PV9T5gwwe2qcFpamho0aKCwsDCFhoZ6/XilzTAMVahQQeHh4bb4BrcTwzDkcDgUFhbmm7lJTZXj4YfPL7/0kmp16VL6dViQz+cG+WJ+rIu5sS67zU1QUFChx3p0W7KJEyd6spnHxowZo08//VTr169X/fr18x1bu3ZtJScnu61LTk5W7dq189wmMDBQgYGBOdb7+fnZ4htCcl59t9P52IlP5+bhh93uyuB3zz18UO0C/L2xNubHupgb67LT3BTlHDx+8IQkffXVV9q6datSU1Nz9FE4HA49+eSTxdm9TNPUfffdp8WLF2vt2rWFutVZVFSUVq9e7dZfvGrVKtfVaQB/4a4MAIBywqPAe+zYMQ0dOlTffvutTNOUw+Fw3aos+8/eCLyjR4/WvHnztGTJElWpUsXVh1u1alUFBwdLkkaNGqV69eopLi5OkjR27FhFR0frlVde0dChQzV//nxt3rxZs2bNKlYtgK1wVwYAQDni0fXshx56SNu2bdO8efP0yy+/yDRNrVy5Uj///LPuuecedezYUX/88Uexi5sxY4ZSU1PVt29f1alTx/VasGCBa8yBAwd06NAh13LPnj01b948zZo1Sx06dNCiRYuUkJBQqv3GgOXxgAkAQDni0RXe5cuX6+6779bw4cN19OhRSc4+imbNmmn69Om67rrrNG7cOH344YfFKq6gB1xI0tq1a3OsGzZsmIYNG1asYwO2RSsDAKCc8egK74kTJ9S2bVtJUkhIiCTp1KlTrvcHDhyolStXeqE8AF5FKwMAoBzyKPDWrVvX1U8bGBio8PBw/fDDD673f//9dzm4YgRYD60MAIByyKOWhj59+mjVqlV6/PHHJUnDhw/Xiy++KH9/fxmGoalTp2rQoEFeLRRAMdHKAAAopzwKvOPHj9eqVauUmZmpwMBATZo0STt37nTdlaFPnz56/fXXvVoogGKglQEAUI55FHjbt2+v9u3bu5arV6+uL774QidOnJC/v7+qVKnitQIBeAGtDACAcqxYD564WLVq1by5OwDeQCsDAKCc8zjwZmVlaeXKlfrll190/PjxHLcQ88aDJwAUE60MAAB4Fng3b96s66+/Xr/99lue98ol8AIWQCsDAACe3Zbs3nvv1Z9//qmEhAQdO3ZMhmHkeGVlZXm7VgBFQSsDAACSPLzCu23bNj377LO66qqrvF0PAG+glQEAABePrvDWr1+/UI/9BeAjtDIAAODiUeB95JFH9NZbbyktLc3b9QAoLloZAABw41FLw8mTJxUSEqJmzZrppptuUoMGDeTv7+82xuFw6P777/dKkQAKiVYGAABy8CjwPvjgg64/v/HGG7mOIfACPkArAwAAOXgUePft2+ftOgAUF60MAADkyqPA24hfkQLWQisDAAB58uhDawAshlYGAADyVKgrvE2aNJGfn59++uknVaxYUU2aNJGjgF+VOhwO7d271ytFAsgHrQwAAOSrUIE3OjpaDodDfn5+bssAfIxWBgAAClSowBsfH5/vMgAfoZUBAIAC0cMLlFW0MgAAUCiFusK7fv16j3bep08fj7YDUABaGQAAKLRCBd6+ffu69eyaplmoHt6srCzPKwOQN1oZAAAotEIF3i+//NJtOTMzUw8//LBOnz6tu+66Sy1btpQk/fTTT3rrrbdUuXJlvfjii96vFgCtDAAAFFGh79JwofHjxysgIEBff/21goKCXOuvuuoqjR49WtHR0VqxYoWuuOIK71YLlHe0MgAAUGQefWht7ty5GjlypFvYzVapUiWNHDlSH3zwQbGLA3ARWhkAACgyjwJvenq6Dh06lOf7hw4d0unTpz0uCkAuaGUAAMAjHgXeAQMGaNq0afr4449zvPfRRx9p2rRpGjBgQLGLA/AXWhkAAPBYoXp4LzZ9+nRdfvnlGjZsmOrUqaNmzZpJkvbu3as//vhDl1xyiV5//XWvFgqUa7QyAADgMY+u8NarV08//PCDXn31VbVr107JyclKTk5W27Zt9dprr+mHH35Q/fr1vV0rUD7RygAAQLEU+QpvRkaGZs2apY4dO2rs2LEaO3ZsSdQFQKKVAQAALyjyFd6goCA98sgj2rVrV0nUA+BCtDIAAFBsHrU0tGvXTr/++quXSwHghlYGAAC8wqPA++yzz+rNN9/UF1984e16AEi0MgAA4EUe3aXhjTfeUI0aNTRo0CA1adJETZo0UXBwsNsYh8OhJUuWeKVIoNyhlQEAAK/xKPBu27ZNDodDDRs2VFZWlvbs2ZNjjINfvQKeoZUBAACv8ijw0r8LlBBaGQAA8DqPengBlBBaGQAA8DqPrvBmW7dunZYtW6b9+/dLkho1aqShQ4cqOjraK8UB5QqtDAAAlAiPAu+ZM2c0YsQIJSQkyDRNVatWTZJ04sQJvfLKK7r22mv14YcfqmLFit6sFbAvWhkAACgxHrU0TJ48WYsXL9YDDzygQ4cO6dixYzp27JiSkpL04IMP6uOPP9aUKVO8XStgW46HHqKVAQCAEuJR4J03b55iY2P14osvKiIiwrU+PDxcL7zwgkaNGqX333/fKwWuX79eV111lerWrSuHw6GEhIR8x69du1YOhyPHKykpySv1AN4W8OWXcrzzjnOBVgYAALzOo8B76NAh9ejRI8/3e/To4bWAmZ6erg4dOmj69OlF2m7Xrl06dOiQ6xUeHu6VegCvSk1V1QcfPL9MKwMAAF7nUQ9v/fr1tXbtWt1zzz25vr9u3TrVr1+/WIVlGzx4sAYPHlzk7cLDw129xYBVOR56SH5//OFcoJUBAIAS4VHgjY2N1cSJE1WtWjXdf//9atasmRwOh3bv3q2pU6dq4cKFmjx5srdrLZKOHTsqMzNT7dq106RJk9SrV688x2ZmZiozM9O1nJaWJkkyDEOGYZR4rSXNMAyZpmmLc7GVzz+X31+tDGZIiMxZsyTTdL7gc/y9sTbmx7qYG+uy29wU5Tw8CryPPfaY9u7dq1mzZumtt96Sn5+f68CmaSo2NlaPPfaYJ7sutjp16mjmzJnq2rWrMjMz9fbbb6tv37765ptv1Llz51y3iYuLyzWgHz58WBkZGSVdcokzDEPnzp1TSkqKa67gW460NNW6/XbXcuoTTygjOFhKSfFhVbiQYRhKTU2VaZr8vbEg5se6mBvrstvcnDx5stBjHabp+eWkbdu2afny5W734R0yZIgiIyM93WW+HA6HFi9erJiYmCJtFx0drYYNG+b5QbrcrvA2aNBAx48fV2hoaHFKtgTDMDRkyBAtX77cFt/gduC46y7XB9UyL7tM/qtXy8/f38dV4UKGYejw4cMKCwvj740FMT/WxdxYl93mJi0tTdWrV1dqamqBea1YD56IjIwssXDrTd27d9eGDRvyfD8wMFCBgYE51vv5+dniG0Jy/rBgp/Mp01aulC5oZUh95RXV8vdnbiyIvzfWxvxYF3NjXXaam6KcQ6FGnj592uNiirOttyQmJqpOnTq+LgPI8YAJ88UXZTRo4MOCAACwv0IF3gYNGmjKlCk6dOhQoXf8+++/66mnnlLDhg09Lk6STp06pcTERCUmJkqS9u3bp8TERB04cECSNGHCBI0aNco1furUqVqyZIn27NmjHTt2aNy4cVqzZo1Gjx5drDoAr3jwQR4wAQBAKStUS8OMGTM0adIkTZkyRb169dKAAQPUuXNnNWnSRNWrV5dpmjp+/Lj27dunzZs364svvtDXX3+t5s2b69///nexCty8ebP69evnWh4/frwk550i4uPjdejQIVf4lZyPPX7ggQf0+++/q1KlSoqMjNQXX3zhtg/AJ1audD5UQuIBEwAAlKJCf2jNMAwtXbpU8fHxWrFihc6cOSPHRf+zNk1TAQEBGjhwoG677TZdffXVZbJHJC0tTVWrVi1UE3RZYBiGBg8erM8++6xMzoctpKZK7dqdv7o7c6Z0990yDEMpKSkKDw9nbiyGubE25se6mBvrstvcFCWvFfpDa35+foqJiVFMTIwyMzO1ZcsW/fTTTzp69KgkqWbNmmrVqpW6dOmS6wfAgHKNVgYAAHzGo7s0BAYGqmfPnurZs6e36wHsh1YGAAB8quxfzwas7KK7Mujll6VGjXxXDwAA5RCBFyhJtDIAAOBzBF6gpNDKAACAJRB4gZJAKwMAAJZB4AVKAq0MAABYhkeB95tvvvF2HYB90MoAAICleBR4o6Ki1KJFCz399NP65ZdfvF0TUHbRygAAgOV4FHg/+OADNW/eXE8//bSaN2+uXr16aebMmTp27Ji36wPKFloZAACwHI8C7//93/9p2bJl+uOPPzRt2jSZpql7771XdevWVUxMjBYtWqQzZ854u1bA2mhlAADAkor1obVatWppzJgx2rhxo3bv3q3HH39cP/30k4YPH67atWvrrrvu0oYNG7xVK2BdtDIAAGBZXrtLQ3BwsCpVqqSgoCCZpimHw6ElS5YoOjpa3bp1048//uitQwHWQysDAACWVazAe/LkSc2ZM0cDBgxQo0aN9Nhjj6lx48ZatGiRkpKS9Mcff2jBggVKSUnRrbfe6q2aAWuhlQEAAEur4MlGS5Ys0dy5c/Xpp58qIyND3bp109SpU3XTTTepZs2abmNvuOEGHT9+XKNHj/ZKwYCl0MoAAIDleRR4r732WjVo0ED333+/Ro0apZYtW+Y7vkOHDvr73//uUYGApdHKAACA5XkUeNesWaO+ffsWenz37t3VvXt3Tw4FWBetDAAAlAke9fAWJewCtkQrAwAAZYbX7tIAlCu0MgAAUGYQeIGiopUBAIAyhcALFAWtDAAAlDkEXqAoaGVw0ze+rxyTHXJMdujKeVf6uhyXxKREV12OyQ4t+nGRr0sCUEY0buz8pZ3DIY0Z4+tqzps69XxdDod05IivKypbCLxAYVmolWHvsb26+5O71XRaUwU9E6TQuFD1mt1L076epj/P/lmqtbSq1UrvX/u+Huz5oNv6BTsW6OaPb1bz15vLMdmhvvF9i32sb3//Vvcuu1ddZnVRxacryjE5969/o6qN9P617+ux3o8V+5gASlZ8vHuQczik8HCpXz/ps898U9Nll0nvvy/FxuY9ZsMG74TPBQukm2+Wmjd37iuv+wL87W/Omq691vNjlWce3ZYMKHcs1Mqw7OdlGrZwmAIrBGpU5Ci1C2+nM1lntOHgBj206iHtPLxTs66aVWr1RFSO0M2RN+dYP2PzDG05tEXd6nbT0dNHvXKs5buX6+2tbysyIlJNqzfVz0d/znVc9eDqujnyZq39da2e2/CcV44NoGRNmSI1aSKZppSc7AzCQ4ZIn3wiXVnKv0Bq2tQZQvNiGNJ990mVK0vp6cU71owZ0pYtUrdu0tF8/qls1cr52rNHWry4eMcsjwi8QGFYpJVh3/F9uumjm9SoWiOtGbVGdarUcb03uvto7em3R8t+XuaT2i72/rXvq15oPfk5/NTu3+28ss9/dP2HHun1iIIrBmvM8jF5Bl4AZc/gwVLXrueXb79dioiQPvyw9ANvQWbNkg4edF4HmTatePt6/32pXj3Jz09q551/KpELAi9QEAu1Mrz41Ys6deaU3rn6Hbewm61ZjWYae+lYH1SWU4OqDby+z4iQCK/vE4A1VasmBQdLFSyWVI4dk554wnlFOiWl+Ptr4P1/KpELi30bARZjoVYGSfrk50/UtHpT9WzQ0+N9nD57WqfPni5wnL/DX9WDq3t8HAAoitRUZy+saTqD5OuvS6dO5d9akO3UKSkjo+BxFStKVasWr84nn5Rq15buvlt6+uni7Qulh8AL5McirQySlJaZpt9P/q5rWl5TrP28+NWLmrxucoHjGlVtpF/H/VqsYwFAYQ0Y4L4cGCjNni1dcUXB244ZI737bsHjoqOltWs9Kk+StG2b9Oab0vLlkr+/5/tB6SPwAnmxUCuD5Ay8klQlsEqx9jOqwyj1bti7wHHBFYKLdRwAKIrp06UWLZx/Tk6WPvjA+Qu2KlWk667Lf9uHHy7cleDqxfyl1T//6ew1HjiwePtB6SPwArmxWCuDJIUGhkqSTmaeLNZ+mlZvqqbVm3qjJADwmu7d3T+0NmKE1KmT8+rtlVdKAQF5b9umjfNVkhYskDZulHbsKNnjoGQQeIHcWKiVIVtoYKjqVqmrHSnF+9f21JlTOnXmVIHj/B3+CqscVqxjAYCn/Pyc9+KdNk3avVtq2zbvsamp0p+FuAV5QIBUo4Zn9Tz0kDRsmHMfv/7qXHfihPO/Bw9KZ85Idet6tm+UPAIvcDGLtTJc6MrmV2rW1lnadHCTohpEebSPlze+TA8vgDLh3Dnnf08V8DP62LEl38N78KA0b57zdbHOnaUOHaTERM/2jZJH4AUuZMFWhgs93Othzd0+V3d8cofWjFqT4zZde4/t1ac/f5rvrcno4QVQFpw9K33+ufOKauvW+Y8tjR7e3B72MH++s9Xhvfek+vU93zdKHoEXuJAFWxkudEmNSzTv+nkavmi4Wk9vrVEdzj9pbePBjVr440Ld0uGWfPdRWj286/ev1/r96yVJh08fVvrZdD2z/hlJUp9GfdSnUR/XWMdkh6IbRWvtLWvz3ef+E/v1/rb3JUmb/9gsSa59NqraSCM7jPT2aQAoJZ99Jv30k/PPKSnOK6m7d0uPPiqFhua/bWn08MbE5FyXfUV38GCpVq3z69eudbZjTJwoTZqU/37Xr3e+JOnwYeeT255x/rOmPn2cLxQfgRfIZuFWhgtd3fJqbbtnm17a+JKW7FqiGZtnKNA/UJERkXpl4Cu6s/Odvi5RkrRm35ocrRNPfvmkJGli9ERX4M3uJ87tQRoX23din2sfF+8zulE0gRcow5566vyfg4Kcj9GdMcN5v9uyJrsFo07B/6xpzRpp8kVdZk/+9c/cxIkEXm8h8AKS5VsZLta8ZnPNumqWr8uQJJ01zurI6SMK8A9w3UlCkib1naRJfScVuP36/evlkEOP9X6swLF9G/eVOdEscFyWkaXjGceVmpFa4FgAvnXLLc6XlWRmOh+CERwsVa6c97hJk3K/grt+vbPFoTDnldc+LpaR4QzSpwt+bhBy4efrAgBLsHgrg5VtPLhRYS+F6f8++j+Ptv9y35e6qd1Nah/R3ms1bU/ZrrCXwhSzIMZr+wRQfsyfL4WFSY884tn2X37pvEobGOi9mmbOdNb00kve22d5whVeoIy0MljRKwNf0fGM45KksEqe3cLspYHe/9e7WY1mWjVylWs5MiLS68cAYE9z556/xVmDBp7t47vvvFdPtuuvl9q1O79c3EcklzcEXpRvZayVwWq61O3i6xJyFRIQogFNBxQ8EAAu0quXryvIXYMGngdw0NKA8o5WBgAAbI/Ai/KLVgYAAMoFywfe9evX66qrrlLdunXlcDiUkJBQ4DZr165V586dFRgYqGbNmik+Pr7E60QZQysDAADlhuUDb3p6ujp06KDp06cXavy+ffs0dOhQ9evXT4mJiRo3bpzuuOMOrVy5soQrRZlCKwMAAOWG5T+0NnjwYA0ePLjQ42fOnKkmTZrolVdekSS1bt1aGzZs0GuvvaZBgwaVVJkoS2hlAIBC+/JL56t1aykqyvnLMP7JRFlj+cBbVJs2bdKAAe6fzh40aJDGjRuX5zaZmZnKzMx0LaelpUmSDMOQYRglUmdpMgxDpmna4lyKLTVVjjvuUPa/1caLLzo/9uqjrw1zY13MjbUxP6Xj9GnpyisdOn36fMKtXdvUpZdK7dub6txZuuoq9wDM3FiX3eamKOdhu8CblJSkiIgIt3URERFKS0vTn3/+qeDg4BzbxMXFafLFz/WTdPjwYWVkZJRYraXFMAydO3dOKSkp8vOzfBdLiQp94AFV+quVIbNPHx2PiXE+tN1HDMNQamqqTNMs93NjNcyNtTE/pcM0pUsuqant2yu61iUlOZSQICUkOFPuVVf9qVmzzj/VkLmxLrvNzcmTJws91naB1xMTJkzQ+PHjXctpaWlq0KCBwsLCFBoams+WZYNhGKpQoYLCw8Nt8Q3usZUr5TdvniTJDAlRxfh4hV/0w1FpMwxDDodDYWFh5XtuLIi5sTbmp+SlpDgfkRsV5VBqqqkDB3LvY/jttyCFh59/pBhzY112m5ugoKBCj7Vd4K1du7aSk5Pd1iUnJys0NDTXq7uSFBgYqMBcnv/n5+dni28ISXI4HLY6nyJLTXX7YJrj5ZflaNKkVEv4+cjPenHji+oQ0UH39bjvfC3lfW4sjLmxNubHu1JSpHXrpLVrna8ffyx4m7p1nVd6/fzcwzBzY112mpuinIPtAm9UVJSWL1/utm7VqlWKioryUUWwhAce8MldGX45/osW7lyo//z4H209tNW1vkf9Huper3up1AAAuSlKwPXzk8LDpaSk8+uuv1567z2pUqWSrhQoPssH3lOnTmnPnj2u5X379ikxMVE1atRQw4YNNWHCBP3+++967733JEn33HOP3njjDT388MO67bbbtGbNGv3nP//RsmXLfHUK8LWVK6V33nH+uRTuymCapmZ/P1szNs/QlkNbch1TJaBKiR0fAHJTlIDr7y917Sr17et8ffWV9Mwz599/+GEpLs4ZhIGywPKBd/PmzerXr59rObvXNjY2VvHx8Tp06JAOHDjger9JkyZatmyZ7r//fk2bNk3169fX22+/zS3JyisfPGBizb41uuOTO/J8v2XNlmod1rpEawCA4gTcXr2kKhf8XP7XNSX5+0szZkh33llydQMlwfKBt2/fvjJNM8/3c3uKWt++ffX999+XYFUoM3zQylCnSh1VqlhJp8+eVtXAqkrNTHV7//ZOt5d4DQDKH28G3Iu98orUrp00cKBzO6CssXzgBTxWyq0M2dqEtdHu+3YrdnGsvtj3hSTJIYdMOX9wu6HNDSVeAwD7K8mAe7E6daTHHitevYAvEXhhTz5oZch2JuuM7vn0HlfYDfQPVGaW88EmXet2VZPqpXt3CAD2UJoBF7AbAi/syUd3ZTiTdUY3/OcGffLzJ5Kk4ArBGtNtjF7a9JIkaVibYaVSB4Cyj4ALeA+BF/bjo1aG3MLusv9bpmpB1TRzy0xVD66uWzreUuJ1ACibCLhAySHwwl581MqQV9jt18R5h5E/HvhDgf6BquhfMb/dAChHCLhA6SHwwl580MpQUNiVpJCAkBKvA4C1EXAB3yHwwj580MpQmLALoHwi4ALWQeCFPfiglYGwC+BCBFzAugi8sIdSbmUg7AIg4AJlB4EXZV8ptzIQdoHyiYALlF0EXpRtpdzKQNgFyg8CLmAfBF6UbaXYykDYBeyNgAvYF4EXZVcptjIQdgH7IeAC5QeBF2VTKbYyEHYBeyhqwO3SxT3ghoaWTp0AvI/Ai7KplFoZCLtA2UXABZCNwIuyp5RaGQi7QNlCwAWQFwIvypZSamUg7ALWd+SIn9atk9avJ+ACyB+BF2VLKbQyEHYBa3K/guvQjz+G5zmWgAvgQgRelB2l0MpA2AWsI/8WBfe/+wRcAPkh8KJsKIVWBsIu4FspKefbE9aulXbuzHusv7+pyMizGjCgovr1cxBwAeSLwIuyoYRbGQi7QOkrWsB1v4IbFWUqI+OYwsPD5edXco8SB2APBF5YXwm3MhB2gdJRnIB78RVcw5AyMkq2XgD2QeCFtZVwKwNhFyg53gy4AFAcBF5YWwm2MhB2Ae8i4AKwKgIvrKsEWxkIu0DxEXABlBUEXlhTCbYyEHYBzxBwAZRVBF5YUwm1MhB2gcIj4AKwCwIvrKeEWhkIu0D+CLgA7IrAC2spoVYGwi6QEwEXQHlB4IW1lEArA2EXcCLgAiivCLywjhJoZSDsojwj4AKAE4EX1lACrQyEXZQ3BFwAyB2BF9bg5VYGwi7KAwIuABQOgRe+5+VWBsIu7IqACwCeIfDCt7zcykDYhZ0QcAHAOwi88C0vtjIQdlHWEXABoGQQeOE7XmxlIOyiLCLgAkDpIPDCN7zYykDYRVlBwAUA3yDwwje81MpA2IWVEXABwBoIvCh9XmplIOzCagi4AGBNZSLwTp8+XS+99JKSkpLUoUMHvf766+revXuuY+Pj43Xrrbe6rQsMDFRGRkZplIqCeKmVgbALKyDgAkDZYPnAu2DBAo0fP14zZ85Ujx49NHXqVA0aNEi7du1SeHh4rtuEhoZq165drmVHMR9PCy/yQisDYRe+QsAFgLLJ8oH31Vdf1Z133um6ajtz5kwtW7ZMs2fP1qOPPprrNg6HQ7Vr1y7NMlEYK1YUu5WBsIvSdOSIn9avPx9yCbgAUDZZOvCeOXNGW7Zs0YQJE1zr/Pz8NGDAAG3atCnP7U6dOqVGjRrJMAx17txZzz33nNq2bZvn+MzMTGVmZrqW09LSJEmGYcgwDC+ciW8ZhiHTNH17Lqmpctx5p7LjrfHii1KDBlIRajqTdUbDFg7Tp7s/leQMu5+M+ETRjaLL7DxZYm7gkn0Fd906h9atc2jnztx/iyRJ/v6munSRoqOl6Ggz14DLtJYc/u5YF3NjXXabm6Kch6UD75EjR5SVlaWIiAi39REREfrpp59y3aZly5aaPXu2IiMjlZqaqpdfflk9e/bUzp07Vb9+/Vy3iYuL0+TJk3OsP3z4sC16fw3D0Llz55SSkiI/Pz+f1BD6wAOq9FcrQ2afPjoeE+NMF4V0JuuM7lx1pz7f/7kkKahCkN7/2/tqW6mtUoqwH6sxDEOpqakyTdNnc1OeHTnip6+/rqiNGwO0cWOAdu2qmOdYf39TkZFn1bPnGUVFnVH37mdVpYrpej8jw/lC6eDvjnUxN9Zlt7k5efJkocdaOvB6IioqSlFRUa7lnj17qnXr1nrzzTf19NNP57rNhAkTNH78eNdyWlqaGjRooLCwMIXa4HeShmGoQoUKCg8P9803+IoV8ps3T5JkhoSoYny8wi/6ISY/2Vd2s8Nu9pXdfo3LfhuDYRhyOBwKCwuzxT8+Vud+BVfauTPvlho/P1MdOpxV//4VLmhRqCDnP5uVSqtk5IG/O9bF3FiX3eYmKCio0GMtHXhr1aolf39/JScnu61PTk4udI9uxYoV1alTJ+3ZsyfPMYGBgQoMDMyx3s/PzxbfEJKzr9kn55OaKt199/k6Xn5ZjiZNCr35mawzunHRjW5tDHbr2fXZ3JQDxfmQWVSUqYyMY777QREF4u+OdTE31mWnuSnKOVg68AYEBKhLly5avXq1YmJiJDl/Olm9erXGjBlTqH1kZWVp+/btGjJkSAlWijwV464MfEANReXNuygYBi0KAGAXlg68kjR+/HjFxsaqa9eu6t69u6ZOnar09HTXXRtGjRqlevXqKS4uTpI0ZcoUXXrppWrWrJlOnDihl156Sfv379cdF977FaWjGHdlIOyiMLhNGACgMCwfeIcPH67Dhw/rqaeeUlJSkjp27KgVK1a4Psh24MABt0vax48f15133qmkpCRVr15dXbp00caNG9WmTRtfnUL5lJoq3Xnn+eUiPGCCsIu8EHABAJ6wfOCVpDFjxuTZwrB27Vq35ddee02vvfZaKVSFfHnYykDYxYUIuAAAbygTgRdljIetDIRdEHABACWBwAvv8rCVgbBbPhFwAQClgcAL7/KglYGwW34QcAEAvkDghfd40MpA2LU3Ai4AwAoIvPAOD1oZCLv2Q8AFAFgRgRfeUcRWBsKuPRBwAQBlAYEXxVfEVgbCbtlFwAUAlEUEXhRPEVsZCLtlCwEXAGAHBF4UTxFaGQi71kfABQDYEYEXnitCKwNh15oIuACA8oDAC88UoZWBsGsdBFwAQHlE4IVnCtnKQNj1LQIuAAAEXniikK0MhN3SR8AFACAnAi+KppCtDITd0kHABQCgYAReFE0hWhkIuyWHgAsAQNEReFF4hWhlIOx6FwEXAIDiI/CicArRykDYLT4CLgAA3kfgReEU0MpA2PVMSor06aeB+v57h9atI+ACAFASCLwoWAGtDITdwst5BddPUvVcxxJwAQDwDgIv8ldAKwNhN3+0KAAA4HsEXuQvn1YGwm5ORQ+4prp1S9fgwZV02WV+BFwAAEoAgRd5y6eVgbDrVNwruCEhplJSTik8vJL8/EqnZgAAyhsCL3KXTytDeQ673m5RMIySqxUAADgReJG7PFoZylvYpQcXAICyj8CLnPJoZSgPYZeACwCA/RB44S6PVga7hl0CLgAA9kfghbtcWhnsFHYJuAAAlD8EXpyXSyvDGeNsmQ67BFwAAEDghVMurQxn6tcpc2GXgAsAAC5G4IXTRa0MZ26/pUyEXQIuAAAoCIEXOVoZzrz5b92wcJglwy4BFwAAFBWBt7y7qJXhzEvP64ZvHrBM2CXgAgCA4iLwlncXtDKcueJy3VB1pU/DLgEXAAB4G4G3PLugleFMaGXdcJNfqYddAi4AAChpBN7y6oJWhjP+0g0TLtEnB7+QVLJhl4ALAABKG4G3vPqrleGMv3TDvbX0SeY2Sd4PuwRcAADgawTe8uivVoYz/tINI/z1Sc0jkrwTdgm4AADAagi85c1frQxn/KUbbpQ+aZYlyfOwS8AFAABWR+Atbx54QGcO/eYMuy2dq4oSdgm4AACgrCHwlicrVuhM/DtFCrsEXAAAUNYReMuJSmfP6uw/7tKwAsIuARcAANiNn68LKIzp06ercePGCgoKUo8ePfTtt9/mO37hwoVq1aqVgoKC1L59ey1fvryUKrWuW/fs0rBev+cIu20r99OiRdKYMVK7dlJEhDRsmDR9es6w6+8vde8uPfywtHy5dOyY9M030gsvSIMHE3YBAIA1Wf4K74IFCzR+/HjNnDlTPXr00NSpUzVo0CDt2rVL4eHhOcZv3LhRI0aMUFxcnK688krNmzdPMTEx2rp1q9q1a+eDM7CA9HQlNvnDFXYDFKyBR5bpvqv6cQUXAADYnsM0TdPXReSnR48e6tatm9544w1JkmEYatCgge677z49+uijOcYPHz5c6enp+vTTT13rLr30UnXs2FEzZ84s1DHT0tJUtWpVpaamKtQGCc84c0a3DAvV+50zpbPB0txl0q85e3YJuKXPMAylpKQoPDxcfn5l4hcu5QZzY23Mj3UxN9Zlt7kpSl6z9BXeM2fOaMuWLZowYYJrnZ+fnwYMGKBNmzblus2mTZs0fvx4t3WDBg1SQkJCnsfJzMxUZmamazktLU2S8xvDMIxinIE1GH5++i0jStW/jNXxbZdJxy+RJPn7m+rSRYqOlqKjzVwDrg1O39IMw5Bpmrb4PrMb5sbamB/rYm6sy25zU5TzsHTgPXLkiLKyshQREeG2PiIiQj/99FOu2yQlJeU6PikpKc/jxMXFafLkyTnWX3/99apQwdJfokIxTVPbtm5Xw/r7VbFic1VqvEXVq/+g6tV/VIUKp/XDD9IPP0j/+pevKy1/TNPUuXPnVKFCBTkcDl+XgwswN9bG/FgXc2Nddpubc+fOFXps2U9zXjBhwgS3q8JpaWlq0KCBPvroI3u0NBiGhgwZouXLH7/gVxg3+LQmOBmGocOHDyssLMwWv16yE+bG2pgf62JurMtuc5OWlqbq1asXaqylA2+tWrXk7++v5ORkt/XJycmqXbt2rtvUrl27SOMlKTAwUIGBgTnW+/n52eIbQpIcDoetzsdOmBvrYm6sjfmxLubGuuw0N0U5B0ufbUBAgLp06aLVq1e71hmGodWrVysqKirXbaKiotzGS9KqVavyHA8AAAB7s/QVXkkaP368YmNj1bVrV3Xv3l1Tp05Venq6br31VknSqFGjVK9ePcXFxUmSxo4dq+joaL3yyisaOnSo5s+fr82bN2vWrFm+PA0AAAD4iOUD7/Dhw3X48GE99dRTSkpKUseOHbVixQrXB9MOHDjgdkm7Z8+emjdvnp544gk99thjat68uRISEsrvPXgBAADKOcsHXkkaM2aMxowZk+t7a9euzbFu2LBhGjZsWAlXBQAAgLLA0j28AAAAQHEReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK1ZOvCapqmnnnpKderUUXBwsAYMGKDdu3fnu82kSZPkcDjcXq1atSqligEAAGA1lg68L774ov71r39p5syZ+uabb1S5cmUNGjRIGRkZ+W7Xtm1bHTp0yPXasGFDKVUMAAAAq6ng6wLyYpqmpk6dqieeeELXXHONJOm9995TRESEEhISdNNNN+W5bYUKFVS7du3SKhUAAAAWZtnAu2/fPiUlJWnAgAGudVWrVlWPHj20adOmfAPv7t27VbduXQUFBSkqKkpxcXFq2LBhnuMzMzOVmZnpWk5NTZUknThxQoZheOFsfMswDJ09e1YnTpyQn5+lL+qXO4ZhKC0tTQEBAcyNxTA31sb8WBdzY112m5u0tDRJzoukBbFs4E1KSpIkRUREuK2PiIhwvZebHj16KD4+Xi1bttShQ4c0efJkXXbZZdqxY4eqVKmS6zZxcXGaPHlyjvWNGjUqxhlYT82aNX1dAgAAgFedPHlSVatWzXeMwyxMLC4Fc+fO1d133+1aXrZsmfr27as//vhDderUca2/8cYb5XA4tGDBgkLt98SJE2rUqJFeffVV3X777bmOufgKr2EYOnbsmGrWrCmHw+HhGVlHWlqaGjRooIMHDyo0NNTX5eACzI11MTfWxvxYF3NjXXabG9M0dfLkSdWtW7fAK9aWucJ79dVXq0ePHq7l7ACanJzsFniTk5PVsWPHQu+3WrVqatGihfbs2ZPnmMDAQAUGBubYzm5CQ0Nt8Q1uR8yNdTE31sb8WBdzY112mpuCruxms0wDR5UqVdSsWTPXq02bNqpdu7ZWr17tGpOWlqZvvvlGUVFRhd7vqVOntHfvXrfQDAAAgPLDMoH3Yg6HQ+PGjdMzzzyjpUuXavv27Ro1apTq1q2rmJgY17j+/fvrjTfecC0/+OCDWrdunX799Vdt3LhR1157rfz9/TVixAgfnAUAAAB8zTItDbl5+OGHlZ6errvuuksnTpxQ7969tWLFCgUFBbnG7N27V0eOHHEt//bbbxoxYoSOHj2qsLAw9e7dW19//bXCwsJ8cQqWEBgYqIkTJ+Zo24DvMTfWxdxYG/NjXcyNdZXnubHMh9YAAACAkmDZlgYAAADAGwi8AAAAsDUCLwAAAGyNwAsAAABbI/Da3PTp09W4cWMFBQWpR48e+vbbb31dEiStX79eV111lerWrSuHw6GEhARfl4S/xMXFqVu3bqpSpYrCw8MVExOjXbt2+bosSJoxY4YiIyNdN82PiorSZ5995uuykIvnn3/edXtR+N6kSZPkcDjcXq1atfJ1WaWKwGtjCxYs0Pjx4zVx4kRt3bpVHTp00KBBg5SSkuLr0sq99PR0dejQQdOnT/d1KbjIunXrNHr0aH399ddatWqVzp49q4EDByo9Pd3XpZV79evX1/PPP68tW7Zo8+bNuvzyy3XNNddo586dvi4NF/juu+/05ptvKjIy0tel4AJt27bVoUOHXK8NGzb4uqRSxW3JbKxHjx7q1q2b68EchmGoQYMGuu+++/Too4/6uDpkczgcWrx4sdsDVWAdhw8fVnh4uNatW6c+ffr4uhxcpEaNGnrppZd0++23+7oUyPl0086dO+vf//63nnnmGXXs2FFTp071dVnl3qRJk5SQkKDExERfl+IzXOG1qTNnzmjLli0aMGCAa52fn58GDBigTZs2+bAyoGxJTU2V5AxWsI6srCzNnz9f6enpRXrcPErW6NGjNXToULf/98Aadu/erbp166pp06b6+9//rgMHDvi6pFJl6SetwXNHjhxRVlaWIiIi3NZHRETop59+8lFVQNliGIbGjRunXr16qV27dr4uB5K2b9+uqKgoZWRkKCQkRIsXL1abNm18XRYkzZ8/X1u3btV3333n61JwkR49eig+Pl4tW7bUoUOHNHnyZF122WXasWOHqlSp4uvySgWBFwDyMHr0aO3YsaPc9bpZWcuWLZWYmKjU1FQtWrRIsbGxWrduHaHXxw4ePKixY8dq1apVCgoK8nU5uMjgwYNdf46MjFSPHj3UqFEj/ec//yk37UAEXpuqVauW/P39lZyc7LY+OTlZtWvX9lFVQNkxZswYffrpp1q/fr3q16/v63Lwl4CAADVr1kyS1KVLF3333XeaNm2a3nzzTR9XVr5t2bJFKSkp6ty5s2tdVlaW1q9frzfeeEOZmZny9/f3YYW4ULVq1dSiRQvt2bPH16WUGnp4bSogIEBdunTR6tWrXesMw9Dq1avpdwPyYZqmxowZo8WLF2vNmjVq0qSJr0tCPgzDUGZmpq/LKPf69++v7du3KzEx0fXq2rWr/v73vysxMZGwazGnTp3S3r17VadOHV+XUmq4wmtj48ePV2xsrLp27aru3btr6tSpSk9P16233urr0sq9U6dOuf1kvW/fPiUmJqpGjRpq2LChDyvD6NGjNW/ePC1ZskRVqlRRUlKSJKlq1aoKDg72cXXl24QJEzR48GA1bNhQJ0+e1Lx587R27VqtXLnS16WVe1WqVMnR5165cmXVrFmT/ncLePDBB3XVVVepUaNG+uOPPzRx4kT5+/trxIgRvi6t1BB4bWz48OE6fPiwnnrqKSUlJaljx45asWJFjg+yofRt3rxZ/fr1cy2PHz9ekhQbG6v4+HgfVQXJ+XADSerbt6/b+jlz5uiWW24p/YLgkpKSolGjRunQoUOqWrWqIiMjtXLlSl1xxRW+Lg2wtN9++00jRozQ0aNHFRYWpt69e+vrr79WWFiYr0srNdyHFwAAALZGDy8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AlKL//Oc/qlGjhk6dOlXkbRs3bqwrr7yyBKrKXXx8vBwOh3799ddSO+aFfvzxR1WoUEE7duzwyfEB2AeBFwBKSVZWliZOnKj77rtPISEhvi7H8tq0aaOhQ4fqqaee8nUpAMo4Ai8AlJJPPvlEu3bt0l133eXrUgpl5MiR+vPPP9WoUSOf1XDPPfdo8eLF2rt3r89qAFD2EXgBoJTMmTNHvXr1Ur169XxdSqH4+/srKChIDofDZzUMGDBA1atX17vvvuuzGgCUfQReACiEP//8U61atVKrVq30559/utYfO3ZMderUUc+ePZWVlZXn9hkZGVqxYoUGDBiQ4705c+bo8ssvV3h4uAIDA9WmTRvNmDEjz319/vnn6tixo4KCgtSmTRt9/PHHbu+fPXtWkydPVvPmzRUUFKSaNWuqd+/eWrVqldu4n376STfeeKPCwsIUHBysli1b6vHHH3e9n1sP7+bNmzVo0CDVqlVLwcHBatKkiW677Ta3/c6fP19dunRRlSpVFBoaqvbt22vatGluX7MHH3xQ7du3V0hIiEJDQzV48GD98MMPOc61YsWK6tu3r5YsWZLn1wMAClLB1wUAQFkQHBysd999V7169dLjjz+uV199VZI0evRopaamKj4+Xv7+/nluv2XLFp05c0adO3fO8d6MGTPUtm1bXX311apQoYI++eQT3XvvvTIMQ6NHj3Ybu3v3bg0fPlz33HOPYmNjNWfOHA0bNkwrVqzQFVdcIUmaNGmS4uLidMcdd6h79+5KS0vT5s2btXXrVteYbdu26bLLLlPFihV11113qXHjxtq7d68++eQTPfvss7meQ0pKigYOHKiwsDA9+uijqlatmn799Ve3wL1q1SqNGDFC/fv31wsvvCBJ+t///qevvvpKY8eOlST98ssvSkhI0LBhw9SkSRMlJyfrzTffVHR0tH788UfVrVvX7bhdunTRkiVLlJaWptDQ0HznCQByZQIACm3ChAmmn5+fuX79enPhwoWmJHPq1KkFbvf222+bkszt27fneO/06dM51g0aNMhs2rSp27pGjRqZksyPPvrItS41NdWsU6eO2alTJ9e6Dh06mEOHDs23nj59+phVqlQx9+/f77beMAzXn+fMmWNKMvft22eapmkuXrzYlGR+9913ee537NixZmhoqHnu3Lk8x2RkZJhZWVlu6/bt22cGBgaaU6ZMyTF+3rx5piTzm2++yfecACAvtDQAQBFMmjRJbdu2VWxsrO69915FR0frn//8Z4HbHT16VJJUvXr1HO8FBwe7/pyamqojR44oOjpav/zyi1JTU93G1q1bV9dee61rOTQ0VKNGjdL333+vpKQkSVK1atW0c+dO7d69O9daDh8+rPXr1+u2225Tw4YN3d7Lr1+3WrVqkqRPP/1UZ8+ezXNMenp6jvaJCwUGBsrPz/m/n6ysLB09elQhISFq2bKltm7dmmN89tfsyJEjee4TAPJD4AWAIggICNDs2bO1b98+nTx5UnPmzCnSh7pM08yx7quvvtKAAQNUuXJlVatWTWFhYXrsscckKUfgbdasWY7jtWjRQpJcvbZTpkzRiRMn1KJFC7Vv314PPfSQtm3b5hr/yy+/SJLatWtX6LolKTo6Wtdff70mT56sWrVq6ZprrtGcOXOUmZnpGnPvvfeqRYsWGjx4sOrXr6/bbrtNK1ascNuPYRh67bXX1Lx5cwUGBqpWrVoKCwvTtm3bcpyvdP5r5ssPzwEo2wi8AFBEK1eulOT8IFpeV1EvVrNmTUnS8ePH3dbv3btX/fv315EjR/Tqq69q2bJlWrVqle6//35JznBYVH369NHevXs1e/ZstWvXTm+//bY6d+6st99+u8j7upDD4dCiRYu0adMmjRkzRr///rtuu+02denSxfUgjfDwcCUmJmrp0qW6+uqr9eWXX2rw4MGKjY117ee5557T+PHj1adPH33wwQdauXKlVq1apbZt2+Z6vtlfs1q1ahWrfgDlmK97KgCgLPnhhx/MgIAA89ZbbzU7depkNmjQwDxx4kSB223YsMGUZC5ZssRt/WuvvWZKytFL+9hjj7n1z5qms4e3bt26bn22pmmajzzyiCnJPHToUK7HPnnypNmpUyezXr16pmmaZkpKiinJHDt2bL41X9zDm5u5c+eaksy33nor1/ezsrLMu+++25Rk7t692zRNZ49xv379coytV6+eGR0dnWP9M888Y/r5+RXq6wwAueEKLwAU0tmzZ3XLLbeobt26mjZtmuLj45WcnOy6GpufLl26KCAgQJs3b3Zbn31nB/OCVofU1FTNmTMn1/388ccfWrx4sWs5LS1N7733njp27KjatWtLOt8vnC0kJETNmjVztR6EhYWpT58+mj17tg4cOOA21syl5SLb8ePHc7zfsWNHSXLt++Jj+/n5KTIy0m2Mv79/jv0sXLhQv//+e67H3bJli9q2bauqVavmWRsA5IfbkgFAIT3zzDNKTEzU6tWrVaVKFUVGRuqpp57SE088oRtuuEFDhgzJc9ugoCANHDhQX3zxhaZMmeJaP3DgQAUEBOiqq67S3XffrVOnTumtt95SeHi4Dh06lGM/LVq00O23367vvvtOERERmj17tpKTk90Ccps2bdS3b1916dJFNWrU0ObNm7Vo0SKNGTPGNeZf//qXevfurc6dO+uuu+5SkyZN9Ouvv2rZsmVKTEzM9Rzeffdd/fvf/9a1116rSy65RCdPntRbb72l0NBQ17nfcccdOnbsmC6//HLVr19f+/fv1+uvv66OHTuqdevWkqQrr7xSU6ZM0a233qqePXtq+/btmjt3rpo2bZrjmGfPntW6det077335j85AJAf315gBoCyYcuWLWaFChXM++67z239uXPnzG7dupl169Y1jx8/nu8+Pv74Y9PhcJgHDhxwW7906VIzMjLSDAoKMhs3bmy+8MIL5uzZs3NtaRg6dKi5cuVKMzIy0gwMDDRbtWplLly40G1/zzzzjNm9e3ezWrVqZnBwsNmqVSvz2WefNc+cOeM2bseOHea1115rVqtWzQwKCjJbtmxpPvnkk673L25p2Lp1qzlixAizYcOGZmBgoBkeHm5eeeWV5ubNm13bLFq0yBw4cKAZHh5uBgQEmA0bNjTvvvtut3aLjIwM84EHHjDr1KljBgcHm7169TI3bdpkRkdH52hp+Oyzz9zaIQDAEw7TzOf3VwAAr8nKylKbNm1044036umnn/Z1OWVCTEyMHA6HWxsHABQVgRcAStGCBQv0j3/8QwcOHFBISIivy7G0//3vf2rfvr0SExOLfAs1ALgQgRcAAAC2xl0aAAAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANja/wO/kav13fKFjwAAAABJRU5ErkJggg==",
"text/plain": [
"<Figure size 800x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\n"
]
}
],
"source": [
"# 可视化二维向量\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# 设置中文字体(如果系统有的话)\n",
"try:\n",
" plt.rcParams['font.sans-serif'] = ['SimHei', 'Noto Sans CJK SC', 'WenQuanYi Micro Hei']\n",
" plt.rcParams['axes.unicode_minus'] = False\n",
"except:\n",
" pass # 如果没有中文字体就用默认\n",
"\n",
"# 创建画布\n",
"fig, ax = plt.subplots(figsize=(8, 8))\n",
"\n",
"# 定义向量\n",
"vectors = {\n",
" 'A = [2, 3]': np.array([2, 3]),\n",
" 'B = [4, 1]': np.array([4, 1]),\n",
" 'C = [1, 1]': np.array([1, 1]),\n",
"}\n",
"\n",
"# 画每个向量\n",
"colors = ['red', 'blue', 'green']\n",
"for (name, vec), color in zip(vectors.items(), colors):\n",
" ax.annotate('', xy=vec, xytext=(0, 0),\n",
" arrowprops=dict(arrowstyle='->', color=color, lw=2))\n",
" ax.text(vec[0]+0.1, vec[1]+0.1, name, fontsize=12, color=color)\n",
"\n",
"# 画坐标系\n",
"ax.axhline(y=0, color='black', linewidth=0.5)\n",
"ax.axvline(x=0, color='black', linewidth=0.5)\n",
"\n",
"# 设置范围\n",
"ax.set_xlim(-0.5, 5.5)\n",
"ax.set_ylim(-0.5, 4)\n",
"ax.set_xlabel('x (abscissa)', fontsize=12)\n",
"ax.set_ylabel('y (ordinate)', fontsize=12)\n",
"ax.set_title('2D Vector Visualization', fontsize=14)\n",
"ax.grid(True, alpha=0.3)\n",
"ax.set_aspect('equal')\n",
"\n",
"plt.show()\n",
"print(\"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 向量的基本运算\n",
"\n",
"### 3.2.1 向量加法\n",
"\n",
"**规则:对应位置相加**\n",
"\n",
"```python\n",
"[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n",
"```\n",
"\n",
"**几何直观**先走向量a再走向量b等价于直接从原点走到a+b\n",
"\n",
"```\n",
" b=[4,5,6]\n",
" ↗\n",
" |\n",
" a+b |\n",
" ↙|\n",
" ↙ |\n",
"O →——→ a=[1,2,3]\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"向量加法演示\n",
"==================================================\n",
"向量 a = [1 2 3]\n",
"向量 b = [4 5 6]\n",
"a + b = [5 7 9]\n",
"\n",
"计算过程:\n",
" 位置0: 1 + 4 = 5\n",
" 位置1: 2 + 5 = 7\n",
" 位置2: 3 + 6 = 9\n",
"\n",
"验证: True True True\n"
]
}
],
"source": [
"# 向量加法演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"向量加法演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"a = np.array([1, 2, 3])\n",
"b = np.array([4, 5, 6])\n",
"c = a + b\n",
"\n",
"print(f\"向量 a = {a}\")\n",
"print(f\"向量 b = {b}\")\n",
"print(f\"a + b = {c}\")\n",
"print()\n",
"print(\"计算过程:\")\n",
"print(f\" 位置0: {a[0]} + {b[0]} = {a[0]+b[0]}\")\n",
"print(f\" 位置1: {a[1]} + {b[1]} = {a[1]+b[1]}\")\n",
"print(f\" 位置2: {a[2]} + {b[2]} = {a[2]+b[2]}\")\n",
"print()\n",
"print(\"验证:\", a[0]+b[0] == c[0], a[1]+b[1] == c[1], a[2]+b[2] == c[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2.2 向量数乘(标量乘法)\n",
"\n",
"**规则:每个元素都乘以这个标量(数字)**\n",
"\n",
"```python\n",
"2 × [1, 2, 3] = [2×1, 2×2, 2×3] = [2, 4, 6]\n",
"3 × [1, 2, 3] = [3×1, 3×2, 3×3] = [3, 6, 9]\n",
"0.5 × [1, 2, 3] = [0.5, 1.0, 1.5]\n",
"```\n",
"\n",
"**几何直观**\n",
"- 正数:方向不变,长度缩放\n",
"- 负数:方向相反,长度缩放\n",
"- 0变成零向量"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"向量数乘(标量乘法)演示\n",
"==================================================\n",
"原始向量 v = [1 2 3]\n",
"\n",
"2 × v = [2 4 6]\n",
"3 × v = [3 6 9]\n",
"0.5 × v = [0.5 1. 1.5]\n",
"-1 × v = [-1 -2 -3]\n",
"0 × v = [0 0 0]\n"
]
}
],
"source": [
"# 向量数乘演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"向量数乘(标量乘法)演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"v = np.array([1, 2, 3])\n",
"\n",
"print(f\"原始向量 v = {v}\")\n",
"print()\n",
"\n",
"for scalar in [2, 3, 0.5, -1, 0]:\n",
" result = scalar * v\n",
" print(f\"{scalar} × v = {result}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2.3 向量的长度(模/范数)\n",
"\n",
"**定义:从原点到向量终点的距离**\n",
"\n",
"对于二维向量 `[a, b]`\n",
"```\n",
"长度 = √(a² + b²)\n",
"\n",
"这就是\"勾股定理\"\n",
"\n",
" |\n",
" b |\n",
" | |\n",
" | √(a²+b²)\n",
" | /\n",
" | /\n",
" |/ a\n",
" O——————\n",
"```\n",
"\n",
"对于n维向量 `[a₁, a₂, ..., aₙ]`\n",
"```\n",
"长度 = √(a₁² + a₂² + ... + aₙ²)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"向量长度(模/范数)演示\n",
"==================================================\n",
"向量 v = [3 4]\n",
"长度 = √(3² + 4²) = √(9 + 16) = √25 = 5.0\n",
"\n",
"向量长度计算例子:\n",
" [np.int64(1), np.int64(1)] -> 长度 = 1.4142\n",
" [np.int64(0), np.int64(5)] -> 长度 = 5.0000\n",
" [np.int64(3), np.int64(4)] -> 长度 = 5.0000\n",
" [np.int64(1), np.int64(2), np.int64(2)] -> 长度 = 3.0000\n",
" [np.int64(1), np.int64(1), np.int64(1), np.int64(1)] -> 长度 = 2.0000\n"
]
}
],
"source": [
"# 向量长度计算\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"向量长度(模/范数)演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 二维向量例子\n",
"v2d = np.array([3, 4])\n",
"length_2d = np.linalg.norm(v2d)\n",
"\n",
"print(f\"向量 v = {v2d}\")\n",
"print(f\"长度 = √({v2d[0]}² + {v2d[1]}²) = √({v2d[0]**2} + {v2d[1]**2}) = √{v2d[0]**2 + v2d[1]**2} = {length_2d}\")\n",
"print()\n",
"\n",
"# 更多例子\n",
"examples = [\n",
" np.array([1, 1]), # 45度角\n",
" np.array([0, 5]), # 在y轴上\n",
" np.array([3, 4]), # 经典勾股数\n",
" np.array([1, 2, 2]), # 三维向量\n",
" np.array([1, 1, 1, 1]) # 四维向量\n",
"]\n",
"\n",
"print(\"向量长度计算例子:\")\n",
"for v in examples:\n",
" length = np.linalg.norm(v)\n",
" print(f\" {list(v)} -> 长度 = {length:.4f}\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"练习题3答案\n",
"==================================================\n",
"A = [3 4], B = [1 2]\n",
"\n",
"1. A + B = [4 6]\n",
"2. 2 × A = [6 8]\n",
"3. A的长度 = 5.0\n"
]
}
],
"source": [
"# 练习题3答案\n",
"import numpy as np\n",
"print(\"=\" * 50)\n",
"print(\"练习题3答案\")\n",
"print(\"=\" * 50)\n",
"\n",
"A = np.array([3, 4])\n",
"B = np.array([1, 2])\n",
"\n",
"print(f\"A = {A}, B = {B}\")\n",
"print()\n",
"\n",
"# 1. A + B\n",
"print(\"1. A + B =\", A + B)\n",
"\n",
"# 2. 2 × A\n",
"print(\"2. 2 × A =\", 2 * A)\n",
"\n",
"# 3. A的长度\n",
"print(f\"3. A的长度 = {np.linalg.norm(A)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第四部分:余弦相似度\n",
"\n",
"## 4.1 什么是相似度?\n",
"\n",
"**相似度 = 两个向量有多\"像\"**\n",
"\n",
"### 日常生活中的相似例子\n",
"\n",
"| 相似度高 | 原因 | 相似度低 | 原因 |\n",
"|----------|------|----------|------|\n",
"| \"猫\" 和 \"狗\" | 都是动物,都四只脚 | \"猫\" 和 \"石头\" | 一个是动物,一个不是 |\n",
"| \"红色\" 和 \"黄色\" | 都是颜色,暖色调 | \"热\" 和 \"冷\" | 意思相反 |\n",
"| \"跑步\" 和 \"游泳\" | 都是运动 | \"太阳\" 和 \"细菌\" | 几乎没有共同点 |\n",
"| \"苹果\" 和 \"梨\" | 都是水果 | \"苹果\" 和 \"手机\" | 需要上下文才能关联 |\n",
"\n",
"### 计算机如何量化相似度?\n",
"\n",
"文本相似度在计算机中的应用:\n",
"\n",
"```\n",
"搜索场景:\n",
" 用户输入: \"如何学习编程?\"\n",
" 文档1: \"Python入门教程\" → 相似度高 ✅\n",
" 文档2: \"做蛋糕的100种方法\" → 相似度低 ❌\n",
"\n",
"推荐场景:\n",
" 用户喜欢: \"猫和狗的搞笑视频\"\n",
" 推荐1: \"仓鼠的可爱瞬间\" → 相似度高 ✅\n",
" 推荐2: \"汽车发动机维修教程\" → 相似度低 ❌\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.2 点积Dot Product— 最重要的运算\n",
"\n",
"### 定义:对应位置相乘,再求和\n",
"\n",
"```python\n",
"a = [1, 2, 3]\n",
"b = [4, 5, 6]\n",
"\n",
"点积 = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n",
"```\n",
"\n",
"### 点积的几何意义\n",
"\n",
"```\n",
"点积 = |A| × |B| × cos(θ)\n",
"\n",
"其中:\n",
" |A| = 向量A的长度\n",
" |B| = 向量B的长度\n",
" θ = 两个向量之间的夹角\n",
"```\n",
"\n",
"| 夹角 θ | cos(θ) | 点积结果 | 含义 |\n",
"|--------|--------|----------|------|\n",
"| 0° | 1 | |A|×|B|(最大) | 方向完全相同 |\n",
"| 90° | 0 | 0 | 垂直/正交 |\n",
"| 180° | -1 | -|A|×|B|(最小) | 方向完全相反 |"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"向量点积演示\n",
"==================================================\n",
"向量 a = [1 2 3]\n",
"向量 b = [4 5 6]\n",
"\n",
"点积 a · b = 32\n",
"验证: a @ b = 32\n",
"手动计算: 32\n",
"\n",
"计算过程:\n",
" a[0]×b[0] = 1×4 = 4\n",
" a[1]×b[1] = 2×5 = 10\n",
" a[2]×b[2] = 3×6 = 18\n",
" 求和: 4 + 10 + 18 = 32\n"
]
}
],
"source": [
"# 点积计算演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"向量点积演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"a = np.array([1, 2, 3])\n",
"b = np.array([4, 5, 6])\n",
"\n",
"# 方法1使用np.dot()\n",
"dot1 = np.dot(a, b)\n",
"\n",
"# 方法2使用@运算符\n",
"dot2 = a @ b\n",
"\n",
"# 方法3手动计算\n",
"dot3 = sum(a[i] * b[i] for i in range(len(a)))\n",
"\n",
"print(f\"向量 a = {a}\")\n",
"print(f\"向量 b = {b}\")\n",
"print()\n",
"print(f\"点积 a · b = {dot1}\")\n",
"print(f\"验证: a @ b = {dot2}\")\n",
"print(f\"手动计算: {dot3}\")\n",
"print()\n",
"print(\"计算过程:\")\n",
"print(f\" a[0]×b[0] = {a[0]}×{b[0]} = {a[0]*b[0]}\")\n",
"print(f\" a[1]×b[1] = {a[1]}×{b[1]} = {a[1]*b[1]}\")\n",
"print(f\" a[2]×b[2] = {a[2]}×{b[2]} = {a[2]*b[2]}\")\n",
"print(f\" 求和: {a[0]*b[0]} + {a[1]*b[1]} + {a[2]*b[2]} = {dot1}\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"点积与夹角的关系\n",
"==================================================\n",
"夹角0°: a=[1 0], b=[2 0], 点积=2\n",
"夹角90°: a=[1 0], b=[0 1], 点积=0\n",
"夹角180°: a=[1 0], b=[-1 0], 点积=-1\n",
"\n",
"任意角度: a=[1 1], b=[1 0]\n",
" 点积 = 1\n",
" cos(θ) = 0.7071\n",
" 夹角 θ = 45.0°\n"
]
}
],
"source": [
"# 点积与夹角的关系\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"点积与夹角的关系\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 夹角为0度方向完全相同\n",
"a = np.array([1, 0])\n",
"b = np.array([2, 0])\n",
"dot = np.dot(a, b)\n",
"print(f\"夹角0°: a={a}, b={b}, 点积={dot}\")\n",
"\n",
"# 夹角为90度垂直\n",
"a = np.array([1, 0])\n",
"b = np.array([0, 1])\n",
"dot = np.dot(a, b)\n",
"print(f\"夹角90°: a={a}, b={b}, 点积={dot}\")\n",
"\n",
"# 夹角为180度方向相反\n",
"a = np.array([1, 0])\n",
"b = np.array([-1, 0])\n",
"dot = np.dot(a, b)\n",
"print(f\"夹角180°: a={a}, b={b}, 点积={dot}\")\n",
"\n",
"# 任意角度\n",
"import math\n",
"a = np.array([1, 1])\n",
"b = np.array([1, 0])\n",
"dot = np.dot(a, b)\n",
"cos_angle = dot / (np.linalg.norm(a) * np.linalg.norm(b))\n",
"angle = math.acos(cos_angle) * 180 / math.pi\n",
"print(f\"\\n任意角度: a={a}, b={b}\")\n",
"print(f\" 点积 = {dot}\")\n",
"print(f\" cos(θ) = {cos_angle:.4f}\")\n",
"print(f\" 夹角 θ = {angle:.1f}°\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.3 余弦相似度 — 用点积判断\"像不像\"\n",
"\n",
"### 公式\n",
"\n",
"```\n",
" A · B\n",
"cos(θ) = ──────────\n",
" |A| × |B|\n",
"\n",
"其中:\n",
" A · B = 向量A和B的点积\n",
" |A| = 向量A的长度\n",
" |B| = 向量B的长度\n",
" cos(θ) = 相似度,范围是 [-1, 1]\n",
"```\n",
"\n",
"### 为什么叫\"余弦\"相似度?\n",
"\n",
"因为公式中计算的就是两个向量夹角的余弦值!\n",
"\n",
"从点积公式推导:\n",
"```\n",
"A · B = |A| × |B| × cos(θ)\n",
" ↓\n",
"cos(θ) = (A · B) / (|A| × |B|)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"余弦相似度函数已定义cosine_similarity(a, b)\n"
]
}
],
"source": [
"# 定义余弦相似度函数\n",
"import numpy as np\n",
"\n",
"def cosine_similarity(a, b):\n",
" \"\"\"\n",
" 计算余弦相似度\n",
" \n",
" 参数:\n",
" a, b: 两个numpy数组向量\n",
" \n",
" 返回:\n",
" float: 余弦相似度,范围[-1, 1]\n",
" \"\"\"\n",
" dot = np.dot(a, b) # 点积\n",
" norm_a = np.linalg.norm(a) # 向量a的长度\n",
" norm_b = np.linalg.norm(b) # 向量b的长度\n",
" \n",
" # 防止除以零\n",
" if norm_a == 0 or norm_b == 0:\n",
" return 0.0\n",
" \n",
" return dot / (norm_a * norm_b)\n",
"\n",
"print(\"余弦相似度函数已定义cosine_similarity(a, b)\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"余弦相似度计算示例\n",
"==================================================\n",
"1. 方向完全相同: a=[1 2 3], b=[2 4 6]\n",
" 相似度 = 1.000 (应该是1.000)\n",
"\n",
"2. 方向完全相反: a=[1 2 3], b=[-1 -2 -3]\n",
" 相似度 = -1.000 (应该是-1.000)\n",
"\n",
"3. 垂直向量: a=[1 0], b=[0 1]\n",
" 相似度 = 0.000 (应该是0.000)\n",
"\n",
"4. 45度夹角: a=[1 1], b=[1 0]\n",
" 相似度 = 0.707 (应该是0.707)\n"
]
}
],
"source": [
"# 余弦相似度计算示例\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"余弦相似度计算示例\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 示例1方向完全相同的向量\n",
"a = np.array([1, 2, 3])\n",
"b = np.array([2, 4, 6]) # b是a的两倍方向完全相同\n",
"sim = cosine_similarity(a, b)\n",
"print(f\"1. 方向完全相同: a={a}, b={b}\")\n",
"print(f\" 相似度 = {sim:.3f} (应该是1.000)\")\n",
"print()\n",
"\n",
"# 示例2方向完全相反的向量\n",
"a = np.array([1, 2, 3])\n",
"b = np.array([-1, -2, -3]) # b是a的相反方向\n",
"sim = cosine_similarity(a, b)\n",
"print(f\"2. 方向完全相反: a={a}, b={b}\")\n",
"print(f\" 相似度 = {sim:.3f} (应该是-1.000)\")\n",
"print()\n",
"\n",
"# 示例3垂直的向量\n",
"a = np.array([1, 0])\n",
"b = np.array([0, 1])\n",
"sim = cosine_similarity(a, b)\n",
"print(f\"3. 垂直向量: a={a}, b={b}\")\n",
"print(f\" 相似度 = {sim:.3f} (应该是0.000)\")\n",
"print()\n",
"\n",
"# 示例445度夹角\n",
"a = np.array([1, 1])\n",
"b = np.array([1, 0])\n",
"sim = cosine_similarity(a, b)\n",
"print(f\"4. 45度夹角: a={a}, b={b}\")\n",
"print(f\" 相似度 = {sim:.3f} (应该是0.707)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 余弦相似度的值代表什么?\n",
"\n",
"| cos(θ) 值 | 夹角 θ | 相似程度 | 示例 |\n",
"|----------|--------|---------|------|\n",
"| 1.0 | 0° | **完全相同** | 同一向量 |\n",
"| 0.8~0.99 | 0~37° | **非常相似** | \"猫\" vs \"狗\" |\n",
"| 0.5~0.8 | 37~60° | **比较相似** | \"跑步\" vs \"运动\" |\n",
"| 0.3~0.5 | 60~72° | **有些相似** | \"苹果\" vs \"水果\" |\n",
"| 0 | 90° | **毫不相关** | \"猫\" vs \"石头\" |\n",
"| -0.5~0 | 90~120° | **有些相反** | \"热\" vs \"冷\" |\n",
"| -1.0 | 180° | **完全相反** | \"高\" vs \"矮\" |"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"语义相似度示例(用向量模拟词义)\n",
"==================================================\n",
"\n",
"词向量(简化模拟):\n",
" 猫 = [0.9 0.1 0.7 0.8 0.9]\n",
" 狗 = [0.8 0.2 0.6 0.8 0.9]\n",
" 苹果 = [0.1 0.9 0.9 0. 0. ]\n",
" 汽车 = [0. 0. 0. 0.9 0. ]\n",
" 石头 = [0. 0.1 0. 0. 0. ]\n",
"\n",
"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n",
"\n",
"相似度计算结果:\n",
" 猫 vs 狗: 0.996 (都是动物,都有宠物属性)\n",
" 猫 vs 苹果: 0.382 (动物vs植物很不同)\n",
" 猫 vs 汽车: 0.482 (动物vs机械)\n",
" 猫 vs 石头: 0.060 (动物vs无机物)\n",
" 狗 vs 汽车: 0.507 (动物vs机械但都能移动)\n",
" 苹果 vs 石头: 0.705 (都是静态的)\n"
]
}
],
"source": [
"# 语义相似度示例\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"语义相似度示例(用向量模拟词义)\")\n",
"print(\"=\" * 50)\n",
"print()\n",
"\n",
"# 假设这些是词的\"意义向量\"(简化版)\n",
"# 维度解释: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n",
"# 每个维度取值0-1表示该属性的强弱\n",
"\n",
"cat = np.array([0.9, 0.1, 0.7, 0.8, 0.9]) # 猫\n",
"dog = np.array([0.8, 0.2, 0.6, 0.8, 0.9]) # 狗\n",
"apple = np.array([0.1, 0.9, 0.9, 0.0, 0.0]) # 苹果\n",
"car = np.array([0.0, 0.0, 0.0, 0.9, 0.0]) # 汽车\n",
"rock = np.array([0.0, 0.1, 0.0, 0.0, 0.0]) # 石头\n",
"\n",
"print(\"词向量(简化模拟):\")\n",
"print(f\" 猫 = {cat}\")\n",
"print(f\" 狗 = {dog}\")\n",
"print(f\" 苹果 = {apple}\")\n",
"print(f\" 汽车 = {car}\")\n",
"print(f\" 石头 = {rock}\")\n",
"print()\n",
"print(\"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\")\n",
"print()\n",
"\n",
"# 计算相似度\n",
"print(\"相似度计算结果:\")\n",
"print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n",
"print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物很不同)\")\n",
"print(f\" 猫 vs 汽车: {cosine_similarity(cat, car):.3f} (动物vs机械)\")\n",
"print(f\" 猫 vs 石头: {cosine_similarity(cat, rock):.3f} (动物vs无机物)\")\n",
"print(f\" 狗 vs 汽车: {cosine_similarity(dog, car):.3f} (动物vs机械但都能移动)\")\n",
"print(f\" 苹果 vs 石头: {cosine_similarity(apple, rock):.3f} (都是静态的)\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"练习题4答案\n",
"==================================================\n",
"A = [1 2 3], B = [4 5 6]\n",
"\n",
"1. 点积 A · B = 32\n",
" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n",
"\n",
"2. 余弦相似度 = 0.9746\n",
"\n",
"3. A=[1,0], B=[0,1] 的余弦相似度 = 0.0\n",
" 原因这两个向量垂直夹角90°cos(90°)=0\n"
]
}
],
"source": [
"# 练习题4答案\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"练习题4答案\")\n",
"print(\"=\" * 50)\n",
"\n",
"A = np.array([1, 2, 3])\n",
"B = np.array([4, 5, 6])\n",
"\n",
"print(f\"A = {A}, B = {B}\")\n",
"print()\n",
"\n",
"# 1. 点积\n",
"dot = np.dot(A, B)\n",
"print(f\"1. 点积 A · B = {dot}\")\n",
"print(f\" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = {dot}\")\n",
"print()\n",
"\n",
"# 2. 余弦相似度\n",
"cos_sim = cosine_similarity(A, B)\n",
"print(f\"2. 余弦相似度 = {cos_sim:.4f}\")\n",
"print()\n",
"\n",
"# 3. 垂直向量的相似度\n",
"A = np.array([1, 0])\n",
"B = np.array([0, 1])\n",
"cos_sim = cosine_similarity(A, B)\n",
"print(f\"3. A=[1,0], B=[0,1] 的余弦相似度 = {cos_sim}\")\n",
"print(\" 原因这两个向量垂直夹角90°cos(90°)=0\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第五部分:文本向量化的核心思想\n",
"\n",
"## 5.1 核心目标:把所有文本变成\"向量\"\n",
"\n",
"```\n",
"┌──────────────────────────────────────────────────────────────────┐\n",
"│ │\n",
"│ 文本(符号) ──→ 数值向量 ──→ 计算机可以计算 ──→ AI模型处理 │\n",
"│ │\n",
"│ \"猫\" [0.9, 0.1, 0.8] │\n",
"│ \"狗\" [0.8, 0.2, 0.7] │\n",
"│ │\n",
"└──────────────────────────────────────────────────────────────────┘\n",
"```\n",
"\n",
"### 为什么必须是向量?\n",
"\n",
"| 计算机擅长 | 计算机不擅长 |\n",
"|------------|-------------|\n",
"| 向量加减v1 + v2 = ? | 字符串比较:\"Python\" == \"Java\" ? |\n",
"| 向量点积v1 · v2 = ? | 词语推理:\"猫\" 类似于 \"狗\" ? |\n",
"| 向量距离:|v1 - v2| = ? | 语义理解:\"你好\"是问候语 |\n",
"| 余弦相似度cos(θ) = ? | 情感判断:\"绝了\"是夸还是骂? |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.2 向量化示例:从\"词\"到\"数\"\n",
"\n",
"### 方法1位置编码只有位置信息没有语义\n",
"\n",
"```python\n",
"# 假设我们有一个很小的词汇表只有5个词\n",
"vocab = [\"猫\", \"狗\", \"鱼\", \"苹果\", \"香蕉\"]\n",
"\n",
"# 位置编码:每个词对应一个位置\n",
"# \"猫\" → [1, 0, 0, 0, 0] 第1个位置是1其他是0\n",
"# \"狗\" → [0, 1, 0, 0, 0] 第2个位置是1其他是0\n",
"# \"苹果\" → [0, 0, 0, 1, 0] 第4个位置是1其他是0\n",
"```\n",
"\n",
"**问题**:这只是\"位置编码\",没有语义信息!\n",
"\n",
"```\n",
"\"猫\" = [1, 0, 0, 0, 0]\n",
"\"狗\" = [0, 1, 0, 0, 0]\n",
"\n",
"余弦相似度 = 0 (完全不相似)\n",
"\n",
"但实际上\"猫\"和\"狗\"都是动物,应该很相似!\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"位置编码的缺陷\n",
"==================================================\n",
"位置编码向量:\n",
" 猫 = [1 0 0 0 0]\n",
" 狗 = [0 1 0 0 0]\n",
" 苹果 = [0 0 0 1 0]\n",
"\n",
"余弦相似度(用位置编码):\n",
" 猫 vs 狗: 0.000\n",
" 猫 vs 苹果: 0.000\n",
"\n",
"问题猫和狗都是动物相似度却是0\n",
" 猫和苹果不是同类相似度也是0\n",
" 位置编码没有语义信息!\n"
]
}
],
"source": [
"# 位置编码的缺陷演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"位置编码的缺陷\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 位置编码向量\n",
"cat_onehot = np.array([1, 0, 0, 0, 0]) # \"猫\"\n",
"dog_onehot = np.array([0, 1, 0, 0, 0]) # \"狗\"\n",
"apple_onehot = np.array([0, 0, 0, 1, 0]) # \"苹果\"\n",
"\n",
"print(\"位置编码向量:\")\n",
"print(f\" 猫 = {cat_onehot}\")\n",
"print(f\" 狗 = {dog_onehot}\")\n",
"print(f\" 苹果 = {apple_onehot}\")\n",
"print()\n",
"\n",
"# 相似度计算\n",
"print(\"余弦相似度(用位置编码):\")\n",
"print(f\" 猫 vs 狗: {cosine_similarity(cat_onehot, dog_onehot):.3f}\")\n",
"print(f\" 猫 vs 苹果: {cosine_similarity(cat_onehot, apple_onehot):.3f}\")\n",
"print()\n",
"print(\"问题猫和狗都是动物相似度却是0\")\n",
"print(\" 猫和苹果不是同类相似度也是0\")\n",
"print(\" 位置编码没有语义信息!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 方法2语义编码有语义信息\n",
"\n",
"```python\n",
"# 语义编码:每个词用\"含义\"来表示\n",
"# 维度:[动物性, 植物性, 可食用性, 宠物性]\n",
"\n",
"cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n",
"dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n",
"apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n",
"```\n",
"\n",
"**这就是文本向量化的威力:把\"语义\"变成\"可计算的数值\"**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"语义编码的优点\n",
"==================================================\n",
"语义编码向量:\n",
" 猫 = [0.9 0.1 0.7 0.9]\n",
" 狗 = [0.8 0.2 0.6 0.9]\n",
" 苹果 = [0.1 0.9 0.9 0. ]\n",
"\n",
"维度说明: [动物性, 植物性, 可食用性, 宠物性]\n",
"\n",
"余弦相似度(用语义编码):\n",
" 猫 vs 狗: 0.995 (都是动物,都有宠物属性)\n",
" 猫 vs 苹果: 0.436 (动物vs植物)\n",
"\n",
"太棒了!语义编码可以捕捉到词的语义相似性!\n"
]
}
],
"source": [
"# 语义编码的优点演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"语义编码的优点\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 语义编码向量\n",
"# 维度: [动物性, 植物性, 可食用性, 宠物性]\n",
"cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n",
"dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n",
"apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n",
"\n",
"print(\"语义编码向量:\")\n",
"print(f\" 猫 = {cat}\")\n",
"print(f\" 狗 = {dog}\")\n",
"print(f\" 苹果 = {apple}\")\n",
"print()\n",
"print(\"维度说明: [动物性, 植物性, 可食用性, 宠物性]\")\n",
"print()\n",
"\n",
"# 相似度计算\n",
"print(\"余弦相似度(用语义编码):\")\n",
"print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n",
"print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物)\")\n",
"print()\n",
"print(\"太棒了!语义编码可以捕捉到词的语义相似性!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.3 向量化方法演进\n",
"\n",
"```\n",
"文本向量化的三种主要方法:\n",
"\n",
"[ BoW ] ───→ [ TF-IDF ] ───→ [ Word Embedding ]\n",
" (词袋模型) (词频权重) (词向量嵌入)\n",
" \n",
" 简单粗暴 加入词重要性 蕴含语义信息\n",
" 无语义 部分语义 深度语义\n",
" \n",
" 1980年代 1990年代 2013年后\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第六部分BoW词袋模型\n",
"\n",
"## 6.1 原理\n",
"\n",
"把文本看成\"一袋词\"**不考虑顺序**,只管词出现了几次。\n",
"\n",
"```\n",
"文本1: \"Python 是 编程 语言\"\n",
"文本2: \"Java 是 编程 语言\"\n",
"\n",
"分词后:\n",
" Doc1: [\"Python\", \"是\", \"编程\", \"语言\"]\n",
" Doc2: [\"Java\", \"是\", \"编程\", \"语言\"]\n",
"\n",
"构建词表(所有文档的词集合):\n",
" 词表: [\"Python\", \"Java\", \"是\", \"编程\", \"语言\"]\n",
"\n",
"向量化:统计每个词出现的次数\n",
" Doc1 → [1, 0, 1, 1, 1] # Python出现1次Java出现0次...\n",
" Doc2 → [0, 1, 1, 1, 1] # Python出现0次Java出现1次...\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"BoW词袋模型演示手动实现\n",
"==================================================\n",
"【示例1】文档集合\n",
" Doc1: Python 是 编程 语言\n",
" Doc2: Java 是 编程 语言\n",
"\n",
"词表: ['Java', 'Python', '是', '编程', '语言']\n",
"\n",
"BoW矩阵每行是一个文档每列是一个词\n",
" Doc1: [0, 1, 1, 1, 1]\n",
" Doc2: [1, 0, 1, 1, 1]\n",
"\n",
"详细解释:\n",
"\n",
"Doc1: Python 是 编程 语言\n",
" -> 'Python' 出现 1 次\n",
" -> '是' 出现 1 次\n",
" -> '编程' 出现 1 次\n",
" -> '语言' 出现 1 次\n",
"\n",
"Doc2: Java 是 编程 语言\n",
" -> 'Java' 出现 1 次\n",
" -> '是' 出现 1 次\n",
" -> '编程' 出现 1 次\n",
" -> '语言' 出现 1 次\n"
]
}
],
"source": [
"# BoW词袋模型演示纯Python实现不依赖sklearn\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"BoW词袋模型演示手动实现\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_bow(docs):\n",
" \"\"\"\n",
" 简单的BoW实现\n",
" \n",
" 参数:\n",
" docs: 文档列表,每篇文档已经是分词后的词列表\n",
" 返回:\n",
" vocab: 词表(有序列表)\n",
" bow_matrix: BoW矩阵 (n_docs x n_vocab)\n",
" \"\"\"\n",
" # 1. 构建词表\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set)) # 排序保证顺序一致\n",
" \n",
" # 2. 构建BoW矩阵\n",
" bow_matrix = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow_matrix.append(vec)\n",
" \n",
" return vocab, bow_matrix\n",
"\n",
"\n",
"# 示例1中文文档用空格分词\n",
"docs = [\n",
" [\"Python\", \"是\", \"编程\", \"语言\"],\n",
" [\"Java\", \"是\", \"编程\", \"语言\"],\n",
"]\n",
"\n",
"vocab, bow_matrix = simple_bow(docs)\n",
"\n",
"print(\"【示例1】文档集合\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"\n",
"print(\"BoW矩阵每行是一个文档每列是一个词\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
"print()\n",
"\n",
"# 详细解释\n",
"print(\"详细解释:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n",
" for j, word in enumerate(vocab):\n",
" if bow_matrix[i][j] > 0:\n",
" print(f\" -> '{word}' 出现 {bow_matrix[i][j]} 次\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"BoW词袋模型更多示例\n",
"==================================================\n",
"文档集合:\n",
" Doc1: 我 爱 Python 编程\n",
" Doc2: Python 很 好 学\n",
" Doc3: 我 爱 写 代码\n",
"\n",
"词表: ['Python', '代码', '写', '好', '学', '很', '我', '爱', '编程']\n",
"\n",
"BoW矩阵:\n",
" Doc1: [1, 0, 0, 0, 0, 0, 1, 1, 1]\n",
" Doc2: [1, 0, 0, 1, 1, 1, 0, 0, 0]\n",
" Doc3: [0, 1, 1, 0, 0, 0, 1, 1, 0]\n",
"\n",
"表格形式:\n",
"Doc | Python | 代码 | 写 | 好 | 学 | 很\n",
"----------------------------------\n",
"Doc1 | 1 | 0 | 0 | 0 | 0 | 0\n",
"Doc2 | 1 | 0 | 0 | 1 | 1 | 1\n",
"Doc3 | 0 | 1 | 1 | 0 | 0 | 0\n"
]
}
],
"source": [
"# 更多BoW示例纯Python实现\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"BoW词袋模型更多示例\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_bow(docs):\n",
" \"\"\"简单的BoW实现\"\"\"\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow_matrix = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow_matrix.append(vec)\n",
" return vocab, bow_matrix\n",
"\n",
"docs = [\n",
" [\"我\", \"爱\", \"Python\", \"编程\"],\n",
" [\"Python\", \"很\", \"好\", \"学\"],\n",
" [\"我\", \"爱\", \"写\", \"代码\"]\n",
"]\n",
"\n",
"vocab, bow_matrix = simple_bow(docs)\n",
"\n",
"print(\"文档集合:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"\n",
"print(\"BoW矩阵:\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
"\n",
"print()\n",
"\n",
"# 显示成表格\n",
"print(\"表格形式:\")\n",
"header = \"Doc | \" + \" | \".join(vocab[:6])\n",
"print(header)\n",
"print(\"-\" * len(header))\n",
"for i, row in enumerate(bow_matrix):\n",
" print(f\"Doc{i+1} | \" + \" | \".join(map(str, row[:6])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.2 BoW 的优缺点\n",
"\n",
"| 优点 | 缺点 |\n",
"|------|------|\n",
"| **简单直观** | 忽略词序 |\n",
"| **容易实现** | \"我爱你\"和\"你爱我\"向量完全相同 |\n",
"| **计算速度快** | 所有词同等重要 |\n",
"| **适合基线模型** | 无法捕捉语义 |\n",
"| | 无法处理同义词:\"电脑\"和\"计算机\"完全不同 |\n",
"| | 维度很高(词表有多大,维度就多大) |"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"BoW忽略词序的演示\n",
"==================================================\n",
"文档:\n",
" Doc1: 我爱你\n",
" Doc2: 你爱我\n",
" Doc3: 爱你我\n",
"\n",
"BoW矩阵\n",
" Doc1: [1, 1, 1, 0]\n",
" Doc2: [1, 1, 1, 0]\n",
" Doc3: [0, 0, 0, 1]\n",
"\n",
"词表: ['你', '我', '爱', '爱你我']\n",
"\n",
"问题这三个完全不同的句子BoW向量完全相同\n",
"Doc1: 我爱你(表达爱意)\n",
"Doc2: 你爱我(对方爱我)\n",
"Doc3: 爱你我(意义不明)\n",
"\n",
"结论BoW模型丢失了词序信息\n"
]
}
],
"source": [
"# BoW忽略词序的演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"BoW忽略词序的演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_bow(docs):\n",
" \"\"\"简单的BoW实现\"\"\"\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow_matrix = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow_matrix.append(vec)\n",
" return vocab, bow_matrix\n",
"\n",
"# 两个完全不同的句子但BoW向量相同\n",
"docs = [\n",
" [\"我\", \"爱\", \"你\"], # 正常语序\n",
" [\"你\", \"爱\", \"我\"], # 完全相反\n",
" [\"爱你我\"], # 没有空格(中文连续)\n",
"]\n",
"\n",
"vocab, bow_matrix = simple_bow(docs)\n",
"\n",
"print(\"文档:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {''.join(doc)}\")\n",
"print()\n",
"\n",
"print(\"BoW矩阵\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"\n",
"print(\"问题这三个完全不同的句子BoW向量完全相同\")\n",
"print(\"Doc1: 我爱你(表达爱意)\")\n",
"print(\"Doc2: 你爱我(对方爱我)\")\n",
"print(\"Doc3: 爱你我(意义不明)\")\n",
"print()\n",
"print(\"结论BoW模型丢失了词序信息\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"练习题5答案\n",
"==================================================\n",
"文档集合:\n",
" Doc1: Python 是 编程 语言\n",
" Doc2: Java 是 编程 语言\n",
" Doc3: Python Python Python\n",
"\n",
"词表: ['Java', 'Python', '是', '编程', '语言']\n",
"\n",
"BoW矩阵每行是一个文档的向量\n",
" Doc1: [0, 1, 1, 1, 1]\n",
" Doc2: [1, 0, 1, 1, 1]\n",
" Doc3: [0, 3, 0, 0, 0]\n",
"\n",
"解析:\n",
" Doc1: [0, 1, 1, 1, 1]\n",
" - 'Python' 出现 1 次\n",
" - '是' 出现 1 次\n",
" - '编程' 出现 1 次\n",
" - '语言' 出现 1 次\n",
" Doc2: [1, 0, 1, 1, 1]\n",
" - 'Java' 出现 1 次\n",
" - '是' 出现 1 次\n",
" - '编程' 出现 1 次\n",
" - '语言' 出现 1 次\n",
" Doc3: [0, 3, 0, 0, 0]\n",
" - 'Python' 出现 3 次\n"
]
}
],
"source": [
"# 练习题5答案\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"练习题5答案\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_bow(docs):\n",
" \"\"\"简单的BoW实现\"\"\"\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow_matrix = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow_matrix.append(vec)\n",
" return vocab, bow_matrix\n",
"\n",
"docs = [\n",
" [\"Python\", \"是\", \"编程\", \"语言\"],\n",
" [\"Java\", \"是\", \"编程\", \"语言\"],\n",
" [\"Python\", \"Python\", \"Python\"]\n",
"]\n",
"\n",
"vocab, bow_matrix = simple_bow(docs)\n",
"\n",
"print(\"文档集合:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"\n",
"print(\"BoW矩阵每行是一个文档的向量\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
"print()\n",
"\n",
"print(\"解析:\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
" for j, count in enumerate(vec):\n",
" if count > 0:\n",
" print(f\" - '{vocab[j]}' 出现 {count} 次\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第七部分TF-IDF\n",
"\n",
"## 7.1 为什么需要TF-IDF\n",
"\n",
"**BoW的问题**:所有词同等重要!\n",
"\n",
"```\n",
"文档A: \"Python 是 编程 语言Python Python Python\"\n",
"文档B: \"Python 是 编程 语言\"\n",
"\n",
"BoW结果\n",
" 文档A: Python=4, 是=1, 编程=1, 语言=1\n",
" 文档B: Python=1, 是=1, 编程=1, 语言=1\n",
"\n",
"问题:\"Python\"在A中出现4次在B中出现1次\n",
" 但\"是\"、\"编程\"、\"语言\"出现次数相同\n",
" 我们希望\"Python\"的权重更高(因为它更重要)\n",
"```\n",
"\n",
"**关键洞察**\n",
"- 高频出现的词 ≠ 一定重要(\"的\"、\"了\"在所有文章都出现)\n",
"- 罕见词 ≠ 不重要(\"TensorFlow\"只在AI文章出现很重要"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.2 TF-IDF公式\n",
"\n",
"**TF-IDF = 词频(TF) × 逆文档频率(IDF)**\n",
"\n",
"```\n",
"TF = 这个词在本文中出现了多少次\n",
"IDF = log(总文档数 / 包含该词的文档数)\n",
"\n",
"TF-IDF = TF × IDF\n",
"```\n",
"\n",
"### IDF的含义\n",
"\n",
"| 词 | 在多少文档出现 | IDF值 | 解释 |\n",
"|----|----------------|-------|------|\n",
"| \"的\" | 所有文档 | log(很高) ≈ 0 | 到处都是,不重要 |\n",
"| \"Python\" | 少数文档 | log(中等) = 高 | 较独特,重要 |\n",
"| \"TensorFlow\" | 极少数文档 | log(很低) = 更高 | 很独特,非常重要 |\n",
"| \"AI\" | 只有1篇 | log(总文档数/1) = 最高 | 最独特,最重要 |"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"TF-IDF词频-逆文档频率演示\n",
"==================================================\n",
"文档集合:\n",
" Doc1: Python 编程 语言\n",
" Doc2: Python Python Python\n",
" Doc3: Java 编程 语言\n",
"\n",
"词表: ['Java', 'Python', '编程', '语言']\n",
"\n",
"IDF值: [1.4055, 1.0, 1.0, 1.0]\n",
"\n",
"TF-IDF矩阵\n",
" Doc1: [0.0, 1.0, 1.0, 1.0]\n",
" Doc2: [0.0, 3.0, 0.0, 0.0]\n",
" Doc3: [1.4055, 0.0, 1.0, 1.0]\n",
"\n",
"详细分析:\n",
"\n",
"Doc1: Python 编程 语言\n",
" 'Python': TF-IDF = 1.0000\n",
" '编程': TF-IDF = 1.0000\n",
" '语言': TF-IDF = 1.0000\n",
"\n",
"Doc2: Python Python Python\n",
" 'Python': TF-IDF = 3.0000\n",
"\n",
"Doc3: Java 编程 语言\n",
" 'Java': TF-IDF = 1.4055\n",
" '编程': TF-IDF = 1.0000\n",
" '语言': TF-IDF = 1.0000\n"
]
}
],
"source": [
"# TF-IDF演示纯Python实现\n",
"import math\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"TF-IDF词频-逆文档频率演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_tfidf(docs):\n",
" \"\"\"\n",
" 简单的TF-IDF实现\n",
" \n",
" 参数:\n",
" docs: 文档列表,每篇文档已经是分词后的词列表\n",
" 返回:\n",
" vocab: 词表\n",
" tfidf_matrix: TF-IDF矩阵\n",
" idf: 每个词的IDF值\n",
" \"\"\"\n",
" # 1. 构建词表和BoW\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" \n",
" # 2. 构建BoW矩阵\n",
" bow = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow.append(vec)\n",
" \n",
" n_docs = len(docs)\n",
" \n",
" # 3. 计算IDF\n",
" idf = []\n",
" for j, word in enumerate(vocab):\n",
" df = sum(1 for vec in bow if vec[j] > 0)\n",
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
" idf.append(idf_j)\n",
" \n",
" # 4. 计算TF-IDF\n",
" tfidf = []\n",
" for vec in bow:\n",
" tfidf_vec = []\n",
" for i, tf in enumerate(vec):\n",
" tfidf_vec.append(tf * idf[i])\n",
" tfidf.append(tfidf_vec)\n",
" \n",
" return vocab, tfidf, idf\n",
"\n",
"docs = [\n",
" [\"Python\", \"编程\", \"语言\"],\n",
" [\"Python\", \"Python\", \"Python\"], # Python出现3次\n",
" [\"Java\", \"编程\", \"语言\"],\n",
"]\n",
"\n",
"vocab, tfidf_matrix, idf = simple_tfidf(docs)\n",
"\n",
"print(\"文档集合:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n",
"print()\n",
"\n",
"print(\"TF-IDF矩阵\")\n",
"for i, vec in enumerate(tfidf_matrix):\n",
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
"print()\n",
"\n",
"print(\"详细分析:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n",
" for j, score in enumerate(tfidf_matrix[i]):\n",
" if score > 0:\n",
" print(f\" '{vocab[j]}': TF-IDF = {score:.4f}\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"TF-IDF vs BoW 对比\n",
"==================================================\n",
"文档:\n",
" Doc1: Python 编程\n",
" Doc2: Java 编程\n",
" Doc3: Python Python Python\n",
"\n",
"BoW矩阵\n",
" Doc1: [0, 1, 1]\n",
" Doc2: [1, 0, 1]\n",
" Doc3: [0, 3, 0]\n",
"\n",
"TF-IDF矩阵\n",
" Doc1: [0.0, 1.0, 1.0]\n",
" Doc2: [1.4055, 0.0, 1.0]\n",
" Doc3: [0.0, 3.0, 0.0]\n",
"\n",
"重点分析:\n",
"Doc3 'Python Python Python':\n",
" BoW: Python出现3次\n",
" TF-IDF: Python的TF-IDF = 0.0000\n",
"\n",
"为什么Doc3的TF-IDF不是最高的\n",
"因为Python在Doc1和Doc2也出现了IDF值被稀释\n"
]
}
],
"source": [
"# TF-IDF vs BoW 对比\n",
"import math\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"TF-IDF vs BoW 对比\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_bow(docs):\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow_matrix = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow_matrix.append(vec)\n",
" return vocab, bow_matrix\n",
"\n",
"def simple_tfidf(docs):\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow.append(vec)\n",
" \n",
" n_docs = len(docs)\n",
" idf = []\n",
" for j, word in enumerate(vocab):\n",
" df = sum(1 for vec in bow if vec[j] > 0)\n",
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
" idf.append(idf_j)\n",
" \n",
" tfidf = []\n",
" for vec in bow:\n",
" tfidf_vec = []\n",
" for i, tf in enumerate(vec):\n",
" tfidf_vec.append(tf * idf[i])\n",
" tfidf.append(tfidf_vec)\n",
" \n",
" return vocab, tfidf, idf\n",
"\n",
"docs = [\n",
" [\"Python\", \"编程\"],\n",
" [\"Java\", \"编程\"],\n",
" [\"Python\", \"Python\", \"Python\"] # Python出现3次\n",
"]\n",
"\n",
"vocab_bow, bow_matrix = simple_bow(docs)\n",
"vocab_tfidf, tfidf_matrix, idf = simple_tfidf(docs)\n",
"\n",
"print(\"文档:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(\"BoW矩阵\")\n",
"for i, vec in enumerate(bow_matrix):\n",
" print(f\" Doc{i+1}: {vec}\")\n",
"print()\n",
"\n",
"print(\"TF-IDF矩阵\")\n",
"for i, vec in enumerate(tfidf_matrix):\n",
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
"print()\n",
"\n",
"# 重点分析Doc3\n",
"print(\"重点分析:\")\n",
"print(f\"Doc3 'Python Python Python':\")\n",
"print(f\" BoW: Python出现3次\")\n",
"print(f\" TF-IDF: Python的TF-IDF = {tfidf_matrix[2][0]:.4f}\")\n",
"print()\n",
"print(\"为什么Doc3的TF-IDF不是最高的\")\n",
"print(\"因为Python在Doc1和Doc2也出现了IDF值被稀释\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"附加题答案\n",
"==================================================\n",
"文档:\n",
" Doc1: Python 编程\n",
" Doc2: Java 编程\n",
" Doc3: Python Python\n",
"\n",
"词表: ['Java', 'Python', '编程']\n",
"\n",
"IDF值: [1.4055, 1.0, 1.0]\n",
"\n",
"TF-IDF矩阵\n",
" Doc1: [0.0, 1.0, 1.0]\n",
" Doc2: [1.4055, 0.0, 1.0]\n",
" Doc3: [0.0, 2.0, 0.0]\n",
"\n",
"问题1为什么Python在Doc3中的TF-IDF值不是最高\n",
"答因为Python在Doc1、Doc2、Doc3中都出现了\n",
" IDF = log(3/3) = 0所以TF-IDF = 3 * 0 = 0\n",
"\n",
"问题2Java在Doc2中的TF-IDF值是多少\n",
"答Java在Doc2的TF-IDF值 = 1.4055\n",
" 因为Java只出现在Doc2中其他文档没有所以IDF值高\n"
]
}
],
"source": [
"# 附加题答案\n",
"import math\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"附加题答案\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_tfidf(docs):\n",
" vocab_set = set()\n",
" for doc in docs:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" bow = []\n",
" for doc in docs:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" vec[vocab.index(word)] += 1\n",
" bow.append(vec)\n",
" \n",
" n_docs = len(docs)\n",
" idf = []\n",
" for j, word in enumerate(vocab):\n",
" df = sum(1 for vec in bow if vec[j] > 0)\n",
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
" idf.append(idf_j)\n",
" \n",
" tfidf = []\n",
" for vec in bow:\n",
" tfidf_vec = []\n",
" for i, tf in enumerate(vec):\n",
" tfidf_vec.append(tf * idf[i])\n",
" tfidf.append(tfidf_vec)\n",
" \n",
" return vocab, tfidf, idf\n",
"\n",
"docs = [[\"Python\", \"编程\"], [\"Java\", \"编程\"], [\"Python\", \"Python\"]]\n",
"\n",
"vocab, tfidf_matrix, idf = simple_tfidf(docs)\n",
"\n",
"print(\"文档:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {' '.join(doc)}\")\n",
"print()\n",
"\n",
"print(f\"词表: {vocab}\")\n",
"print()\n",
"print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n",
"print()\n",
"\n",
"print(\"TF-IDF矩阵\")\n",
"for i, vec in enumerate(tfidf_matrix):\n",
" print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n",
"print()\n",
"\n",
"print(\"问题1为什么Python在Doc3中的TF-IDF值不是最高\")\n",
"print(\"答因为Python在Doc1、Doc2、Doc3中都出现了\")\n",
"print(\" IDF = log(3/3) = 0所以TF-IDF = 3 * 0 = 0\")\n",
"print()\n",
"print(\"问题2Java在Doc2中的TF-IDF值是多少\")\n",
"java_idx = vocab.index(\"Java\")\n",
"print(f\"答Java在Doc2的TF-IDF值 = {tfidf_matrix[1][java_idx]:.4f}\")\n",
"print(\" 因为Java只出现在Doc2中其他文档没有所以IDF值高\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.3 TF-IDF的优缺点\n",
"\n",
"| 优点 | 缺点 |\n",
"|------|------|\n",
"| 考虑词的重要性 | 忽略词序 |\n",
"| 降低常见词权重 | 无法捕捉语义 |\n",
"| 提高独特词权重 | \"猫\"和\"狗\"的TF-IDF可能相似也可能不相似 |\n",
"| 可以提取关键词 | 无法处理同义词 \"电脑\" vs \"计算机\" |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第八部分Word Embedding词嵌入\n",
"\n",
"## 8.1 BoW和TF-IDF的根本问题\n",
"\n",
"```python\n",
"# 位置编码的问题\n",
"\"猫\" → [1, 0, 0, ...] # 只是\"位置编码\"\n",
"\"狗\" → [0, 1, 0, ...] # 猫和狗的位置不同\n",
"\"小猫\" → [0, 0, 1, ...] # 但它们语义相近,向量却正交!\n",
"\n",
"# 问题:无法表达语义相似性!\n",
"# \"猫\"和\"狗\"都是动物,语义很相似\n",
"# 但在BoW/TF-IDF中它们的向量可能完全不同\n",
"```\n",
"\n",
"### 词嵌入的核心思想\n",
"\n",
"```\n",
"不再用\"位置\"表示词,而是用\"语义空间\"表示词\n",
"\n",
"语义空间示例(二维简化):\n",
" ↑ 动物性\n",
" 狗 | ↑ 猫\n",
" | ↗\n",
" 0 |↗ ↑ 苹果\n",
" |___________→ 植物性\n",
" ↑ 香蕉\n",
" \n",
" 语义相近的词在空间中距离近\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"词嵌入Word Embedding概念演示\n",
"==================================================\n",
"\n",
"词向量简化版3维示意\n",
"维度含义: [动物性, 植物性, 其他/技术性]\n",
"\n",
" 猫: [0.9 0.1 0.2]\n",
" 狗: [0.8 0.3 0.1]\n",
" 小猫: [0.85 0.2 0.15]\n",
" 苹果: [0.1 0.2 0.9]\n",
" 香蕉: [0.1 0.1 0.85]\n",
" Python: [0.1 0. 0.9]\n",
" Java: [0.1 0. 0.85]\n",
"\n",
"语义相似度:\n",
" 猫 vs 狗: 0.965\n",
" 猫 vs 小猫: 0.992\n",
" 猫 vs 苹果: 0.337\n",
" 苹果 vs 香蕉: 0.995\n",
" Python vs Java: 1.000\n",
"\n",
"词嵌入的优势:\n",
" - 语义相似的词,向量也相似\n",
" - 可以做类比推理:国王-男人+女人=女王\n"
]
}
],
"source": [
"# Word2Vec词嵌入的概念演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"词嵌入Word Embedding概念演示\")\n",
"print(\"=\" * 50)\n",
"print()\n",
"\n",
"# 假设这些是用Word2Vec等方法训练出来的词向量简化版3维\n",
"# 实际中向量通常是50/100/300维\n",
"word_vectors = {\n",
" \"猫\": np.array([0.9, 0.1, 0.2]), # 动物属性高,其他低\n",
" \"狗\": np.array([0.8, 0.3, 0.1]), # 动物属性高\n",
" \"小猫\": np.array([0.85, 0.2, 0.15]), # 小动物,也像猫\n",
" \"苹果\": np.array([0.1, 0.2, 0.9]), # 水果属性高\n",
" \"香蕉\": np.array([0.1, 0.1, 0.85]), # 水果属性高\n",
" \"Python\": np.array([0.1, 0.0, 0.9]), # 编程语言\n",
" \"Java\": np.array([0.1, 0.0, 0.85]), # 编程语言\n",
"}\n",
"\n",
"print(\"词向量简化版3维示意\")\n",
"print(\"维度含义: [动物性, 植物性, 其他/技术性]\")\n",
"print()\n",
"for word, vec in word_vectors.items():\n",
" print(f\" {word}: {vec}\")\n",
"print()\n",
"\n",
"# 计算相似度\n",
"print(\"语义相似度:\")\n",
"print(f\" 猫 vs 狗: {cosine_similarity(word_vectors['猫'], word_vectors['狗']):.3f}\")\n",
"print(f\" 猫 vs 小猫: {cosine_similarity(word_vectors['猫'], word_vectors['小猫']):.3f}\")\n",
"print(f\" 猫 vs 苹果: {cosine_similarity(word_vectors['猫'], word_vectors['苹果']):.3f}\")\n",
"print(f\" 苹果 vs 香蕉: {cosine_similarity(word_vectors['苹果'], word_vectors['香蕉']):.3f}\")\n",
"print(f\" Python vs Java: {cosine_similarity(word_vectors['Python'], word_vectors['Java']):.3f}\")\n",
"print()\n",
"print(\"词嵌入的优势:\")\n",
"print(\" - 语义相似的词,向量也相似\")\n",
"print(\" - 可以做类比推理:国王-男人+女人=女王\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"词嵌入的类比推理\n",
"==================================================\n",
"\n",
"词向量(简化版):\n",
" King: [0.9 0.1 0.8 0.3]\n",
" Man: [0.8 0.1 0.2 0.5]\n",
" Woman: [0.1 0.8 0.2 0.5]\n",
" Queen: [0.1 0.9 0.8 0.3]\n",
"\n",
"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\n",
"\n",
"King - Man + Woman = [0.2 0.8 0.8 0.3]\n",
"Queen = [0.1 0.9 0.8 0.3]\n",
"\n",
"相似度验证:\n",
" (King-Man+Woman) vs Queen: 0.994\n",
"\n",
"结论:词嵌入可以捕捉语义关系!\n",
" '国王' - '男人' + '女人' ≈ '女王'\n",
" 这说明词向量编码了语义信息!\n"
]
}
],
"source": [
"# 词嵌入的类比推理演示\n",
"import numpy as np\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"词嵌入的类比推理\")\n",
"print(\"=\" * 50)\n",
"print()\n",
"\n",
"# 经典例子King - Man + Woman ≈ Queen\n",
"# 这个例子说明了词嵌入可以捕捉语义关系\n",
"\n",
"# 简化版词向量(实际中这些向量由神经网络学习得到)\n",
"king = np.array([0.9, 0.1, 0.8, 0.3]) # 皇室、男性、有权力\n",
"man = np.array([0.8, 0.1, 0.2, 0.5]) # 男性\n",
"woman = np.array([0.1, 0.8, 0.2, 0.5]) # 女性\n",
"queen = np.array([0.1, 0.9, 0.8, 0.3]) # 皇室、女性、有权力\n",
"\n",
"print(\"词向量(简化版):\")\n",
"print(f\" King: {king}\")\n",
"print(f\" Man: {man}\")\n",
"print(f\" Woman: {woman}\")\n",
"print(f\" Queen: {queen}\")\n",
"print()\n",
"print(\"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\")\n",
"print()\n",
"\n",
"# 计算 King - Man + Woman\n",
"result = king - man + woman\n",
"print(f\"King - Man + Woman = {result}\")\n",
"print(f\"Queen = {queen}\")\n",
"print()\n",
"\n",
"# 相似度\n",
"print(\"相似度验证:\")\n",
"print(f\" (King-Man+Woman) vs Queen: {cosine_similarity(result, queen):.3f}\")\n",
"print()\n",
"\n",
"print(\"结论:词嵌入可以捕捉语义关系!\")\n",
"print(\" '国王' - '男人' + '女人' ≈ '女王'\")\n",
"print(\" 这说明词向量编码了语义信息!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8.2 词嵌入的发展历史\n",
"\n",
"| 方法 | 年份 | 特点 |\n",
"|------|------|------|\n",
"| Word2Vec | 2013 | Google开源开启词嵌入时代 |\n",
"| GloVe | 2014 | Stanford提出基于全局共现矩阵 |\n",
"| FastText | 2016 | Facebook开源支持子词 |\n",
"| ELMo | 2018 | 考虑上下文,动态词向量 |\n",
"| BERT | 2018 | Transformer架构预训练大模型 |\n",
"| GPT系列 | 2018-现在 | 生成式AIChatGPT核心 |"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"==================================================\n",
"预训练词向量演示(使用内置示例向量)\n",
"==================================================\n",
"\n",
"注意真实环境中加载Gensim预训练模型需要下载约66MB\n",
"本notebook使用内置示例向量进行演示\n",
"\n",
"词向量示例每个词用一个5维向量表示\n",
"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n",
"\n",
" cat : [0.9, 0.1, 0.2, 0.8, 0.3]\n",
" dog : [0.8, 0.2, 0.1, 0.9, 0.3]\n",
" bird : [0.7, 0.3, 0.1, 0.9, 0.2]\n",
" fish : [0.6, 0.2, 0.1, 0.8, 0.2]\n",
" apple : [0.1, 0.9, 0.3, 0.0, 0.2]\n",
" rose : [0.1, 0.8, 0.1, 0.0, 0.1]\n",
" python : [0.1, 0.0, 0.9, 0.0, 0.5]\n",
" java : [0.1, 0.0, 0.8, 0.0, 0.4]\n",
" computer : [0.1, 0.0, 0.9, 0.3, 0.4]\n",
" love : [0.3, 0.2, 0.1, 0.1, 0.9]\n",
" hate : [0.2, 0.1, 0.1, 0.1, 0.8]\n",
"\n",
"==================================================\n",
"1. 语义相似度计算\n",
"==================================================\n",
" cat vs dog : 0.987\n",
" cat vs apple : 0.244\n",
" python vs java : 0.998\n",
" python vs cat : 0.322\n",
" love vs hate : 0.993\n",
"\n",
"==================================================\n",
"2. 类比推理Word2Vec核心能力\n",
"==================================================\n",
"类比问题man -> woman, king -> ?\n",
"\n",
" King = [0.6 0.1 0.3 0.3 0.6]\n",
" Man = [0.8 0.1 0.2 0.5 0.3]\n",
" Woman = [0.2 0.8 0.2 0.5 0.5]\n",
" King - Man + Woman = [-0. 0.8 0.3 0.3 0.8]\n",
" Queen (真实) = [0.2 0.9 0.3 0.3 0.6]\n",
"\n",
" 相似度: 0.969\n",
"\n",
"太棒了!词嵌入可以捕捉语义关系!\n",
"\n",
"==================================================\n",
"真实环境中加载Gensim预训练模型的方法\n",
"==================================================\n",
"如需加载真实的预训练词向量,可以运行:\n",
"\n",
" import gensim.downloader as api\n",
" model = api.load('glove-wiki-gigaword-50')\n",
"\n",
"这会下载约66MB的预训练词向量模型\n"
]
}
],
"source": [
"# 实战:用预训练词向量演示词嵌入(跳过实际下载)\n",
"import numpy as np\n",
"\n",
"def cosine_similarity(a, b):\n",
" \"\"\"计算余弦相似度\"\"\"\n",
" dot = np.dot(a, b)\n",
" norm_a = np.linalg.norm(a)\n",
" norm_b = np.linalg.norm(b)\n",
" if norm_a == 0 or norm_b == 0:\n",
" return 0.0\n",
" return dot / (norm_a * norm_b)\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"预训练词向量演示(使用内置示例向量)\")\n",
"print(\"=\" * 50)\n",
"print()\n",
"print(\"注意真实环境中加载Gensim预训练模型需要下载约66MB\")\n",
"print(\"本notebook使用内置示例向量进行演示\")\n",
"print()\n",
"\n",
"# 使用内置的小规模词向量示例(模拟真实词向量)\n",
"# 维度: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n",
"word_vectors = {\n",
" # 动物\n",
" \"cat\": np.array([0.9, 0.1, 0.2, 0.8, 0.3]),\n",
" \"dog\": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),\n",
" \"bird\": np.array([0.7, 0.3, 0.1, 0.9, 0.2]),\n",
" \"fish\": np.array([0.6, 0.2, 0.1, 0.8, 0.2]),\n",
" # 植物\n",
" \"apple\": np.array([0.1, 0.9, 0.3, 0.0, 0.2]),\n",
" \"rose\": np.array([0.1, 0.8, 0.1, 0.0, 0.1]),\n",
" # 技术\n",
" \"python\": np.array([0.1, 0.0, 0.9, 0.0, 0.5]),\n",
" \"java\": np.array([0.1, 0.0, 0.85, 0.0, 0.4]),\n",
" \"computer\": np.array([0.1, 0.0, 0.9, 0.3, 0.4]),\n",
" # 抽象概念\n",
" \"love\": np.array([0.3, 0.2, 0.1, 0.1, 0.9]),\n",
" \"hate\": np.array([0.2, 0.1, 0.1, 0.1, 0.8]),\n",
"}\n",
"\n",
"# 显示词向量\n",
"print(\"词向量示例每个词用一个5维向量表示\")\n",
"print(\"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\")\n",
"print()\n",
"for word, vec in word_vectors.items():\n",
" print(f\" {word:12s}: [{vec[0]:.1f}, {vec[1]:.1f}, {vec[2]:.1f}, {vec[3]:.1f}, {vec[4]:.1f}]\")\n",
"\n",
"print()\n",
"print(\"=\" * 50)\n",
"print(\"1. 语义相似度计算\")\n",
"print(\"=\" * 50)\n",
"pairs = [\n",
" (\"cat\", \"dog\"), # 都是动物\n",
" (\"cat\", \"apple\"), # 动物 vs 植物\n",
" (\"python\", \"java\"), # 都是编程语言\n",
" (\"python\", \"cat\"), # 编程语言 vs 动物\n",
" (\"love\", \"hate\"), # 情感词\n",
"]\n",
"for w1, w2 in pairs:\n",
" sim = cosine_similarity(word_vectors[w1], word_vectors[w2])\n",
" print(f\" {w1:10s} vs {w2:10s}: {sim:.3f}\")\n",
"\n",
"print()\n",
"print(\"=\" * 50)\n",
"print(\"2. 类比推理Word2Vec核心能力\")\n",
"print(\"=\" * 50)\n",
"print(\"类比问题man -> woman, king -> ?\")\n",
"print()\n",
"\n",
"# 简化版类比:使用语义维度\n",
"# man=[0.8, 0.1, 0.2, 0.5, 0.3], woman=[0.2, 0.8, 0.2, 0.5, 0.5]\n",
"# king=[0.6, 0.1, 0.3, 0.3, 0.6], queen=[0.2, 0.9, 0.3, 0.3, 0.6]\n",
"man = np.array([0.8, 0.1, 0.2, 0.5, 0.3])\n",
"woman = np.array([0.2, 0.8, 0.2, 0.5, 0.5])\n",
"king = np.array([0.6, 0.1, 0.3, 0.3, 0.6])\n",
"queen = np.array([0.2, 0.9, 0.3, 0.3, 0.6])\n",
"\n",
"# king - man + woman ≈ queen\n",
"result = king - man + woman\n",
"\n",
"print(f\" King = {king}\")\n",
"print(f\" Man = {man}\")\n",
"print(f\" Woman = {woman}\")\n",
"print(f\" King - Man + Woman = {np.round(result, 2)}\")\n",
"print(f\" Queen (真实) = {queen}\")\n",
"print()\n",
"print(f\" 相似度: {cosine_similarity(result, queen):.3f}\")\n",
"print()\n",
"print(\"太棒了!词嵌入可以捕捉语义关系!\")\n",
"print()\n",
"\n",
"# 真实环境中加载Gensim模型的方法仅供参考不执行\n",
"print(\"=\" * 50)\n",
"print(\"真实环境中加载Gensim预训练模型的方法\")\n",
"print(\"=\" * 50)\n",
"print(\"如需加载真实的预训练词向量,可以运行:\")\n",
"print()\n",
"print(\" import gensim.downloader as api\")\n",
"print(\" model = api.load('glove-wiki-gigaword-50')\")\n",
"print()\n",
"print(\"这会下载约66MB的预训练词向量模型\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第九部分:文本处理完整流程\n",
"\n",
"## 9.1 流程图\n",
"\n",
"```\n",
"┌──────────────────────────────────────────────────────────────────┐\n",
"│ 文本数据 │\n",
"│ \"今天天气真不错!\" │\n",
"└─────────────────────────┬────────────────────────────────────────┘\n",
" │\n",
" ▼\n",
"┌──────────────────────────────────────────────────────────────────┐\n",
"│ 1. 文本预处理 │\n",
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
"│ │ 分词 │→ │ 去停用词│→ │ 统一大小│→ │ 去除标点│ │\n",
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
"│ \"今天/天气/真/不错\" → \"今天/天气/不错\" │\n",
"└─────────────────────────┬────────────────────────────────────────┘\n",
" │\n",
" ▼\n",
"┌──────────────────────────────────────────────────────────────────┐\n",
"│ 2. 文本向量化 │\n",
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
"│ │ BoW │ │ TF-IDF │ │ Embedding│ │ 预训练模型│ │\n",
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
"│ ↓ ↓ ↓ ↓ │\n",
"│ [1,0,2,0,1] [0.5,0,0.8] [0.9,0.3] [BERT向量] │\n",
"└─────────────────────────┬────────────────────────────────────────┘\n",
" │\n",
" ▼\n",
"┌──────────────────────────────────────────────────────────────────┐\n",
"│ 3. 下游任务 │\n",
"│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n",
"│ │ 分类 │ │ 相似度 │ │ 聚类 │ │ 生成 │ │\n",
"│ │ 情感分析│ │ 文本匹配│ │ 主题分组│ │ 聊天机器人│ │\n",
"│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n",
"└──────────────────────────────────────────────────────────────────┘\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9.2 各环节详解\n",
"\n",
"### 环节1文本预处理\n",
"\n",
"| 步骤 | 输入 | 输出 | 作用 |\n",
"|------|------|------|------|\n",
"| 分词 | \"今天天气不错\" | [\"今天\", \"天气\", \"不错\"] | 把文本切成词 |\n",
"| 去停用词 | [\"今天\", \"天气\", \"不错\"] | [\"天气\", \"不错\"] | 去掉\"的、了、在\"等无意义词 |\n",
"| 统一大小写 | [\"Python\", \"python\"] | [\"python\", \"python\"] | 归一化 |\n",
"| 去标点 | [\"语言!!!\"] | [\"语言\"] | 清理噪音 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 环节2文本向量化\n",
"\n",
"| 方法 | 适用场景 | 不适用场景 |\n",
"|------|---------|-----------|\n",
"| BoW | 基线模型、快速原型 | 需要语义理解 |\n",
"| TF-IDF | 文本分类、关键词提取 | 同义词识别 |\n",
"| Embedding | 语义相似度、推荐系统 | 需要精确匹配 |\n",
"| 预训练模型 | 通用NLP任务 | 计算资源有限 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 环节3下游任务\n",
"\n",
"```python\n",
"# 分类任务:\n",
"\"这部电影太好看了!\" → 情感分类 → 正面 ✅\n",
"\n",
"# 相似度任务:\n",
"\"如何学习Python\" → 查找相似文档 → \"Python入门教程\" ✅\n",
"\n",
"# 生成任务:\n",
"\"今天天气\" → GPT续写 → \"今天天气真好,适合出去玩\" ✅\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 第十部分实战用jieba进行中文分词\n",
"\n",
"## 10.1 安装jieba\n",
"\n",
"```bash\n",
"!pip install jieba\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 安装jieba\n",
"import subprocess\n",
"subprocess.run(['pip', 'install', 'jieba', '-q'])\n",
"\n",
"print(\"jieba安装完成\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10.2 基础分词\n",
"\n",
"jieba支持三种分词模式\n",
"\n",
"| 模式 | 说明 | 适用场景 |\n",
"|------|------|---------|\n",
"| 精确模式 | 试图将句子最精确地切开,适合文本分析 | **默认,推荐** |\n",
"| 全模式 | 把所有可能的词都扫描出来,速度快 | 速度要求高 |\n",
"| 搜索引擎模式 | 在精确模式基础上,对长词再次切分 | 搜索引擎 |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"jieba分词演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"text = \"我喜欢深度学习和人工智能\"\n",
"\n",
"print(f\"原文: {text}\")\n",
"print()\n",
"\n",
"# 精确模式(默认)\n",
"words精确 = list(jieba.cut(text, cut_all=False))\n",
"print(f\"精确模式: {' / '.join(words精确)}\")\n",
"\n",
"# 全模式\n",
"words全 = list(jieba.cut(text, cut_all=True))\n",
"print(f\"全模式: {' / '.join(words全)}\")\n",
"\n",
"# 搜索引擎模式\n",
"words搜索 = list(jieba.cut_for_search(text))\n",
"print(f\"搜索模式: {' / '.join(words搜索)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 更多分词示例\n",
"import jieba\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"更多分词示例\")\n",
"print(\"=\" * 50)\n",
"\n",
"examples = [\n",
" \"今天天气真不错\",\n",
" \"人工智能是未来的发展方向\",\n",
" \"Python是一门非常流行的编程语言\",\n",
" \"小明毕业于清华大学计算机系\",\n",
" \"我今天在京东买了一部iPhone手机\"\n",
"]\n",
"\n",
"for i, text in enumerate(examples):\n",
" words = list(jieba.cut(text))\n",
" print(f\"{i+1}. {text}\")\n",
" print(f\" → {' / '.join(words)}\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10.3 词性标注\n",
"\n",
"jieba支持词性标注可以标注每个词是名词、动词、形容词等。\n",
"\n",
"| 词性代码 | 含义 | 示例 |\n",
"|----------|------|------|\n",
"| n | 名词 | 人、山、电脑 |\n",
"| v | 动词 | 跑、吃、学习 |\n",
"| adj | 形容词 | 漂亮、好吃、优秀 |\n",
"| adv | 副词 | 很、非常、慢慢 |\n",
"| m | 数词 | 一、百、千 |\n",
"| q | 量词 | 个、本、件 |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import jieba.posseg as pseg\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"jieba词性标注演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"text = \"我喜欢深度学习和人工智能\"\n",
"\n",
"print(f\"原文: {text}\")\n",
"print()\n",
"\n",
"words = pseg.cut(text)\n",
"print(\"分词 + 词性标注:\")\n",
"for word, flag in words:\n",
" print(f\" {word}: {flag}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10.4 停用词处理\n",
"\n",
"停用词是在文本处理中需要过滤掉的常见词,如\"的\"、\"了\"、\"在\"等。\n",
"\n",
"这些词在所有文档中都可能出现,对区分文档没有帮助。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import jieba\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"停用词处理演示\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 常见停用词列表\n",
"stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'])\n",
"\n",
"text = \"人工智能是未来的发展方向,也是当前科技领域的热门话题\"\n",
"\n",
"print(f\"原文: {text}\")\n",
"print()\n",
"\n",
"# 不使用停用词\n",
"words_all = list(jieba.cut(text))\n",
"print(f\"不使用停用词: {' / '.join(words_all)}\")\n",
"\n",
"# 使用停用词\n",
"words_filtered = [w for w in words_all if w not in stopwords]\n",
"print(f\"使用停用词: {' / '.join(words_filtered)}\")\n",
"print()\n",
"\n",
"# 更完整的停用词表可以从网上下载\n",
"print(\"提示:实际项目中可以从以下地方获取停用词表:\")\n",
"print(\" - 哈工大停用词表\")\n",
"print(\" - 百度停用词表\")\n",
"print(\" - 四川大学机器学习实验室停用词表\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 实战:完整的文本预处理流程\n",
"import jieba\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"完整的文本预处理流程\")\n",
"print(\"=\" * 50)\n",
"\n",
"# 示例文档集合\n",
"docs = [\n",
" \"今天天气真不错!适合出去玩。\",\n",
" \"Python是一门很棒的编程语言。\",\n",
" \"人工智能和机器学习是未来的发展方向。\",\n",
" \"今天在咖啡馆喝了一杯很好喝的拿铁。\"\n",
"]\n",
"\n",
"# 停用词表\n",
"stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这', '', '。', ','])\n",
"\n",
"def preprocess_text(text):\n",
" \"\"\"完整的文本预处理流程\"\"\"\n",
" # 1. 分词\n",
" words = jieba.cut(text)\n",
" \n",
" # 2. 去除停用词\n",
" words = [w for w in words if w not in stopwords and len(w) > 0]\n",
" \n",
" # 3. 去除空格\n",
" words = [w for w in words if w.strip()]\n",
" \n",
" return words\n",
"\n",
"print(\"预处理结果:\")\n",
"for i, doc in enumerate(docs):\n",
" words = preprocess_text(doc)\n",
" print(f\"\\nDoc{i+1}: {doc}\")\n",
" print(f\" → {' / '.join(words)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 实战jieba分词 + TF-IDF完整流程\n",
"import jieba\n",
"import math\n",
"\n",
"print(\"=\" * 50)\n",
"print(\"实战jieba分词 + TF-IDF完整流程\")\n",
"print(\"=\" * 50)\n",
"\n",
"def simple_tfidf_tokenized(docs, stopwords=None):\n",
" \"\"\"\n",
" 结合分词的TF-IDF实现\n",
" 参数:\n",
" docs: 原始文档列表\n",
" stopwords: 停用词集合\n",
" 返回:\n",
" vocab, tfidf_matrix\n",
" \"\"\"\n",
" # 1. 分词\n",
" tokenized = []\n",
" for doc in docs:\n",
" words = jieba.cut(doc)\n",
" if stopwords:\n",
" words = [w for w in words if w not in stopwords and len(w) > 1]\n",
" else:\n",
" words = [w for w in words if len(w) > 1]\n",
" tokenized.append(words)\n",
" \n",
" # 2. 构建词表\n",
" vocab_set = set()\n",
" for doc in tokenized:\n",
" vocab_set.update(doc)\n",
" vocab = sorted(list(vocab_set))\n",
" \n",
" # 3. 构建TF矩阵并计算IDF\n",
" n_docs = len(tokenized)\n",
" tf_matrix = []\n",
" df_dict = {word: 0 for word in vocab}\n",
" \n",
" for doc in tokenized:\n",
" vec = [0] * len(vocab)\n",
" for word in doc:\n",
" if word in vocab:\n",
" idx = vocab.index(word)\n",
" vec[idx] += 1\n",
" tf_matrix.append(vec)\n",
" \n",
" # 计算DF\n",
" for vec in tf_matrix:\n",
" for j, count in enumerate(vec):\n",
" if count > 0:\n",
" word = vocab[j]\n",
" df_dict[word] += 1\n",
" \n",
" # 计算IDF\n",
" idf = []\n",
" for word in vocab:\n",
" df = df_dict[word]\n",
" idf_j = math.log(n_docs / (df + 1)) + 1\n",
" idf.append(idf_j)\n",
" \n",
" # 计算TF-IDF\n",
" tfidf = []\n",
" for vec in tf_matrix:\n",
" tfidf_vec = [vec[i] * idf[i] for i in range(len(vec))]\n",
" tfidf.append(tfidf_vec)\n",
" \n",
" return vocab, tfidf, tokenized\n",
"\n",
"# 示例文档集合\n",
"docs = [\n",
" \"Python是一门很棒的编程语言\",\n",
" \"人工智能是未来的发展方向\",\n",
" \"深度学习是机器学习的一个分支\",\n",
" \"Python和Java都是很流行的编程语言\"\n",
"]\n",
"\n",
"# 停用词\n",
"stopwords = set([\"的\", \"是\", \"一个\", \"很\", \"和\", \"在\", \"了\"])\n",
"\n",
"vocab, tfidf_matrix, tokenized = simple_tfidf_tokenized(docs, stopwords)\n",
"\n",
"print(\"文档集合:\")\n",
"for i, doc in enumerate(docs):\n",
" print(f\" Doc{i+1}: {doc}\")\n",
"print()\n",
"\n",
"print(f\"分词结果:\")\n",
"for i, words in enumerate(tokenized):\n",
" print(f\" Doc{i+1}: {' / '.join(words)}\")\n",
"print()\n",
"\n",
"print(f\"词表(共{len(vocab)}个词):\")\n",
"print(f\" {vocab}\")\n",
"print()\n",
"\n",
"print(\"TF-IDF矩阵\")\n",
"for i, vec in enumerate(tfidf_matrix):\n",
" # 只显示非零值\n",
" nonzero = [(vocab[j], round(vec[j], 4)) for j in range(len(vec)) if vec[j] > 0]\n",
" print(f\" Doc{i+1}: {nonzero}\")\n",
"\n",
"print()\n",
"\n",
"# 找每个文档最重要的词\n",
"print(\"每个文档最重要的词TF-IDF值最高\")\n",
"for i, vec in enumerate(tfidf_matrix):\n",
" max_idx = max(range(len(vec)), key=lambda j: vec[j])\n",
" max_score = vec[max_idx]\n",
" if max_score > 0:\n",
" print(f\" Doc{i+1}: '{vocab[max_idx]}' (TF-IDF={max_score:.4f})\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"---\n",
"\n",
"# 📋 总结\n",
"\n",
"## 本章核心概念\n",
"\n",
"```\n",
"文本数据处理\n",
" │\n",
" ├── 核心问题:文本(符号) → 向量(数字)\n",
" │\n",
" ├── 向量化方法\n",
" │ ├── BoW词袋模型\n",
" │ │ └── 核心:统计词频,忽略顺序\n",
" │ │\n",
" │ ├── TF-IDF词频-逆文档频率)\n",
" │ │ └── 核心:词的重要性 × 词的独特性\n",
" │ │\n",
" │ └── Word Embedding词嵌入\n",
" │ └── 核心:用语义空间表示词\n",
" │\n",
" └── 处理流程\n",
" ├── 文本预处理(分词、去停用词)\n",
" ├── 向量化\n",
" └── 下游任务(分类、相似度、生成)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 关键公式速查\n",
"\n",
"| 方法 | 公式 | 含义 |\n",
"|------|------|------|\n",
"| 向量加法 | [1,2] + [3,4] = [4,6] | 对应位置相加 |\n",
"| 向量数乘 | 2 × [1,2] = [2,4] | 每个元素乘以标量 |\n",
"| 向量点积 | [1,2] · [3,4] = 11 | 对应相乘再求和 |\n",
"| 向量长度 | |[3,4]| = √(3²+4²) = 5 | 勾股定理 |\n",
"| 余弦相似度 | cos(θ) = (A·B) / (|A|×|B|) | 向量相似程度 |\n",
"| TF-IDF | TF × IDF | 词频 × 逆文档频率 |\n",
"\n",
"---\n",
"\n",
"> **记住:文本向量化的核心目标是把\"符号\"变成\"可计算的数值向量\"**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}