diff --git a/260423-2509165039.py b/260423-2509165039.py new file mode 100644 index 0000000..a241e34 --- /dev/null +++ b/260423-2509165039.py @@ -0,0 +1,15 @@ +# 题目5 + +# 词表:["Python","是","编程","语言","Java"] +# Doc1向量:[1,1,1,1,0] +# Doc2向量:[0,1,1,1,1] +# Doc3向量:[3,0,0,0,0] + + + +# 题目6 + +缺点1:忽略词序与词义关系,导致无法区分语义相反的文本(如"猫吃鱼"与"鱼吃猫")。 +缺点2:高纬度稀疏性与词义鸿沟,增加计算复杂度且无法处理同义词/多义词(如"编程"与"编码")。 + + diff --git a/3-2-1_文本数据处理导论_课堂演示.ipynb b/3-2-1_文本数据处理导论_课堂演示.ipynb new file mode 100644 index 0000000..4dcedd7 --- /dev/null +++ b/3-2-1_文本数据处理导论_课堂演示.ipynb @@ -0,0 +1,3356 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 3-2-1 文本数据处理导论\n", + "## 课堂演示notebook\n", + "\n", + "---\n", + "\n", + "## 目录\n", + "\n", + "1. [什么是文本数据?](#第一部分-什么是文本数据)\n", + "2. [计算机如何读取文本?](#第二部分-计算机如何读取文本)\n", + "3. [向量基础入门](#第三部分-向量基础入门)\n", + "4. [余弦相似度](#第四部分-余弦相似度)\n", + "5. [文本向量化的核心思想](#第五部分-文本向量化的核心思想)\n", + "6. [BoW词袋模型](#第六部分-bow词袋模型)\n", + "7. [TF-IDF词频-逆文档频率](#第七部分-tf-idf)\n", + "8. [Word Embedding词嵌入](#第八部分-word-embedding词嵌入)\n", + "9. [文本处理完整流程](#第九部分-文本处理完整流程)\n", + "10. [实战:用jieba进行中文分词](#第十部分-实战用jieba进行中文分词)\n", + "\n", + "---\n", + "\n", + "**注意**:运行本notebook需要安装以下依赖:\n", + "```bash\n", + "pip install numpy matplotlib jieba\n", + "```\n", + "- BoW和TF-IDF代码使用纯Python+NumPy实现,不依赖sklearn\n", + "- 如果服务器没有中文字体,图表中的中文可能显示为方块,这是正常现象。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 第一部分:什么是文本数据?\n", + "\n", + "## 1.1 文本数据的定义\n", + "\n", + "**文本数据**是由文字、符号组成的序列信息,是人类语言在计算机中的表示形式。\n", + "\n", + "### 生活中的文本数据例子\n", + "\n", + "| 类型 | 示例 |\n", + "|------|------|\n", + "| 一句话 | \"今天天气真好\" |\n", + "| 一篇文章 | 一篇新闻报道 |\n", + "| 一条评论 | \"这家餐厅的菜太好吃了!\" |\n", + "| 一段对话 | \"你好,请问这本书多少钱?\" |\n", + "| 一首诗 | \"床前明月光,疑是地上霜\" |\n", + "| 一段代码 | `print('Hello World')` |\n", + "| 一封邮件 | 包含正文、收件人、发件人等 |\n", + "| 聊天记录 | 微信对话、短信 |\n", + "\n", + "**简单来说:只要是文字组成的信息,都是文本数据!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1.2 文本数据的特点\n", + "\n", + "文本数据与图像、音频等数据有显著区别:\n", + "\n", + "| 特点 | 说明 | 示例 |\n", + "|------|------|------|\n", + "| **离散符号** | 由离散的字符/词组成,不是连续的数值 | \"hello\" 由 h,e,l,l,o 这5个字符组成 |\n", + "| **序列性** | 符号按特定顺序排列,顺序改变意思就改变 | \"我爱你\" ≠ \"你爱我\" |\n", + "| **语义丰富** | 同样的词在不同场景意思可能不同 | \"苹果\"可以是水果或手机品牌 |\n", + "| **上下文相关** | 词的意思依赖上下文 | \"他打了猫,猫跑了\" 中两个\"猫\"意思相同 |\n", + "| **歧义性** | 同样的话可能有多重理解 | \"天气真不错\"可以是正面或反讽 |\n", + "\n", + "### 思考:序列性有多重要?\n", + "\n", + "```\n", + "文本1: \"我吃了饭\"\n", + "文本2: \"饭了我吃\"\n", + "文本3: \"饭吃了我\"\n", + "\n", + "这三个文本由完全相同的字符组成,但顺序不同,意思也完全不同!\n", + "这说明:文本的顺序承载了重要的语义信息。\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第二部分:计算机如何\"读取\"文本?\n", + "\n", + "## 2.1 对比:图像数据 vs 文本数据的存储方式\n", + "\n", + "### 图像数据的读取\n", + "\n", + "```\n", + "图像文件(.jpg/.png)\n", + " ↓\n", + "计算机读取像素值(每个像素是0-255的数值)\n", + " ↓\n", + "存储为3维矩阵 [高度, 宽度, 通道(RGB)]\n", + " ↓\n", + "一张 1920×1080 的彩色图 = 1920 × 1080 × 3 = 6,220,800 个数字\n", + "```\n", + "\n", + "**图像的本质:密集的数值矩阵,计算机可以直接处理!**\n", + "\n", + "### 文本数据的读取\n", + "\n", + "```\n", + "文本文件(.txt/.md/.py)\n", + " ↓\n", + "计算机读取字符编码(ASCII/UTF-8/GBK)\n", + " ↓\n", + "存储为字符序列(每个字符是一个数字编码)\n", + " ↓\n", + "\"Python\" → [80, 121, 116, 104, 111](ASCII编码)\n", + "```\n", + "\n", + "**文本的本质:符号序列,计算机需要额外处理才能理解!**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2.2 字符编码:用数字表示字符\n", + "\n", + "### ASCII编码(英文和部分符号)\n", + "\n", + "ASCII码使用0-127的数字来表示128个字符:\n", + "\n", + "| 字符 | ASCII码 | 说明 |\n", + "|------|---------|------|\n", + "| 'A' | 65 | 大写字母 |\n", + "| 'B' | 66 | 大写字母 |\n", + "| ... | ... | ... |\n", + "| 'Z' | 90 | 大写字母 |\n", + "| 'a' | 97 | 小写字母 |\n", + "| 'b' | 98 | 小写字母 |\n", + "| ... | ... | ... |\n", + "| 'z' | 122 | 小写字母 |\n", + "| '0' | 48 | 数字 |\n", + "| '1' | 49 | 数字 |\n", + "| ... | ... | ... |\n", + "| '9' | 57 | 数字 |\n", + "\n", + "### UTF-8编码(支持全球所有语言,包括中文)\n", + "\n", + "UTF-8是一种变长编码,中文通常用3-4个字节表示:\n", + "\n", + "| 字符 | UTF-8编码值 | 字节数 |\n", + "|------|-------------|--------|\n", + "| '中' | 20013 | 2字节 |\n", + "| '文' | 25991 | 2字节 |\n", + "| 'P' | 80 | 1字节 |\n", + "| 'y' | 121 | 1字节 |\n", + "| '👍' | 128077 | 4字节(emoji) |" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "英文文本的字符编码\n", + "==================================================\n", + "文本: Hello\n", + "每个字符的ASCII码: [72, 101, 108, 108, 111]\n", + "\n", + " 'H' -> 72\n", + " 'e' -> 101\n", + " 'l' -> 108\n", + " 'l' -> 108\n", + " 'o' -> 111\n", + "\n", + "==================================================\n", + "中文文本的字符编码\n", + "==================================================\n", + "文本: 你好\n", + "每个字符的UTF-8编码值: [20320, 22909]\n", + "\n", + " '你' -> 20320\n", + " '好' -> 22909\n" + ] + } + ], + "source": [ + "# 实际演示:查看字符的编码值\n", + "\n", + "# 英文例子\n", + "text_en = \"Hello\"\n", + "print(\"=\" * 50)\n", + "print(\"英文文本的字符编码\")\n", + "print(\"=\" * 50)\n", + "print(f\"文本: {text_en}\")\n", + "print(f\"每个字符的ASCII码: {[ord(c) for c in text_en]}\")\n", + "print()\n", + "\n", + "# 逐个显示\n", + "for c in text_en:\n", + " print(f\" '{c}' -> {ord(c)}\")\n", + "\n", + "print()\n", + "print(\"=\" * 50)\n", + "print(\"中文文本的字符编码\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 中文例子\n", + "text_cn = \"你好\"\n", + "print(f\"文本: {text_cn}\")\n", + "print(f\"每个字符的UTF-8编码值: {[ord(c) for c in text_cn]}\")\n", + "print()\n", + "\n", + "# 逐个显示\n", + "for c in text_cn:\n", + " print(f\" '{c}' -> {ord(c)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "验证:数字编码转字符\n", + "\n", + "chr(65) = 'A' # 应该是大写字母 A\n", + "chr(97) = 'a' # 应该是小写字母 a\n", + "chr(20013) = '中' # 应该是中文'中'\n", + "chr(25991) = '文' # 应该是中文'文'\n" + ] + } + ], + "source": [ + "# 用chr()函数反向验证:数字编码转字符\n", + "print(\"验证:数字编码转字符\")\n", + "print()\n", + "\n", + "# 65是大写字母A\n", + "print(f\"chr(65) = '{chr(65)}' # 应该是大写字母 A\")\n", + "\n", + "# 97是小写字母a\n", + "print(f\"chr(97) = '{chr(97)}' # 应该是小写字母 a\")\n", + "\n", + "# 20013是中文\"中\"\n", + "print(f\"chr(20013) = '{chr(20013)}' # 应该是中文'中'\")\n", + "\n", + "# 25991是中文\"文\"\n", + "print(f\"chr(25991) = '{chr(25991)}' # 应该是中文'文'\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "练习题1答案\n", + "==================================================\n", + "1. 'Hello' 的ASCII码:\n", + "[72, 101, 108, 108, 111]\n", + "\n", + "2. 验证 chr(65):\n", + "chr(65) = 'A'\n", + "\n", + "验证 A-Z 的ASCII码范围 (65-90):\n", + "['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']\n" + ] + } + ], + "source": [ + "# 练习题1答案:验证字符编码\n", + "print(\"=\" * 50)\n", + "print(\"练习题1答案\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 1. 用 ord() 函数打印 \"Hello\" 每个字符的ASCII码\n", + "print(\"1. 'Hello' 的ASCII码:\")\n", + "print([ord(c) for c in \"Hello\"])\n", + "\n", + "# 2. 验证字符65对应大写字母A\n", + "print()\n", + "print(\"2. 验证 chr(65):\")\n", + "print(f\"chr(65) = '{chr(65)}'\")\n", + "\n", + "# 验证范围\n", + "print()\n", + "print(\"验证 A-Z 的ASCII码范围 (65-90):\")\n", + "print([chr(i) for i in range(65, 91)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2.3 计算机擅长什么?不擅长什么?\n", + "\n", + "### 计算机擅长的任务 ✅\n", + "\n", + "| 任务类型 | 示例 | 说明 |\n", + "|----------|------|------|\n", + "| 数字计算 | 1 + 2 = 3 | 加减乘除、方程求解 |\n", + "| 逻辑判断 | if a > b then ... | 条件分支、布尔运算 |\n", + "| 矩阵运算 | 图像卷积、矩阵乘法 | 深度学习核心 |\n", + "| 精确匹配 | 字符串完全相同比较 | 数据库查询 |\n", + "| 模式识别 | 符合规则的数据查找 | 正则表达式 |\n", + "| 存储检索 | 海量数据快速存取 | 搜索引擎 |\n", + "\n", + "### 计算机不擅长的任务 ❌\n", + "\n", + "| 任务类型 | 示例 | 为什么困难 |\n", + "|----------|------|-------------|\n", + "| 语义理解 | \"今天天气真好\"是好是坏? | 需要常识和上下文 |\n", + "| 情感判断 | \"真是绝了\"是夸还是骂? | 歧义性、反讽 |\n", + "| 模糊推理 | \"大概\"、\"也许\" | 无法精确处理 |\n", + "| 创意创作 | 写诗、写小说 | 需要想象力 |\n", + "| 常识理解 | \"水往低处流\" | 缺乏物理常识 |\n", + "| 多义性理解 | \"苹果\"指什么? | 需要世界知识 |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 为什么计算机不擅长理解文本?\n", + "\n", + "**原因一:文本是\"符号\",不是\"数值\"**\n", + "\n", + "```\n", + "计算机的大脑 = 计算器(专门处理数字)\n", + "文本 = 一堆符号(对计算机来说就像乱码)\n", + "\n", + "数字:1, 2, 3, 100.5, -7 → 计算机直接能算\n", + "文本:\"好\"、\"bad\"、\"hello\" → 计算机不知道啥意思\n", + "```\n", + "\n", + "**原因二:语义不是显式表达的**\n", + "\n", + "```python\n", + "# 人类理解的文本:\n", + "text = \"他今天心情不太好,因为下雨了\"\n", + "\n", + "# 人类理解:\n", + "# - \"心情不太好\" = 不开心\n", + "# - \"因为下雨了\" = 原因是下雨\n", + "# - 因果关系:下雨 → 心情不好\n", + "\n", + "# 计算机只能看到:\n", + "print(text)\n", + "# 计算机:???不理解下雨和心情的因果关系\n", + "```\n", + "\n", + "**原因三:同样的符号,不同的语境,不同的意思**\n", + "\n", + "```\n", + "语境1: \"苹果真好吃\" → 说的是水果(吃的苹果)\n", + "\n", + "语境2: \"苹果手机真贵\" → 说的是手机品牌(Apple)\n", + "\n", + "语境3: \"牛顿被苹果砸到了\" → 说的是水果(引发万有引力灵感)\n", + "\n", + "计算机怎么知道?需要上下文理解能力!\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 关键结论:为什么需要文本向量化?\n", + "\n", + "```\n", + "┌─────────────────────────────────────────────────────────────┐\n", + "│ 核心矛盾 │\n", + "│ │\n", + "│ 文本(符号序列) ←→ 计算机擅长(数值计算) │\n", + "│ ↓ │\n", + "│ 需要一座桥梁 │\n", + "│ 这座桥梁就是 │\n", + "│ 【文本向量化】 │\n", + "│ │\n", + "│ 文本 → 数值向量 → 计算机可以计算 → AI模型处理 │\n", + "└─────────────────────────────────────────────────────────────┘\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第三部分:向量基础入门\n", + "\n", + "## 3.1 什么是向量?\n", + "\n", + "**向量 = 有方向的量**,是数学中描述\"大小+方向\"的基本工具。\n", + "\n", + "### 生活中的向量例子\n", + "\n", + "| 例子 | 大小 | 方向 | 说明 |\n", + "|------|------|------|------|\n", + "| 速度 | 60 km/h | 向北 | 速度是向量 |\n", + "| 力 | 10 N | 向右推 | 力是向量 |\n", + "| 风向 | 5 m/s | 东南风 | 风向是向量 |\n", + "| 位移 | 100 km | 北京→上海 | 位移是向量 |\n", + "\n", + "### 向量在数学中的表示\n", + "\n", + "**一维向量(数轴上的点)**:\n", + "\n", + "```\n", + " ←———————————|———————————→\n", + " -3 -2 -1 0 1 2 3\n", + "\n", + " 点A在位置 2 → 向量A = [2] (只有1个数字)\n", + " 点B在位置 -3 → 向量B = [-3] (负数表示方向相反)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "NumPy向量创建演示\n", + "==================================================\n", + "一维向量 v1 = [3]\n", + "v1 有 1 个元素\n", + "\n", + "二维向量 v2 = [2 3]\n", + "v2 有 2 个元素\n", + "\n", + "三维向量 v3 = [1 2 3]\n", + "v3 有 3 个元素\n", + "\n", + "10维向量 v10 = [ 0.1 0.5 -0.3 0.8 0.2 -0.1 0.7 0.3 -0.2 0.6]\n", + "v10 有 10 个元素\n" + ] + } + ], + "source": [ + "# Python中使用NumPy创建向量\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"NumPy向量创建演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 一维向量(只有1个数字)\n", + "v1 = np.array([3])\n", + "print(f\"一维向量 v1 = {v1}\")\n", + "print(f\"v1 有 {len(v1)} 个元素\")\n", + "\n", + "# 二维向量(2个数字,表示平面上的一个点)\n", + "v2 = np.array([2, 3])\n", + "print(f\"\\n二维向量 v2 = {v2}\")\n", + "print(f\"v2 有 {len(v2)} 个元素\")\n", + "\n", + "# 三维向量(3个数字,表示立体空间的一个点)\n", + "v3 = np.array([1, 2, 3])\n", + "print(f\"\\n三维向量 v3 = {v3}\")\n", + "print(f\"v3 有 {len(v3)} 个元素\")\n", + "\n", + "# 高维向量(机器学习中常用,几十维到几千维)\n", + "v10 = np.array([0.1, 0.5, -0.3, 0.8, 0.2, -0.1, 0.7, 0.3, -0.2, 0.6])\n", + "print(f\"\\n10维向量 v10 = {v10}\")\n", + "print(f\"v10 有 {len(v10)} 个元素\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 二维向量的几何直观\n", + "\n", + "```\n", + " y (纵坐标)\n", + " ↑\n", + " |\n", + " 3 | * A(2,3)\n", + " |\n", + " 2 |\n", + " |\n", + " 1 | * B(4,1)\n", + " |\n", + " 0---+—————————————→ x (横坐标)\n", + " 0 1 2 3 4 5\n", + "\n", + " 向量A = [2, 3] (横坐标2,纵坐标3)\n", + " 向量B = [4, 1]\n", + "\n", + " 从原点(0,0)出发,到点(2,3)的箭头,就是向量A的图形表示\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", + "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAArwAAAIrCAYAAAAN2Uq4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaLhJREFUeJzt3Xd8FHX+x/H3JpACIdQk9Ca9hA4GkIAgHGCJBZHfCbHrCR6IFRvFErtwyoGoEAsIB0pAQRBB4BAsgJHiiYAIqCShJhBMgMz8/lizsKRvNtnJ5PV8PPYhM/udmc/kG/CdyWdnHKZpmgIAAABsys/XBQAAAAAlicALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAOWUw+FQ3759fV1GrtauXSuHw6FJkya5re/bt68cDodvirpIfHy8HA6H4uPjfV0KgAIQeAEU2u+//66pU6dq4MCBatiwoQICAlS7dm1df/31+uabb3LdJjugZL8qVqyomjVrqmPHjrr99tu1YsUKGYZRqOPv2rVLDodDrVq1KnDs448/LofDoeeee65I51gUv/76qxwOh2655ZYSO0ZBHnvsMTkcDsXFxeU7zjAMNWzYUP7+/jp48GApVVe2WWF+AXhHBV8XAKDseP311/XCCy/okksu0cCBAxUWFqbdu3crISFBCQkJmjdvnoYPH57rtg888IBCQkJkGIZOnDih//3vf5o7d65mz56tnj176sMPP1TDhg3zPX7Lli3Vu3dvbdiwQV999ZV69eqV6zjDMPTee+/J39/f9mHltttuU1xcnObMmaMJEybkOW7VqlU6ePCg/va3v6lBgwaSpP/973+qVKlSaZXqFe+9955Onz7t6zIkSddee60uvfRS1alTx9elACgAgRdAoXXv3l1r165VdHS02/r//ve/6t+/v/7xj38oJiZGgYGBObZ98MEHVbt2bbd1R44c0T//+U99+OGHGjRokDZv3qzKlSvnW8Ptt9+uDRs2aPbs2XkG3pUrV+q3337T0KFDVbdu3SKeZdnSrFkzRUdHa926dfrvf/+ryy67LNdxs2fPluT8+mUrzJVyqynoh6LSVLVqVVWtWtXXZQAoBFoaABTaddddlyPsStJll12mfv366fjx49q+fXuh91erVi198MEHuvzyy/XTTz9p+vTpBW4zbNgwValSRf/5z3+Unp6e65jcwl1KSoruv/9+NWvWTIGBgapVq5auv/567dixI9d9pKSk6IEHHlDLli0VHBysGjVqqEePHnr55ZclOfs3mzRpIkl699133do21q5d69pPenq6Jk6cqFatWikoKEg1atTQ0KFD9dVXX+U45qRJk1zbx8fHq3PnzqpUqVKBfbbZ55l93hc7duyYlixZolq1aunqq692rc+thzc1NVVPPfWU2rRpo5CQEIWGhqpZs2aKjY3V/v37XeNuueUWORwO/frrr/meR7YzZ87o9ddf16BBg9SgQQMFBgYqPDxc1113nb7//vt8z+9CufXwXvi1z+11YY/t4sWLNWLECDVr1kyVKlVS1apVddlll+mjjz5y22dh5je/Ht6vvvpKQ4cOVY0aNRQUFKRWrVpp4sSJuV6dzp6H5ORkxcbGqlatWgoODtall17q9jUE4Dmu8ALwiooVK0qSKlQo2j8rfn5+evzxx7VmzRotWLBADz/8cL7jK1eurJtuuklvvfWW/vOf/+jWW291e//o0aNaunSpwsPDdeWVV0qS9u7dq759++q3337TwIEDFRMTo5SUFH300UdauXKlVq9erR49erj2sWvXLvXr10+HDh1S7969FRMTo/T0dO3cuVPPPfecHnzwQXXs2FFjx47VtGnT1KFDB8XExLi2b9y4sSQpIyNDl19+ub799lt17txZ48aNU3JyshYsWKCVK1fqww8/1LBhw3Kc40svvaQvv/xS11xzjQYOHCh/f/98vyY33HCD7rvvPi1cuFCvv/66QkJC3N6fN2+eMjMzde+99yogICDP/ZimqUGDBumbb75Rr1699Le//U1+fn7av3+/li5dqpEjR6pRo0b51pKXY8eOady4cbrssss0ZMgQVa9eXb/88ouWLl2qzz77TOvXr1e3bt082vfEiRNzXT9jxgylpKS4tW1MmDBBAQEB6t27t+rUqaPDhw9r6dKluuGGG/Svf/1L9913nyQVan7zsnDhQo0YMUKBgYEaPny4wsPD9fnnn2vKlClauXKl1q5dq6CgILdtTpw4od69e6tq1aoaOXKkUlJStGDBAg0aNEhbtmxRu3btPPraAPiLCQDFtH//fjMwMNCsU6eOee7cObf3oqOjTUnmoUOH8tw+IyPDrFChgunn52eePXu2wON9/fXXpiSzd+/eOd6bNm2aKcl88MEHXet69uxp+vv7mytWrHAbu2vXLrNKlSpm+/bt3dZ37drVlGTOmjUrx/4PHjzo+vO+fftMSWZsbGyudU6ePNmUZP797383DcNwrd+6dasZEBBgVqtWzUxLS3OtnzhxoinJrFy5srlt27b8vwgXueeee0xJ5ttvv53jvU6dOpmSzB07dritl2RGR0e7lrdt22ZKMmNiYnLsIyMjwzx58qRrOTY21pRk7tu3L8fY7PP48ssv3bb/7bffcozdsWOHGRISYg4YMMBt/ZdffmlKMidOnOi2Pvv7qSDPP/+8Kcm85pprzKysLNf6vXv35hh78uRJs3379mbVqlXN9PR01/qC5nfOnDmmJHPOnDmudampqWbVqlXNwMBA84cffnCtz8rKMocPH25KMqdMmeK2H0mmJPPee+91q/Xtt982JZl33313gecLIH+0NAAolrNnz2rkyJHKzMzUCy+8UODVyNwEBgaqZs2aMgxDx44dK3B8jx491K5dO23YsEG7d+92e2/OnDmSnB/mkqTvv/9eGzduVGxsrAYNGuQ2tkWLFrrzzju1fft2V2vDt99+q82bN6tPnz668847cxy7fv36hT6vd999VxUrVtTzzz/v9mv4Tp06KTY2VidOnFBCQkKO7e666y61b9++0MeR8m5r+OGHH/T999+re/fuatu2baH2FRwcnGNdYGBgjivHRREYGKh69erlWN+2bVv169dP69ev19mzZz3e/4U+/vhjTZgwQZ07d9bcuXPl53f+f3VNmzbNMT4kJES33HKLUlNT9d133xXr2EuWLFFqaqpuu+02RUZGutb7+fnpxRdfVIUKFXJtgahcubJeeOEFt1pjY2NVoUKFYtcEgJYGAMVgGIZuueUWrV+/XnfeeadGjhxZase+/fbbdf/992v27NmuW3Jt3bpViYmJioqKUuvWrSVJX3/9tSQpOTk5xz1dJemnn35y/bddu3b69ttvJUkDBw4sVn1paWn65Zdf1Lp161xDcr9+/fTWW28pMTExx9ete/fuRT5e165d1aFDB23cuFG7du1Sy5YtJUnvvPOOJPd+5ry0bt1akZGR+vDDD/Xbb78pJiZGffv2VceOHd2CmKcSExP14osvasOGDUpKSsoRcI8cOVLsOx5s3rxZI0eOVN26dfXJJ5/k+BBkSkqKnn/+eX322Wfav3+//vzzT7f3//jjj2IdP7sfObe+64YNG6pp06b6+eefdfLkSVWpUsX1XosWLXL8QFGhQgVFREToxIkTxaoJAIEXgIcMw9Btt92mefPm6eabb9bMmTM93ldmZqaOHj0qf39/1ahRo1Db3HzzzXrkkUf03nvv6ZlnnpG/v3+uH1bLvmK8bNkyLVu2LM/9ZX8ALjU1VZJyvRpZFGlpaZKkiIiIXN/PDnbZ4y6U1zYFuf322/XPf/5Ts2fP1gsvvKAzZ85o3rx5qlSpkm666aYCt69QoYLWrFmjSZMm6aOPPtIDDzwgSQoLC9OYMWP0+OOPe3QFX5I2btyoyy+/XJLzh4nmzZsrJCREDodDCQkJ+uGHH5SZmenRvrMdPHhQV111lRwOhz755JMcd+g4duyYunXrpgMHDqhXr14aMGCAqlWrJn9/fyUmJmrJkiXFrqEw8/7zzz8rLS3NLfCGhobmOr5ChQrKysoqVk0AuEsDAA8YhqFbb71V7777rkaMGKH4+PhiXQH86quvdO7cOXXs2LHQH3qrVauWrrnmGv3xxx/67LPPlJmZqXnz5ikkJMTtXsDZQeL111+XaZp5vmJjYyVJ1apVk+R8yEZxZB83OTk51/eTkpLcxl3I0yeJ/f3vf1dgYKDee+89nTt3TkuWLNHRo0c1bNiwPAPVxWrWrKnXX39dv//+u3788Ue98cYbqlGjhiZOnKgXX3zRNS57vs+dO5djH9k/NFzo2WefVWZmpr744gstXbpUr7zyiiZPnqxJkybluF2dJ06ePKkrr7xSKSkpmjdvnjp16pRjzDvvvKMDBw7o6aef1oYNG/T666/r6aef1qRJk3TppZcWuwapePMOoOQQeAEUSXbYfe+99zR8+HC9//77Hl/1y97fs88+K0kaMWJEkba9sG81ISFBx48f14033uj2q+Hsuy9s2rSpUPvMbif4/PPPCxybfd65XYELDQ1V06ZNtWfPnlzDc/btpjp27FiougqjRo0auvbaa5WUlKTly5fnesW7sBwOh1q3bq3Ro0dr1apVkqSlS5e63q9evbqk3H8wyO02Y3v37lWNGjXUu3dvt/WnT5/W1q1bi1zfhbKysnTTTTdp27Zteumll9xuvXZxDZJ0zTXX5Hjvv//9b451+c1vXrKDdm63Ezt48KD27t2rpk2bul3dBVDyCLwACi27jeG9997TsGHD9MEHHxQr7B45ckQ333yz1qxZozZt2ugf//hHkba/4oor1KBBA3366ad69dVXJeUMd927d1ePHj304YcfasGCBbme07p161zL3bp1U7du3bR+/Xq99dZbOcZfGPCqV68uh8OR56N6Y2NjdfbsWU2YMEGmabrWb9u2TfHx8apatarb7a68Ifv84+Li9Pnnn6tFixZ5PoziYr/++muu99XNvlp54a20sm8hdvEHsBYtWuT29czWqFEjHT9+XDt37nSty8rK0oMPPqjDhw8Xqr68jBs3TsuXL9ddd92l8ePH5zku+5ZqGzZscFs/b948LV++PMf4guY3N9dcc42qVq2qOXPmuJ2raZp65JFHdO7cOds//Q+wInp4ARTalClT9O677yokJEQtWrTQM888k2NMTExMrlctX375ZdejhdPS0vTjjz/qv//9rzIyMtSrVy99+OGHRX7MrZ+fn2699VZNmTJF3377rVq1aqWePXvmGPfhhx+qX79+uummmzR16lR17txZwcHBOnDggDZt2qTDhw8rIyPDNX7u3Lnq27ev7rrrLr3//vuKiopSRkaGdu7cqe+//15Hjx6V5Px0f3Y4HjlypJo3by4/Pz/X/WoffvhhLVu2TO+//77+97//qX///q77q547d05vvfWW16/09e/fX40bN3Z9WC/7bhWFkZiYqOuuu07du3dXmzZtVLt2bf3+++9KSEiQn5+f7r//ftfYa665Rpdcconi4+N18OBBderUSf/73/+0Zs0aDRkyJEeAvO+++/T555+rd+/euvHGGxUUFKS1a9fq999/V9++fT1+wMK3336rN954Q8HBwQoLC8v1g4nZ35MjR47UCy+8oPvuu09ffvmlGjVqpB9++EGrV6/Wddddp48//thtu4LmNzehoaF66623NGLECPXo0UPDhw9XWFiYvvjiC23ZskXdu3fXQw895NG5AigGn90QDUCZk33v1fxeF96T1DTP3zc1+1WhQgWzevXqZocOHczbbrvNXLFihdu9R4tq3759psPhMCWZL774Yp7jjh07Zj7xxBNmu3btzODgYDMkJMRs3ry5+X//93/mxx9/nGN8UlKSOXbsWLNp06ZmQECAWaNGDbNHjx7mq6++6jZu165d5pAhQ8xq1aq56rjw/rOnTp0yn3zySbNFixaue+8OHjzY/O9//5vjmLndv9YT2ff/9ff3N//44488x+mi+/AePHjQfPTRR81LL73UDA8PNwMCAsyGDRua1113nblp06Yc2+/bt8+MiYkxq1SpYlauXNns37+/+d133+V5HosWLTI7d+5sVqpUyaxVq5Z54403mnv37s31nr6FvQ9v9rjCfk8mJiaaAwcONKtXr25WqVLFjI6ONr/44otc76lrmvnPb17bmKZprl+/3hw8eLBZrVo1MyAgwGzRooX55JNPmqdOnSpwHi7UqFEjs1GjRrm+B6DwHKZ5we/ZAAAAAJuhhxcAAAC2RuAFAACArRF4AQAAYGtlKvBmP49+3Lhx+Y5buHChWrVqpaCgILVv3z7X280AAACgfCgzgfe7777Tm2++qcjIyHzHbdy4USNGjNDtt9+u77//XjExMYqJidGOHTtKqVIAAABYSZm4S8OpU6fUuXNn/fvf/9Yzzzyjjh07aurUqbmOHT58uNLT0/Xpp5+61l166aXq2LGjZs6cWUoVAwAAwCrKxIMnRo8eraFDh2rAgAG53uj+Qps2bcrxpJ1BgwYpISEhz20yMzOVmZnpWjYMQ8eOHVPNmjU9fqY9AAAASo5pmjp58qTq1q0rP7/8mxYsH3jnz5+vrVu36rvvvivU+KSkJEVERLiti4iIUFJSUp7bxMXFafLkycWqEwAAAKXv4MGDql+/fr5jLB14Dx48qLFjx2rVqlVuz3D3tgkTJrhdFU5NTVXDhg21f/9+hYaGlthxS4thGLrhhhu0aNGiAn8CQukyDENHjhxRrVq1mBuLYW6sjfmxLubGuuw2N2lpaWrUqFGhHtFu6cC7ZcsWpaSkqHPnzq51WVlZWr9+vd544w1lZmbK39/fbZvatWsrOTnZbV1ycrJq166d53ECAwMVGBiYY321atVsE3grVqyoatWq2eIb3E4Mw9CZM2eYGwtibqyN+bEu5sa67DY32edQmPZTS59t//79tX37diUmJrpeXbt21d///nclJibmCLuSFBUVpdWrV7utW7VqlaKiokqrbAAAAFiIpa/wVqlSRe3atXNbV7lyZdWsWdO1ftSoUapXr57i4uIkSWPHjlV0dLReeeUVDR06VPPnz9fmzZs1a9asUq8fAAAAvmfpK7yFceDAAR06dMi13LNnT82bN0+zZs1Shw4dtGjRIiUkJOQIzgAAACgfLH2FNzdr167Nd1mShg0bpmHDhpVOQQAAALC0Mn+FFwAAAMgPgRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGuWDrwzZsxQZGSkQkNDFRoaqqioKH322Wd5jo+Pj5fD4XB7BQUFlWLFAAAAsJoKvi4gP/Xr19fzzz+v5s2byzRNvfvuu7rmmmv0/fffq23btrluExoaql27drmWHQ5HaZULAAAAC7J04L3qqqvclp999lnNmDFDX3/9dZ6B1+FwqHbt2qVRHgAAAMoASwfeC2VlZWnhwoVKT09XVFRUnuNOnTqlRo0ayTAMde7cWc8991ye4ThbZmamMjMzXctpaWmSJMMwZBiGd07AhwzDkGmatjgXu2FurIu5sTbmx7qYG+uy29wU5TwsH3i3b9+uqKgoZWRkKCQkRIsXL1abNm1yHduyZUvNnj1bkZGRSk1N1csvv6yePXtq586dql+/fp7HiIuL0+TJk3OsP3z4sDIyMrx2Lr5iGIbOnTunlJQU+flZum273DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj3WYpmmWYC3FdubMGR04cECpqalatGiR3n77ba1bty7P0Huhs2fPqnXr1hoxYoSefvrpPMfldoW3QYMGOn78uEJDQ71yHr5kGIaGDBmi5cuX2+Ib3E4Mw9Dhw4cVFhbG3FgMc2NtzI91MTfWZbe5SUtLU/Xq1ZWamlpgXrP8Fd6AgAA1a9ZMktSlSxd99913mjZtmt58880Ct61YsaI6deqkPXv25DsuMDBQgYGBOdb7+fnZ4htCcvY22+l87IS5sS7mxtqYH+tibqzLTnNTlHMoc2drGIbb1dj8ZGVlafv27apTp04JVwUAAACrsvQV3gkTJmjw4MFq2LChTp48qXnz5mnt2rVauXKlJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1xxx2+PA0AAAD4kKUDb0pKikaNGqVDhw6patWqioyM1MqVK3XFFVdIkg4cOOB2Ofv48eO68847lZSUpOrVq6tLly7auHFjofp9AQAAYE+WDrzvvPNOvu+vXbvWbfm1117Ta6+9VoIVAQAAoKwpcz28AAAAQFEQeAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AsKvGjSWHw/kaM8bX1ZyXkHC+LodD2rzZ1xUBsDkCLwCUhH//2xnmevTwbR2XXSa9/74UG3t+3cGD0uTJUvfuUvXqUq1aUt++0hdfFO9Yzz0nXXqpFBYmBQVJzZtL48ZJhw+7j+va1VnTXXcV73gAUEgEXgAoCXPnOq+wfvuttGeP7+po2lS6+WapW7fz65YskV54QWrWTHrmGenJJ6WTJ6UrrpDmzPH8WFu2SB07So8/Lk2fLl1zjXN/PXtK6ennx9Wv76wpKsrzYwFAEVTwdQEAYDv79kkbN0offyzdfbcz/E6c6OuqzuvXTzpwwHllN9s99zjD6lNPSbfe6tl+P/oo57qoKOmGG6RPPpFuusmz/QJAMXGFFwC8be5cZ6vA0KHOsDd3rq8rcte2rXvYlaTAQGnIEOm335xXe72lcWPnf0+c8N4+AaCIuMILAN42d6503XVSQIA0YoQ0Y4b03XfubQV5OXVKysgoeFzFilLVqsWv9UJJSVKlSs6Xp0xTOnpUOndO2r1bevRRyd/f2SMMAD5C4AUAb9qyRfrpJ+n1153LvXs7e1bnzi1c4B0zRnr33YLHRUdLa9cWq1Q3e/Y4WzCGDXMGVE8lJ0t16pxfrl9fmjdPatWq+DUCgIcIvADgTXPnShERzj5ZyXmnhuHDpQ8+kF55peAw+fDDzg90FaR69eLXmu30aWfQDQ6Wnn++ePuqUUNatcp5lfr7750h+tQp79QJAB4i8AKAt2RlSfPnO8Puvn3n1/fo4Qy7q1dLAwfmv482bZyv0pKV5fww2Y8/Sp99JtWtW7z9BQRIAwY4/3zllVL//lKvXlJ4uHMZAHyAwAsA3rJmjXTokDP0zp+f8/25cwsOvKmp0p9/FnysgADn1dTiuvNO6dNPnbVdfnnx93exnj2dLQ5z5xJ4AfgMgRcAvGXuXOeVzOnTc7738cfS4sXSzJnO1oG8jB1bej28Dz3kvE/u1KnOD9eVlIwMZ5AHAB8h8AKAN/z55/kPfd1wQ87369aVPvxQWrrU2dObl9Lq4X3pJenll6XHHnOG7OJKT3f2K198h4ePPpKOH3c+XQ0AfITACwDesHSp8/61V1+d+/vZj9ydOzf/wFsaPbyLFzuDdfPmUuvWzg/UXeiKK5wfvJOkX3+VmjRxPpo4Pj7vfe7e7ezdHT7ceUcGPz9p82bnvhs39k6oBgAPEXgBwBvmzpWCgpxhMTd+fs4HUcyd67xPbc2apVvfhX74wfnf3bulkSNzvv/ll+cDb/YdFi681Vhu6teXrr/e2cf87rvS2bNSo0bO26w9/rhvzxdAuUfgBQBvWLq04DFz5jhfpSkzUzpyxNk3XLmyc92kSc5XYaxf79xu3Lj8x9WqJb35ZuH2eeaMlJbG7coAlBoeLQwAdjZ/vrOV4pFHPNv+yy+lf/7z/BVfb1i+3FnTffd5b58AkA+u8AKAXc2de/4WZw0aeLaPhQu9V0+2Xr2cD6fI1rKl948BABcg8AKAXfXq5esKchcWdv7hFABQCmhpAAAAgK1ZOvDOmDFDkZGRCg0NVWhoqKKiovTZZ5/lu83ChQvVqlUrBQUFqX379lq+fHkpVQsAAAArsnTgrV+/vp5//nlt2bJFmzdv1uWXX65rrrlGO3fuzHX8xo0bNWLECN1+++36/vvvFRMTo5iYGO3YsaOUKwcAAIBVWDrwXnXVVRoyZIiaN2+uFi1a6Nlnn1VISIi+/vrrXMdPmzZNf/vb3/TQQw+pdevWevrpp9W5c2e98cYbpVw5AAAArKLMfGgtKytLCxcuVHp6uqKionIds2nTJo0fP95t3aBBg5SQkJDvvjMzM5WZmelaTktLkyQZhiHDMIpXuAUYhiHTNG1xLnbD3FgXc2NtzI91MTfWZbe5Kcp5WD7wbt++XVFRUcrIyFBISIgWL16sNnk8djMpKUkRF90rMiIiQklJSfkeIy4uTpMnT86x/vDhw8rIyPC8eIswDEPnzp1TSkqK/PwsfVG/3DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj7V84G3ZsqUSExOVmpqqRYsWKTY2VuvWrcsz9HpiwoQJbleG09LS1KBBA4WFhSk0NNRrx/EVwzBUoUIFhYeH2+Ib3E4Mw5DD4VBYWBhzYzHMjbUxP9bF3FiX3eYmKCio0GMtH3gDAgLUrFkzSVKXLl303Xffadq0aXozl0dY1q5dW8nJyW7rkpOTVbt27XyPERgYqMDAwBzr/fz8bPENIUkOh8NW52MnzI11MTfWxvxYF3NjXXaam6KcQ5k7W8Mw3PptLxQVFaXVq1e7rVu1alWePb8A4GKa0jvvSHFxvq4EAOBllr7CO2HCBA0ePFgNGzbUyZMnNW/ePK1du1YrV66UJI0aNUr16tVT3F//gxo7dqyio6P1yiuvaOjQoZo/f742b96sWbNm+fI0AFjdtm1SbKyUmOhc/vNPadIkX1YEAPAiSwfelJQUjRo1SocOHVLVqlUVGRmplStX6oorrpAkHThwwO1yds+ePTVv3jw98cQTeuyxx9S8eXMlJCSoXbt2vjoFAFaWkiI9+aT09tvShZ/2PXzYdzUBALzO0oH3nXfeyff9tWvX5lg3bNgwDRs2rIQqAmALWVnSa69JTz8t/XUbQjeDB5d+TQCAEmPpwAsAJeKVV6RHHsn7/SZNSq8WAECJK3MfWgOAYsv+4KvDIVWunPP9Ro1Ktx4AQIki8AIofx57TFqyRLruOik93bku+36ONWpINrj/NgDgPAIvgPLH318KDJQ++si5XLmydPas88+NG/usLABAySDwAih/UlOlO+44v/zyy1J4uPPPvXv7piYAQInhQ2sAyp8HHpB++8355wEDpLvvlvr3l9avl7jLCwDYDoEXQPmycqXziWqSFBLivAevwyE1b+58AQBsh5YGAOVHbq0M3JEBAGyPwAug/Li4leGuu3xbDwCgVBB4AZQPebUyAABsj8ALwP5oZQCAco3AC8D+aGUAgHKNwAvA3mhlAIByj8ALwL5oZQAAiMALwM5oZQAAiMALwK5oZQAA/IXAC8B+aGUAAFyAwAvAfmhlAABcgMALwF5oZQAAXITAC8A+aGUAAOSCwAvAPmhlAADkgsALwB5oZQAA5IHAC6Dso5UBAJAPAi+Aso9WBgBAPgi8AMo2WhkAAAUg8AIou2hlAAAUAoEXQNlFKwMAoBAIvADKJloZAACFROAFUPbQygAAKAICL4Cyh1YGAEAREHgBlC20MgAAiojAC6DsoJUBAOABAi+AsoNWBgCABwi8AMoGWhkAAB4i8AKwPloZAADFQOAFYH20MgAAioHAC8DaaGUAABQTgReAddHKAADwAgIvAOuilQEA4AUEXgDWRCsDAMBLCLwArIdWBgCAF1k68MbFxalbt26qUqWKwsPDFRMTo127duW7TXx8vBwOh9srKCiolCoG4BW0MgAAvMjSgXfdunUaPXq0vv76a61atUpnz57VwIEDlZ6enu92oaGhOnTokOu1f//+UqoYQLHRygAA8LIKvi4gPytWrHBbjo+PV3h4uLZs2aI+ffrkuZ3D4VDt2rVLujwA3kYrAwCgBFg68F4sNTVVklSjRo18x506dUqNGjWSYRjq3LmznnvuObVt2zbP8ZmZmcrMzHQtp6WlSZIMw5BhGF6o3LcMw5BpmrY4F7thbtw5xo+X469WBrN/f5l33CH56GvD3Fgb82NdzI112W1uinIeZSbwGoahcePGqVevXmrXrl2e41q2bKnZs2crMjJSqampevnll9WzZ0/t3LlT9evXz3WbuLg4TZ48Ocf6w4cPKyMjw2vn4CuGYejcuXNKSUmRn5+lu1jKHcMwlJqaKtM0y/3cBHz5pWrMni1JMipX1pG4OBmHD/usHubG2pgf62JurMtuc3Py5MlCj3WYpmmWYC1e849//EOfffaZNmzYkGdwzc3Zs2fVunVrjRgxQk8//XSuY3K7wtugQQMdP35coaGhxa7d1wzD0JAhQ7R8+XJbfIPbiWEYOnz4sMLCwsr33KSmyhEZ6bq6a/z739Ldd/u0JObG2pgf62JurMtuc5OWlqbq1asrNTW1wLxWJq7wjhkzRp9++qnWr19fpLArSRUrVlSnTp20Z8+ePMcEBgYqMDAwx3o/Pz9bfENIzr5mO52PnTA3kh56yO2uDH733GOJD6oxN9bG/FgXc2NddpqbopyDpc/WNE2NGTNGixcv1po1a9SkSZMi7yMrK0vbt29XnTp1SqBCAMXGXRkAACXM0ld4R48erXnz5mnJkiWqUqWKkpKSJElVq1ZVcHCwJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1x4Se/AVgDd2UAAJQCSwfeGTNmSJL69u3rtn7OnDm65ZZbJEkHDhxwu6R9/Phx3XnnnUpKSlL16tXVpUsXbdy4UW3atCmtsgEUFg+YAACUAksH3sJ8nm7t2rVuy6+99ppee+21EqoIgNfQygAAKCWW7uEFYFO0MgAAShGBF0Dpo5UBAFCKCLwAShetDACAUkbgBVB6aGUAAPgAgRdA6aGVAQDgAwReAKWDVgYAgI8QeAGUPFoZAAA+ROAFUPJoZQAA+BCBF0DJopUBAOBjBF4AJYdWBgCABRB4AZQcWhkAABZA4AVQMmhlAABYBIEXgPfRygAAsBACLwDvo5UBAGAhBF4A3kUrAwDAYgi8ALyHVgYAgAUReAF4D60MAAALIvAC8A5aGQAAFkXgBVB8tDIAACyMwAug+GhlAABYGIEXQPHQygAAsDgCLwDP0coAACgDCLwAPPfgg7QyAAAsj8ALwDMrVzrbFyRaGQAAllahqBucPn1aq1at0ldffaUff/xRR44ckcPhUK1atdS6dWv16tVLAwYMUOXKlUuiXgBWQCsDAKAMKfQV3u3bt+uWW25R7dq1de2112r69Onas2ePHA6HTNPUzz//rDfeeEPXXnutateurVtuuUXbt28vydoB+AqtDACAMqRQV3iHDx+ujz76SF27dtWkSZN0xRVXqE2bNvL393cbl5WVpR9//FGff/65Fi1apE6dOmnYsGH68MMPS6R4AD5AKwMAoIwpVOD18/PT5s2b1bFjx3zH+fv7q3379mrfvr0eeOABJSYm6oUXXvBGnQCsgFYGAEAZVKjA6+kV2o4dO3J1F7ATWhkAAGUQd2kAUDi0MgAAyiiPA29aWpqef/55DRo0SJ06ddK3334rSTp27JheffVV7dmzx2tFAvAxWhkAAGVYkW9LJkm//faboqOjdfDgQTVv3lw//fSTTp06JUmqUaOG3nzzTe3fv1/Tpk3zarEAfIRWBgBAGeZR4H3ooYd08uRJJSYmKjw8XOHh4W7vx8TE6NNPP/VKgQB8jFYGAEAZ51FLw+eff65//vOfatOmjRy5/I+vadOmOnjwYLGLA+BjtDIAAGzAo8D7559/KiwsLM/3T5486XFBACyEVgYAgA14FHjbtGmj9evX5/l+QkKCOnXq5HFRACyAVgYAgE14FHjHjRun+fPn64UXXlBqaqokyTAM7dmzRyNHjtSmTZt0//33e7VQAKWIVgYAgI149KG1m2++Wfv379cTTzyhxx9/XJL0t7/9TaZpys/PT88995xiYmK8WSeA0kQrAwDARjwKvJL0+OOPa+TIkfroo4+0Z88eGYahSy65RNddd52aNm3qzRoBlCZaGQAANuNR4D1w4IDCwsLUsGHDXFsX/vzzTx0+fFgNGzYsdoEAShGtDAAAG/Koh7dJkyZavHhxnu8vXbpUTZo08biobHFxcerWrZuqVKmi8PBwxcTEaNeuXQVut3DhQrVq1UpBQUFq3769li9fXuxagHKBVgYAgA15FHhN08z3/bNnz8rPz+OnFrusW7dOo0eP1tdff61Vq1bp7NmzGjhwoNLT0/PcZuPGjRoxYoRuv/12ff/994qJiVFMTIx27NhR7HoAW6OVAQBgU4VuaUhLS9OJEydcy0ePHtWBAwdyjDtx4oTmz5+vOnXqFLu4FStWuC3Hx8crPDxcW7ZsUZ8+fXLdZtq0afrb3/6mhx56SJL09NNPa9WqVXrjjTc0c+bMYtcE2BKtDAAAGyt04H3ttdc0ZcoUSZLD4dC4ceM0bty4XMeapqlnnnnGKwVeKPsWaDVq1MhzzKZNmzR+/Hi3dYMGDVJCQkKe22RmZiozM9O1nJaWJsl5qzXDMIpRsTUYhiHTNG1xLnZjlblxPPCAHH+1Mpj9+8u84w6pnH+/WGVukDvmx7qYG+uy29wU5TwKHXgHDhyokJAQmaaphx9+WCNGjFDnzp3dxjgcDlWuXFldunRR165dC19xIRiGoXHjxqlXr15q165dnuOSkpIUERHhti4iIkJJSUl5bhMXF6fJkyfnWH/48GFlZGR4XrRFGIahc+fOKSUlxSutJvAewzCUmprquqWfLwR8+aVqvPOOs57KlXUkLk7G4cM+qcVKrDA3yBvzY13MjXXZbW6K8mTfQgfeqKgoRUVFSZLS09N1/fXX5xs8vW306NHasWOHNmzY4PV9T5gwwe2qcFpamho0aKCwsDCFhoZ6/XilzTAMVahQQeHh4bb4BrcTwzDkcDgUFhbmm7lJTZXj4YfPL7/0kmp16VL6dViQz+cG+WJ+rIu5sS67zU1QUFChx3p0W7KJEyd6spnHxowZo08//VTr169X/fr18x1bu3ZtJScnu61LTk5W7dq189wmMDBQgYGBOdb7+fnZ4htCcl59t9P52IlP5+bhh93uyuB3zz18UO0C/L2xNubHupgb67LT3BTlHDx+8IQkffXVV9q6datSU1Nz9FE4HA49+eSTxdm9TNPUfffdp8WLF2vt2rWFutVZVFSUVq9e7dZfvGrVKtfVaQB/4a4MAIBywqPAe+zYMQ0dOlTffvutTNOUw+Fw3aos+8/eCLyjR4/WvHnztGTJElWpUsXVh1u1alUFBwdLkkaNGqV69eopLi5OkjR27FhFR0frlVde0dChQzV//nxt3rxZs2bNKlYtgK1wVwYAQDni0fXshx56SNu2bdO8efP0yy+/yDRNrVy5Uj///LPuuecedezYUX/88Uexi5sxY4ZSU1PVt29f1alTx/VasGCBa8yBAwd06NAh13LPnj01b948zZo1Sx06dNCiRYuUkJBQqv3GgOXxgAkAQDni0RXe5cuX6+6779bw4cN19OhRSc4+imbNmmn69Om67rrrNG7cOH344YfFKq6gB1xI0tq1a3OsGzZsmIYNG1asYwO2RSsDAKCc8egK74kTJ9S2bVtJUkhIiCTp1KlTrvcHDhyolStXeqE8AF5FKwMAoBzyKPDWrVvX1U8bGBio8PBw/fDDD673f//9dzm4YgRYD60MAIByyKOWhj59+mjVqlV6/PHHJUnDhw/Xiy++KH9/fxmGoalTp2rQoEFeLRRAMdHKAAAopzwKvOPHj9eqVauUmZmpwMBATZo0STt37nTdlaFPnz56/fXXvVoogGKglQEAUI55FHjbt2+v9u3bu5arV6+uL774QidOnJC/v7+qVKnitQIBeAGtDACAcqxYD564WLVq1by5OwDeQCsDAKCc8zjwZmVlaeXKlfrll190/PjxHLcQ88aDJwAUE60MAAB4Fng3b96s66+/Xr/99lue98ol8AIWQCsDAACe3Zbs3nvv1Z9//qmEhAQdO3ZMhmHkeGVlZXm7VgBFQSsDAACSPLzCu23bNj377LO66qqrvF0PAG+glQEAABePrvDWr1+/UI/9BeAjtDIAAODiUeB95JFH9NZbbyktLc3b9QAoLloZAABw41FLw8mTJxUSEqJmzZrppptuUoMGDeTv7+82xuFw6P777/dKkQAKiVYGAABy8CjwPvjgg64/v/HGG7mOIfACPkArAwAAOXgUePft2+ftOgAUF60MAADkyqPA24hfkQLWQisDAAB58uhDawAshlYGAADyVKgrvE2aNJGfn59++uknVaxYUU2aNJGjgF+VOhwO7d271ytFAsgHrQwAAOSrUIE3OjpaDodDfn5+bssAfIxWBgAAClSowBsfH5/vMgAfoZUBAIAC0cMLlFW0MgAAUCiFusK7fv16j3bep08fj7YDUABaGQAAKLRCBd6+ffu69eyaplmoHt6srCzPKwOQN1oZAAAotEIF3i+//NJtOTMzUw8//LBOnz6tu+66Sy1btpQk/fTTT3rrrbdUuXJlvfjii96vFgCtDAAAFFGh79JwofHjxysgIEBff/21goKCXOuvuuoqjR49WtHR0VqxYoWuuOIK71YLlHe0MgAAUGQefWht7ty5GjlypFvYzVapUiWNHDlSH3zwQbGLA3ARWhkAACgyjwJvenq6Dh06lOf7hw4d0unTpz0uCkAuaGUAAMAjHgXeAQMGaNq0afr4449zvPfRRx9p2rRpGjBgQLGLA/AXWhkAAPBYoXp4LzZ9+nRdfvnlGjZsmOrUqaNmzZpJkvbu3as//vhDl1xyiV5//XWvFgqUa7QyAADgMY+u8NarV08//PCDXn31VbVr107JyclKTk5W27Zt9dprr+mHH35Q/fr1vV0rUD7RygAAQLEU+QpvRkaGZs2apY4dO2rs2LEaO3ZsSdQFQKKVAQAALyjyFd6goCA98sgj2rVrV0nUA+BCtDIAAFBsHrU0tGvXTr/++quXSwHghlYGAAC8wqPA++yzz+rNN9/UF1984e16AEi0MgAA4EUe3aXhjTfeUI0aNTRo0CA1adJETZo0UXBwsNsYh8OhJUuWeKVIoNyhlQEAAK/xKPBu27ZNDodDDRs2VFZWlvbs2ZNjjINfvQKeoZUBAACv8ijw0r8LlBBaGQAA8DqPengBlBBaGQAA8DqPrvBmW7dunZYtW6b9+/dLkho1aqShQ4cqOjraK8UB5QqtDAAAlAiPAu+ZM2c0YsQIJSQkyDRNVatWTZJ04sQJvfLKK7r22mv14YcfqmLFit6sFbAvWhkAACgxHrU0TJ48WYsXL9YDDzygQ4cO6dixYzp27JiSkpL04IMP6uOPP9aUKVO8XStgW46HHqKVAQCAEuJR4J03b55iY2P14osvKiIiwrU+PDxcL7zwgkaNGqX333/fKwWuX79eV111lerWrSuHw6GEhIR8x69du1YOhyPHKykpySv1AN4W8OWXcrzzjnOBVgYAALzOo8B76NAh9ejRI8/3e/To4bWAmZ6erg4dOmj69OlF2m7Xrl06dOiQ6xUeHu6VegCvSk1V1QcfPL9MKwMAAF7nUQ9v/fr1tXbtWt1zzz25vr9u3TrVr1+/WIVlGzx4sAYPHlzk7cLDw129xYBVOR56SH5//OFcoJUBAIAS4VHgjY2N1cSJE1WtWjXdf//9atasmRwOh3bv3q2pU6dq4cKFmjx5srdrLZKOHTsqMzNT7dq106RJk9SrV688x2ZmZiozM9O1nJaWJkkyDEOGYZR4rSXNMAyZpmmLc7GVzz+X31+tDGZIiMxZsyTTdL7gc/y9sTbmx7qYG+uy29wU5Tw8CryPPfaY9u7dq1mzZumtt96Sn5+f68CmaSo2NlaPPfaYJ7sutjp16mjmzJnq2rWrMjMz9fbbb6tv37765ptv1Llz51y3iYuLyzWgHz58WBkZGSVdcokzDEPnzp1TSkqKa67gW460NNW6/XbXcuoTTygjOFhKSfFhVbiQYRhKTU2VaZr8vbEg5se6mBvrstvcnDx5stBjHabp+eWkbdu2afny5W734R0yZIgiIyM93WW+HA6HFi9erJiYmCJtFx0drYYNG+b5QbrcrvA2aNBAx48fV2hoaHFKtgTDMDRkyBAtX77cFt/gduC46y7XB9UyL7tM/qtXy8/f38dV4UKGYejw4cMKCwvj740FMT/WxdxYl93mJi0tTdWrV1dqamqBea1YD56IjIwssXDrTd27d9eGDRvyfD8wMFCBgYE51vv5+dniG0Jy/rBgp/Mp01aulC5oZUh95RXV8vdnbiyIvzfWxvxYF3NjXXaam6KcQ6FGnj592uNiirOttyQmJqpOnTq+LgPI8YAJ88UXZTRo4MOCAACwv0IF3gYNGmjKlCk6dOhQoXf8+++/66mnnlLDhg09Lk6STp06pcTERCUmJkqS9u3bp8TERB04cECSNGHCBI0aNco1furUqVqyZIn27NmjHTt2aNy4cVqzZo1Gjx5drDoAr3jwQR4wAQBAKStUS8OMGTM0adIkTZkyRb169dKAAQPUuXNnNWnSRNWrV5dpmjp+/Lj27dunzZs364svvtDXX3+t5s2b69///nexCty8ebP69evnWh4/frwk550i4uPjdejQIVf4lZyPPX7ggQf0+++/q1KlSoqMjNQXX3zhtg/AJ1audD5UQuIBEwAAlKJCf2jNMAwtXbpU8fHxWrFihc6cOSPHRf+zNk1TAQEBGjhwoG677TZdffXVZbJHJC0tTVWrVi1UE3RZYBiGBg8erM8++6xMzoctpKZK7dqdv7o7c6Z0990yDEMpKSkKDw9nbiyGubE25se6mBvrstvcFCWvFfpDa35+foqJiVFMTIwyMzO1ZcsW/fTTTzp69KgkqWbNmmrVqpW6dOmS6wfAgHKNVgYAAHzGo7s0BAYGqmfPnurZs6e36wHsh1YGAAB8quxfzwas7KK7Mujll6VGjXxXDwAA5RCBFyhJtDIAAOBzBF6gpNDKAACAJRB4gZJAKwMAAJZB4AVKAq0MAABYhkeB95tvvvF2HYB90MoAAICleBR4o6Ki1KJFCz399NP65ZdfvF0TUHbRygAAgOV4FHg/+OADNW/eXE8//bSaN2+uXr16aebMmTp27Ji36wPKFloZAACwHI8C7//93/9p2bJl+uOPPzRt2jSZpql7771XdevWVUxMjBYtWqQzZ854u1bA2mhlAADAkor1obVatWppzJgx2rhxo3bv3q3HH39cP/30k4YPH67atWvrrrvu0oYNG7xVK2BdtDIAAGBZXrtLQ3BwsCpVqqSgoCCZpimHw6ElS5YoOjpa3bp1048//uitQwHWQysDAACWVazAe/LkSc2ZM0cDBgxQo0aN9Nhjj6lx48ZatGiRkpKS9Mcff2jBggVKSUnRrbfe6q2aAWuhlQEAAEur4MlGS5Ys0dy5c/Xpp58qIyND3bp109SpU3XTTTepZs2abmNvuOEGHT9+XKNHj/ZKwYCl0MoAAIDleRR4r732WjVo0ED333+/Ro0apZYtW+Y7vkOHDvr73//uUYGApdHKAACA5XkUeNesWaO+ffsWenz37t3VvXt3Tw4FWBetDAAAlAke9fAWJewCtkQrAwAAZYbX7tIAlCu0MgAAUGYQeIGiopUBAIAyhcALFAWtDAAAlDkEXqAoaGVw0ze+rxyTHXJMdujKeVf6uhyXxKREV12OyQ4t+nGRr0sCUEY0buz8pZ3DIY0Z4+tqzps69XxdDod05IivKypbCLxAYVmolWHvsb26+5O71XRaUwU9E6TQuFD1mt1L076epj/P/lmqtbSq1UrvX/u+Huz5oNv6BTsW6OaPb1bz15vLMdmhvvF9i32sb3//Vvcuu1ddZnVRxacryjE5969/o6qN9P617+ux3o8V+5gASlZ8vHuQczik8HCpXz/ps898U9Nll0nvvy/FxuY9ZsMG74TPBQukm2+Wmjd37iuv+wL87W/Omq691vNjlWce3ZYMKHcs1Mqw7OdlGrZwmAIrBGpU5Ci1C2+nM1lntOHgBj206iHtPLxTs66aVWr1RFSO0M2RN+dYP2PzDG05tEXd6nbT0dNHvXKs5buX6+2tbysyIlJNqzfVz0d/znVc9eDqujnyZq39da2e2/CcV44NoGRNmSI1aSKZppSc7AzCQ4ZIn3wiXVnKv0Bq2tQZQvNiGNJ990mVK0vp6cU71owZ0pYtUrdu0tF8/qls1cr52rNHWry4eMcsjwi8QGFYpJVh3/F9uumjm9SoWiOtGbVGdarUcb03uvto7em3R8t+XuaT2i72/rXvq15oPfk5/NTu3+28ss9/dP2HHun1iIIrBmvM8jF5Bl4AZc/gwVLXrueXb79dioiQPvyw9ANvQWbNkg4edF4HmTatePt6/32pXj3Jz09q551/KpELAi9QEAu1Mrz41Ys6deaU3rn6Hbewm61ZjWYae+lYH1SWU4OqDby+z4iQCK/vE4A1VasmBQdLFSyWVI4dk554wnlFOiWl+Ptr4P1/KpELi30bARZjoVYGSfrk50/UtHpT9WzQ0+N9nD57WqfPni5wnL/DX9WDq3t8HAAoitRUZy+saTqD5OuvS6dO5d9akO3UKSkjo+BxFStKVasWr84nn5Rq15buvlt6+uni7Qulh8AL5McirQySlJaZpt9P/q5rWl5TrP28+NWLmrxucoHjGlVtpF/H/VqsYwFAYQ0Y4L4cGCjNni1dcUXB244ZI737bsHjoqOltWs9Kk+StG2b9Oab0vLlkr+/5/tB6SPwAnmxUCuD5Ay8klQlsEqx9jOqwyj1bti7wHHBFYKLdRwAKIrp06UWLZx/Tk6WPvjA+Qu2KlWk667Lf9uHHy7cleDqxfyl1T//6ew1HjiwePtB6SPwArmxWCuDJIUGhkqSTmaeLNZ+mlZvqqbVm3qjJADwmu7d3T+0NmKE1KmT8+rtlVdKAQF5b9umjfNVkhYskDZulHbsKNnjoGQQeIHcWKiVIVtoYKjqVqmrHSnF+9f21JlTOnXmVIHj/B3+CqscVqxjAYCn/Pyc9+KdNk3avVtq2zbvsamp0p+FuAV5QIBUo4Zn9Tz0kDRsmHMfv/7qXHfihPO/Bw9KZ85Idet6tm+UPAIvcDGLtTJc6MrmV2rW1lnadHCTohpEebSPlze+TA8vgDLh3Dnnf08V8DP62LEl38N78KA0b57zdbHOnaUOHaTERM/2jZJH4AUuZMFWhgs93Othzd0+V3d8cofWjFqT4zZde4/t1ac/f5rvrcno4QVQFpw9K33+ufOKauvW+Y8tjR7e3B72MH++s9Xhvfek+vU93zdKHoEXuJAFWxkudEmNSzTv+nkavmi4Wk9vrVEdzj9pbePBjVr440Ld0uGWfPdRWj286/ev1/r96yVJh08fVvrZdD2z/hlJUp9GfdSnUR/XWMdkh6IbRWvtLWvz3ef+E/v1/rb3JUmb/9gsSa59NqraSCM7jPT2aQAoJZ99Jv30k/PPKSnOK6m7d0uPPiqFhua/bWn08MbE5FyXfUV38GCpVq3z69eudbZjTJwoTZqU/37Xr3e+JOnwYeeT255x/rOmPn2cLxQfgRfIZuFWhgtd3fJqbbtnm17a+JKW7FqiGZtnKNA/UJERkXpl4Cu6s/Odvi5RkrRm35ocrRNPfvmkJGli9ERX4M3uJ87tQRoX23din2sfF+8zulE0gRcow5566vyfg4Kcj9GdMcN5v9uyJrsFo07B/6xpzRpp8kVdZk/+9c/cxIkEXm8h8AKS5VsZLta8ZnPNumqWr8uQJJ01zurI6SMK8A9w3UlCkib1naRJfScVuP36/evlkEOP9X6swLF9G/eVOdEscFyWkaXjGceVmpFa4FgAvnXLLc6XlWRmOh+CERwsVa6c97hJk3K/grt+vbPFoTDnldc+LpaR4QzSpwt+bhBy4efrAgBLsHgrg5VtPLhRYS+F6f8++j+Ptv9y35e6qd1Nah/R3ms1bU/ZrrCXwhSzIMZr+wRQfsyfL4WFSY884tn2X37pvEobGOi9mmbOdNb00kve22d5whVeoIy0MljRKwNf0fGM45KksEqe3cLspYHe/9e7WY1mWjVylWs5MiLS68cAYE9z556/xVmDBp7t47vvvFdPtuuvl9q1O79c3EcklzcEXpRvZayVwWq61O3i6xJyFRIQogFNBxQ8EAAu0quXryvIXYMGngdw0NKA8o5WBgAAbI/Ai/KLVgYAAMoFywfe9evX66qrrlLdunXlcDiUkJBQ4DZr165V586dFRgYqGbNmik+Pr7E60QZQysDAADlhuUDb3p6ujp06KDp06cXavy+ffs0dOhQ9evXT4mJiRo3bpzuuOMOrVy5soQrRZlCKwMAAOWG5T+0NnjwYA0ePLjQ42fOnKkmTZrolVdekSS1bt1aGzZs0GuvvaZBgwaVVJkoS2hlAIBC+/JL56t1aykqyvnLMP7JRFlj+cBbVJs2bdKAAe6fzh40aJDGjRuX5zaZmZnKzMx0LaelpUmSDMOQYRglUmdpMgxDpmna4lyKLTVVjjvuUPa/1caLLzo/9uqjrw1zY13MjbUxP6Xj9GnpyisdOn36fMKtXdvUpZdK7dub6txZuuoq9wDM3FiX3eamKOdhu8CblJSkiIgIt3URERFKS0vTn3/+qeDg4BzbxMXFafLFz/WTdPjwYWVkZJRYraXFMAydO3dOKSkp8vOzfBdLiQp94AFV+quVIbNPHx2PiXE+tN1HDMNQamqqTNMs93NjNcyNtTE/pcM0pUsuqant2yu61iUlOZSQICUkOFPuVVf9qVmzzj/VkLmxLrvNzcmTJws91naB1xMTJkzQ+PHjXctpaWlq0KCBwsLCFBoams+WZYNhGKpQoYLCw8Nt8Q3usZUr5TdvniTJDAlRxfh4hV/0w1FpMwxDDodDYWFh5XtuLIi5sTbmp+SlpDgfkRsV5VBqqqkDB3LvY/jttyCFh59/pBhzY112m5ugoKBCj7Vd4K1du7aSk5Pd1iUnJys0NDTXq7uSFBgYqMBcnv/n5+dni28ISXI4HLY6nyJLTXX7YJrj5ZflaNKkVEv4+cjPenHji+oQ0UH39bjvfC3lfW4sjLmxNubHu1JSpHXrpLVrna8ffyx4m7p1nVd6/fzcwzBzY112mpuinIPtAm9UVJSWL1/utm7VqlWKioryUUWwhAce8MldGX45/osW7lyo//z4H209tNW1vkf9Huper3up1AAAuSlKwPXzk8LDpaSk8+uuv1567z2pUqWSrhQoPssH3lOnTmnPnj2u5X379ikxMVE1atRQw4YNNWHCBP3+++967733JEn33HOP3njjDT388MO67bbbtGbNGv3nP//RsmXLfHUK8LWVK6V33nH+uRTuymCapmZ/P1szNs/QlkNbch1TJaBKiR0fAHJTlIDr7y917Sr17et8ffWV9Mwz599/+GEpLs4ZhIGywPKBd/PmzerXr59rObvXNjY2VvHx8Tp06JAOHDjger9JkyZatmyZ7r//fk2bNk3169fX22+/zS3JyisfPGBizb41uuOTO/J8v2XNlmod1rpEawCA4gTcXr2kKhf8XP7XNSX5+0szZkh33llydQMlwfKBt2/fvjJNM8/3c3uKWt++ffX999+XYFUoM3zQylCnSh1VqlhJp8+eVtXAqkrNTHV7//ZOt5d4DQDKH28G3Iu98orUrp00cKBzO6CssXzgBTxWyq0M2dqEtdHu+3YrdnGsvtj3hSTJIYdMOX9wu6HNDSVeAwD7K8mAe7E6daTHHitevYAvEXhhTz5oZch2JuuM7vn0HlfYDfQPVGaW88EmXet2VZPqpXt3CAD2UJoBF7AbAi/syUd3ZTiTdUY3/OcGffLzJ5Kk4ArBGtNtjF7a9JIkaVibYaVSB4Cyj4ALeA+BF/bjo1aG3MLusv9bpmpB1TRzy0xVD66uWzreUuJ1ACibCLhAySHwwl581MqQV9jt18R5h5E/HvhDgf6BquhfMb/dAChHCLhA6SHwwl580MpQUNiVpJCAkBKvA4C1EXAB3yHwwj580MpQmLALoHwi4ALWQeCFPfiglYGwC+BCBFzAugi8sIdSbmUg7AIg4AJlB4EXZV8ptzIQdoHyiYALlF0EXpRtpdzKQNgFyg8CLmAfBF6UbaXYykDYBeyNgAvYF4EXZVcptjIQdgH7IeAC5QeBF2VTKbYyEHYBeyhqwO3SxT3ghoaWTp0AvI/Ai7KplFoZCLtA2UXABZCNwIuyp5RaGQi7QNlCwAWQFwIvypZSamUg7ALWd+SIn9atk9avJ+ACyB+BF2VLKbQyEHYBa3K/guvQjz+G5zmWgAvgQgRelB2l0MpA2AWsI/8WBfe/+wRcAPkh8KJsKIVWBsIu4FspKefbE9aulXbuzHusv7+pyMizGjCgovr1cxBwAeSLwIuyoYRbGQi7QOkrWsB1v4IbFWUqI+OYwsPD5edXco8SB2APBF5YXwm3MhB2gdJRnIB78RVcw5AyMkq2XgD2QeCFtZVwKwNhFyg53gy4AFAcBF5YWwm2MhB2Ae8i4AKwKgIvrKsEWxkIu0DxEXABlBUEXlhTCbYyEHYBzxBwAZRVBF5YUwm1MhB2gcIj4AKwCwIvrKeEWhkIu0D+CLgA7IrAC2spoVYGwi6QEwEXQHlB4IW1lEArA2EXcCLgAiivCLywjhJoZSDsojwj4AKAE4EX1lACrQyEXZQ3BFwAyB2BF9bg5VYGwi7KAwIuABQOgRe+5+VWBsIu7IqACwCeIfDCt7zcykDYhZ0QcAHAOwi88C0vtjIQdlHWEXABoGQQeOE7XmxlIOyiLCLgAkDpIPDCN7zYykDYRVlBwAUA3yDwwje81MpA2IWVEXABwBoIvCh9XmplIOzCagi4AGBNZSLwTp8+XS+99JKSkpLUoUMHvf766+revXuuY+Pj43Xrrbe6rQsMDFRGRkZplIqCeKmVgbALKyDgAkDZYPnAu2DBAo0fP14zZ85Ujx49NHXqVA0aNEi7du1SeHh4rtuEhoZq165drmVHMR9PCy/yQisDYRe+QsAFgLLJ8oH31Vdf1Z133um6ajtz5kwtW7ZMs2fP1qOPPprrNg6HQ7Vr1y7NMlEYK1YUu5WBsIvSdOSIn9avPx9yCbgAUDZZOvCeOXNGW7Zs0YQJE1zr/Pz8NGDAAG3atCnP7U6dOqVGjRrJMAx17txZzz33nNq2bZvn+MzMTGVmZrqW09LSJEmGYcgwDC+ciW8ZhiHTNH17Lqmpctx5p7LjrfHii1KDBlIRajqTdUbDFg7Tp7s/leQMu5+M+ETRjaLL7DxZYm7gkn0Fd906h9atc2jnztx/iyRJ/v6munSRoqOl6Ggz14DLtJYc/u5YF3NjXXabm6Kch6UD75EjR5SVlaWIiAi39REREfrpp59y3aZly5aaPXu2IiMjlZqaqpdfflk9e/bUzp07Vb9+/Vy3iYuL0+TJk3OsP3z4sC16fw3D0Llz55SSkiI/Pz+f1BD6wAOq9FcrQ2afPjoeE+NMF4V0JuuM7lx1pz7f/7kkKahCkN7/2/tqW6mtUoqwH6sxDEOpqakyTdNnc1OeHTnip6+/rqiNGwO0cWOAdu2qmOdYf39TkZFn1bPnGUVFnVH37mdVpYrpej8jw/lC6eDvjnUxN9Zlt7k5efJkocdaOvB6IioqSlFRUa7lnj17qnXr1nrzzTf19NNP57rNhAkTNH78eNdyWlqaGjRooLCwMIXa4HeShmGoQoUKCg8P9803+IoV8ps3T5JkhoSoYny8wi/6ISY/2Vd2s8Nu9pXdfo3LfhuDYRhyOBwKCwuzxT8+Vud+BVfauTPvlho/P1MdOpxV//4VLmhRqCDnP5uVSqtk5IG/O9bF3FiX3eYmKCio0GMtHXhr1aolf39/JScnu61PTk4udI9uxYoV1alTJ+3ZsyfPMYGBgQoMDMyx3s/PzxbfEJKzr9kn55OaKt199/k6Xn5ZjiZNCr35mawzunHRjW5tDHbr2fXZ3JQDxfmQWVSUqYyMY777QREF4u+OdTE31mWnuSnKOVg68AYEBKhLly5avXq1YmJiJDl/Olm9erXGjBlTqH1kZWVp+/btGjJkSAlWijwV464MfEANReXNuygYBi0KAGAXlg68kjR+/HjFxsaqa9eu6t69u6ZOnar09HTXXRtGjRqlevXqKS4uTpI0ZcoUXXrppWrWrJlOnDihl156Sfv379cdF977FaWjGHdlIOyiMLhNGACgMCwfeIcPH67Dhw/rqaeeUlJSkjp27KgVK1a4Psh24MABt0vax48f15133qmkpCRVr15dXbp00caNG9WmTRtfnUL5lJoq3Xnn+eUiPGCCsIu8EHABAJ6wfOCVpDFjxuTZwrB27Vq35ddee02vvfZaKVSFfHnYykDYxYUIuAAAbygTgRdljIetDIRdEHABACWBwAvv8rCVgbBbPhFwAQClgcAL7/KglYGwW34QcAEAvkDghfd40MpA2LU3Ai4AwAoIvPAOD1oZCLv2Q8AFAFgRgRfeUcRWBsKuPRBwAQBlAYEXxVfEVgbCbtlFwAUAlEUEXhRPEVsZCLtlCwEXAGAHBF4UTxFaGQi71kfABQDYEYEXnitCKwNh15oIuACA8oDAC88UoZWBsGsdBFwAQHlE4IVnCtnKQNj1LQIuAAAEXniikK0MhN3SR8AFACAnAi+KppCtDITd0kHABQCgYAReFE0hWhkIuyWHgAsAQNEReFF4hWhlIOx6FwEXAIDiI/CicArRykDYLT4CLgAA3kfgReEU0MpA2PVMSor06aeB+v57h9atI+ACAFASCLwoWAGtDITdwst5BddPUvVcxxJwAQDwDgIv8ldAKwNhN3+0KAAA4HsEXuQvn1YGwm5ORQ+4prp1S9fgwZV02WV+BFwAAEoAgRd5y6eVgbDrVNwruCEhplJSTik8vJL8/EqnZgAAyhsCL3KXTytDeQ673m5RMIySqxUAADgReJG7PFoZylvYpQcXAICyj8CLnPJoZSgPYZeACwCA/RB44S6PVga7hl0CLgAA9kfghbtcWhnsFHYJuAAAlD8EXpyXSyvDGeNsmQ67BFwAAEDghVMurQxn6tcpc2GXgAsAAC5G4IXTRa0MZ26/pUyEXQIuAAAoCIEXOVoZzrz5b92wcJglwy4BFwAAFBWBt7y7qJXhzEvP64ZvHrBM2CXgAgCA4iLwlncXtDKcueJy3VB1pU/DLgEXAAB4G4G3PLugleFMaGXdcJNfqYddAi4AAChpBN7y6oJWhjP+0g0TLtEnB7+QVLJhl4ALAABKG4G3vPqrleGMv3TDvbX0SeY2Sd4PuwRcAADgawTe8uivVoYz/tINI/z1Sc0jkrwTdgm4AADAagi85c1frQxn/KUbbpQ+aZYlyfOwS8AFAABWR+Atbx54QGcO/eYMuy2dq4oSdgm4AACgrCHwlicrVuhM/DtFCrsEXAAAUNYReMuJSmfP6uw/7tKwAsIuARcAANiNn68LKIzp06ercePGCgoKUo8ePfTtt9/mO37hwoVq1aqVgoKC1L59ey1fvryUKrWuW/fs0rBev+cIu20r99OiRdKYMVK7dlJEhDRsmDR9es6w6+8vde8uPfywtHy5dOyY9M030gsvSIMHE3YBAIA1Wf4K74IFCzR+/HjNnDlTPXr00NSpUzVo0CDt2rVL4eHhOcZv3LhRI0aMUFxcnK688krNmzdPMTEx2rp1q9q1a+eDM7CA9HQlNvnDFXYDFKyBR5bpvqv6cQUXAADYnsM0TdPXReSnR48e6tatm9544w1JkmEYatCgge677z49+uijOcYPHz5c6enp+vTTT13rLr30UnXs2FEzZ84s1DHT0tJUtWpVpaamKtQGCc84c0a3DAvV+50zpbPB0txl0q85e3YJuKXPMAylpKQoPDxcfn5l4hcu5QZzY23Mj3UxN9Zlt7kpSl6z9BXeM2fOaMuWLZowYYJrnZ+fnwYMGKBNmzblus2mTZs0fvx4t3WDBg1SQkJCnsfJzMxUZmamazktLU2S8xvDMIxinIE1GH5++i0jStW/jNXxbZdJxy+RJPn7m+rSRYqOlqKjzVwDrg1O39IMw5Bpmrb4PrMb5sbamB/rYm6sy25zU5TzsHTgPXLkiLKyshQREeG2PiIiQj/99FOu2yQlJeU6PikpKc/jxMXFafLkyTnWX3/99apQwdJfokIxTVPbtm5Xw/r7VbFic1VqvEXVq/+g6tV/VIUKp/XDD9IPP0j/+pevKy1/TNPUuXPnVKFCBTkcDl+XgwswN9bG/FgXc2Nddpubc+fOFXps2U9zXjBhwgS3q8JpaWlq0KCBPvroI3u0NBiGhgwZouXLH7/gVxg3+LQmOBmGocOHDyssLMwWv16yE+bG2pgf62JurMtuc5OWlqbq1asXaqylA2+tWrXk7++v5ORkt/XJycmqXbt2rtvUrl27SOMlKTAwUIGBgTnW+/n52eIbQpIcDoetzsdOmBvrYm6sjfmxLubGuuw0N0U5B0ufbUBAgLp06aLVq1e71hmGodWrVysqKirXbaKiotzGS9KqVavyHA8AAAB7s/QVXkkaP368YmNj1bVrV3Xv3l1Tp05Venq6br31VknSqFGjVK9ePcXFxUmSxo4dq+joaL3yyisaOnSo5s+fr82bN2vWrFm+PA0AAAD4iOUD7/Dhw3X48GE99dRTSkpKUseOHbVixQrXB9MOHDjgdkm7Z8+emjdvnp544gk99thjat68uRISEsrvPXgBAADKOcsHXkkaM2aMxowZk+t7a9euzbFu2LBhGjZsWAlXBQAAgLLA0j28AAAAQHEReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK1ZOvCapqmnnnpKderUUXBwsAYMGKDdu3fnu82kSZPkcDjcXq1atSqligEAAGA1lg68L774ov71r39p5syZ+uabb1S5cmUNGjRIGRkZ+W7Xtm1bHTp0yPXasGFDKVUMAAAAq6ng6wLyYpqmpk6dqieeeELXXHONJOm9995TRESEEhISdNNNN+W5bYUKFVS7du3SKhUAAAAWZtnAu2/fPiUlJWnAgAGudVWrVlWPHj20adOmfAPv7t27VbduXQUFBSkqKkpxcXFq2LBhnuMzMzOVmZnpWk5NTZUknThxQoZheOFsfMswDJ09e1YnTpyQn5+lL+qXO4ZhKC0tTQEBAcyNxTA31sb8WBdzY112m5u0tDRJzoukBbFs4E1KSpIkRUREuK2PiIhwvZebHj16KD4+Xi1bttShQ4c0efJkXXbZZdqxY4eqVKmS6zZxcXGaPHlyjvWNGjUqxhlYT82aNX1dAgAAgFedPHlSVatWzXeMwyxMLC4Fc+fO1d133+1aXrZsmfr27as//vhDderUca2/8cYb5XA4tGDBgkLt98SJE2rUqJFeffVV3X777bmOufgKr2EYOnbsmGrWrCmHw+HhGVlHWlqaGjRooIMHDyo0NNTX5eACzI11MTfWxvxYF3NjXXabG9M0dfLkSdWtW7fAK9aWucJ79dVXq0ePHq7l7ACanJzsFniTk5PVsWPHQu+3WrVqatGihfbs2ZPnmMDAQAUGBubYzm5CQ0Nt8Q1uR8yNdTE31sb8WBdzY112mpuCruxms0wDR5UqVdSsWTPXq02bNqpdu7ZWr17tGpOWlqZvvvlGUVFRhd7vqVOntHfvXrfQDAAAgPLDMoH3Yg6HQ+PGjdMzzzyjpUuXavv27Ro1apTq1q2rmJgY17j+/fvrjTfecC0/+OCDWrdunX799Vdt3LhR1157rfz9/TVixAgfnAUAAAB8zTItDbl5+OGHlZ6errvuuksnTpxQ7969tWLFCgUFBbnG7N27V0eOHHEt//bbbxoxYoSOHj2qsLAw9e7dW19//bXCwsJ8cQqWEBgYqIkTJ+Zo24DvMTfWxdxYG/NjXcyNdZXnubHMh9YAAACAkmDZlgYAAADAGwi8AAAAsDUCLwAAAGyNwAsAAABbI/Da3PTp09W4cWMFBQWpR48e+vbbb31dEiStX79eV111lerWrSuHw6GEhARfl4S/xMXFqVu3bqpSpYrCw8MVExOjXbt2+bosSJoxY4YiIyNdN82PiorSZ5995uuykIvnn3/edXtR+N6kSZPkcDjcXq1atfJ1WaWKwGtjCxYs0Pjx4zVx4kRt3bpVHTp00KBBg5SSkuLr0sq99PR0dejQQdOnT/d1KbjIunXrNHr0aH399ddatWqVzp49q4EDByo9Pd3XpZV79evX1/PPP68tW7Zo8+bNuvzyy3XNNddo586dvi4NF/juu+/05ptvKjIy0tel4AJt27bVoUOHXK8NGzb4uqRSxW3JbKxHjx7q1q2b68EchmGoQYMGuu+++/Too4/6uDpkczgcWrx4sdsDVWAdhw8fVnh4uNatW6c+ffr4uhxcpEaNGnrppZd0++23+7oUyPl0086dO+vf//63nnnmGXXs2FFTp071dVnl3qRJk5SQkKDExERfl+IzXOG1qTNnzmjLli0aMGCAa52fn58GDBigTZs2+bAyoGxJTU2V5AxWsI6srCzNnz9f6enpRXrcPErW6NGjNXToULf/98Aadu/erbp166pp06b6+9//rgMHDvi6pFJl6SetwXNHjhxRVlaWIiIi3NZHRETop59+8lFVQNliGIbGjRunXr16qV27dr4uB5K2b9+uqKgoZWRkKCQkRIsXL1abNm18XRYkzZ8/X1u3btV3333n61JwkR49eig+Pl4tW7bUoUOHNHnyZF122WXasWOHqlSp4uvySgWBFwDyMHr0aO3YsaPc9bpZWcuWLZWYmKjU1FQtWrRIsbGxWrduHaHXxw4ePKixY8dq1apVCgoK8nU5uMjgwYNdf46MjFSPHj3UqFEj/ec//yk37UAEXpuqVauW/P39lZyc7LY+OTlZtWvX9lFVQNkxZswYffrpp1q/fr3q16/v63Lwl4CAADVr1kyS1KVLF3333XeaNm2a3nzzTR9XVr5t2bJFKSkp6ty5s2tdVlaW1q9frzfeeEOZmZny9/f3YYW4ULVq1dSiRQvt2bPH16WUGnp4bSogIEBdunTR6tWrXesMw9Dq1avpdwPyYZqmxowZo8WLF2vNmjVq0qSJr0tCPgzDUGZmpq/LKPf69++v7du3KzEx0fXq2rWr/v73vysxMZGwazGnTp3S3r17VadOHV+XUmq4wmtj48ePV2xsrLp27aru3btr6tSpSk9P16233urr0sq9U6dOuf1kvW/fPiUmJqpGjRpq2LChDyvD6NGjNW/ePC1ZskRVqlRRUlKSJKlq1aoKDg72cXXl24QJEzR48GA1bNhQJ0+e1Lx587R27VqtXLnS16WVe1WqVMnR5165cmXVrFmT/ncLePDBB3XVVVepUaNG+uOPPzRx4kT5+/trxIgRvi6t1BB4bWz48OE6fPiwnnrqKSUlJaljx45asWJFjg+yofRt3rxZ/fr1cy2PHz9ekhQbG6v4+HgfVQXJ+XADSerbt6/b+jlz5uiWW24p/YLgkpKSolGjRunQoUOqWrWqIiMjtXLlSl1xxRW+Lg2wtN9++00jRozQ0aNHFRYWpt69e+vrr79WWFiYr0srNdyHFwAAALZGDy8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AlKL//Oc/qlGjhk6dOlXkbRs3bqwrr7yyBKrKXXx8vBwOh3799ddSO+aFfvzxR1WoUEE7duzwyfEB2AeBFwBKSVZWliZOnKj77rtPISEhvi7H8tq0aaOhQ4fqqaee8nUpAMo4Ai8AlJJPPvlEu3bt0l133eXrUgpl5MiR+vPPP9WoUSOf1XDPPfdo8eLF2rt3r89qAFD2EXgBoJTMmTNHvXr1Ur169XxdSqH4+/srKChIDofDZzUMGDBA1atX17vvvuuzGgCUfQReACiEP//8U61atVKrVq30559/utYfO3ZMderUUc+ePZWVlZXn9hkZGVqxYoUGDBiQ4705c+bo8ssvV3h4uAIDA9WmTRvNmDEjz319/vnn6tixo4KCgtSmTRt9/PHHbu+fPXtWkydPVvPmzRUUFKSaNWuqd+/eWrVqldu4n376STfeeKPCwsIUHBysli1b6vHHH3e9n1sP7+bNmzVo0CDVqlVLwcHBatKkiW677Ta3/c6fP19dunRRlSpVFBoaqvbt22vatGluX7MHH3xQ7du3V0hIiEJDQzV48GD98MMPOc61YsWK6tu3r5YsWZLn1wMAClLB1wUAQFkQHBysd999V7169dLjjz+uV199VZI0evRopaamKj4+Xv7+/nluv2XLFp05c0adO3fO8d6MGTPUtm1bXX311apQoYI++eQT3XvvvTIMQ6NHj3Ybu3v3bg0fPlz33HOPYmNjNWfOHA0bNkwrVqzQFVdcIUmaNGmS4uLidMcdd6h79+5KS0vT5s2btXXrVteYbdu26bLLLlPFihV11113qXHjxtq7d68++eQTPfvss7meQ0pKigYOHKiwsDA9+uijqlatmn799Ve3wL1q1SqNGDFC/fv31wsvvCBJ+t///qevvvpKY8eOlST98ssvSkhI0LBhw9SkSRMlJyfrzTffVHR0tH788UfVrVvX7bhdunTRkiVLlJaWptDQ0HznCQByZQIACm3ChAmmn5+fuX79enPhwoWmJHPq1KkFbvf222+bkszt27fneO/06dM51g0aNMhs2rSp27pGjRqZksyPPvrItS41NdWsU6eO2alTJ9e6Dh06mEOHDs23nj59+phVqlQx9+/f77beMAzXn+fMmWNKMvft22eapmkuXrzYlGR+9913ee537NixZmhoqHnu3Lk8x2RkZJhZWVlu6/bt22cGBgaaU6ZMyTF+3rx5piTzm2++yfecACAvtDQAQBFMmjRJbdu2VWxsrO69915FR0frn//8Z4HbHT16VJJUvXr1HO8FBwe7/pyamqojR44oOjpav/zyi1JTU93G1q1bV9dee61rOTQ0VKNGjdL333+vpKQkSVK1atW0c+dO7d69O9daDh8+rPXr1+u2225Tw4YN3d7Lr1+3WrVqkqRPP/1UZ8+ezXNMenp6jvaJCwUGBsrPz/m/n6ysLB09elQhISFq2bKltm7dmmN89tfsyJEjee4TAPJD4AWAIggICNDs2bO1b98+nTx5UnPmzCnSh7pM08yx7quvvtKAAQNUuXJlVatWTWFhYXrsscckKUfgbdasWY7jtWjRQpJcvbZTpkzRiRMn1KJFC7Vv314PPfSQtm3b5hr/yy+/SJLatWtX6LolKTo6Wtdff70mT56sWrVq6ZprrtGcOXOUmZnpGnPvvfeqRYsWGjx4sOrXr6/bbrtNK1ascNuPYRh67bXX1Lx5cwUGBqpWrVoKCwvTtm3bcpyvdP5r5ssPzwEo2wi8AFBEK1eulOT8IFpeV1EvVrNmTUnS8ePH3dbv3btX/fv315EjR/Tqq69q2bJlWrVqle6//35JznBYVH369NHevXs1e/ZstWvXTm+//bY6d+6st99+u8j7upDD4dCiRYu0adMmjRkzRr///rtuu+02denSxfUgjfDwcCUmJmrp0qW6+uqr9eWXX2rw4MGKjY117ee5557T+PHj1adPH33wwQdauXKlVq1apbZt2+Z6vtlfs1q1ahWrfgDlmK97KgCgLPnhhx/MgIAA89ZbbzU7depkNmjQwDxx4kSB223YsMGUZC5ZssRt/WuvvWZKytFL+9hjj7n1z5qms4e3bt26bn22pmmajzzyiCnJPHToUK7HPnnypNmpUyezXr16pmmaZkpKiinJHDt2bL41X9zDm5u5c+eaksy33nor1/ezsrLMu+++25Rk7t692zRNZ49xv379coytV6+eGR0dnWP9M888Y/r5+RXq6wwAueEKLwAU0tmzZ3XLLbeobt26mjZtmuLj45WcnOy6GpufLl26KCAgQJs3b3Zbn31nB/OCVofU1FTNmTMn1/388ccfWrx4sWs5LS1N7733njp27KjatWtLOt8vnC0kJETNmjVztR6EhYWpT58+mj17tg4cOOA21syl5SLb8ePHc7zfsWNHSXLt++Jj+/n5KTIy0m2Mv79/jv0sXLhQv//+e67H3bJli9q2bauqVavmWRsA5IfbkgFAIT3zzDNKTEzU6tWrVaVKFUVGRuqpp57SE088oRtuuEFDhgzJc9ugoCANHDhQX3zxhaZMmeJaP3DgQAUEBOiqq67S3XffrVOnTumtt95SeHi4Dh06lGM/LVq00O23367vvvtOERERmj17tpKTk90Ccps2bdS3b1916dJFNWrU0ObNm7Vo0SKNGTPGNeZf//qXevfurc6dO+uuu+5SkyZN9Ouvv2rZsmVKTEzM9Rzeffdd/fvf/9a1116rSy65RCdPntRbb72l0NBQ17nfcccdOnbsmC6//HLVr19f+/fv1+uvv66OHTuqdevWkqQrr7xSU6ZM0a233qqePXtq+/btmjt3rpo2bZrjmGfPntW6det077335j85AJAf315gBoCyYcuWLWaFChXM++67z239uXPnzG7dupl169Y1jx8/nu8+Pv74Y9PhcJgHDhxwW7906VIzMjLSDAoKMhs3bmy+8MIL5uzZs3NtaRg6dKi5cuVKMzIy0gwMDDRbtWplLly40G1/zzzzjNm9e3ezWrVqZnBwsNmqVSvz2WefNc+cOeM2bseOHea1115rVqtWzQwKCjJbtmxpPvnkk673L25p2Lp1qzlixAizYcOGZmBgoBkeHm5eeeWV5ubNm13bLFq0yBw4cKAZHh5uBgQEmA0bNjTvvvtut3aLjIwM84EHHjDr1KljBgcHm7169TI3bdpkRkdH52hp+Oyzz9zaIQDAEw7TzOf3VwAAr8nKylKbNm1044036umnn/Z1OWVCTEyMHA6HWxsHABQVgRcAStGCBQv0j3/8QwcOHFBISIivy7G0//3vf2rfvr0SExOLfAs1ALgQgRcAAAC2xl0aAAAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANja/wO/kav13fKFjwAAAABJRU5ErkJggg==", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\n" + ] + } + ], + "source": [ + "# 可视化二维向量\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# 设置中文字体(如果系统有的话)\n", + "try:\n", + " plt.rcParams['font.sans-serif'] = ['SimHei', 'Noto Sans CJK SC', 'WenQuanYi Micro Hei']\n", + " plt.rcParams['axes.unicode_minus'] = False\n", + "except:\n", + " pass # 如果没有中文字体就用默认\n", + "\n", + "# 创建画布\n", + "fig, ax = plt.subplots(figsize=(8, 8))\n", + "\n", + "# 定义向量\n", + "vectors = {\n", + " 'A = [2, 3]': np.array([2, 3]),\n", + " 'B = [4, 1]': np.array([4, 1]),\n", + " 'C = [1, 1]': np.array([1, 1]),\n", + "}\n", + "\n", + "# 画每个向量\n", + "colors = ['red', 'blue', 'green']\n", + "for (name, vec), color in zip(vectors.items(), colors):\n", + " ax.annotate('', xy=vec, xytext=(0, 0),\n", + " arrowprops=dict(arrowstyle='->', color=color, lw=2))\n", + " ax.text(vec[0]+0.1, vec[1]+0.1, name, fontsize=12, color=color)\n", + "\n", + "# 画坐标系\n", + "ax.axhline(y=0, color='black', linewidth=0.5)\n", + "ax.axvline(x=0, color='black', linewidth=0.5)\n", + "\n", + "# 设置范围\n", + "ax.set_xlim(-0.5, 5.5)\n", + "ax.set_ylim(-0.5, 4)\n", + "ax.set_xlabel('x (abscissa)', fontsize=12)\n", + "ax.set_ylabel('y (ordinate)', fontsize=12)\n", + "ax.set_title('2D Vector Visualization', fontsize=14)\n", + "ax.grid(True, alpha=0.3)\n", + "ax.set_aspect('equal')\n", + "\n", + "plt.show()\n", + "print(\"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.2 向量的基本运算\n", + "\n", + "### 3.2.1 向量加法\n", + "\n", + "**规则:对应位置相加**\n", + "\n", + "```python\n", + "[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n", + "```\n", + "\n", + "**几何直观**:先走向量a,再走向量b,等价于直接从原点走到a+b\n", + "\n", + "```\n", + " b=[4,5,6]\n", + " ↗\n", + " |\n", + " a+b |\n", + " ↙|\n", + " ↙ |\n", + "O →——→ a=[1,2,3]\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "向量加法演示\n", + "==================================================\n", + "向量 a = [1 2 3]\n", + "向量 b = [4 5 6]\n", + "a + b = [5 7 9]\n", + "\n", + "计算过程:\n", + " 位置0: 1 + 4 = 5\n", + " 位置1: 2 + 5 = 7\n", + " 位置2: 3 + 6 = 9\n", + "\n", + "验证: True True True\n" + ] + } + ], + "source": [ + "# 向量加法演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"向量加法演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "a = np.array([1, 2, 3])\n", + "b = np.array([4, 5, 6])\n", + "c = a + b\n", + "\n", + "print(f\"向量 a = {a}\")\n", + "print(f\"向量 b = {b}\")\n", + "print(f\"a + b = {c}\")\n", + "print()\n", + "print(\"计算过程:\")\n", + "print(f\" 位置0: {a[0]} + {b[0]} = {a[0]+b[0]}\")\n", + "print(f\" 位置1: {a[1]} + {b[1]} = {a[1]+b[1]}\")\n", + "print(f\" 位置2: {a[2]} + {b[2]} = {a[2]+b[2]}\")\n", + "print()\n", + "print(\"验证:\", a[0]+b[0] == c[0], a[1]+b[1] == c[1], a[2]+b[2] == c[2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.2.2 向量数乘(标量乘法)\n", + "\n", + "**规则:每个元素都乘以这个标量(数字)**\n", + "\n", + "```python\n", + "2 × [1, 2, 3] = [2×1, 2×2, 2×3] = [2, 4, 6]\n", + "3 × [1, 2, 3] = [3×1, 3×2, 3×3] = [3, 6, 9]\n", + "0.5 × [1, 2, 3] = [0.5, 1.0, 1.5]\n", + "```\n", + "\n", + "**几何直观**:\n", + "- 正数:方向不变,长度缩放\n", + "- 负数:方向相反,长度缩放\n", + "- 0:变成零向量" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "向量数乘(标量乘法)演示\n", + "==================================================\n", + "原始向量 v = [1 2 3]\n", + "\n", + "2 × v = [2 4 6]\n", + "3 × v = [3 6 9]\n", + "0.5 × v = [0.5 1. 1.5]\n", + "-1 × v = [-1 -2 -3]\n", + "0 × v = [0 0 0]\n" + ] + } + ], + "source": [ + "# 向量数乘演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"向量数乘(标量乘法)演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "v = np.array([1, 2, 3])\n", + "\n", + "print(f\"原始向量 v = {v}\")\n", + "print()\n", + "\n", + "for scalar in [2, 3, 0.5, -1, 0]:\n", + " result = scalar * v\n", + " print(f\"{scalar} × v = {result}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.2.3 向量的长度(模/范数)\n", + "\n", + "**定义:从原点到向量终点的距离**\n", + "\n", + "对于二维向量 `[a, b]`:\n", + "```\n", + "长度 = √(a² + b²)\n", + "\n", + "这就是\"勾股定理\"!\n", + "\n", + " |\n", + " b |\n", + " | |\n", + " | √(a²+b²)\n", + " | /\n", + " | /\n", + " |/ a\n", + " O——————\n", + "```\n", + "\n", + "对于n维向量 `[a₁, a₂, ..., aₙ]`:\n", + "```\n", + "长度 = √(a₁² + a₂² + ... + aₙ²)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "向量长度(模/范数)演示\n", + "==================================================\n", + "向量 v = [3 4]\n", + "长度 = √(3² + 4²) = √(9 + 16) = √25 = 5.0\n", + "\n", + "向量长度计算例子:\n", + " [np.int64(1), np.int64(1)] -> 长度 = 1.4142\n", + " [np.int64(0), np.int64(5)] -> 长度 = 5.0000\n", + " [np.int64(3), np.int64(4)] -> 长度 = 5.0000\n", + " [np.int64(1), np.int64(2), np.int64(2)] -> 长度 = 3.0000\n", + " [np.int64(1), np.int64(1), np.int64(1), np.int64(1)] -> 长度 = 2.0000\n" + ] + } + ], + "source": [ + "# 向量长度计算\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"向量长度(模/范数)演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 二维向量例子\n", + "v2d = np.array([3, 4])\n", + "length_2d = np.linalg.norm(v2d)\n", + "\n", + "print(f\"向量 v = {v2d}\")\n", + "print(f\"长度 = √({v2d[0]}² + {v2d[1]}²) = √({v2d[0]**2} + {v2d[1]**2}) = √{v2d[0]**2 + v2d[1]**2} = {length_2d}\")\n", + "print()\n", + "\n", + "# 更多例子\n", + "examples = [\n", + " np.array([1, 1]), # 45度角\n", + " np.array([0, 5]), # 在y轴上\n", + " np.array([3, 4]), # 经典勾股数\n", + " np.array([1, 2, 2]), # 三维向量\n", + " np.array([1, 1, 1, 1]) # 四维向量\n", + "]\n", + "\n", + "print(\"向量长度计算例子:\")\n", + "for v in examples:\n", + " length = np.linalg.norm(v)\n", + " print(f\" {list(v)} -> 长度 = {length:.4f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "练习题3答案\n", + "==================================================\n", + "A = [3 4], B = [1 2]\n", + "\n", + "1. A + B = [4 6]\n", + "2. 2 × A = [6 8]\n", + "3. A的长度 = 5.0\n" + ] + } + ], + "source": [ + "# 练习题3答案\n", + "import numpy as np\n", + "print(\"=\" * 50)\n", + "print(\"练习题3答案\")\n", + "print(\"=\" * 50)\n", + "\n", + "A = np.array([3, 4])\n", + "B = np.array([1, 2])\n", + "\n", + "print(f\"A = {A}, B = {B}\")\n", + "print()\n", + "\n", + "# 1. A + B\n", + "print(\"1. A + B =\", A + B)\n", + "\n", + "# 2. 2 × A\n", + "print(\"2. 2 × A =\", 2 * A)\n", + "\n", + "# 3. A的长度\n", + "print(f\"3. A的长度 = {np.linalg.norm(A)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第四部分:余弦相似度\n", + "\n", + "## 4.1 什么是相似度?\n", + "\n", + "**相似度 = 两个向量有多\"像\"**\n", + "\n", + "### 日常生活中的相似例子\n", + "\n", + "| 相似度高 | 原因 | 相似度低 | 原因 |\n", + "|----------|------|----------|------|\n", + "| \"猫\" 和 \"狗\" | 都是动物,都四只脚 | \"猫\" 和 \"石头\" | 一个是动物,一个不是 |\n", + "| \"红色\" 和 \"黄色\" | 都是颜色,暖色调 | \"热\" 和 \"冷\" | 意思相反 |\n", + "| \"跑步\" 和 \"游泳\" | 都是运动 | \"太阳\" 和 \"细菌\" | 几乎没有共同点 |\n", + "| \"苹果\" 和 \"梨\" | 都是水果 | \"苹果\" 和 \"手机\" | 需要上下文才能关联 |\n", + "\n", + "### 计算机如何量化相似度?\n", + "\n", + "文本相似度在计算机中的应用:\n", + "\n", + "```\n", + "搜索场景:\n", + " 用户输入: \"如何学习编程?\"\n", + " 文档1: \"Python入门教程\" → 相似度高 ✅\n", + " 文档2: \"做蛋糕的100种方法\" → 相似度低 ❌\n", + "\n", + "推荐场景:\n", + " 用户喜欢: \"猫和狗的搞笑视频\"\n", + " 推荐1: \"仓鼠的可爱瞬间\" → 相似度高 ✅\n", + " 推荐2: \"汽车发动机维修教程\" → 相似度低 ❌\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.2 点积(Dot Product)— 最重要的运算\n", + "\n", + "### 定义:对应位置相乘,再求和\n", + "\n", + "```python\n", + "a = [1, 2, 3]\n", + "b = [4, 5, 6]\n", + "\n", + "点积 = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n", + "```\n", + "\n", + "### 点积的几何意义\n", + "\n", + "```\n", + "点积 = |A| × |B| × cos(θ)\n", + "\n", + "其中:\n", + " |A| = 向量A的长度\n", + " |B| = 向量B的长度\n", + " θ = 两个向量之间的夹角\n", + "```\n", + "\n", + "| 夹角 θ | cos(θ) | 点积结果 | 含义 |\n", + "|--------|--------|----------|------|\n", + "| 0° | 1 | |A|×|B|(最大) | 方向完全相同 |\n", + "| 90° | 0 | 0 | 垂直/正交 |\n", + "| 180° | -1 | -|A|×|B|(最小) | 方向完全相反 |" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "向量点积演示\n", + "==================================================\n", + "向量 a = [1 2 3]\n", + "向量 b = [4 5 6]\n", + "\n", + "点积 a · b = 32\n", + "验证: a @ b = 32\n", + "手动计算: 32\n", + "\n", + "计算过程:\n", + " a[0]×b[0] = 1×4 = 4\n", + " a[1]×b[1] = 2×5 = 10\n", + " a[2]×b[2] = 3×6 = 18\n", + " 求和: 4 + 10 + 18 = 32\n" + ] + } + ], + "source": [ + "# 点积计算演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"向量点积演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "a = np.array([1, 2, 3])\n", + "b = np.array([4, 5, 6])\n", + "\n", + "# 方法1:使用np.dot()\n", + "dot1 = np.dot(a, b)\n", + "\n", + "# 方法2:使用@运算符\n", + "dot2 = a @ b\n", + "\n", + "# 方法3:手动计算\n", + "dot3 = sum(a[i] * b[i] for i in range(len(a)))\n", + "\n", + "print(f\"向量 a = {a}\")\n", + "print(f\"向量 b = {b}\")\n", + "print()\n", + "print(f\"点积 a · b = {dot1}\")\n", + "print(f\"验证: a @ b = {dot2}\")\n", + "print(f\"手动计算: {dot3}\")\n", + "print()\n", + "print(\"计算过程:\")\n", + "print(f\" a[0]×b[0] = {a[0]}×{b[0]} = {a[0]*b[0]}\")\n", + "print(f\" a[1]×b[1] = {a[1]}×{b[1]} = {a[1]*b[1]}\")\n", + "print(f\" a[2]×b[2] = {a[2]}×{b[2]} = {a[2]*b[2]}\")\n", + "print(f\" 求和: {a[0]*b[0]} + {a[1]*b[1]} + {a[2]*b[2]} = {dot1}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "点积与夹角的关系\n", + "==================================================\n", + "夹角0°: a=[1 0], b=[2 0], 点积=2\n", + "夹角90°: a=[1 0], b=[0 1], 点积=0\n", + "夹角180°: a=[1 0], b=[-1 0], 点积=-1\n", + "\n", + "任意角度: a=[1 1], b=[1 0]\n", + " 点积 = 1\n", + " cos(θ) = 0.7071\n", + " 夹角 θ = 45.0°\n" + ] + } + ], + "source": [ + "# 点积与夹角的关系\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"点积与夹角的关系\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 夹角为0度:方向完全相同\n", + "a = np.array([1, 0])\n", + "b = np.array([2, 0])\n", + "dot = np.dot(a, b)\n", + "print(f\"夹角0°: a={a}, b={b}, 点积={dot}\")\n", + "\n", + "# 夹角为90度:垂直\n", + "a = np.array([1, 0])\n", + "b = np.array([0, 1])\n", + "dot = np.dot(a, b)\n", + "print(f\"夹角90°: a={a}, b={b}, 点积={dot}\")\n", + "\n", + "# 夹角为180度:方向相反\n", + "a = np.array([1, 0])\n", + "b = np.array([-1, 0])\n", + "dot = np.dot(a, b)\n", + "print(f\"夹角180°: a={a}, b={b}, 点积={dot}\")\n", + "\n", + "# 任意角度\n", + "import math\n", + "a = np.array([1, 1])\n", + "b = np.array([1, 0])\n", + "dot = np.dot(a, b)\n", + "cos_angle = dot / (np.linalg.norm(a) * np.linalg.norm(b))\n", + "angle = math.acos(cos_angle) * 180 / math.pi\n", + "print(f\"\\n任意角度: a={a}, b={b}\")\n", + "print(f\" 点积 = {dot}\")\n", + "print(f\" cos(θ) = {cos_angle:.4f}\")\n", + "print(f\" 夹角 θ = {angle:.1f}°\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.3 余弦相似度 — 用点积判断\"像不像\"\n", + "\n", + "### 公式\n", + "\n", + "```\n", + " A · B\n", + "cos(θ) = ──────────\n", + " |A| × |B|\n", + "\n", + "其中:\n", + " A · B = 向量A和B的点积\n", + " |A| = 向量A的长度(模)\n", + " |B| = 向量B的长度(模)\n", + " cos(θ) = 相似度,范围是 [-1, 1]\n", + "```\n", + "\n", + "### 为什么叫\"余弦\"相似度?\n", + "\n", + "因为公式中计算的就是两个向量夹角的余弦值!\n", + "\n", + "从点积公式推导:\n", + "```\n", + "A · B = |A| × |B| × cos(θ)\n", + " ↓\n", + "cos(θ) = (A · B) / (|A| × |B|)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "余弦相似度函数已定义:cosine_similarity(a, b)\n" + ] + } + ], + "source": [ + "# 定义余弦相似度函数\n", + "import numpy as np\n", + "\n", + "def cosine_similarity(a, b):\n", + " \"\"\"\n", + " 计算余弦相似度\n", + " \n", + " 参数:\n", + " a, b: 两个numpy数组(向量)\n", + " \n", + " 返回:\n", + " float: 余弦相似度,范围[-1, 1]\n", + " \"\"\"\n", + " dot = np.dot(a, b) # 点积\n", + " norm_a = np.linalg.norm(a) # 向量a的长度\n", + " norm_b = np.linalg.norm(b) # 向量b的长度\n", + " \n", + " # 防止除以零\n", + " if norm_a == 0 or norm_b == 0:\n", + " return 0.0\n", + " \n", + " return dot / (norm_a * norm_b)\n", + "\n", + "print(\"余弦相似度函数已定义:cosine_similarity(a, b)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "余弦相似度计算示例\n", + "==================================================\n", + "1. 方向完全相同: a=[1 2 3], b=[2 4 6]\n", + " 相似度 = 1.000 (应该是1.000)\n", + "\n", + "2. 方向完全相反: a=[1 2 3], b=[-1 -2 -3]\n", + " 相似度 = -1.000 (应该是-1.000)\n", + "\n", + "3. 垂直向量: a=[1 0], b=[0 1]\n", + " 相似度 = 0.000 (应该是0.000)\n", + "\n", + "4. 45度夹角: a=[1 1], b=[1 0]\n", + " 相似度 = 0.707 (应该是0.707)\n" + ] + } + ], + "source": [ + "# 余弦相似度计算示例\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"余弦相似度计算示例\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 示例1:方向完全相同的向量\n", + "a = np.array([1, 2, 3])\n", + "b = np.array([2, 4, 6]) # b是a的两倍,方向完全相同\n", + "sim = cosine_similarity(a, b)\n", + "print(f\"1. 方向完全相同: a={a}, b={b}\")\n", + "print(f\" 相似度 = {sim:.3f} (应该是1.000)\")\n", + "print()\n", + "\n", + "# 示例2:方向完全相反的向量\n", + "a = np.array([1, 2, 3])\n", + "b = np.array([-1, -2, -3]) # b是a的相反方向\n", + "sim = cosine_similarity(a, b)\n", + "print(f\"2. 方向完全相反: a={a}, b={b}\")\n", + "print(f\" 相似度 = {sim:.3f} (应该是-1.000)\")\n", + "print()\n", + "\n", + "# 示例3:垂直的向量\n", + "a = np.array([1, 0])\n", + "b = np.array([0, 1])\n", + "sim = cosine_similarity(a, b)\n", + "print(f\"3. 垂直向量: a={a}, b={b}\")\n", + "print(f\" 相似度 = {sim:.3f} (应该是0.000)\")\n", + "print()\n", + "\n", + "# 示例4:45度夹角\n", + "a = np.array([1, 1])\n", + "b = np.array([1, 0])\n", + "sim = cosine_similarity(a, b)\n", + "print(f\"4. 45度夹角: a={a}, b={b}\")\n", + "print(f\" 相似度 = {sim:.3f} (应该是0.707)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 余弦相似度的值代表什么?\n", + "\n", + "| cos(θ) 值 | 夹角 θ | 相似程度 | 示例 |\n", + "|----------|--------|---------|------|\n", + "| 1.0 | 0° | **完全相同** | 同一向量 |\n", + "| 0.8~0.99 | 0~37° | **非常相似** | \"猫\" vs \"狗\" |\n", + "| 0.5~0.8 | 37~60° | **比较相似** | \"跑步\" vs \"运动\" |\n", + "| 0.3~0.5 | 60~72° | **有些相似** | \"苹果\" vs \"水果\" |\n", + "| 0 | 90° | **毫不相关** | \"猫\" vs \"石头\" |\n", + "| -0.5~0 | 90~120° | **有些相反** | \"热\" vs \"冷\" |\n", + "| -1.0 | 180° | **完全相反** | \"高\" vs \"矮\" |" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "语义相似度示例(用向量模拟词义)\n", + "==================================================\n", + "\n", + "词向量(简化模拟):\n", + " 猫 = [0.9 0.1 0.7 0.8 0.9]\n", + " 狗 = [0.8 0.2 0.6 0.8 0.9]\n", + " 苹果 = [0.1 0.9 0.9 0. 0. ]\n", + " 汽车 = [0. 0. 0. 0.9 0. ]\n", + " 石头 = [0. 0.1 0. 0. 0. ]\n", + "\n", + "维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n", + "\n", + "相似度计算结果:\n", + " 猫 vs 狗: 0.996 (都是动物,都有宠物属性)\n", + " 猫 vs 苹果: 0.382 (动物vs植物,很不同)\n", + " 猫 vs 汽车: 0.482 (动物vs机械)\n", + " 猫 vs 石头: 0.060 (动物vs无机物)\n", + " 狗 vs 汽车: 0.507 (动物vs机械,但都能移动)\n", + " 苹果 vs 石头: 0.705 (都是静态的)\n" + ] + } + ], + "source": [ + "# 语义相似度示例\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"语义相似度示例(用向量模拟词义)\")\n", + "print(\"=\" * 50)\n", + "print()\n", + "\n", + "# 假设这些是词的\"意义向量\"(简化版)\n", + "# 维度解释: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n", + "# 每个维度取值0-1,表示该属性的强弱\n", + "\n", + "cat = np.array([0.9, 0.1, 0.7, 0.8, 0.9]) # 猫\n", + "dog = np.array([0.8, 0.2, 0.6, 0.8, 0.9]) # 狗\n", + "apple = np.array([0.1, 0.9, 0.9, 0.0, 0.0]) # 苹果\n", + "car = np.array([0.0, 0.0, 0.0, 0.9, 0.0]) # 汽车\n", + "rock = np.array([0.0, 0.1, 0.0, 0.0, 0.0]) # 石头\n", + "\n", + "print(\"词向量(简化模拟):\")\n", + "print(f\" 猫 = {cat}\")\n", + "print(f\" 狗 = {dog}\")\n", + "print(f\" 苹果 = {apple}\")\n", + "print(f\" 汽车 = {car}\")\n", + "print(f\" 石头 = {rock}\")\n", + "print()\n", + "print(\"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\")\n", + "print()\n", + "\n", + "# 计算相似度\n", + "print(\"相似度计算结果:\")\n", + "print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n", + "print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物,很不同)\")\n", + "print(f\" 猫 vs 汽车: {cosine_similarity(cat, car):.3f} (动物vs机械)\")\n", + "print(f\" 猫 vs 石头: {cosine_similarity(cat, rock):.3f} (动物vs无机物)\")\n", + "print(f\" 狗 vs 汽车: {cosine_similarity(dog, car):.3f} (动物vs机械,但都能移动)\")\n", + "print(f\" 苹果 vs 石头: {cosine_similarity(apple, rock):.3f} (都是静态的)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "练习题4答案\n", + "==================================================\n", + "A = [1 2 3], B = [4 5 6]\n", + "\n", + "1. 点积 A · B = 32\n", + " 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n", + "\n", + "2. 余弦相似度 = 0.9746\n", + "\n", + "3. A=[1,0], B=[0,1] 的余弦相似度 = 0.0\n", + " 原因:这两个向量垂直,夹角90°,cos(90°)=0\n" + ] + } + ], + "source": [ + "# 练习题4答案\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"练习题4答案\")\n", + "print(\"=\" * 50)\n", + "\n", + "A = np.array([1, 2, 3])\n", + "B = np.array([4, 5, 6])\n", + "\n", + "print(f\"A = {A}, B = {B}\")\n", + "print()\n", + "\n", + "# 1. 点积\n", + "dot = np.dot(A, B)\n", + "print(f\"1. 点积 A · B = {dot}\")\n", + "print(f\" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = {dot}\")\n", + "print()\n", + "\n", + "# 2. 余弦相似度\n", + "cos_sim = cosine_similarity(A, B)\n", + "print(f\"2. 余弦相似度 = {cos_sim:.4f}\")\n", + "print()\n", + "\n", + "# 3. 垂直向量的相似度\n", + "A = np.array([1, 0])\n", + "B = np.array([0, 1])\n", + "cos_sim = cosine_similarity(A, B)\n", + "print(f\"3. A=[1,0], B=[0,1] 的余弦相似度 = {cos_sim}\")\n", + "print(\" 原因:这两个向量垂直,夹角90°,cos(90°)=0\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第五部分:文本向量化的核心思想\n", + "\n", + "## 5.1 核心目标:把所有文本变成\"向量\"\n", + "\n", + "```\n", + "┌──────────────────────────────────────────────────────────────────┐\n", + "│ │\n", + "│ 文本(符号) ──→ 数值向量 ──→ 计算机可以计算 ──→ AI模型处理 │\n", + "│ │\n", + "│ \"猫\" [0.9, 0.1, 0.8] │\n", + "│ \"狗\" [0.8, 0.2, 0.7] │\n", + "│ │\n", + "└──────────────────────────────────────────────────────────────────┘\n", + "```\n", + "\n", + "### 为什么必须是向量?\n", + "\n", + "| 计算机擅长 | 计算机不擅长 |\n", + "|------------|-------------|\n", + "| 向量加减:v1 + v2 = ? | 字符串比较:\"Python\" == \"Java\" ? |\n", + "| 向量点积:v1 · v2 = ? | 词语推理:\"猫\" 类似于 \"狗\" ? |\n", + "| 向量距离:|v1 - v2| = ? | 语义理解:\"你好\"是问候语 |\n", + "| 余弦相似度:cos(θ) = ? | 情感判断:\"绝了\"是夸还是骂? |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.2 向量化示例:从\"词\"到\"数\"\n", + "\n", + "### 方法1:位置编码(只有位置信息,没有语义)\n", + "\n", + "```python\n", + "# 假设我们有一个很小的词汇表(只有5个词)\n", + "vocab = [\"猫\", \"狗\", \"鱼\", \"苹果\", \"香蕉\"]\n", + "\n", + "# 位置编码:每个词对应一个位置\n", + "# \"猫\" → [1, 0, 0, 0, 0] 第1个位置是1,其他是0\n", + "# \"狗\" → [0, 1, 0, 0, 0] 第2个位置是1,其他是0\n", + "# \"苹果\" → [0, 0, 0, 1, 0] 第4个位置是1,其他是0\n", + "```\n", + "\n", + "**问题**:这只是\"位置编码\",没有语义信息!\n", + "\n", + "```\n", + "\"猫\" = [1, 0, 0, 0, 0]\n", + "\"狗\" = [0, 1, 0, 0, 0]\n", + "\n", + "余弦相似度 = 0 (完全不相似)\n", + "\n", + "但实际上\"猫\"和\"狗\"都是动物,应该很相似!\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "位置编码的缺陷\n", + "==================================================\n", + "位置编码向量:\n", + " 猫 = [1 0 0 0 0]\n", + " 狗 = [0 1 0 0 0]\n", + " 苹果 = [0 0 0 1 0]\n", + "\n", + "余弦相似度(用位置编码):\n", + " 猫 vs 狗: 0.000\n", + " 猫 vs 苹果: 0.000\n", + "\n", + "问题:猫和狗都是动物,相似度却是0!\n", + " 猫和苹果不是同类,相似度也是0!\n", + " 位置编码没有语义信息!\n" + ] + } + ], + "source": [ + "# 位置编码的缺陷演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"位置编码的缺陷\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 位置编码向量\n", + "cat_onehot = np.array([1, 0, 0, 0, 0]) # \"猫\"\n", + "dog_onehot = np.array([0, 1, 0, 0, 0]) # \"狗\"\n", + "apple_onehot = np.array([0, 0, 0, 1, 0]) # \"苹果\"\n", + "\n", + "print(\"位置编码向量:\")\n", + "print(f\" 猫 = {cat_onehot}\")\n", + "print(f\" 狗 = {dog_onehot}\")\n", + "print(f\" 苹果 = {apple_onehot}\")\n", + "print()\n", + "\n", + "# 相似度计算\n", + "print(\"余弦相似度(用位置编码):\")\n", + "print(f\" 猫 vs 狗: {cosine_similarity(cat_onehot, dog_onehot):.3f}\")\n", + "print(f\" 猫 vs 苹果: {cosine_similarity(cat_onehot, apple_onehot):.3f}\")\n", + "print()\n", + "print(\"问题:猫和狗都是动物,相似度却是0!\")\n", + "print(\" 猫和苹果不是同类,相似度也是0!\")\n", + "print(\" 位置编码没有语义信息!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 方法2:语义编码(有语义信息)\n", + "\n", + "```python\n", + "# 语义编码:每个词用\"含义\"来表示\n", + "# 维度:[动物性, 植物性, 可食用性, 宠物性]\n", + "\n", + "cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n", + "dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n", + "apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n", + "```\n", + "\n", + "**这就是文本向量化的威力:把\"语义\"变成\"可计算的数值\"!**" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "语义编码的优点\n", + "==================================================\n", + "语义编码向量:\n", + " 猫 = [0.9 0.1 0.7 0.9]\n", + " 狗 = [0.8 0.2 0.6 0.9]\n", + " 苹果 = [0.1 0.9 0.9 0. ]\n", + "\n", + "维度说明: [动物性, 植物性, 可食用性, 宠物性]\n", + "\n", + "余弦相似度(用语义编码):\n", + " 猫 vs 狗: 0.995 (都是动物,都有宠物属性)\n", + " 猫 vs 苹果: 0.436 (动物vs植物)\n", + "\n", + "太棒了!语义编码可以捕捉到词的语义相似性!\n" + ] + } + ], + "source": [ + "# 语义编码的优点演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"语义编码的优点\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 语义编码向量\n", + "# 维度: [动物性, 植物性, 可食用性, 宠物性]\n", + "cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n", + "dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n", + "apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n", + "\n", + "print(\"语义编码向量:\")\n", + "print(f\" 猫 = {cat}\")\n", + "print(f\" 狗 = {dog}\")\n", + "print(f\" 苹果 = {apple}\")\n", + "print()\n", + "print(\"维度说明: [动物性, 植物性, 可食用性, 宠物性]\")\n", + "print()\n", + "\n", + "# 相似度计算\n", + "print(\"余弦相似度(用语义编码):\")\n", + "print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n", + "print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物)\")\n", + "print()\n", + "print(\"太棒了!语义编码可以捕捉到词的语义相似性!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.3 向量化方法演进\n", + "\n", + "```\n", + "文本向量化的三种主要方法:\n", + "\n", + "[ BoW ] ───→ [ TF-IDF ] ───→ [ Word Embedding ]\n", + " (词袋模型) (词频权重) (词向量嵌入)\n", + " \n", + " 简单粗暴 加入词重要性 蕴含语义信息\n", + " 无语义 部分语义 深度语义\n", + " \n", + " 1980年代 1990年代 2013年后\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第六部分:BoW词袋模型\n", + "\n", + "## 6.1 原理\n", + "\n", + "把文本看成\"一袋词\",**不考虑顺序**,只管词出现了几次。\n", + "\n", + "```\n", + "文本1: \"Python 是 编程 语言\"\n", + "文本2: \"Java 是 编程 语言\"\n", + "\n", + "分词后:\n", + " Doc1: [\"Python\", \"是\", \"编程\", \"语言\"]\n", + " Doc2: [\"Java\", \"是\", \"编程\", \"语言\"]\n", + "\n", + "构建词表(所有文档的词集合):\n", + " 词表: [\"Python\", \"Java\", \"是\", \"编程\", \"语言\"]\n", + "\n", + "向量化:统计每个词出现的次数\n", + " Doc1 → [1, 0, 1, 1, 1] # Python出现1次,Java出现0次,...\n", + " Doc2 → [0, 1, 1, 1, 1] # Python出现0次,Java出现1次,...\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "BoW词袋模型演示(手动实现)\n", + "==================================================\n", + "【示例1】文档集合:\n", + " Doc1: Python 是 编程 语言\n", + " Doc2: Java 是 编程 语言\n", + "\n", + "词表: ['Java', 'Python', '是', '编程', '语言']\n", + "\n", + "BoW矩阵(每行是一个文档,每列是一个词):\n", + " Doc1: [0, 1, 1, 1, 1]\n", + " Doc2: [1, 0, 1, 1, 1]\n", + "\n", + "详细解释:\n", + "\n", + "Doc1: Python 是 编程 语言\n", + " -> 'Python' 出现 1 次\n", + " -> '是' 出现 1 次\n", + " -> '编程' 出现 1 次\n", + " -> '语言' 出现 1 次\n", + "\n", + "Doc2: Java 是 编程 语言\n", + " -> 'Java' 出现 1 次\n", + " -> '是' 出现 1 次\n", + " -> '编程' 出现 1 次\n", + " -> '语言' 出现 1 次\n" + ] + } + ], + "source": [ + "# BoW词袋模型演示(纯Python实现,不依赖sklearn)\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"BoW词袋模型演示(手动实现)\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_bow(docs):\n", + " \"\"\"\n", + " 简单的BoW实现\n", + " \n", + " 参数:\n", + " docs: 文档列表,每篇文档已经是分词后的词列表\n", + " 返回:\n", + " vocab: 词表(有序列表)\n", + " bow_matrix: BoW矩阵 (n_docs x n_vocab)\n", + " \"\"\"\n", + " # 1. 构建词表\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set)) # 排序保证顺序一致\n", + " \n", + " # 2. 构建BoW矩阵\n", + " bow_matrix = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow_matrix.append(vec)\n", + " \n", + " return vocab, bow_matrix\n", + "\n", + "\n", + "# 示例1:中文文档(用空格分词)\n", + "docs = [\n", + " [\"Python\", \"是\", \"编程\", \"语言\"],\n", + " [\"Java\", \"是\", \"编程\", \"语言\"],\n", + "]\n", + "\n", + "vocab, bow_matrix = simple_bow(docs)\n", + "\n", + "print(\"【示例1】文档集合:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "\n", + "print(\"BoW矩阵(每行是一个文档,每列是一个词):\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + "print()\n", + "\n", + "# 详细解释\n", + "print(\"详细解释:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n", + " for j, word in enumerate(vocab):\n", + " if bow_matrix[i][j] > 0:\n", + " print(f\" -> '{word}' 出现 {bow_matrix[i][j]} 次\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "BoW词袋模型:更多示例\n", + "==================================================\n", + "文档集合:\n", + " Doc1: 我 爱 Python 编程\n", + " Doc2: Python 很 好 学\n", + " Doc3: 我 爱 写 代码\n", + "\n", + "词表: ['Python', '代码', '写', '好', '学', '很', '我', '爱', '编程']\n", + "\n", + "BoW矩阵:\n", + " Doc1: [1, 0, 0, 0, 0, 0, 1, 1, 1]\n", + " Doc2: [1, 0, 0, 1, 1, 1, 0, 0, 0]\n", + " Doc3: [0, 1, 1, 0, 0, 0, 1, 1, 0]\n", + "\n", + "表格形式:\n", + "Doc | Python | 代码 | 写 | 好 | 学 | 很\n", + "----------------------------------\n", + "Doc1 | 1 | 0 | 0 | 0 | 0 | 0\n", + "Doc2 | 1 | 0 | 0 | 1 | 1 | 1\n", + "Doc3 | 0 | 1 | 1 | 0 | 0 | 0\n" + ] + } + ], + "source": [ + "# 更多BoW示例(纯Python实现)\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"BoW词袋模型:更多示例\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_bow(docs):\n", + " \"\"\"简单的BoW实现\"\"\"\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow_matrix = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow_matrix.append(vec)\n", + " return vocab, bow_matrix\n", + "\n", + "docs = [\n", + " [\"我\", \"爱\", \"Python\", \"编程\"],\n", + " [\"Python\", \"很\", \"好\", \"学\"],\n", + " [\"我\", \"爱\", \"写\", \"代码\"]\n", + "]\n", + "\n", + "vocab, bow_matrix = simple_bow(docs)\n", + "\n", + "print(\"文档集合:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "\n", + "print(\"BoW矩阵:\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + "\n", + "print()\n", + "\n", + "# 显示成表格\n", + "print(\"表格形式:\")\n", + "header = \"Doc | \" + \" | \".join(vocab[:6])\n", + "print(header)\n", + "print(\"-\" * len(header))\n", + "for i, row in enumerate(bow_matrix):\n", + " print(f\"Doc{i+1} | \" + \" | \".join(map(str, row[:6])))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.2 BoW 的优缺点\n", + "\n", + "| 优点 | 缺点 |\n", + "|------|------|\n", + "| **简单直观** | 忽略词序 |\n", + "| **容易实现** | \"我爱你\"和\"你爱我\"向量完全相同 |\n", + "| **计算速度快** | 所有词同等重要 |\n", + "| **适合基线模型** | 无法捕捉语义 |\n", + "| | 无法处理同义词:\"电脑\"和\"计算机\"完全不同 |\n", + "| | 维度很高(词表有多大,维度就多大) |" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "BoW忽略词序的演示\n", + "==================================================\n", + "文档:\n", + " Doc1: 我爱你\n", + " Doc2: 你爱我\n", + " Doc3: 爱你我\n", + "\n", + "BoW矩阵:\n", + " Doc1: [1, 1, 1, 0]\n", + " Doc2: [1, 1, 1, 0]\n", + " Doc3: [0, 0, 0, 1]\n", + "\n", + "词表: ['你', '我', '爱', '爱你我']\n", + "\n", + "问题:这三个完全不同的句子,BoW向量完全相同!\n", + "Doc1: 我爱你(表达爱意)\n", + "Doc2: 你爱我(对方爱我)\n", + "Doc3: 爱你我(意义不明)\n", + "\n", + "结论:BoW模型丢失了词序信息!\n" + ] + } + ], + "source": [ + "# BoW忽略词序的演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"BoW忽略词序的演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_bow(docs):\n", + " \"\"\"简单的BoW实现\"\"\"\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow_matrix = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow_matrix.append(vec)\n", + " return vocab, bow_matrix\n", + "\n", + "# 两个完全不同的句子,但BoW向量相同\n", + "docs = [\n", + " [\"我\", \"爱\", \"你\"], # 正常语序\n", + " [\"你\", \"爱\", \"我\"], # 完全相反\n", + " [\"爱你我\"], # 没有空格(中文连续)\n", + "]\n", + "\n", + "vocab, bow_matrix = simple_bow(docs)\n", + "\n", + "print(\"文档:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {''.join(doc)}\")\n", + "print()\n", + "\n", + "print(\"BoW矩阵:\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "\n", + "print(\"问题:这三个完全不同的句子,BoW向量完全相同!\")\n", + "print(\"Doc1: 我爱你(表达爱意)\")\n", + "print(\"Doc2: 你爱我(对方爱我)\")\n", + "print(\"Doc3: 爱你我(意义不明)\")\n", + "print()\n", + "print(\"结论:BoW模型丢失了词序信息!\")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "练习题5答案\n", + "==================================================\n", + "文档集合:\n", + " Doc1: Python 是 编程 语言\n", + " Doc2: Java 是 编程 语言\n", + " Doc3: Python Python Python\n", + "\n", + "词表: ['Java', 'Python', '是', '编程', '语言']\n", + "\n", + "BoW矩阵(每行是一个文档的向量):\n", + " Doc1: [0, 1, 1, 1, 1]\n", + " Doc2: [1, 0, 1, 1, 1]\n", + " Doc3: [0, 3, 0, 0, 0]\n", + "\n", + "解析:\n", + " Doc1: [0, 1, 1, 1, 1]\n", + " - 'Python' 出现 1 次\n", + " - '是' 出现 1 次\n", + " - '编程' 出现 1 次\n", + " - '语言' 出现 1 次\n", + " Doc2: [1, 0, 1, 1, 1]\n", + " - 'Java' 出现 1 次\n", + " - '是' 出现 1 次\n", + " - '编程' 出现 1 次\n", + " - '语言' 出现 1 次\n", + " Doc3: [0, 3, 0, 0, 0]\n", + " - 'Python' 出现 3 次\n" + ] + } + ], + "source": [ + "# 练习题5答案\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"练习题5答案\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_bow(docs):\n", + " \"\"\"简单的BoW实现\"\"\"\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow_matrix = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow_matrix.append(vec)\n", + " return vocab, bow_matrix\n", + "\n", + "docs = [\n", + " [\"Python\", \"是\", \"编程\", \"语言\"],\n", + " [\"Java\", \"是\", \"编程\", \"语言\"],\n", + " [\"Python\", \"Python\", \"Python\"]\n", + "]\n", + "\n", + "vocab, bow_matrix = simple_bow(docs)\n", + "\n", + "print(\"文档集合:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "\n", + "print(\"BoW矩阵(每行是一个文档的向量):\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + "print()\n", + "\n", + "print(\"解析:\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + " for j, count in enumerate(vec):\n", + " if count > 0:\n", + " print(f\" - '{vocab[j]}' 出现 {count} 次\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第七部分:TF-IDF\n", + "\n", + "## 7.1 为什么需要TF-IDF?\n", + "\n", + "**BoW的问题**:所有词同等重要!\n", + "\n", + "```\n", + "文档A: \"Python 是 编程 语言,Python Python Python\"\n", + "文档B: \"Python 是 编程 语言\"\n", + "\n", + "BoW结果:\n", + " 文档A: Python=4, 是=1, 编程=1, 语言=1\n", + " 文档B: Python=1, 是=1, 编程=1, 语言=1\n", + "\n", + "问题:\"Python\"在A中出现4次,在B中出现1次\n", + " 但\"是\"、\"编程\"、\"语言\"出现次数相同\n", + " 我们希望\"Python\"的权重更高(因为它更重要)\n", + "```\n", + "\n", + "**关键洞察**:\n", + "- 高频出现的词 ≠ 一定重要(\"的\"、\"了\"在所有文章都出现)\n", + "- 罕见词 ≠ 不重要(\"TensorFlow\"只在AI文章出现,很重要)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.2 TF-IDF公式\n", + "\n", + "**TF-IDF = 词频(TF) × 逆文档频率(IDF)**\n", + "\n", + "```\n", + "TF = 这个词在本文中出现了多少次\n", + "IDF = log(总文档数 / 包含该词的文档数)\n", + "\n", + "TF-IDF = TF × IDF\n", + "```\n", + "\n", + "### IDF的含义\n", + "\n", + "| 词 | 在多少文档出现 | IDF值 | 解释 |\n", + "|----|----------------|-------|------|\n", + "| \"的\" | 所有文档 | log(很高) ≈ 0 | 到处都是,不重要 |\n", + "| \"Python\" | 少数文档 | log(中等) = 高 | 较独特,重要 |\n", + "| \"TensorFlow\" | 极少数文档 | log(很低) = 更高 | 很独特,非常重要 |\n", + "| \"AI\" | 只有1篇 | log(总文档数/1) = 最高 | 最独特,最重要 |" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "TF-IDF词频-逆文档频率演示\n", + "==================================================\n", + "文档集合:\n", + " Doc1: Python 编程 语言\n", + " Doc2: Python Python Python\n", + " Doc3: Java 编程 语言\n", + "\n", + "词表: ['Java', 'Python', '编程', '语言']\n", + "\n", + "IDF值: [1.4055, 1.0, 1.0, 1.0]\n", + "\n", + "TF-IDF矩阵:\n", + " Doc1: [0.0, 1.0, 1.0, 1.0]\n", + " Doc2: [0.0, 3.0, 0.0, 0.0]\n", + " Doc3: [1.4055, 0.0, 1.0, 1.0]\n", + "\n", + "详细分析:\n", + "\n", + "Doc1: Python 编程 语言\n", + " 'Python': TF-IDF = 1.0000\n", + " '编程': TF-IDF = 1.0000\n", + " '语言': TF-IDF = 1.0000\n", + "\n", + "Doc2: Python Python Python\n", + " 'Python': TF-IDF = 3.0000\n", + "\n", + "Doc3: Java 编程 语言\n", + " 'Java': TF-IDF = 1.4055\n", + " '编程': TF-IDF = 1.0000\n", + " '语言': TF-IDF = 1.0000\n" + ] + } + ], + "source": [ + "# TF-IDF演示(纯Python实现)\n", + "import math\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"TF-IDF词频-逆文档频率演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_tfidf(docs):\n", + " \"\"\"\n", + " 简单的TF-IDF实现\n", + " \n", + " 参数:\n", + " docs: 文档列表,每篇文档已经是分词后的词列表\n", + " 返回:\n", + " vocab: 词表\n", + " tfidf_matrix: TF-IDF矩阵\n", + " idf: 每个词的IDF值\n", + " \"\"\"\n", + " # 1. 构建词表和BoW\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " \n", + " # 2. 构建BoW矩阵\n", + " bow = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow.append(vec)\n", + " \n", + " n_docs = len(docs)\n", + " \n", + " # 3. 计算IDF\n", + " idf = []\n", + " for j, word in enumerate(vocab):\n", + " df = sum(1 for vec in bow if vec[j] > 0)\n", + " idf_j = math.log(n_docs / (df + 1)) + 1\n", + " idf.append(idf_j)\n", + " \n", + " # 4. 计算TF-IDF\n", + " tfidf = []\n", + " for vec in bow:\n", + " tfidf_vec = []\n", + " for i, tf in enumerate(vec):\n", + " tfidf_vec.append(tf * idf[i])\n", + " tfidf.append(tfidf_vec)\n", + " \n", + " return vocab, tfidf, idf\n", + "\n", + "docs = [\n", + " [\"Python\", \"编程\", \"语言\"],\n", + " [\"Python\", \"Python\", \"Python\"], # Python出现3次\n", + " [\"Java\", \"编程\", \"语言\"],\n", + "]\n", + "\n", + "vocab, tfidf_matrix, idf = simple_tfidf(docs)\n", + "\n", + "print(\"文档集合:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n", + "print()\n", + "\n", + "print(\"TF-IDF矩阵:\")\n", + "for i, vec in enumerate(tfidf_matrix):\n", + " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", + "print()\n", + "\n", + "print(\"详细分析:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n", + " for j, score in enumerate(tfidf_matrix[i]):\n", + " if score > 0:\n", + " print(f\" '{vocab[j]}': TF-IDF = {score:.4f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "TF-IDF vs BoW 对比\n", + "==================================================\n", + "文档:\n", + " Doc1: Python 编程\n", + " Doc2: Java 编程\n", + " Doc3: Python Python Python\n", + "\n", + "BoW矩阵:\n", + " Doc1: [0, 1, 1]\n", + " Doc2: [1, 0, 1]\n", + " Doc3: [0, 3, 0]\n", + "\n", + "TF-IDF矩阵:\n", + " Doc1: [0.0, 1.0, 1.0]\n", + " Doc2: [1.4055, 0.0, 1.0]\n", + " Doc3: [0.0, 3.0, 0.0]\n", + "\n", + "重点分析:\n", + "Doc3 'Python Python Python':\n", + " BoW: Python出现3次\n", + " TF-IDF: Python的TF-IDF = 0.0000\n", + "\n", + "为什么Doc3的TF-IDF不是最高的?\n", + "因为Python在Doc1和Doc2也出现了,IDF值被稀释\n" + ] + } + ], + "source": [ + "# TF-IDF vs BoW 对比\n", + "import math\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"TF-IDF vs BoW 对比\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_bow(docs):\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow_matrix = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow_matrix.append(vec)\n", + " return vocab, bow_matrix\n", + "\n", + "def simple_tfidf(docs):\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow.append(vec)\n", + " \n", + " n_docs = len(docs)\n", + " idf = []\n", + " for j, word in enumerate(vocab):\n", + " df = sum(1 for vec in bow if vec[j] > 0)\n", + " idf_j = math.log(n_docs / (df + 1)) + 1\n", + " idf.append(idf_j)\n", + " \n", + " tfidf = []\n", + " for vec in bow:\n", + " tfidf_vec = []\n", + " for i, tf in enumerate(vec):\n", + " tfidf_vec.append(tf * idf[i])\n", + " tfidf.append(tfidf_vec)\n", + " \n", + " return vocab, tfidf, idf\n", + "\n", + "docs = [\n", + " [\"Python\", \"编程\"],\n", + " [\"Java\", \"编程\"],\n", + " [\"Python\", \"Python\", \"Python\"] # Python出现3次\n", + "]\n", + "\n", + "vocab_bow, bow_matrix = simple_bow(docs)\n", + "vocab_tfidf, tfidf_matrix, idf = simple_tfidf(docs)\n", + "\n", + "print(\"文档:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(\"BoW矩阵:\")\n", + "for i, vec in enumerate(bow_matrix):\n", + " print(f\" Doc{i+1}: {vec}\")\n", + "print()\n", + "\n", + "print(\"TF-IDF矩阵:\")\n", + "for i, vec in enumerate(tfidf_matrix):\n", + " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", + "print()\n", + "\n", + "# 重点分析Doc3\n", + "print(\"重点分析:\")\n", + "print(f\"Doc3 'Python Python Python':\")\n", + "print(f\" BoW: Python出现3次\")\n", + "print(f\" TF-IDF: Python的TF-IDF = {tfidf_matrix[2][0]:.4f}\")\n", + "print()\n", + "print(\"为什么Doc3的TF-IDF不是最高的?\")\n", + "print(\"因为Python在Doc1和Doc2也出现了,IDF值被稀释\")" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "附加题答案\n", + "==================================================\n", + "文档:\n", + " Doc1: Python 编程\n", + " Doc2: Java 编程\n", + " Doc3: Python Python\n", + "\n", + "词表: ['Java', 'Python', '编程']\n", + "\n", + "IDF值: [1.4055, 1.0, 1.0]\n", + "\n", + "TF-IDF矩阵:\n", + " Doc1: [0.0, 1.0, 1.0]\n", + " Doc2: [1.4055, 0.0, 1.0]\n", + " Doc3: [0.0, 2.0, 0.0]\n", + "\n", + "问题1:为什么Python在Doc3中的TF-IDF值不是最高?\n", + "答:因为Python在Doc1、Doc2、Doc3中都出现了,\n", + " IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\n", + "\n", + "问题2:Java在Doc2中的TF-IDF值是多少?\n", + "答:Java在Doc2的TF-IDF值 = 1.4055\n", + " 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\n" + ] + } + ], + "source": [ + "# 附加题答案\n", + "import math\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"附加题答案\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_tfidf(docs):\n", + " vocab_set = set()\n", + " for doc in docs:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " bow = []\n", + " for doc in docs:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " vec[vocab.index(word)] += 1\n", + " bow.append(vec)\n", + " \n", + " n_docs = len(docs)\n", + " idf = []\n", + " for j, word in enumerate(vocab):\n", + " df = sum(1 for vec in bow if vec[j] > 0)\n", + " idf_j = math.log(n_docs / (df + 1)) + 1\n", + " idf.append(idf_j)\n", + " \n", + " tfidf = []\n", + " for vec in bow:\n", + " tfidf_vec = []\n", + " for i, tf in enumerate(vec):\n", + " tfidf_vec.append(tf * idf[i])\n", + " tfidf.append(tfidf_vec)\n", + " \n", + " return vocab, tfidf, idf\n", + "\n", + "docs = [[\"Python\", \"编程\"], [\"Java\", \"编程\"], [\"Python\", \"Python\"]]\n", + "\n", + "vocab, tfidf_matrix, idf = simple_tfidf(docs)\n", + "\n", + "print(\"文档:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", + "print()\n", + "\n", + "print(f\"词表: {vocab}\")\n", + "print()\n", + "print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n", + "print()\n", + "\n", + "print(\"TF-IDF矩阵:\")\n", + "for i, vec in enumerate(tfidf_matrix):\n", + " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", + "print()\n", + "\n", + "print(\"问题1:为什么Python在Doc3中的TF-IDF值不是最高?\")\n", + "print(\"答:因为Python在Doc1、Doc2、Doc3中都出现了,\")\n", + "print(\" IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\")\n", + "print()\n", + "print(\"问题2:Java在Doc2中的TF-IDF值是多少?\")\n", + "java_idx = vocab.index(\"Java\")\n", + "print(f\"答:Java在Doc2的TF-IDF值 = {tfidf_matrix[1][java_idx]:.4f}\")\n", + "print(\" 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.3 TF-IDF的优缺点\n", + "\n", + "| 优点 | 缺点 |\n", + "|------|------|\n", + "| 考虑词的重要性 | 忽略词序 |\n", + "| 降低常见词权重 | 无法捕捉语义 |\n", + "| 提高独特词权重 | \"猫\"和\"狗\"的TF-IDF可能相似也可能不相似 |\n", + "| 可以提取关键词 | 无法处理同义词 \"电脑\" vs \"计算机\" |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第八部分:Word Embedding词嵌入\n", + "\n", + "## 8.1 BoW和TF-IDF的根本问题\n", + "\n", + "```python\n", + "# 位置编码的问题\n", + "\"猫\" → [1, 0, 0, ...] # 只是\"位置编码\"\n", + "\"狗\" → [0, 1, 0, ...] # 猫和狗的位置不同\n", + "\"小猫\" → [0, 0, 1, ...] # 但它们语义相近,向量却正交!\n", + "\n", + "# 问题:无法表达语义相似性!\n", + "# \"猫\"和\"狗\"都是动物,语义很相似\n", + "# 但在BoW/TF-IDF中,它们的向量可能完全不同\n", + "```\n", + "\n", + "### 词嵌入的核心思想\n", + "\n", + "```\n", + "不再用\"位置\"表示词,而是用\"语义空间\"表示词\n", + "\n", + "语义空间示例(二维简化):\n", + " ↑ 动物性\n", + " 狗 | ↑ 猫\n", + " | ↗\n", + " 0 |↗ ↑ 苹果\n", + " |___________→ 植物性\n", + " ↑ 香蕉\n", + " \n", + " 语义相近的词在空间中距离近\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "词嵌入(Word Embedding)概念演示\n", + "==================================================\n", + "\n", + "词向量(简化版3维)示意:\n", + "维度含义: [动物性, 植物性, 其他/技术性]\n", + "\n", + " 猫: [0.9 0.1 0.2]\n", + " 狗: [0.8 0.3 0.1]\n", + " 小猫: [0.85 0.2 0.15]\n", + " 苹果: [0.1 0.2 0.9]\n", + " 香蕉: [0.1 0.1 0.85]\n", + " Python: [0.1 0. 0.9]\n", + " Java: [0.1 0. 0.85]\n", + "\n", + "语义相似度:\n", + " 猫 vs 狗: 0.965\n", + " 猫 vs 小猫: 0.992\n", + " 猫 vs 苹果: 0.337\n", + " 苹果 vs 香蕉: 0.995\n", + " Python vs Java: 1.000\n", + "\n", + "词嵌入的优势:\n", + " - 语义相似的词,向量也相似\n", + " - 可以做类比推理:国王-男人+女人=女王\n" + ] + } + ], + "source": [ + "# Word2Vec词嵌入的概念演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"词嵌入(Word Embedding)概念演示\")\n", + "print(\"=\" * 50)\n", + "print()\n", + "\n", + "# 假设这些是用Word2Vec等方法训练出来的词向量(简化版,3维)\n", + "# 实际中向量通常是50/100/300维\n", + "word_vectors = {\n", + " \"猫\": np.array([0.9, 0.1, 0.2]), # 动物属性高,其他低\n", + " \"狗\": np.array([0.8, 0.3, 0.1]), # 动物属性高\n", + " \"小猫\": np.array([0.85, 0.2, 0.15]), # 小动物,也像猫\n", + " \"苹果\": np.array([0.1, 0.2, 0.9]), # 水果属性高\n", + " \"香蕉\": np.array([0.1, 0.1, 0.85]), # 水果属性高\n", + " \"Python\": np.array([0.1, 0.0, 0.9]), # 编程语言\n", + " \"Java\": np.array([0.1, 0.0, 0.85]), # 编程语言\n", + "}\n", + "\n", + "print(\"词向量(简化版3维)示意:\")\n", + "print(\"维度含义: [动物性, 植物性, 其他/技术性]\")\n", + "print()\n", + "for word, vec in word_vectors.items():\n", + " print(f\" {word}: {vec}\")\n", + "print()\n", + "\n", + "# 计算相似度\n", + "print(\"语义相似度:\")\n", + "print(f\" 猫 vs 狗: {cosine_similarity(word_vectors['猫'], word_vectors['狗']):.3f}\")\n", + "print(f\" 猫 vs 小猫: {cosine_similarity(word_vectors['猫'], word_vectors['小猫']):.3f}\")\n", + "print(f\" 猫 vs 苹果: {cosine_similarity(word_vectors['猫'], word_vectors['苹果']):.3f}\")\n", + "print(f\" 苹果 vs 香蕉: {cosine_similarity(word_vectors['苹果'], word_vectors['香蕉']):.3f}\")\n", + "print(f\" Python vs Java: {cosine_similarity(word_vectors['Python'], word_vectors['Java']):.3f}\")\n", + "print()\n", + "print(\"词嵌入的优势:\")\n", + "print(\" - 语义相似的词,向量也相似\")\n", + "print(\" - 可以做类比推理:国王-男人+女人=女王\")" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "词嵌入的类比推理\n", + "==================================================\n", + "\n", + "词向量(简化版):\n", + " King: [0.9 0.1 0.8 0.3]\n", + " Man: [0.8 0.1 0.2 0.5]\n", + " Woman: [0.1 0.8 0.2 0.5]\n", + " Queen: [0.1 0.9 0.8 0.3]\n", + "\n", + "维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\n", + "\n", + "King - Man + Woman = [0.2 0.8 0.8 0.3]\n", + "Queen = [0.1 0.9 0.8 0.3]\n", + "\n", + "相似度验证:\n", + " (King-Man+Woman) vs Queen: 0.994\n", + "\n", + "结论:词嵌入可以捕捉语义关系!\n", + " '国王' - '男人' + '女人' ≈ '女王'\n", + " 这说明词向量编码了语义信息!\n" + ] + } + ], + "source": [ + "# 词嵌入的类比推理演示\n", + "import numpy as np\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"词嵌入的类比推理\")\n", + "print(\"=\" * 50)\n", + "print()\n", + "\n", + "# 经典例子:King - Man + Woman ≈ Queen\n", + "# 这个例子说明了词嵌入可以捕捉语义关系\n", + "\n", + "# 简化版词向量(实际中这些向量由神经网络学习得到)\n", + "king = np.array([0.9, 0.1, 0.8, 0.3]) # 皇室、男性、有权力\n", + "man = np.array([0.8, 0.1, 0.2, 0.5]) # 男性\n", + "woman = np.array([0.1, 0.8, 0.2, 0.5]) # 女性\n", + "queen = np.array([0.1, 0.9, 0.8, 0.3]) # 皇室、女性、有权力\n", + "\n", + "print(\"词向量(简化版):\")\n", + "print(f\" King: {king}\")\n", + "print(f\" Man: {man}\")\n", + "print(f\" Woman: {woman}\")\n", + "print(f\" Queen: {queen}\")\n", + "print()\n", + "print(\"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\")\n", + "print()\n", + "\n", + "# 计算 King - Man + Woman\n", + "result = king - man + woman\n", + "print(f\"King - Man + Woman = {result}\")\n", + "print(f\"Queen = {queen}\")\n", + "print()\n", + "\n", + "# 相似度\n", + "print(\"相似度验证:\")\n", + "print(f\" (King-Man+Woman) vs Queen: {cosine_similarity(result, queen):.3f}\")\n", + "print()\n", + "\n", + "print(\"结论:词嵌入可以捕捉语义关系!\")\n", + "print(\" '国王' - '男人' + '女人' ≈ '女王'\")\n", + "print(\" 这说明词向量编码了语义信息!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.2 词嵌入的发展历史\n", + "\n", + "| 方法 | 年份 | 特点 |\n", + "|------|------|------|\n", + "| Word2Vec | 2013 | Google开源,开启词嵌入时代 |\n", + "| GloVe | 2014 | Stanford提出,基于全局共现矩阵 |\n", + "| FastText | 2016 | Facebook开源,支持子词 |\n", + "| ELMo | 2018 | 考虑上下文,动态词向量 |\n", + "| BERT | 2018 | Transformer架构,预训练大模型 |\n", + "| GPT系列 | 2018-现在 | 生成式AI,ChatGPT核心 |" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================================\n", + "预训练词向量演示(使用内置示例向量)\n", + "==================================================\n", + "\n", + "注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\n", + "本notebook使用内置示例向量进行演示\n", + "\n", + "词向量示例(每个词用一个5维向量表示):\n", + "维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n", + "\n", + " cat : [0.9, 0.1, 0.2, 0.8, 0.3]\n", + " dog : [0.8, 0.2, 0.1, 0.9, 0.3]\n", + " bird : [0.7, 0.3, 0.1, 0.9, 0.2]\n", + " fish : [0.6, 0.2, 0.1, 0.8, 0.2]\n", + " apple : [0.1, 0.9, 0.3, 0.0, 0.2]\n", + " rose : [0.1, 0.8, 0.1, 0.0, 0.1]\n", + " python : [0.1, 0.0, 0.9, 0.0, 0.5]\n", + " java : [0.1, 0.0, 0.8, 0.0, 0.4]\n", + " computer : [0.1, 0.0, 0.9, 0.3, 0.4]\n", + " love : [0.3, 0.2, 0.1, 0.1, 0.9]\n", + " hate : [0.2, 0.1, 0.1, 0.1, 0.8]\n", + "\n", + "==================================================\n", + "1. 语义相似度计算\n", + "==================================================\n", + " cat vs dog : 0.987\n", + " cat vs apple : 0.244\n", + " python vs java : 0.998\n", + " python vs cat : 0.322\n", + " love vs hate : 0.993\n", + "\n", + "==================================================\n", + "2. 类比推理(Word2Vec核心能力)\n", + "==================================================\n", + "类比问题:man -> woman, king -> ?\n", + "\n", + " King = [0.6 0.1 0.3 0.3 0.6]\n", + " Man = [0.8 0.1 0.2 0.5 0.3]\n", + " Woman = [0.2 0.8 0.2 0.5 0.5]\n", + " King - Man + Woman = [-0. 0.8 0.3 0.3 0.8]\n", + " Queen (真实) = [0.2 0.9 0.3 0.3 0.6]\n", + "\n", + " 相似度: 0.969\n", + "\n", + "太棒了!词嵌入可以捕捉语义关系!\n", + "\n", + "==================================================\n", + "真实环境中加载Gensim预训练模型的方法\n", + "==================================================\n", + "如需加载真实的预训练词向量,可以运行:\n", + "\n", + " import gensim.downloader as api\n", + " model = api.load('glove-wiki-gigaword-50')\n", + "\n", + "这会下载约66MB的预训练词向量模型\n" + ] + } + ], + "source": [ + "# 实战:用预训练词向量演示词嵌入(跳过实际下载)\n", + "import numpy as np\n", + "\n", + "def cosine_similarity(a, b):\n", + " \"\"\"计算余弦相似度\"\"\"\n", + " dot = np.dot(a, b)\n", + " norm_a = np.linalg.norm(a)\n", + " norm_b = np.linalg.norm(b)\n", + " if norm_a == 0 or norm_b == 0:\n", + " return 0.0\n", + " return dot / (norm_a * norm_b)\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"预训练词向量演示(使用内置示例向量)\")\n", + "print(\"=\" * 50)\n", + "print()\n", + "print(\"注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\")\n", + "print(\"本notebook使用内置示例向量进行演示\")\n", + "print()\n", + "\n", + "# 使用内置的小规模词向量示例(模拟真实词向量)\n", + "# 维度: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n", + "word_vectors = {\n", + " # 动物\n", + " \"cat\": np.array([0.9, 0.1, 0.2, 0.8, 0.3]),\n", + " \"dog\": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),\n", + " \"bird\": np.array([0.7, 0.3, 0.1, 0.9, 0.2]),\n", + " \"fish\": np.array([0.6, 0.2, 0.1, 0.8, 0.2]),\n", + " # 植物\n", + " \"apple\": np.array([0.1, 0.9, 0.3, 0.0, 0.2]),\n", + " \"rose\": np.array([0.1, 0.8, 0.1, 0.0, 0.1]),\n", + " # 技术\n", + " \"python\": np.array([0.1, 0.0, 0.9, 0.0, 0.5]),\n", + " \"java\": np.array([0.1, 0.0, 0.85, 0.0, 0.4]),\n", + " \"computer\": np.array([0.1, 0.0, 0.9, 0.3, 0.4]),\n", + " # 抽象概念\n", + " \"love\": np.array([0.3, 0.2, 0.1, 0.1, 0.9]),\n", + " \"hate\": np.array([0.2, 0.1, 0.1, 0.1, 0.8]),\n", + "}\n", + "\n", + "# 显示词向量\n", + "print(\"词向量示例(每个词用一个5维向量表示):\")\n", + "print(\"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\")\n", + "print()\n", + "for word, vec in word_vectors.items():\n", + " print(f\" {word:12s}: [{vec[0]:.1f}, {vec[1]:.1f}, {vec[2]:.1f}, {vec[3]:.1f}, {vec[4]:.1f}]\")\n", + "\n", + "print()\n", + "print(\"=\" * 50)\n", + "print(\"1. 语义相似度计算\")\n", + "print(\"=\" * 50)\n", + "pairs = [\n", + " (\"cat\", \"dog\"), # 都是动物\n", + " (\"cat\", \"apple\"), # 动物 vs 植物\n", + " (\"python\", \"java\"), # 都是编程语言\n", + " (\"python\", \"cat\"), # 编程语言 vs 动物\n", + " (\"love\", \"hate\"), # 情感词\n", + "]\n", + "for w1, w2 in pairs:\n", + " sim = cosine_similarity(word_vectors[w1], word_vectors[w2])\n", + " print(f\" {w1:10s} vs {w2:10s}: {sim:.3f}\")\n", + "\n", + "print()\n", + "print(\"=\" * 50)\n", + "print(\"2. 类比推理(Word2Vec核心能力)\")\n", + "print(\"=\" * 50)\n", + "print(\"类比问题:man -> woman, king -> ?\")\n", + "print()\n", + "\n", + "# 简化版类比:使用语义维度\n", + "# man=[0.8, 0.1, 0.2, 0.5, 0.3], woman=[0.2, 0.8, 0.2, 0.5, 0.5]\n", + "# king=[0.6, 0.1, 0.3, 0.3, 0.6], queen=[0.2, 0.9, 0.3, 0.3, 0.6]\n", + "man = np.array([0.8, 0.1, 0.2, 0.5, 0.3])\n", + "woman = np.array([0.2, 0.8, 0.2, 0.5, 0.5])\n", + "king = np.array([0.6, 0.1, 0.3, 0.3, 0.6])\n", + "queen = np.array([0.2, 0.9, 0.3, 0.3, 0.6])\n", + "\n", + "# king - man + woman ≈ queen\n", + "result = king - man + woman\n", + "\n", + "print(f\" King = {king}\")\n", + "print(f\" Man = {man}\")\n", + "print(f\" Woman = {woman}\")\n", + "print(f\" King - Man + Woman = {np.round(result, 2)}\")\n", + "print(f\" Queen (真实) = {queen}\")\n", + "print()\n", + "print(f\" 相似度: {cosine_similarity(result, queen):.3f}\")\n", + "print()\n", + "print(\"太棒了!词嵌入可以捕捉语义关系!\")\n", + "print()\n", + "\n", + "# 真实环境中加载Gensim模型的方法(仅供参考,不执行)\n", + "print(\"=\" * 50)\n", + "print(\"真实环境中加载Gensim预训练模型的方法\")\n", + "print(\"=\" * 50)\n", + "print(\"如需加载真实的预训练词向量,可以运行:\")\n", + "print()\n", + "print(\" import gensim.downloader as api\")\n", + "print(\" model = api.load('glove-wiki-gigaword-50')\")\n", + "print()\n", + "print(\"这会下载约66MB的预训练词向量模型\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第九部分:文本处理完整流程\n", + "\n", + "## 9.1 流程图\n", + "\n", + "```\n", + "┌──────────────────────────────────────────────────────────────────┐\n", + "│ 文本数据 │\n", + "│ \"今天天气真不错!\" │\n", + "└─────────────────────────┬────────────────────────────────────────┘\n", + " │\n", + " ▼\n", + "┌──────────────────────────────────────────────────────────────────┐\n", + "│ 1. 文本预处理 │\n", + "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", + "│ │ 分词 │→ │ 去停用词│→ │ 统一大小│→ │ 去除标点│ │\n", + "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", + "│ \"今天/天气/真/不错\" → \"今天/天气/不错\" │\n", + "└─────────────────────────┬────────────────────────────────────────┘\n", + " │\n", + " ▼\n", + "┌──────────────────────────────────────────────────────────────────┐\n", + "│ 2. 文本向量化 │\n", + "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", + "│ │ BoW │ │ TF-IDF │ │ Embedding│ │ 预训练模型│ │\n", + "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", + "│ ↓ ↓ ↓ ↓ │\n", + "│ [1,0,2,0,1] [0.5,0,0.8] [0.9,0.3] [BERT向量] │\n", + "└─────────────────────────┬────────────────────────────────────────┘\n", + " │\n", + " ▼\n", + "┌──────────────────────────────────────────────────────────────────┐\n", + "│ 3. 下游任务 │\n", + "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", + "│ │ 分类 │ │ 相似度 │ │ 聚类 │ │ 生成 │ │\n", + "│ │ 情感分析│ │ 文本匹配│ │ 主题分组│ │ 聊天机器人│ │\n", + "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", + "└──────────────────────────────────────────────────────────────────┘\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.2 各环节详解\n", + "\n", + "### 环节1:文本预处理\n", + "\n", + "| 步骤 | 输入 | 输出 | 作用 |\n", + "|------|------|------|------|\n", + "| 分词 | \"今天天气不错\" | [\"今天\", \"天气\", \"不错\"] | 把文本切成词 |\n", + "| 去停用词 | [\"今天\", \"天气\", \"不错\"] | [\"天气\", \"不错\"] | 去掉\"的、了、在\"等无意义词 |\n", + "| 统一大小写 | [\"Python\", \"python\"] | [\"python\", \"python\"] | 归一化 |\n", + "| 去标点 | [\"语言!!!\"] | [\"语言\"] | 清理噪音 |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 环节2:文本向量化\n", + "\n", + "| 方法 | 适用场景 | 不适用场景 |\n", + "|------|---------|-----------|\n", + "| BoW | 基线模型、快速原型 | 需要语义理解 |\n", + "| TF-IDF | 文本分类、关键词提取 | 同义词识别 |\n", + "| Embedding | 语义相似度、推荐系统 | 需要精确匹配 |\n", + "| 预训练模型 | 通用NLP任务 | 计算资源有限 |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 环节3:下游任务\n", + "\n", + "```python\n", + "# 分类任务:\n", + "\"这部电影太好看了!\" → 情感分类 → 正面 ✅\n", + "\n", + "# 相似度任务:\n", + "\"如何学习Python?\" → 查找相似文档 → \"Python入门教程\" ✅\n", + "\n", + "# 生成任务:\n", + "\"今天天气\" → GPT续写 → \"今天天气真好,适合出去玩\" ✅\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "# 第十部分:实战用jieba进行中文分词\n", + "\n", + "## 10.1 安装jieba\n", + "\n", + "```bash\n", + "!pip install jieba\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 安装jieba\n", + "import subprocess\n", + "subprocess.run(['pip', 'install', 'jieba', '-q'])\n", + "\n", + "print(\"jieba安装完成!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.2 基础分词\n", + "\n", + "jieba支持三种分词模式:\n", + "\n", + "| 模式 | 说明 | 适用场景 |\n", + "|------|------|---------|\n", + "| 精确模式 | 试图将句子最精确地切开,适合文本分析 | **默认,推荐** |\n", + "| 全模式 | 把所有可能的词都扫描出来,速度快 | 速度要求高 |\n", + "| 搜索引擎模式 | 在精确模式基础上,对长词再次切分 | 搜索引擎 |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import jieba\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"jieba分词演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "text = \"我喜欢深度学习和人工智能\"\n", + "\n", + "print(f\"原文: {text}\")\n", + "print()\n", + "\n", + "# 精确模式(默认)\n", + "words精确 = list(jieba.cut(text, cut_all=False))\n", + "print(f\"精确模式: {' / '.join(words精确)}\")\n", + "\n", + "# 全模式\n", + "words全 = list(jieba.cut(text, cut_all=True))\n", + "print(f\"全模式: {' / '.join(words全)}\")\n", + "\n", + "# 搜索引擎模式\n", + "words搜索 = list(jieba.cut_for_search(text))\n", + "print(f\"搜索模式: {' / '.join(words搜索)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 更多分词示例\n", + "import jieba\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"更多分词示例\")\n", + "print(\"=\" * 50)\n", + "\n", + "examples = [\n", + " \"今天天气真不错\",\n", + " \"人工智能是未来的发展方向\",\n", + " \"Python是一门非常流行的编程语言\",\n", + " \"小明毕业于清华大学计算机系\",\n", + " \"我今天在京东买了一部iPhone手机\"\n", + "]\n", + "\n", + "for i, text in enumerate(examples):\n", + " words = list(jieba.cut(text))\n", + " print(f\"{i+1}. {text}\")\n", + " print(f\" → {' / '.join(words)}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.3 词性标注\n", + "\n", + "jieba支持词性标注,可以标注每个词是名词、动词、形容词等。\n", + "\n", + "| 词性代码 | 含义 | 示例 |\n", + "|----------|------|------|\n", + "| n | 名词 | 人、山、电脑 |\n", + "| v | 动词 | 跑、吃、学习 |\n", + "| adj | 形容词 | 漂亮、好吃、优秀 |\n", + "| adv | 副词 | 很、非常、慢慢 |\n", + "| m | 数词 | 一、百、千 |\n", + "| q | 量词 | 个、本、件 |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import jieba.posseg as pseg\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"jieba词性标注演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "text = \"我喜欢深度学习和人工智能\"\n", + "\n", + "print(f\"原文: {text}\")\n", + "print()\n", + "\n", + "words = pseg.cut(text)\n", + "print(\"分词 + 词性标注:\")\n", + "for word, flag in words:\n", + " print(f\" {word}: {flag}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.4 停用词处理\n", + "\n", + "停用词是在文本处理中需要过滤掉的常见词,如\"的\"、\"了\"、\"在\"等。\n", + "\n", + "这些词在所有文档中都可能出现,对区分文档没有帮助。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import jieba\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"停用词处理演示\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 常见停用词列表\n", + "stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'])\n", + "\n", + "text = \"人工智能是未来的发展方向,也是当前科技领域的热门话题\"\n", + "\n", + "print(f\"原文: {text}\")\n", + "print()\n", + "\n", + "# 不使用停用词\n", + "words_all = list(jieba.cut(text))\n", + "print(f\"不使用停用词: {' / '.join(words_all)}\")\n", + "\n", + "# 使用停用词\n", + "words_filtered = [w for w in words_all if w not in stopwords]\n", + "print(f\"使用停用词: {' / '.join(words_filtered)}\")\n", + "print()\n", + "\n", + "# 更完整的停用词表可以从网上下载\n", + "print(\"提示:实际项目中可以从以下地方获取停用词表:\")\n", + "print(\" - 哈工大停用词表\")\n", + "print(\" - 百度停用词表\")\n", + "print(\" - 四川大学机器学习实验室停用词表\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 实战:完整的文本预处理流程\n", + "import jieba\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"完整的文本预处理流程\")\n", + "print(\"=\" * 50)\n", + "\n", + "# 示例文档集合\n", + "docs = [\n", + " \"今天天气真不错!适合出去玩。\",\n", + " \"Python是一门很棒的编程语言。\",\n", + " \"人工智能和机器学习是未来的发展方向。\",\n", + " \"今天在咖啡馆喝了一杯很好喝的拿铁。\"\n", + "]\n", + "\n", + "# 停用词表\n", + "stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这', '!', '。', ','])\n", + "\n", + "def preprocess_text(text):\n", + " \"\"\"完整的文本预处理流程\"\"\"\n", + " # 1. 分词\n", + " words = jieba.cut(text)\n", + " \n", + " # 2. 去除停用词\n", + " words = [w for w in words if w not in stopwords and len(w) > 0]\n", + " \n", + " # 3. 去除空格\n", + " words = [w for w in words if w.strip()]\n", + " \n", + " return words\n", + "\n", + "print(\"预处理结果:\")\n", + "for i, doc in enumerate(docs):\n", + " words = preprocess_text(doc)\n", + " print(f\"\\nDoc{i+1}: {doc}\")\n", + " print(f\" → {' / '.join(words)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 实战:jieba分词 + TF-IDF完整流程\n", + "import jieba\n", + "import math\n", + "\n", + "print(\"=\" * 50)\n", + "print(\"实战:jieba分词 + TF-IDF完整流程\")\n", + "print(\"=\" * 50)\n", + "\n", + "def simple_tfidf_tokenized(docs, stopwords=None):\n", + " \"\"\"\n", + " 结合分词的TF-IDF实现\n", + " 参数:\n", + " docs: 原始文档列表\n", + " stopwords: 停用词集合\n", + " 返回:\n", + " vocab, tfidf_matrix\n", + " \"\"\"\n", + " # 1. 分词\n", + " tokenized = []\n", + " for doc in docs:\n", + " words = jieba.cut(doc)\n", + " if stopwords:\n", + " words = [w for w in words if w not in stopwords and len(w) > 1]\n", + " else:\n", + " words = [w for w in words if len(w) > 1]\n", + " tokenized.append(words)\n", + " \n", + " # 2. 构建词表\n", + " vocab_set = set()\n", + " for doc in tokenized:\n", + " vocab_set.update(doc)\n", + " vocab = sorted(list(vocab_set))\n", + " \n", + " # 3. 构建TF矩阵并计算IDF\n", + " n_docs = len(tokenized)\n", + " tf_matrix = []\n", + " df_dict = {word: 0 for word in vocab}\n", + " \n", + " for doc in tokenized:\n", + " vec = [0] * len(vocab)\n", + " for word in doc:\n", + " if word in vocab:\n", + " idx = vocab.index(word)\n", + " vec[idx] += 1\n", + " tf_matrix.append(vec)\n", + " \n", + " # 计算DF\n", + " for vec in tf_matrix:\n", + " for j, count in enumerate(vec):\n", + " if count > 0:\n", + " word = vocab[j]\n", + " df_dict[word] += 1\n", + " \n", + " # 计算IDF\n", + " idf = []\n", + " for word in vocab:\n", + " df = df_dict[word]\n", + " idf_j = math.log(n_docs / (df + 1)) + 1\n", + " idf.append(idf_j)\n", + " \n", + " # 计算TF-IDF\n", + " tfidf = []\n", + " for vec in tf_matrix:\n", + " tfidf_vec = [vec[i] * idf[i] for i in range(len(vec))]\n", + " tfidf.append(tfidf_vec)\n", + " \n", + " return vocab, tfidf, tokenized\n", + "\n", + "# 示例文档集合\n", + "docs = [\n", + " \"Python是一门很棒的编程语言\",\n", + " \"人工智能是未来的发展方向\",\n", + " \"深度学习是机器学习的一个分支\",\n", + " \"Python和Java都是很流行的编程语言\"\n", + "]\n", + "\n", + "# 停用词\n", + "stopwords = set([\"的\", \"是\", \"一个\", \"很\", \"和\", \"在\", \"了\"])\n", + "\n", + "vocab, tfidf_matrix, tokenized = simple_tfidf_tokenized(docs, stopwords)\n", + "\n", + "print(\"文档集合:\")\n", + "for i, doc in enumerate(docs):\n", + " print(f\" Doc{i+1}: {doc}\")\n", + "print()\n", + "\n", + "print(f\"分词结果:\")\n", + "for i, words in enumerate(tokenized):\n", + " print(f\" Doc{i+1}: {' / '.join(words)}\")\n", + "print()\n", + "\n", + "print(f\"词表(共{len(vocab)}个词):\")\n", + "print(f\" {vocab}\")\n", + "print()\n", + "\n", + "print(\"TF-IDF矩阵:\")\n", + "for i, vec in enumerate(tfidf_matrix):\n", + " # 只显示非零值\n", + " nonzero = [(vocab[j], round(vec[j], 4)) for j in range(len(vec)) if vec[j] > 0]\n", + " print(f\" Doc{i+1}: {nonzero}\")\n", + "\n", + "print()\n", + "\n", + "# 找每个文档最重要的词\n", + "print(\"每个文档最重要的词(TF-IDF值最高):\")\n", + "for i, vec in enumerate(tfidf_matrix):\n", + " max_idx = max(range(len(vec)), key=lambda j: vec[j])\n", + " max_score = vec[max_idx]\n", + " if max_score > 0:\n", + " print(f\" Doc{i+1}: '{vocab[max_idx]}' (TF-IDF={max_score:.4f})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "---\n", + "\n", + "# 📋 总结\n", + "\n", + "## 本章核心概念\n", + "\n", + "```\n", + "文本数据处理\n", + " │\n", + " ├── 核心问题:文本(符号) → 向量(数字)\n", + " │\n", + " ├── 向量化方法\n", + " │ ├── BoW(词袋模型)\n", + " │ │ └── 核心:统计词频,忽略顺序\n", + " │ │\n", + " │ ├── TF-IDF(词频-逆文档频率)\n", + " │ │ └── 核心:词的重要性 × 词的独特性\n", + " │ │\n", + " │ └── Word Embedding(词嵌入)\n", + " │ └── 核心:用语义空间表示词\n", + " │\n", + " └── 处理流程\n", + " ├── 文本预处理(分词、去停用词)\n", + " ├── 向量化\n", + " └── 下游任务(分类、相似度、生成)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 关键公式速查\n", + "\n", + "| 方法 | 公式 | 含义 |\n", + "|------|------|------|\n", + "| 向量加法 | [1,2] + [3,4] = [4,6] | 对应位置相加 |\n", + "| 向量数乘 | 2 × [1,2] = [2,4] | 每个元素乘以标量 |\n", + "| 向量点积 | [1,2] · [3,4] = 11 | 对应相乘再求和 |\n", + "| 向量长度 | |[3,4]| = √(3²+4²) = 5 | 勾股定理 |\n", + "| 余弦相似度 | cos(θ) = (A·B) / (|A|×|B|) | 向量相似程度 |\n", + "| TF-IDF | TF × IDF | 词频 × 逆文档频率 |\n", + "\n", + "---\n", + "\n", + "> **记住:文本向量化的核心目标是把\"符号\"变成\"可计算的数值向量\"!**" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}