{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3-2-1 文本数据处理导论\n", "## 课堂演示notebook\n", "\n", "---\n", "\n", "## 目录\n", "\n", "1. [什么是文本数据?](#第一部分-什么是文本数据)\n", "2. [计算机如何读取文本?](#第二部分-计算机如何读取文本)\n", "3. [向量基础入门](#第三部分-向量基础入门)\n", "4. [余弦相似度](#第四部分-余弦相似度)\n", "5. [文本向量化的核心思想](#第五部分-文本向量化的核心思想)\n", "6. [BoW词袋模型](#第六部分-bow词袋模型)\n", "7. [TF-IDF词频-逆文档频率](#第七部分-tf-idf)\n", "8. [Word Embedding词嵌入](#第八部分-word-embedding词嵌入)\n", "9. [文本处理完整流程](#第九部分-文本处理完整流程)\n", "10. [实战:用jieba进行中文分词](#第十部分-实战用jieba进行中文分词)\n", "\n", "---\n", "\n", "**注意**:运行本notebook需要安装以下依赖:\n", "```bash\n", "pip install numpy matplotlib jieba\n", "```\n", "- BoW和TF-IDF代码使用纯Python+NumPy实现,不依赖sklearn\n", "- 如果服务器没有中文字体,图表中的中文可能显示为方块,这是正常现象。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 第一部分:什么是文本数据?\n", "\n", "## 1.1 文本数据的定义\n", "\n", "**文本数据**是由文字、符号组成的序列信息,是人类语言在计算机中的表示形式。\n", "\n", "### 生活中的文本数据例子\n", "\n", "| 类型 | 示例 |\n", "|------|------|\n", "| 一句话 | \"今天天气真好\" |\n", "| 一篇文章 | 一篇新闻报道 |\n", "| 一条评论 | \"这家餐厅的菜太好吃了!\" |\n", "| 一段对话 | \"你好,请问这本书多少钱?\" |\n", "| 一首诗 | \"床前明月光,疑是地上霜\" |\n", "| 一段代码 | `print('Hello World')` |\n", "| 一封邮件 | 包含正文、收件人、发件人等 |\n", "| 聊天记录 | 微信对话、短信 |\n", "\n", "**简单来说:只要是文字组成的信息,都是文本数据!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 文本数据的特点\n", "\n", "文本数据与图像、音频等数据有显著区别:\n", "\n", "| 特点 | 说明 | 示例 |\n", "|------|------|------|\n", "| **离散符号** | 由离散的字符/词组成,不是连续的数值 | \"hello\" 由 h,e,l,l,o 这5个字符组成 |\n", "| **序列性** | 符号按特定顺序排列,顺序改变意思就改变 | \"我爱你\" ≠ \"你爱我\" |\n", "| **语义丰富** | 同样的词在不同场景意思可能不同 | \"苹果\"可以是水果或手机品牌 |\n", "| **上下文相关** | 词的意思依赖上下文 | \"他打了猫,猫跑了\" 中两个\"猫\"意思相同 |\n", "| **歧义性** | 同样的话可能有多重理解 | \"天气真不错\"可以是正面或反讽 |\n", "\n", "### 思考:序列性有多重要?\n", "\n", "```\n", "文本1: \"我吃了饭\"\n", "文本2: \"饭了我吃\"\n", "文本3: \"饭吃了我\"\n", "\n", "这三个文本由完全相同的字符组成,但顺序不同,意思也完全不同!\n", "这说明:文本的顺序承载了重要的语义信息。\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第二部分:计算机如何\"读取\"文本?\n", "\n", "## 2.1 对比:图像数据 vs 文本数据的存储方式\n", "\n", "### 图像数据的读取\n", "\n", "```\n", "图像文件(.jpg/.png)\n", " ↓\n", "计算机读取像素值(每个像素是0-255的数值)\n", " ↓\n", "存储为3维矩阵 [高度, 宽度, 通道(RGB)]\n", " ↓\n", "一张 1920×1080 的彩色图 = 1920 × 1080 × 3 = 6,220,800 个数字\n", "```\n", "\n", "**图像的本质:密集的数值矩阵,计算机可以直接处理!**\n", "\n", "### 文本数据的读取\n", "\n", "```\n", "文本文件(.txt/.md/.py)\n", " ↓\n", "计算机读取字符编码(ASCII/UTF-8/GBK)\n", " ↓\n", "存储为字符序列(每个字符是一个数字编码)\n", " ↓\n", "\"Python\" → [80, 121, 116, 104, 111](ASCII编码)\n", "```\n", "\n", "**文本的本质:符号序列,计算机需要额外处理才能理解!**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 字符编码:用数字表示字符\n", "\n", "### ASCII编码(英文和部分符号)\n", "\n", "ASCII码使用0-127的数字来表示128个字符:\n", "\n", "| 字符 | ASCII码 | 说明 |\n", "|------|---------|------|\n", "| 'A' | 65 | 大写字母 |\n", "| 'B' | 66 | 大写字母 |\n", "| ... | ... | ... |\n", "| 'Z' | 90 | 大写字母 |\n", "| 'a' | 97 | 小写字母 |\n", "| 'b' | 98 | 小写字母 |\n", "| ... | ... | ... |\n", "| 'z' | 122 | 小写字母 |\n", "| '0' | 48 | 数字 |\n", "| '1' | 49 | 数字 |\n", "| ... | ... | ... |\n", "| '9' | 57 | 数字 |\n", "\n", "### UTF-8编码(支持全球所有语言,包括中文)\n", "\n", "UTF-8是一种变长编码,中文通常用3-4个字节表示:\n", "\n", "| 字符 | UTF-8编码值 | 字节数 |\n", "|------|-------------|--------|\n", "| '中' | 20013 | 2字节 |\n", "| '文' | 25991 | 2字节 |\n", "| 'P' | 80 | 1字节 |\n", "| 'y' | 121 | 1字节 |\n", "| '👍' | 128077 | 4字节(emoji) |" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "英文文本的字符编码\n", "==================================================\n", "文本: Hello\n", "每个字符的ASCII码: [72, 101, 108, 108, 111]\n", "\n", " 'H' -> 72\n", " 'e' -> 101\n", " 'l' -> 108\n", " 'l' -> 108\n", " 'o' -> 111\n", "\n", "==================================================\n", "中文文本的字符编码\n", "==================================================\n", "文本: 你好\n", "每个字符的UTF-8编码值: [20320, 22909]\n", "\n", " '你' -> 20320\n", " '好' -> 22909\n" ] } ], "source": [ "# 实际演示:查看字符的编码值\n", "\n", "# 英文例子\n", "text_en = \"Hello\"\n", "print(\"=\" * 50)\n", "print(\"英文文本的字符编码\")\n", "print(\"=\" * 50)\n", "print(f\"文本: {text_en}\")\n", "print(f\"每个字符的ASCII码: {[ord(c) for c in text_en]}\")\n", "print()\n", "\n", "# 逐个显示\n", "for c in text_en:\n", " print(f\" '{c}' -> {ord(c)}\")\n", "\n", "print()\n", "print(\"=\" * 50)\n", "print(\"中文文本的字符编码\")\n", "print(\"=\" * 50)\n", "\n", "# 中文例子\n", "text_cn = \"你好\"\n", "print(f\"文本: {text_cn}\")\n", "print(f\"每个字符的UTF-8编码值: {[ord(c) for c in text_cn]}\")\n", "print()\n", "\n", "# 逐个显示\n", "for c in text_cn:\n", " print(f\" '{c}' -> {ord(c)}\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "验证:数字编码转字符\n", "\n", "chr(65) = 'A' # 应该是大写字母 A\n", "chr(97) = 'a' # 应该是小写字母 a\n", "chr(20013) = '中' # 应该是中文'中'\n", "chr(25991) = '文' # 应该是中文'文'\n" ] } ], "source": [ "# 用chr()函数反向验证:数字编码转字符\n", "print(\"验证:数字编码转字符\")\n", "print()\n", "\n", "# 65是大写字母A\n", "print(f\"chr(65) = '{chr(65)}' # 应该是大写字母 A\")\n", "\n", "# 97是小写字母a\n", "print(f\"chr(97) = '{chr(97)}' # 应该是小写字母 a\")\n", "\n", "# 20013是中文\"中\"\n", "print(f\"chr(20013) = '{chr(20013)}' # 应该是中文'中'\")\n", "\n", "# 25991是中文\"文\"\n", "print(f\"chr(25991) = '{chr(25991)}' # 应该是中文'文'\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "练习题1答案\n", "==================================================\n", "1. 'Hello' 的ASCII码:\n", "[72, 101, 108, 108, 111]\n", "\n", "2. 验证 chr(65):\n", "chr(65) = 'A'\n", "\n", "验证 A-Z 的ASCII码范围 (65-90):\n", "['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']\n" ] } ], "source": [ "# 练习题1答案:验证字符编码\n", "print(\"=\" * 50)\n", "print(\"练习题1答案\")\n", "print(\"=\" * 50)\n", "\n", "# 1. 用 ord() 函数打印 \"Hello\" 每个字符的ASCII码\n", "print(\"1. 'Hello' 的ASCII码:\")\n", "print([ord(c) for c in \"Hello\"])\n", "\n", "# 2. 验证字符65对应大写字母A\n", "print()\n", "print(\"2. 验证 chr(65):\")\n", "print(f\"chr(65) = '{chr(65)}'\")\n", "\n", "# 验证范围\n", "print()\n", "print(\"验证 A-Z 的ASCII码范围 (65-90):\")\n", "print([chr(i) for i in range(65, 91)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 计算机擅长什么?不擅长什么?\n", "\n", "### 计算机擅长的任务 ✅\n", "\n", "| 任务类型 | 示例 | 说明 |\n", "|----------|------|------|\n", "| 数字计算 | 1 + 2 = 3 | 加减乘除、方程求解 |\n", "| 逻辑判断 | if a > b then ... | 条件分支、布尔运算 |\n", "| 矩阵运算 | 图像卷积、矩阵乘法 | 深度学习核心 |\n", "| 精确匹配 | 字符串完全相同比较 | 数据库查询 |\n", "| 模式识别 | 符合规则的数据查找 | 正则表达式 |\n", "| 存储检索 | 海量数据快速存取 | 搜索引擎 |\n", "\n", "### 计算机不擅长的任务 ❌\n", "\n", "| 任务类型 | 示例 | 为什么困难 |\n", "|----------|------|-------------|\n", "| 语义理解 | \"今天天气真好\"是好是坏? | 需要常识和上下文 |\n", "| 情感判断 | \"真是绝了\"是夸还是骂? | 歧义性、反讽 |\n", "| 模糊推理 | \"大概\"、\"也许\" | 无法精确处理 |\n", "| 创意创作 | 写诗、写小说 | 需要想象力 |\n", "| 常识理解 | \"水往低处流\" | 缺乏物理常识 |\n", "| 多义性理解 | \"苹果\"指什么? | 需要世界知识 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 为什么计算机不擅长理解文本?\n", "\n", "**原因一:文本是\"符号\",不是\"数值\"**\n", "\n", "```\n", "计算机的大脑 = 计算器(专门处理数字)\n", "文本 = 一堆符号(对计算机来说就像乱码)\n", "\n", "数字:1, 2, 3, 100.5, -7 → 计算机直接能算\n", "文本:\"好\"、\"bad\"、\"hello\" → 计算机不知道啥意思\n", "```\n", "\n", "**原因二:语义不是显式表达的**\n", "\n", "```python\n", "# 人类理解的文本:\n", "text = \"他今天心情不太好,因为下雨了\"\n", "\n", "# 人类理解:\n", "# - \"心情不太好\" = 不开心\n", "# - \"因为下雨了\" = 原因是下雨\n", "# - 因果关系:下雨 → 心情不好\n", "\n", "# 计算机只能看到:\n", "print(text)\n", "# 计算机:???不理解下雨和心情的因果关系\n", "```\n", "\n", "**原因三:同样的符号,不同的语境,不同的意思**\n", "\n", "```\n", "语境1: \"苹果真好吃\" → 说的是水果(吃的苹果)\n", "\n", "语境2: \"苹果手机真贵\" → 说的是手机品牌(Apple)\n", "\n", "语境3: \"牛顿被苹果砸到了\" → 说的是水果(引发万有引力灵感)\n", "\n", "计算机怎么知道?需要上下文理解能力!\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 关键结论:为什么需要文本向量化?\n", "\n", "```\n", "┌─────────────────────────────────────────────────────────────┐\n", "│ 核心矛盾 │\n", "│ │\n", "│ 文本(符号序列) ←→ 计算机擅长(数值计算) │\n", "│ ↓ │\n", "│ 需要一座桥梁 │\n", "│ 这座桥梁就是 │\n", "│ 【文本向量化】 │\n", "│ │\n", "│ 文本 → 数值向量 → 计算机可以计算 → AI模型处理 │\n", "└─────────────────────────────────────────────────────────────┘\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第三部分:向量基础入门\n", "\n", "## 3.1 什么是向量?\n", "\n", "**向量 = 有方向的量**,是数学中描述\"大小+方向\"的基本工具。\n", "\n", "### 生活中的向量例子\n", "\n", "| 例子 | 大小 | 方向 | 说明 |\n", "|------|------|------|------|\n", "| 速度 | 60 km/h | 向北 | 速度是向量 |\n", "| 力 | 10 N | 向右推 | 力是向量 |\n", "| 风向 | 5 m/s | 东南风 | 风向是向量 |\n", "| 位移 | 100 km | 北京→上海 | 位移是向量 |\n", "\n", "### 向量在数学中的表示\n", "\n", "**一维向量(数轴上的点)**:\n", "\n", "```\n", " ←———————————|———————————→\n", " -3 -2 -1 0 1 2 3\n", "\n", " 点A在位置 2 → 向量A = [2] (只有1个数字)\n", " 点B在位置 -3 → 向量B = [-3] (负数表示方向相反)\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "NumPy向量创建演示\n", "==================================================\n", "一维向量 v1 = [3]\n", "v1 有 1 个元素\n", "\n", "二维向量 v2 = [2 3]\n", "v2 有 2 个元素\n", "\n", "三维向量 v3 = [1 2 3]\n", "v3 有 3 个元素\n", "\n", "10维向量 v10 = [ 0.1 0.5 -0.3 0.8 0.2 -0.1 0.7 0.3 -0.2 0.6]\n", "v10 有 10 个元素\n" ] } ], "source": [ "# Python中使用NumPy创建向量\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"NumPy向量创建演示\")\n", "print(\"=\" * 50)\n", "\n", "# 一维向量(只有1个数字)\n", "v1 = np.array([3])\n", "print(f\"一维向量 v1 = {v1}\")\n", "print(f\"v1 有 {len(v1)} 个元素\")\n", "\n", "# 二维向量(2个数字,表示平面上的一个点)\n", "v2 = np.array([2, 3])\n", "print(f\"\\n二维向量 v2 = {v2}\")\n", "print(f\"v2 有 {len(v2)} 个元素\")\n", "\n", "# 三维向量(3个数字,表示立体空间的一个点)\n", "v3 = np.array([1, 2, 3])\n", "print(f\"\\n三维向量 v3 = {v3}\")\n", "print(f\"v3 有 {len(v3)} 个元素\")\n", "\n", "# 高维向量(机器学习中常用,几十维到几千维)\n", "v10 = np.array([0.1, 0.5, -0.3, 0.8, 0.2, -0.1, 0.7, 0.3, -0.2, 0.6])\n", "print(f\"\\n10维向量 v10 = {v10}\")\n", "print(f\"v10 有 {len(v10)} 个元素\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二维向量的几何直观\n", "\n", "```\n", " y (纵坐标)\n", " ↑\n", " |\n", " 3 | * A(2,3)\n", " |\n", " 2 |\n", " |\n", " 1 | * B(4,1)\n", " |\n", " 0---+—————————————→ x (横坐标)\n", " 0 1 2 3 4 5\n", "\n", " 向量A = [2, 3] (横坐标2,纵坐标3)\n", " 向量B = [4, 1]\n", "\n", " 从原点(0,0)出发,到点(2,3)的箭头,就是向量A的图形表示\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n", "findfont: Generic family 'sans-serif' not found because none of the following families were found: SimHei, Noto Sans CJK SC, WenQuanYi Micro Hei\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAArwAAAIrCAYAAAAN2Uq4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjgsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvwVt1zgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaLhJREFUeJzt3Xd8FHX+x/H3JpACIdQk9Ca9hA4GkIAgHGCJBZHfCbHrCR6IFRvFErtwyoGoEAsIB0pAQRBB4BAsgJHiiYAIqCShJhBMgMz8/lizsKRvNtnJ5PV8PPYhM/udmc/kG/CdyWdnHKZpmgIAAABsys/XBQAAAAAlicALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAOWUw+FQ3759fV1GrtauXSuHw6FJkya5re/bt68cDodvirpIfHy8HA6H4uPjfV0KgAIQeAEU2u+//66pU6dq4MCBatiwoQICAlS7dm1df/31+uabb3LdJjugZL8qVqyomjVrqmPHjrr99tu1YsUKGYZRqOPv2rVLDodDrVq1KnDs448/LofDoeeee65I51gUv/76qxwOh2655ZYSO0ZBHnvsMTkcDsXFxeU7zjAMNWzYUP7+/jp48GApVVe2WWF+AXhHBV8XAKDseP311/XCCy/okksu0cCBAxUWFqbdu3crISFBCQkJmjdvnoYPH57rtg888IBCQkJkGIZOnDih//3vf5o7d65mz56tnj176sMPP1TDhg3zPX7Lli3Vu3dvbdiwQV999ZV69eqV6zjDMPTee+/J39/f9mHltttuU1xcnObMmaMJEybkOW7VqlU6ePCg/va3v6lBgwaSpP/973+qVKlSaZXqFe+9955Onz7t6zIkSddee60uvfRS1alTx9elACgAgRdAoXXv3l1r165VdHS02/r//ve/6t+/v/7xj38oJiZGgYGBObZ98MEHVbt2bbd1R44c0T//+U99+OGHGjRokDZv3qzKlSvnW8Ptt9+uDRs2aPbs2XkG3pUrV+q3337T0KFDVbdu3SKeZdnSrFkzRUdHa926dfrvf/+ryy67LNdxs2fPluT8+mUrzJVyqynoh6LSVLVqVVWtWtXXZQAoBFoaABTaddddlyPsStJll12mfv366fjx49q+fXuh91erVi198MEHuvzyy/XTTz9p+vTpBW4zbNgwValSRf/5z3+Unp6e65jcwl1KSoruv/9+NWvWTIGBgapVq5auv/567dixI9d9pKSk6IEHHlDLli0VHBysGjVqqEePHnr55ZclOfs3mzRpIkl699133do21q5d69pPenq6Jk6cqFatWikoKEg1atTQ0KFD9dVXX+U45qRJk1zbx8fHq3PnzqpUqVKBfbbZ55l93hc7duyYlixZolq1aunqq692rc+thzc1NVVPPfWU2rRpo5CQEIWGhqpZs2aKjY3V/v37XeNuueUWORwO/frrr/meR7YzZ87o9ddf16BBg9SgQQMFBgYqPDxc1113nb7//vt8z+9CufXwXvi1z+11YY/t4sWLNWLECDVr1kyVKlVS1apVddlll+mjjz5y22dh5je/Ht6vvvpKQ4cOVY0aNRQUFKRWrVpp4sSJuV6dzp6H5ORkxcbGqlatWgoODtall17q9jUE4Dmu8ALwiooVK0qSKlQo2j8rfn5+evzxx7VmzRotWLBADz/8cL7jK1eurJtuuklvvfWW/vOf/+jWW291e//o0aNaunSpwsPDdeWVV0qS9u7dq759++q3337TwIEDFRMTo5SUFH300UdauXKlVq9erR49erj2sWvXLvXr10+HDh1S7969FRMTo/T0dO3cuVPPPfecHnzwQXXs2FFjx47VtGnT1KFDB8XExLi2b9y4sSQpIyNDl19+ub799lt17txZ48aNU3JyshYsWKCVK1fqww8/1LBhw3Kc40svvaQvv/xS11xzjQYOHCh/f/98vyY33HCD7rvvPi1cuFCvv/66QkJC3N6fN2+eMjMzde+99yogICDP/ZimqUGDBumbb75Rr1699Le//U1+fn7av3+/li5dqpEjR6pRo0b51pKXY8eOady4cbrssss0ZMgQVa9eXb/88ouWLl2qzz77TOvXr1e3bt082vfEiRNzXT9jxgylpKS4tW1MmDBBAQEB6t27t+rUqaPDhw9r6dKluuGGG/Svf/1L9913nyQVan7zsnDhQo0YMUKBgYEaPny4wsPD9fnnn2vKlClauXKl1q5dq6CgILdtTpw4od69e6tq1aoaOXKkUlJStGDBAg0aNEhbtmxRu3btPPraAPiLCQDFtH//fjMwMNCsU6eOee7cObf3oqOjTUnmoUOH8tw+IyPDrFChgunn52eePXu2wON9/fXXpiSzd+/eOd6bNm2aKcl88MEHXet69uxp+vv7mytWrHAbu2vXLrNKlSpm+/bt3dZ37drVlGTOmjUrx/4PHjzo+vO+fftMSWZsbGyudU6ePNmUZP797383DcNwrd+6dasZEBBgVqtWzUxLS3OtnzhxoinJrFy5srlt27b8vwgXueeee0xJ5ttvv53jvU6dOpmSzB07dritl2RGR0e7lrdt22ZKMmNiYnLsIyMjwzx58qRrOTY21pRk7tu3L8fY7PP48ssv3bb/7bffcozdsWOHGRISYg4YMMBt/ZdffmlKMidOnOi2Pvv7qSDPP/+8Kcm85pprzKysLNf6vXv35hh78uRJs3379mbVqlXN9PR01/qC5nfOnDmmJHPOnDmudampqWbVqlXNwMBA84cffnCtz8rKMocPH25KMqdMmeK2H0mmJPPee+91q/Xtt982JZl33313gecLIH+0NAAolrNnz2rkyJHKzMzUCy+8UODVyNwEBgaqZs2aMgxDx44dK3B8jx491K5dO23YsEG7d+92e2/OnDmSnB/mkqTvv/9eGzduVGxsrAYNGuQ2tkWLFrrzzju1fft2V2vDt99+q82bN6tPnz668847cxy7fv36hT6vd999VxUrVtTzzz/v9mv4Tp06KTY2VidOnFBCQkKO7e666y61b9++0MeR8m5r+OGHH/T999+re/fuatu2baH2FRwcnGNdYGBgjivHRREYGKh69erlWN+2bVv169dP69ev19mzZz3e/4U+/vhjTZgwQZ07d9bcuXPl53f+f3VNmzbNMT4kJES33HKLUlNT9d133xXr2EuWLFFqaqpuu+02RUZGutb7+fnpxRdfVIUKFXJtgahcubJeeOEFt1pjY2NVoUKFYtcEgJYGAMVgGIZuueUWrV+/XnfeeadGjhxZase+/fbbdf/992v27NmuW3Jt3bpViYmJioqKUuvWrSVJX3/9tSQpOTk5xz1dJemnn35y/bddu3b69ttvJUkDBw4sVn1paWn65Zdf1Lp161xDcr9+/fTWW28pMTExx9ete/fuRT5e165d1aFDB23cuFG7du1Sy5YtJUnvvPOOJPd+5ry0bt1akZGR+vDDD/Xbb78pJiZGffv2VceOHd2CmKcSExP14osvasOGDUpKSsoRcI8cOVLsOx5s3rxZI0eOVN26dfXJJ5/k+BBkSkqKnn/+eX322Wfav3+//vzzT7f3//jjj2IdP7sfObe+64YNG6pp06b6+eefdfLkSVWpUsX1XosWLXL8QFGhQgVFREToxIkTxaoJAIEXgIcMw9Btt92mefPm6eabb9bMmTM93ldmZqaOHj0qf39/1ahRo1Db3HzzzXrkkUf03nvv6ZlnnpG/v3+uH1bLvmK8bNkyLVu2LM/9ZX8ALjU1VZJyvRpZFGlpaZKkiIiIXN/PDnbZ4y6U1zYFuf322/XPf/5Ts2fP1gsvvKAzZ85o3rx5qlSpkm666aYCt69QoYLWrFmjSZMm6aOPPtIDDzwgSQoLC9OYMWP0+OOPe3QFX5I2btyoyy+/XJLzh4nmzZsrJCREDodDCQkJ+uGHH5SZmenRvrMdPHhQV111lRwOhz755JMcd+g4duyYunXrpgMHDqhXr14aMGCAqlWrJn9/fyUmJmrJkiXFrqEw8/7zzz8rLS3NLfCGhobmOr5ChQrKysoqVk0AuEsDAA8YhqFbb71V7777rkaMGKH4+PhiXQH86quvdO7cOXXs2LHQH3qrVauWrrnmGv3xxx/67LPPlJmZqXnz5ikkJMTtXsDZQeL111+XaZp5vmJjYyVJ1apVk+R8yEZxZB83OTk51/eTkpLcxl3I0yeJ/f3vf1dgYKDee+89nTt3TkuWLNHRo0c1bNiwPAPVxWrWrKnXX39dv//+u3788Ue98cYbqlGjhiZOnKgXX3zRNS57vs+dO5djH9k/NFzo2WefVWZmpr744gstXbpUr7zyiiZPnqxJkybluF2dJ06ePKkrr7xSKSkpmjdvnjp16pRjzDvvvKMDBw7o6aef1oYNG/T666/r6aef1qRJk3TppZcWuwapePMOoOQQeAEUSXbYfe+99zR8+HC9//77Hl/1y97fs88+K0kaMWJEkba9sG81ISFBx48f14033uj2q+Hsuy9s2rSpUPvMbif4/PPPCxybfd65XYELDQ1V06ZNtWfPnlzDc/btpjp27FiougqjRo0auvbaa5WUlKTly5fnesW7sBwOh1q3bq3Ro0dr1apVkqSlS5e63q9evbqk3H8wyO02Y3v37lWNGjXUu3dvt/WnT5/W1q1bi1zfhbKysnTTTTdp27Zteumll9xuvXZxDZJ0zTXX5Hjvv//9b451+c1vXrKDdm63Ezt48KD27t2rpk2bul3dBVDyCLwACi27jeG9997TsGHD9MEHHxQr7B45ckQ333yz1qxZozZt2ugf//hHkba/4oor1KBBA3366ad69dVXJeUMd927d1ePHj304YcfasGCBbme07p161zL3bp1U7du3bR+/Xq99dZbOcZfGPCqV68uh8OR56N6Y2NjdfbsWU2YMEGmabrWb9u2TfHx8apatarb7a68Ifv84+Li9Pnnn6tFixZ5PoziYr/++muu99XNvlp54a20sm8hdvEHsBYtWuT29czWqFEjHT9+XDt37nSty8rK0oMPPqjDhw8Xqr68jBs3TsuXL9ddd92l8ePH5zku+5ZqGzZscFs/b948LV++PMf4guY3N9dcc42qVq2qOXPmuJ2raZp65JFHdO7cOds//Q+wInp4ARTalClT9O677yokJEQtWrTQM888k2NMTExMrlctX375ZdejhdPS0vTjjz/qv//9rzIyMtSrVy99+OGHRX7MrZ+fn2699VZNmTJF3377rVq1aqWePXvmGPfhhx+qX79+uummmzR16lR17txZwcHBOnDggDZt2qTDhw8rIyPDNX7u3Lnq27ev7rrrLr3//vuKiopSRkaGdu7cqe+//15Hjx6V5Px0f3Y4HjlypJo3by4/Pz/X/WoffvhhLVu2TO+//77+97//qX///q77q547d05vvfWW16/09e/fX40bN3Z9WC/7bhWFkZiYqOuuu07du3dXmzZtVLt2bf3+++9KSEiQn5+f7r//ftfYa665Rpdcconi4+N18OBBderUSf/73/+0Zs0aDRkyJEeAvO+++/T555+rd+/euvHGGxUUFKS1a9fq999/V9++fT1+wMK3336rN954Q8HBwQoLC8v1g4nZ35MjR47UCy+8oPvuu09ffvmlGjVqpB9++EGrV6/Wddddp48//thtu4LmNzehoaF66623NGLECPXo0UPDhw9XWFiYvvjiC23ZskXdu3fXQw895NG5AigGn90QDUCZk33v1fxeF96T1DTP3zc1+1WhQgWzevXqZocOHczbbrvNXLFihdu9R4tq3759psPhMCWZL774Yp7jjh07Zj7xxBNmu3btzODgYDMkJMRs3ry5+X//93/mxx9/nGN8UlKSOXbsWLNp06ZmQECAWaNGDbNHjx7mq6++6jZu165d5pAhQ8xq1aq56rjw/rOnTp0yn3zySbNFixaue+8OHjzY/O9//5vjmLndv9YT2ff/9ff3N//44488x+mi+/AePHjQfPTRR81LL73UDA8PNwMCAsyGDRua1113nblp06Yc2+/bt8+MiYkxq1SpYlauXNns37+/+d133+V5HosWLTI7d+5sVqpUyaxVq5Z54403mnv37s31nr6FvQ9v9rjCfk8mJiaaAwcONKtXr25WqVLFjI6ONr/44otc76lrmvnPb17bmKZprl+/3hw8eLBZrVo1MyAgwGzRooX55JNPmqdOnSpwHi7UqFEjs1GjRrm+B6DwHKZ5we/ZAAAAAJuhhxcAAAC2RuAFAACArRF4AQAAYGtlKvBmP49+3Lhx+Y5buHChWrVqpaCgILVv3z7X280AAACgfCgzgfe7777Tm2++qcjIyHzHbdy4USNGjNDtt9+u77//XjExMYqJidGOHTtKqVIAAABYSZm4S8OpU6fUuXNn/fvf/9Yzzzyjjh07aurUqbmOHT58uNLT0/Xpp5+61l166aXq2LGjZs6cWUoVAwAAwCrKxIMnRo8eraFDh2rAgAG53uj+Qps2bcrxpJ1BgwYpISEhz20yMzOVmZnpWjYMQ8eOHVPNmjU9fqY9AAAASo5pmjp58qTq1q0rP7/8mxYsH3jnz5+vrVu36rvvvivU+KSkJEVERLiti4iIUFJSUp7bxMXFafLkycWqEwAAAKXv4MGDql+/fr5jLB14Dx48qLFjx2rVqlVuz3D3tgkTJrhdFU5NTVXDhg21f/9+hYaGlthxS4thGLrhhhu0aNGiAn8CQukyDENHjhxRrVq1mBuLYW6sjfmxLubGuuw2N2lpaWrUqFGhHtFu6cC7ZcsWpaSkqHPnzq51WVlZWr9+vd544w1lZmbK39/fbZvatWsrOTnZbV1ycrJq166d53ECAwMVGBiYY321atVsE3grVqyoatWq2eIb3E4Mw9CZM2eYGwtibqyN+bEu5sa67DY32edQmPZTS59t//79tX37diUmJrpeXbt21d///nclJibmCLuSFBUVpdWrV7utW7VqlaKiokqrbAAAAFiIpa/wVqlSRe3atXNbV7lyZdWsWdO1ftSoUapXr57i4uIkSWPHjlV0dLReeeUVDR06VPPnz9fmzZs1a9asUq8fAAAAvmfpK7yFceDAAR06dMi13LNnT82bN0+zZs1Shw4dtGjRIiUkJOQIzgAAACgfLH2FNzdr167Nd1mShg0bpmHDhpVOQQAAALC0Mn+FFwAAAMgPgRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGuWDrwzZsxQZGSkQkNDFRoaqqioKH322Wd5jo+Pj5fD4XB7BQUFlWLFAAAAsJoKvi4gP/Xr19fzzz+v5s2byzRNvfvuu7rmmmv0/fffq23btrluExoaql27drmWHQ5HaZULAAAAC7J04L3qqqvclp999lnNmDFDX3/9dZ6B1+FwqHbt2qVRHgAAAMoASwfeC2VlZWnhwoVKT09XVFRUnuNOnTqlRo0ayTAMde7cWc8991ye4ThbZmamMjMzXctpaWmSJMMwZBiGd07AhwzDkGmatjgXu2FurIu5sTbmx7qYG+uy29wU5TwsH3i3b9+uqKgoZWRkKCQkRIsXL1abNm1yHduyZUvNnj1bkZGRSk1N1csvv6yePXtq586dql+/fp7HiIuL0+TJk3OsP3z4sDIyMrx2Lr5iGIbOnTunlJQU+flZum273DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj3WYpmmWYC3FdubMGR04cECpqalatGiR3n77ba1bty7P0Huhs2fPqnXr1hoxYoSefvrpPMfldoW3QYMGOn78uEJDQ71yHr5kGIaGDBmi5cuX2+Ib3E4Mw9Dhw4cVFhbG3FgMc2NtzI91MTfWZbe5SUtLU/Xq1ZWamlpgXrP8Fd6AgAA1a9ZMktSlSxd99913mjZtmt58880Ct61YsaI6deqkPXv25DsuMDBQgYGBOdb7+fnZ4htCcvY22+l87IS5sS7mxtqYH+tibqzLTnNTlHMoc2drGIbb1dj8ZGVlafv27apTp04JVwUAAACrsvQV3gkTJmjw4MFq2LChTp48qXnz5mnt2rVauXKlJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1xxx2+PA0AAAD4kKUDb0pKikaNGqVDhw6patWqioyM1MqVK3XFFVdIkg4cOOB2Ofv48eO68847lZSUpOrVq6tLly7auHFjofp9AQAAYE+WDrzvvPNOvu+vXbvWbfm1117Ta6+9VoIVAQAAoKwpcz28AAAAQFEQeAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AsKvGjSWHw/kaM8bX1ZyXkHC+LodD2rzZ1xUBsDkCLwCUhH//2xnmevTwbR2XXSa9/74UG3t+3cGD0uTJUvfuUvXqUq1aUt++0hdfFO9Yzz0nXXqpFBYmBQVJzZtL48ZJhw+7j+va1VnTXXcV73gAUEgEXgAoCXPnOq+wfvuttGeP7+po2lS6+WapW7fz65YskV54QWrWTHrmGenJJ6WTJ6UrrpDmzPH8WFu2SB07So8/Lk2fLl1zjXN/PXtK6ennx9Wv76wpKsrzYwFAEVTwdQEAYDv79kkbN0offyzdfbcz/E6c6OuqzuvXTzpwwHllN9s99zjD6lNPSbfe6tl+P/oo57qoKOmGG6RPPpFuusmz/QJAMXGFFwC8be5cZ6vA0KHOsDd3rq8rcte2rXvYlaTAQGnIEOm335xXe72lcWPnf0+c8N4+AaCIuMILAN42d6503XVSQIA0YoQ0Y4b03XfubQV5OXVKysgoeFzFilLVqsWv9UJJSVKlSs6Xp0xTOnpUOndO2r1bevRRyd/f2SMMAD5C4AUAb9qyRfrpJ+n1153LvXs7e1bnzi1c4B0zRnr33YLHRUdLa9cWq1Q3e/Y4WzCGDXMGVE8lJ0t16pxfrl9fmjdPatWq+DUCgIcIvADgTXPnShERzj5ZyXmnhuHDpQ8+kF55peAw+fDDzg90FaR69eLXmu30aWfQDQ6Wnn++ePuqUUNatcp5lfr7750h+tQp79QJAB4i8AKAt2RlSfPnO8Puvn3n1/fo4Qy7q1dLAwfmv482bZyv0pKV5fww2Y8/Sp99JtWtW7z9BQRIAwY4/3zllVL//lKvXlJ4uHMZAHyAwAsA3rJmjXTokDP0zp+f8/25cwsOvKmp0p9/FnysgADn1dTiuvNO6dNPnbVdfnnx93exnj2dLQ5z5xJ4AfgMgRcAvGXuXOeVzOnTc7738cfS4sXSzJnO1oG8jB1bej28Dz3kvE/u1KnOD9eVlIwMZ5AHAB8h8AKAN/z55/kPfd1wQ87369aVPvxQWrrU2dObl9Lq4X3pJenll6XHHnOG7OJKT3f2K198h4ePPpKOH3c+XQ0AfITACwDesHSp8/61V1+d+/vZj9ydOzf/wFsaPbyLFzuDdfPmUuvWzg/UXeiKK5wfvJOkX3+VmjRxPpo4Pj7vfe7e7ezdHT7ceUcGPz9p82bnvhs39k6oBgAPEXgBwBvmzpWCgpxhMTd+fs4HUcyd67xPbc2apVvfhX74wfnf3bulkSNzvv/ll+cDb/YdFi681Vhu6teXrr/e2cf87rvS2bNSo0bO26w9/rhvzxdAuUfgBQBvWLq04DFz5jhfpSkzUzpyxNk3XLmyc92kSc5XYaxf79xu3Lj8x9WqJb35ZuH2eeaMlJbG7coAlBoeLQwAdjZ/vrOV4pFHPNv+yy+lf/7z/BVfb1i+3FnTffd5b58AkA+u8AKAXc2de/4WZw0aeLaPhQu9V0+2Xr2cD6fI1rKl948BABcg8AKAXfXq5esKchcWdv7hFABQCmhpAAAAgK1ZOvDOmDFDkZGRCg0NVWhoqKKiovTZZ5/lu83ChQvVqlUrBQUFqX379lq+fHkpVQsAAAArsnTgrV+/vp5//nlt2bJFmzdv1uWXX65rrrlGO3fuzHX8xo0bNWLECN1+++36/vvvFRMTo5iYGO3YsaOUKwcAAIBVWDrwXnXVVRoyZIiaN2+uFi1a6Nlnn1VISIi+/vrrXMdPmzZNf/vb3/TQQw+pdevWevrpp9W5c2e98cYbpVw5AAAArKLMfGgtKytLCxcuVHp6uqKionIds2nTJo0fP95t3aBBg5SQkJDvvjMzM5WZmelaTktLkyQZhiHDMIpXuAUYhiHTNG1xLnbD3FgXc2NtzI91MTfWZbe5Kcp5WD7wbt++XVFRUcrIyFBISIgWL16sNnk8djMpKUkRF90rMiIiQklJSfkeIy4uTpMnT86x/vDhw8rIyPC8eIswDEPnzp1TSkqK/PwsfVG/3DEMQ6mpqTJNk7mxGObG2pgf62JurMtuc3Py5MlCj7V84G3ZsqUSExOVmpqqRYsWKTY2VuvWrcsz9HpiwoQJbleG09LS1KBBA4WFhSk0NNRrx/EVwzBUoUIFhYeH2+Ib3E4Mw5DD4VBYWBhzYzHMjbUxP9bF3FiX3eYmKCio0GMtH3gDAgLUrFkzSVKXLl303Xffadq0aXozl0dY1q5dW8nJyW7rkpOTVbt27XyPERgYqMDAwBzr/fz8bPENIUkOh8NW52MnzI11MTfWxvxYF3NjXXaam6KcQ5k7W8Mw3PptLxQVFaXVq1e7rVu1alWePb8A4GKa0jvvSHFxvq4EAOBllr7CO2HCBA0ePFgNGzbUyZMnNW/ePK1du1YrV66UJI0aNUr16tVT3F//gxo7dqyio6P1yiuvaOjQoZo/f742b96sWbNm+fI0AFjdtm1SbKyUmOhc/vNPadIkX1YEAPAiSwfelJQUjRo1SocOHVLVqlUVGRmplStX6oorrpAkHThwwO1yds+ePTVv3jw98cQTeuyxx9S8eXMlJCSoXbt2vjoFAFaWkiI9+aT09tvShZ/2PXzYdzUBALzO0oH3nXfeyff9tWvX5lg3bNgwDRs2rIQqAmALWVnSa69JTz8t/XUbQjeDB5d+TQCAEmPpwAsAJeKVV6RHHsn7/SZNSq8WAECJK3MfWgOAYsv+4KvDIVWunPP9Ro1Ktx4AQIki8AIofx57TFqyRLruOik93bku+36ONWpINrj/NgDgPAIvgPLH318KDJQ++si5XLmydPas88+NG/usLABAySDwAih/UlOlO+44v/zyy1J4uPPPvXv7piYAQInhQ2sAyp8HHpB++8355wEDpLvvlvr3l9avl7jLCwDYDoEXQPmycqXziWqSFBLivAevwyE1b+58AQBsh5YGAOVHbq0M3JEBAGyPwAug/Li4leGuu3xbDwCgVBB4AZQPebUyAABsj8ALwP5oZQCAco3AC8D+aGUAgHKNwAvA3mhlAIByj8ALwL5oZQAAiMALwM5oZQAAiMALwK5oZQAA/IXAC8B+aGUAAFyAwAvAfmhlAABcgMALwF5oZQAAXITAC8A+aGUAAOSCwAvAPmhlAADkgsALwB5oZQAA5IHAC6Dso5UBAJAPAi+Aso9WBgBAPgi8AMo2WhkAAAUg8AIou2hlAAAUAoEXQNlFKwMAoBAIvADKJloZAACFROAFUPbQygAAKAICL4Cyh1YGAEAREHgBlC20MgAAiojAC6DsoJUBAOABAi+AsoNWBgCABwi8AMoGWhkAAB4i8AKwPloZAADFQOAFYH20MgAAioHAC8DaaGUAABQTgReAddHKAADwAgIvAOuilQEA4AUEXgDWRCsDAMBLCLwArIdWBgCAF1k68MbFxalbt26qUqWKwsPDFRMTo127duW7TXx8vBwOh9srKCiolCoG4BW0MgAAvMjSgXfdunUaPXq0vv76a61atUpnz57VwIEDlZ6enu92oaGhOnTokOu1f//+UqoYQLHRygAA8LIKvi4gPytWrHBbjo+PV3h4uLZs2aI+ffrkuZ3D4VDt2rVLujwA3kYrAwCgBFg68F4sNTVVklSjRo18x506dUqNGjWSYRjq3LmznnvuObVt2zbP8ZmZmcrMzHQtp6WlSZIMw5BhGF6o3LcMw5BpmrY4F7thbtw5xo+X469WBrN/f5l33CH56GvD3Fgb82NdzI112W1uinIeZSbwGoahcePGqVevXmrXrl2e41q2bKnZs2crMjJSqampevnll9WzZ0/t3LlT9evXz3WbuLg4TZ48Ocf6w4cPKyMjw2vn4CuGYejcuXNKSUmRn5+lu1jKHcMwlJqaKtM0y/3cBHz5pWrMni1JMipX1pG4OBmHD/usHubG2pgf62JurMtuc3Py5MlCj3WYpmmWYC1e849//EOfffaZNmzYkGdwzc3Zs2fVunVrjRgxQk8//XSuY3K7wtugQQMdP35coaGhxa7d1wzD0JAhQ7R8+XJbfIPbiWEYOnz4sMLCwsr33KSmyhEZ6bq6a/z739Ldd/u0JObG2pgf62JurMtuc5OWlqbq1asrNTW1wLxWJq7wjhkzRp9++qnWr19fpLArSRUrVlSnTp20Z8+ePMcEBgYqMDAwx3o/Pz9bfENIzr5mO52PnTA3kh56yO2uDH733GOJD6oxN9bG/FgXc2NddpqbopyDpc/WNE2NGTNGixcv1po1a9SkSZMi7yMrK0vbt29XnTp1SqBCAMXGXRkAACXM0ld4R48erXnz5mnJkiWqUqWKkpKSJElVq1ZVcHCwJGnUqFGqV6+e4uLiJElTpkzRpZdeqmbNmunEiRN66aWXtH//ft1x4Se/AVgDd2UAAJQCSwfeGTNmSJL69u3rtn7OnDm65ZZbJEkHDhxwu6R9/Phx3XnnnUpKSlL16tXVpUsXbdy4UW3atCmtsgEUFg+YAACUAksH3sJ8nm7t2rVuy6+99ppee+21EqoIgNfQygAAKCWW7uEFYFO0MgAAShGBF0Dpo5UBAFCKCLwAShetDACAUkbgBVB6aGUAAPgAgRdA6aGVAQDgAwReAKWDVgYAgI8QeAGUPFoZAAA+ROAFUPJoZQAA+BCBF0DJopUBAOBjBF4AJYdWBgCABRB4AZQcWhkAABZA4AVQMmhlAABYBIEXgPfRygAAsBACLwDvo5UBAGAhBF4A3kUrAwDAYgi8ALyHVgYAgAUReAF4D60MAAALIvAC8A5aGQAAFkXgBVB8tDIAACyMwAug+GhlAABYGIEXQPHQygAAsDgCLwDP0coAACgDCLwAPPfgg7QyAAAsj8ALwDMrVzrbFyRaGQAAllahqBucPn1aq1at0ldffaUff/xRR44ckcPhUK1atdS6dWv16tVLAwYMUOXKlUuiXgBWQCsDAKAMKfQV3u3bt+uWW25R7dq1de2112r69Onas2ePHA6HTNPUzz//rDfeeEPXXnutateurVtuuUXbt28vydoB+AqtDACAMqRQV3iHDx+ujz76SF27dtWkSZN0xRVXqE2bNvL393cbl5WVpR9//FGff/65Fi1apE6dOmnYsGH68MMPS6R4AD5AKwMAoIwpVOD18/PT5s2b1bFjx3zH+fv7q3379mrfvr0eeOABJSYm6oUXXvBGnQCsgFYGAEAZVKjA6+kV2o4dO3J1F7ATWhkAAGUQd2kAUDi0MgAAyiiPA29aWpqef/55DRo0SJ06ddK3334rSTp27JheffVV7dmzx2tFAvAxWhkAAGVYkW9LJkm//faboqOjdfDgQTVv3lw//fSTTp06JUmqUaOG3nzzTe3fv1/Tpk3zarEAfIRWBgBAGeZR4H3ooYd08uRJJSYmKjw8XOHh4W7vx8TE6NNPP/VKgQB8jFYGAEAZ51FLw+eff65//vOfatOmjRy5/I+vadOmOnjwYLGLA+BjtDIAAGzAo8D7559/KiwsLM/3T5486XFBACyEVgYAgA14FHjbtGmj9evX5/l+QkKCOnXq5HFRACyAVgYAgE14FHjHjRun+fPn64UXXlBqaqokyTAM7dmzRyNHjtSmTZt0//33e7VQAKWIVgYAgI149KG1m2++Wfv379cTTzyhxx9/XJL0t7/9TaZpys/PT88995xiYmK8WSeA0kQrAwDARjwKvJL0+OOPa+TIkfroo4+0Z88eGYahSy65RNddd52aNm3qzRoBlCZaGQAANuNR4D1w4IDCwsLUsGHDXFsX/vzzTx0+fFgNGzYsdoEAShGtDAAAG/Koh7dJkyZavHhxnu8vXbpUTZo08biobHFxcerWrZuqVKmi8PBwxcTEaNeuXQVut3DhQrVq1UpBQUFq3769li9fXuxagHKBVgYAgA15FHhN08z3/bNnz8rPz+OnFrusW7dOo0eP1tdff61Vq1bp7NmzGjhwoNLT0/PcZuPGjRoxYoRuv/12ff/994qJiVFMTIx27NhR7HoAW6OVAQBgU4VuaUhLS9OJEydcy0ePHtWBAwdyjDtx4oTmz5+vOnXqFLu4FStWuC3Hx8crPDxcW7ZsUZ8+fXLdZtq0afrb3/6mhx56SJL09NNPa9WqVXrjjTc0c+bMYtcE2BKtDAAAGyt04H3ttdc0ZcoUSZLD4dC4ceM0bty4XMeapqlnnnnGKwVeKPsWaDVq1MhzzKZNmzR+/Hi3dYMGDVJCQkKe22RmZiozM9O1nJaWJsl5qzXDMIpRsTUYhiHTNG1xLnZjlblxPPCAHH+1Mpj9+8u84w6pnH+/WGVukDvmx7qYG+uy29wU5TwKHXgHDhyokJAQmaaphx9+WCNGjFDnzp3dxjgcDlWuXFldunRR165dC19xIRiGoXHjxqlXr15q165dnuOSkpIUERHhti4iIkJJSUl5bhMXF6fJkyfnWH/48GFlZGR4XrRFGIahc+fOKSUlxSutJvAewzCUmprquqWfLwR8+aVqvPOOs57KlXUkLk7G4cM+qcVKrDA3yBvzY13MjXXZbW6K8mTfQgfeqKgoRUVFSZLS09N1/fXX5xs8vW306NHasWOHNmzY4PV9T5gwwe2qcFpamho0aKCwsDCFhoZ6/XilzTAMVahQQeHh4bb4BrcTwzDkcDgUFhbmm7lJTZXj4YfPL7/0kmp16VL6dViQz+cG+WJ+rIu5sS67zU1QUFChx3p0W7KJEyd6spnHxowZo08//VTr169X/fr18x1bu3ZtJScnu61LTk5W7dq189wmMDBQgYGBOdb7+fnZ4htCcl59t9P52IlP5+bhh93uyuB3zz18UO0C/L2xNubHupgb67LT3BTlHDx+8IQkffXVV9q6datSU1Nz9FE4HA49+eSTxdm9TNPUfffdp8WLF2vt2rWFutVZVFSUVq9e7dZfvGrVKtfVaQB/4a4MAIBywqPAe+zYMQ0dOlTffvutTNOUw+Fw3aos+8/eCLyjR4/WvHnztGTJElWpUsXVh1u1alUFBwdLkkaNGqV69eopLi5OkjR27FhFR0frlVde0dChQzV//nxt3rxZs2bNKlYtgK1wVwYAQDni0fXshx56SNu2bdO8efP0yy+/yDRNrVy5Uj///LPuuecedezYUX/88Uexi5sxY4ZSU1PVt29f1alTx/VasGCBa8yBAwd06NAh13LPnj01b948zZo1Sx06dNCiRYuUkJBQqv3GgOXxgAkAQDni0RXe5cuX6+6779bw4cN19OhRSc4+imbNmmn69Om67rrrNG7cOH344YfFKq6gB1xI0tq1a3OsGzZsmIYNG1asYwO2RSsDAKCc8egK74kTJ9S2bVtJUkhIiCTp1KlTrvcHDhyolStXeqE8AF5FKwMAoBzyKPDWrVvX1U8bGBio8PBw/fDDD673f//9dzm4YgRYD60MAIByyKOWhj59+mjVqlV6/PHHJUnDhw/Xiy++KH9/fxmGoalTp2rQoEFeLRRAMdHKAAAopzwKvOPHj9eqVauUmZmpwMBATZo0STt37nTdlaFPnz56/fXXvVoogGKglQEAUI55FHjbt2+v9u3bu5arV6+uL774QidOnJC/v7+qVKnitQIBeAGtDACAcqxYD564WLVq1by5OwDeQCsDAKCc8zjwZmVlaeXKlfrll190/PjxHLcQ88aDJwAUE60MAAB4Fng3b96s66+/Xr/99lue98ol8AIWQCsDAACe3Zbs3nvv1Z9//qmEhAQdO3ZMhmHkeGVlZXm7VgBFQSsDAACSPLzCu23bNj377LO66qqrvF0PAG+glQEAABePrvDWr1+/UI/9BeAjtDIAAODiUeB95JFH9NZbbyktLc3b9QAoLloZAABw41FLw8mTJxUSEqJmzZrppptuUoMGDeTv7+82xuFw6P777/dKkQAKiVYGAABy8CjwPvjgg64/v/HGG7mOIfACPkArAwAAOXgUePft2+ftOgAUF60MAADkyqPA24hfkQLWQisDAAB58uhDawAshlYGAADyVKgrvE2aNJGfn59++uknVaxYUU2aNJGjgF+VOhwO7d271ytFAsgHrQwAAOSrUIE3OjpaDodDfn5+bssAfIxWBgAAClSowBsfH5/vMgAfoZUBAIAC0cMLlFW0MgAAUCiFusK7fv16j3bep08fj7YDUABaGQAAKLRCBd6+ffu69eyaplmoHt6srCzPKwOQN1oZAAAotEIF3i+//NJtOTMzUw8//LBOnz6tu+66Sy1btpQk/fTTT3rrrbdUuXJlvfjii96vFgCtDAAAFFGh79JwofHjxysgIEBff/21goKCXOuvuuoqjR49WtHR0VqxYoWuuOIK71YLlHe0MgAAUGQefWht7ty5GjlypFvYzVapUiWNHDlSH3zwQbGLA3ARWhkAACgyjwJvenq6Dh06lOf7hw4d0unTpz0uCkAuaGUAAMAjHgXeAQMGaNq0afr4449zvPfRRx9p2rRpGjBgQLGLA/AXWhkAAPBYoXp4LzZ9+nRdfvnlGjZsmOrUqaNmzZpJkvbu3as//vhDl1xyiV5//XWvFgqUa7QyAADgMY+u8NarV08//PCDXn31VbVr107JyclKTk5W27Zt9dprr+mHH35Q/fr1vV0rUD7RygAAQLEU+QpvRkaGZs2apY4dO2rs2LEaO3ZsSdQFQKKVAQAALyjyFd6goCA98sgj2rVrV0nUA+BCtDIAAFBsHrU0tGvXTr/++quXSwHghlYGAAC8wqPA++yzz+rNN9/UF1984e16AEi0MgAA4EUe3aXhjTfeUI0aNTRo0CA1adJETZo0UXBwsNsYh8OhJUuWeKVIoNyhlQEAAK/xKPBu27ZNDodDDRs2VFZWlvbs2ZNjjINfvQKeoZUBAACv8ijw0r8LlBBaGQAA8DqPengBlBBaGQAA8DqPrvBmW7dunZYtW6b9+/dLkho1aqShQ4cqOjraK8UB5QqtDAAAlAiPAu+ZM2c0YsQIJSQkyDRNVatWTZJ04sQJvfLKK7r22mv14YcfqmLFit6sFbAvWhkAACgxHrU0TJ48WYsXL9YDDzygQ4cO6dixYzp27JiSkpL04IMP6uOPP9aUKVO8XStgW46HHqKVAQCAEuJR4J03b55iY2P14osvKiIiwrU+PDxcL7zwgkaNGqX333/fKwWuX79eV111lerWrSuHw6GEhIR8x69du1YOhyPHKykpySv1AN4W8OWXcrzzjnOBVgYAALzOo8B76NAh9ejRI8/3e/To4bWAmZ6erg4dOmj69OlF2m7Xrl06dOiQ6xUeHu6VegCvSk1V1QcfPL9MKwMAAF7nUQ9v/fr1tXbtWt1zzz25vr9u3TrVr1+/WIVlGzx4sAYPHlzk7cLDw129xYBVOR56SH5//OFcoJUBAIAS4VHgjY2N1cSJE1WtWjXdf//9atasmRwOh3bv3q2pU6dq4cKFmjx5srdrLZKOHTsqMzNT7dq106RJk9SrV688x2ZmZiozM9O1nJaWJkkyDEOGYZR4rSXNMAyZpmmLc7GVzz+X31+tDGZIiMxZsyTTdL7gc/y9sTbmx7qYG+uy29wU5Tw8CryPPfaY9u7dq1mzZumtt96Sn5+f68CmaSo2NlaPPfaYJ7sutjp16mjmzJnq2rWrMjMz9fbbb6tv37765ptv1Llz51y3iYuLyzWgHz58WBkZGSVdcokzDEPnzp1TSkqKa67gW460NNW6/XbXcuoTTygjOFhKSfFhVbiQYRhKTU2VaZr8vbEg5se6mBvrstvcnDx5stBjHabp+eWkbdu2afny5W734R0yZIgiIyM93WW+HA6HFi9erJiYmCJtFx0drYYNG+b5QbrcrvA2aNBAx48fV2hoaHFKtgTDMDRkyBAtX77cFt/gduC46y7XB9UyL7tM/qtXy8/f38dV4UKGYejw4cMKCwvj740FMT/WxdxYl93mJi0tTdWrV1dqamqBea1YD56IjIwssXDrTd27d9eGDRvyfD8wMFCBgYE51vv5+dniG0Jy/rBgp/Mp01aulC5oZUh95RXV8vdnbiyIvzfWxvxYF3NjXXaam6KcQ6FGnj592uNiirOttyQmJqpOnTq+LgPI8YAJ88UXZTRo4MOCAACwv0IF3gYNGmjKlCk6dOhQoXf8+++/66mnnlLDhg09Lk6STp06pcTERCUmJkqS9u3bp8TERB04cECSNGHCBI0aNco1furUqVqyZIn27NmjHTt2aNy4cVqzZo1Gjx5drDoAr3jwQR4wAQBAKStUS8OMGTM0adIkTZkyRb169dKAAQPUuXNnNWnSRNWrV5dpmjp+/Lj27dunzZs364svvtDXX3+t5s2b69///nexCty8ebP69evnWh4/frwk550i4uPjdejQIVf4lZyPPX7ggQf0+++/q1KlSoqMjNQXX3zhtg/AJ1audD5UQuIBEwAAlKJCf2jNMAwtXbpU8fHxWrFihc6cOSPHRf+zNk1TAQEBGjhwoG677TZdffXVZbJHJC0tTVWrVi1UE3RZYBiGBg8erM8++6xMzoctpKZK7dqdv7o7c6Z0990yDEMpKSkKDw9nbiyGubE25se6mBvrstvcFCWvFfpDa35+foqJiVFMTIwyMzO1ZcsW/fTTTzp69KgkqWbNmmrVqpW6dOmS6wfAgHKNVgYAAHzGo7s0BAYGqmfPnurZs6e36wHsh1YGAAB8quxfzwas7KK7Mujll6VGjXxXDwAA5RCBFyhJtDIAAOBzBF6gpNDKAACAJRB4gZJAKwMAAJZB4AVKAq0MAABYhkeB95tvvvF2HYB90MoAAICleBR4o6Ki1KJFCz399NP65ZdfvF0TUHbRygAAgOV4FHg/+OADNW/eXE8//bSaN2+uXr16aebMmTp27Ji36wPKFloZAACwHI8C7//93/9p2bJl+uOPPzRt2jSZpql7771XdevWVUxMjBYtWqQzZ854u1bA2mhlAADAkor1obVatWppzJgx2rhxo3bv3q3HH39cP/30k4YPH67atWvrrrvu0oYNG7xVK2BdtDIAAGBZXrtLQ3BwsCpVqqSgoCCZpimHw6ElS5YoOjpa3bp1048//uitQwHWQysDAACWVazAe/LkSc2ZM0cDBgxQo0aN9Nhjj6lx48ZatGiRkpKS9Mcff2jBggVKSUnRrbfe6q2aAWuhlQEAAEur4MlGS5Ys0dy5c/Xpp58qIyND3bp109SpU3XTTTepZs2abmNvuOEGHT9+XKNHj/ZKwYCl0MoAAIDleRR4r732WjVo0ED333+/Ro0apZYtW+Y7vkOHDvr73//uUYGApdHKAACA5XkUeNesWaO+ffsWenz37t3VvXt3Tw4FWBetDAAAlAke9fAWJewCtkQrAwAAZYbX7tIAlCu0MgAAUGYQeIGiopUBAIAyhcALFAWtDAAAlDkEXqAoaGVw0ze+rxyTHXJMdujKeVf6uhyXxKREV12OyQ4t+nGRr0sCUEY0buz8pZ3DIY0Z4+tqzps69XxdDod05IivKypbCLxAYVmolWHvsb26+5O71XRaUwU9E6TQuFD1mt1L076epj/P/lmqtbSq1UrvX/u+Huz5oNv6BTsW6OaPb1bz15vLMdmhvvF9i32sb3//Vvcuu1ddZnVRxacryjE5969/o6qN9P617+ux3o8V+5gASlZ8vHuQczik8HCpXz/ps898U9Nll0nvvy/FxuY9ZsMG74TPBQukm2+Wmjd37iuv+wL87W/Omq691vNjlWce3ZYMKHcs1Mqw7OdlGrZwmAIrBGpU5Ci1C2+nM1lntOHgBj206iHtPLxTs66aVWr1RFSO0M2RN+dYP2PzDG05tEXd6nbT0dNHvXKs5buX6+2tbysyIlJNqzfVz0d/znVc9eDqujnyZq39da2e2/CcV44NoGRNmSI1aSKZppSc7AzCQ4ZIn3wiXVnKv0Bq2tQZQvNiGNJ990mVK0vp6cU71owZ0pYtUrdu0tF8/qls1cr52rNHWry4eMcsjwi8QGFYpJVh3/F9uumjm9SoWiOtGbVGdarUcb03uvto7em3R8t+XuaT2i72/rXvq15oPfk5/NTu3+28ss9/dP2HHun1iIIrBmvM8jF5Bl4AZc/gwVLXrueXb79dioiQPvyw9ANvQWbNkg4edF4HmTatePt6/32pXj3Jz09q551/KpELAi9QEAu1Mrz41Ys6deaU3rn6Hbewm61ZjWYae+lYH1SWU4OqDby+z4iQCK/vE4A1VasmBQdLFSyWVI4dk554wnlFOiWl+Ptr4P1/KpELi30bARZjoVYGSfrk50/UtHpT9WzQ0+N9nD57WqfPni5wnL/DX9WDq3t8HAAoitRUZy+saTqD5OuvS6dO5d9akO3UKSkjo+BxFStKVasWr84nn5Rq15buvlt6+uni7Qulh8AL5McirQySlJaZpt9P/q5rWl5TrP28+NWLmrxucoHjGlVtpF/H/VqsYwFAYQ0Y4L4cGCjNni1dcUXB244ZI737bsHjoqOltWs9Kk+StG2b9Oab0vLlkr+/5/tB6SPwAnmxUCuD5Ay8klQlsEqx9jOqwyj1bti7wHHBFYKLdRwAKIrp06UWLZx/Tk6WPvjA+Qu2KlWk667Lf9uHHy7cleDqxfyl1T//6ew1HjiwePtB6SPwArmxWCuDJIUGhkqSTmaeLNZ+mlZvqqbVm3qjJADwmu7d3T+0NmKE1KmT8+rtlVdKAQF5b9umjfNVkhYskDZulHbsKNnjoGQQeIHcWKiVIVtoYKjqVqmrHSnF+9f21JlTOnXmVIHj/B3+CqscVqxjAYCn/Pyc9+KdNk3avVtq2zbvsamp0p+FuAV5QIBUo4Zn9Tz0kDRsmHMfv/7qXHfihPO/Bw9KZ85Idet6tm+UPAIvcDGLtTJc6MrmV2rW1lnadHCTohpEebSPlze+TA8vgDLh3Dnnf08V8DP62LEl38N78KA0b57zdbHOnaUOHaTERM/2jZJH4AUuZMFWhgs93Othzd0+V3d8cofWjFqT4zZde4/t1ac/f5rvrcno4QVQFpw9K33+ufOKauvW+Y8tjR7e3B72MH++s9Xhvfek+vU93zdKHoEXuJAFWxkudEmNSzTv+nkavmi4Wk9vrVEdzj9pbePBjVr440Ld0uGWfPdRWj286/ev1/r96yVJh08fVvrZdD2z/hlJUp9GfdSnUR/XWMdkh6IbRWvtLWvz3ef+E/v1/rb3JUmb/9gsSa59NqraSCM7jPT2aQAoJZ99Jv30k/PPKSnOK6m7d0uPPiqFhua/bWn08MbE5FyXfUV38GCpVq3z69eudbZjTJwoTZqU/37Xr3e+JOnwYeeT255x/rOmPn2cLxQfgRfIZuFWhgtd3fJqbbtnm17a+JKW7FqiGZtnKNA/UJERkXpl4Cu6s/Odvi5RkrRm35ocrRNPfvmkJGli9ERX4M3uJ87tQRoX23din2sfF+8zulE0gRcow5566vyfg4Kcj9GdMcN5v9uyJrsFo07B/6xpzRpp8kVdZk/+9c/cxIkEXm8h8AKS5VsZLta8ZnPNumqWr8uQJJ01zurI6SMK8A9w3UlCkib1naRJfScVuP36/evlkEOP9X6swLF9G/eVOdEscFyWkaXjGceVmpFa4FgAvnXLLc6XlWRmOh+CERwsVa6c97hJk3K/grt+vbPFoTDnldc+LpaR4QzSpwt+bhBy4efrAgBLsHgrg5VtPLhRYS+F6f8++j+Ptv9y35e6qd1Nah/R3ms1bU/ZrrCXwhSzIMZr+wRQfsyfL4WFSY884tn2X37pvEobGOi9mmbOdNb00kve22d5whVeoIy0MljRKwNf0fGM45KksEqe3cLspYHe/9e7WY1mWjVylWs5MiLS68cAYE9z556/xVmDBp7t47vvvFdPtuuvl9q1O79c3EcklzcEXpRvZayVwWq61O3i6xJyFRIQogFNBxQ8EAAu0quXryvIXYMGngdw0NKA8o5WBgAAbI/Ai/KLVgYAAMoFywfe9evX66qrrlLdunXlcDiUkJBQ4DZr165V586dFRgYqGbNmik+Pr7E60QZQysDAADlhuUDb3p6ujp06KDp06cXavy+ffs0dOhQ9evXT4mJiRo3bpzuuOMOrVy5soQrRZlCKwMAAOWG5T+0NnjwYA0ePLjQ42fOnKkmTZrolVdekSS1bt1aGzZs0GuvvaZBgwaVVJkoS2hlAIBC+/JL56t1aykqyvnLMP7JRFlj+cBbVJs2bdKAAe6fzh40aJDGjRuX5zaZmZnKzMx0LaelpUmSDMOQYRglUmdpMgxDpmna4lyKLTVVjjvuUPa/1caLLzo/9uqjrw1zY13MjbUxP6Xj9GnpyisdOn36fMKtXdvUpZdK7dub6txZuuoq9wDM3FiX3eamKOdhu8CblJSkiIgIt3URERFKS0vTn3/+qeDg4BzbxMXFafLFz/WTdPjwYWVkZJRYraXFMAydO3dOKSkp8vOzfBdLiQp94AFV+quVIbNPHx2PiXE+tN1HDMNQamqqTNMs93NjNcyNtTE/pcM0pUsuqant2yu61iUlOZSQICUkOFPuVVf9qVmzzj/VkLmxLrvNzcmTJws91naB1xMTJkzQ+PHjXctpaWlq0KCBwsLCFBoams+WZYNhGKpQoYLCw8Nt8Q3usZUr5TdvniTJDAlRxfh4hV/0w1FpMwxDDodDYWFh5XtuLIi5sTbmp+SlpDgfkRsV5VBqqqkDB3LvY/jttyCFh59/pBhzY112m5ugoKBCj7Vd4K1du7aSk5Pd1iUnJys0NDTXq7uSFBgYqMBcnv/n5+dni28ISXI4HLY6nyJLTXX7YJrj5ZflaNKkVEv4+cjPenHji+oQ0UH39bjvfC3lfW4sjLmxNubHu1JSpHXrpLVrna8ffyx4m7p1nVd6/fzcwzBzY112mpuinIPtAm9UVJSWL1/utm7VqlWKioryUUWwhAce8MldGX45/osW7lyo//z4H209tNW1vkf9Huper3up1AAAuSlKwPXzk8LDpaSk8+uuv1567z2pUqWSrhQoPssH3lOnTmnPnj2u5X379ikxMVE1atRQw4YNNWHCBP3+++967733JEn33HOP3njjDT388MO67bbbtGbNGv3nP//RsmXLfHUK8LWVK6V33nH+uRTuymCapmZ/P1szNs/QlkNbch1TJaBKiR0fAHJTlIDr7y917Sr17et8ffWV9Mwz599/+GEpLs4ZhIGywPKBd/PmzerXr59rObvXNjY2VvHx8Tp06JAOHDjger9JkyZatmyZ7r//fk2bNk3169fX22+/zS3JyisfPGBizb41uuOTO/J8v2XNlmod1rpEawCA4gTcXr2kKhf8XP7XNSX5+0szZkh33llydQMlwfKBt2/fvjJNM8/3c3uKWt++ffX999+XYFUoM3zQylCnSh1VqlhJp8+eVtXAqkrNTHV7//ZOt5d4DQDKH28G3Iu98orUrp00cKBzO6CssXzgBTxWyq0M2dqEtdHu+3YrdnGsvtj3hSTJIYdMOX9wu6HNDSVeAwD7K8mAe7E6daTHHitevYAvEXhhTz5oZch2JuuM7vn0HlfYDfQPVGaW88EmXet2VZPqpXt3CAD2UJoBF7AbAi/syUd3ZTiTdUY3/OcGffLzJ5Kk4ArBGtNtjF7a9JIkaVibYaVSB4Cyj4ALeA+BF/bjo1aG3MLusv9bpmpB1TRzy0xVD66uWzreUuJ1ACibCLhAySHwwl581MqQV9jt18R5h5E/HvhDgf6BquhfMb/dAChHCLhA6SHwwl580MpQUNiVpJCAkBKvA4C1EXAB3yHwwj580MpQmLALoHwi4ALWQeCFPfiglYGwC+BCBFzAugi8sIdSbmUg7AIg4AJlB4EXZV8ptzIQdoHyiYALlF0EXpRtpdzKQNgFyg8CLmAfBF6UbaXYykDYBeyNgAvYF4EXZVcptjIQdgH7IeAC5QeBF2VTKbYyEHYBeyhqwO3SxT3ghoaWTp0AvI/Ai7KplFoZCLtA2UXABZCNwIuyp5RaGQi7QNlCwAWQFwIvypZSamUg7ALWd+SIn9atk9avJ+ACyB+BF2VLKbQyEHYBa3K/guvQjz+G5zmWgAvgQgRelB2l0MpA2AWsI/8WBfe/+wRcAPkh8KJsKIVWBsIu4FspKefbE9aulXbuzHusv7+pyMizGjCgovr1cxBwAeSLwIuyoYRbGQi7QOkrWsB1v4IbFWUqI+OYwsPD5edXco8SB2APBF5YXwm3MhB2gdJRnIB78RVcw5AyMkq2XgD2QeCFtZVwKwNhFyg53gy4AFAcBF5YWwm2MhB2Ae8i4AKwKgIvrKsEWxkIu0DxEXABlBUEXlhTCbYyEHYBzxBwAZRVBF5YUwm1MhB2gcIj4AKwCwIvrKeEWhkIu0D+CLgA7IrAC2spoVYGwi6QEwEXQHlB4IW1lEArA2EXcCLgAiivCLywjhJoZSDsojwj4AKAE4EX1lACrQyEXZQ3BFwAyB2BF9bg5VYGwi7KAwIuABQOgRe+5+VWBsIu7IqACwCeIfDCt7zcykDYhZ0QcAHAOwi88C0vtjIQdlHWEXABoGQQeOE7XmxlIOyiLCLgAkDpIPDCN7zYykDYRVlBwAUA3yDwwje81MpA2IWVEXABwBoIvCh9XmplIOzCagi4AGBNZSLwTp8+XS+99JKSkpLUoUMHvf766+revXuuY+Pj43Xrrbe6rQsMDFRGRkZplIqCeKmVgbALKyDgAkDZYPnAu2DBAo0fP14zZ85Ujx49NHXqVA0aNEi7du1SeHh4rtuEhoZq165drmVHMR9PCy/yQisDYRe+QsAFgLLJ8oH31Vdf1Z133um6ajtz5kwtW7ZMs2fP1qOPPprrNg6HQ7Vr1y7NMlEYK1YUu5WBsIvSdOSIn9avPx9yCbgAUDZZOvCeOXNGW7Zs0YQJE1zr/Pz8NGDAAG3atCnP7U6dOqVGjRrJMAx17txZzz33nNq2bZvn+MzMTGVmZrqW09LSJEmGYcgwDC+ciW8ZhiHTNH17Lqmpctx5p7LjrfHii1KDBlIRajqTdUbDFg7Tp7s/leQMu5+M+ETRjaLL7DxZYm7gkn0Fd906h9atc2jnztx/iyRJ/v6munSRoqOl6Ggz14DLtJYc/u5YF3NjXXabm6Kch6UD75EjR5SVlaWIiAi39REREfrpp59y3aZly5aaPXu2IiMjlZqaqpdfflk9e/bUzp07Vb9+/Vy3iYuL0+TJk3OsP3z4sC16fw3D0Llz55SSkiI/Pz+f1BD6wAOq9FcrQ2afPjoeE+NMF4V0JuuM7lx1pz7f/7kkKahCkN7/2/tqW6mtUoqwH6sxDEOpqakyTdNnc1OeHTnip6+/rqiNGwO0cWOAdu2qmOdYf39TkZFn1bPnGUVFnVH37mdVpYrpej8jw/lC6eDvjnUxN9Zlt7k5efJkocdaOvB6IioqSlFRUa7lnj17qnXr1nrzzTf19NNP57rNhAkTNH78eNdyWlqaGjRooLCwMIXa4HeShmGoQoUKCg8P9803+IoV8ps3T5JkhoSoYny8wi/6ISY/2Vd2s8Nu9pXdfo3LfhuDYRhyOBwKCwuzxT8+Vud+BVfauTPvlho/P1MdOpxV//4VLmhRqCDnP5uVSqtk5IG/O9bF3FiX3eYmKCio0GMtHXhr1aolf39/JScnu61PTk4udI9uxYoV1alTJ+3ZsyfPMYGBgQoMDMyx3s/PzxbfEJKzr9kn55OaKt199/k6Xn5ZjiZNCr35mawzunHRjW5tDHbr2fXZ3JQDxfmQWVSUqYyMY777QREF4u+OdTE31mWnuSnKOVg68AYEBKhLly5avXq1YmJiJDl/Olm9erXGjBlTqH1kZWVp+/btGjJkSAlWijwV464MfEANReXNuygYBi0KAGAXlg68kjR+/HjFxsaqa9eu6t69u6ZOnar09HTXXRtGjRqlevXqKS4uTpI0ZcoUXXrppWrWrJlOnDihl156Sfv379cdF977FaWjGHdlIOyiMLhNGACgMCwfeIcPH67Dhw/rqaeeUlJSkjp27KgVK1a4Psh24MABt0vax48f15133qmkpCRVr15dXbp00caNG9WmTRtfnUL5lJoq3Xnn+eUiPGCCsIu8EHABAJ6wfOCVpDFjxuTZwrB27Vq35ddee02vvfZaKVSFfHnYykDYxYUIuAAAbygTgRdljIetDIRdEHABACWBwAvv8rCVgbBbPhFwAQClgcAL7/KglYGwW34QcAEAvkDghfd40MpA2LU3Ai4AwAoIvPAOD1oZCLv2Q8AFAFgRgRfeUcRWBsKuPRBwAQBlAYEXxVfEVgbCbtlFwAUAlEUEXhRPEVsZCLtlCwEXAGAHBF4UTxFaGQi71kfABQDYEYEXnitCKwNh15oIuACA8oDAC88UoZWBsGsdBFwAQHlE4IVnCtnKQNj1LQIuAAAEXniikK0MhN3SR8AFACAnAi+KppCtDITd0kHABQCgYAReFE0hWhkIuyWHgAsAQNEReFF4hWhlIOx6FwEXAIDiI/CicArRykDYLT4CLgAA3kfgReEU0MpA2PVMSor06aeB+v57h9atI+ACAFASCLwoWAGtDITdwst5BddPUvVcxxJwAQDwDgIv8ldAKwNhN3+0KAAA4HsEXuQvn1YGwm5ORQ+4prp1S9fgwZV02WV+BFwAAEoAgRd5y6eVgbDrVNwruCEhplJSTik8vJL8/EqnZgAAyhsCL3KXTytDeQ673m5RMIySqxUAADgReJG7PFoZylvYpQcXAICyj8CLnPJoZSgPYZeACwCA/RB44S6PVga7hl0CLgAA9kfghbtcWhnsFHYJuAAAlD8EXpyXSyvDGeNsmQ67BFwAAEDghVMurQxn6tcpc2GXgAsAAC5G4IXTRa0MZ26/pUyEXQIuAAAoCIEXOVoZzrz5b92wcJglwy4BFwAAFBWBt7y7qJXhzEvP64ZvHrBM2CXgAgCA4iLwlncXtDKcueJy3VB1pU/DLgEXAAB4G4G3PLugleFMaGXdcJNfqYddAi4AAChpBN7y6oJWhjP+0g0TLtEnB7+QVLJhl4ALAABKG4G3vPqrleGMv3TDvbX0SeY2Sd4PuwRcAADgawTe8uivVoYz/tINI/z1Sc0jkrwTdgm4AADAagi85c1frQxn/KUbbpQ+aZYlyfOwS8AFAABWR+Atbx54QGcO/eYMuy2dq4oSdgm4AACgrCHwlicrVuhM/DtFCrsEXAAAUNYReMuJSmfP6uw/7tKwAsIuARcAANiNn68LKIzp06ercePGCgoKUo8ePfTtt9/mO37hwoVq1aqVgoKC1L59ey1fvryUKrWuW/fs0rBev+cIu20r99OiRdKYMVK7dlJEhDRsmDR9es6w6+8vde8uPfywtHy5dOyY9M030gsvSIMHE3YBAIA1Wf4K74IFCzR+/HjNnDlTPXr00NSpUzVo0CDt2rVL4eHhOcZv3LhRI0aMUFxcnK688krNmzdPMTEx2rp1q9q1a+eDM7CA9HQlNvnDFXYDFKyBR5bpvqv6cQUXAADYnsM0TdPXReSnR48e6tatm9544w1JkmEYatCgge677z49+uijOcYPHz5c6enp+vTTT13rLr30UnXs2FEzZ84s1DHT0tJUtWpVpaamKtQGCc84c0a3DAvV+50zpbPB0txl0q85e3YJuKXPMAylpKQoPDxcfn5l4hcu5QZzY23Mj3UxN9Zlt7kpSl6z9BXeM2fOaMuWLZowYYJrnZ+fnwYMGKBNmzblus2mTZs0fvx4t3WDBg1SQkJCnsfJzMxUZmamazktLU2S8xvDMIxinIE1GH5++i0jStW/jNXxbZdJxy+RJPn7m+rSRYqOlqKjzVwDrg1O39IMw5Bpmrb4PrMb5sbamB/rYm6sy25zU5TzsHTgPXLkiLKyshQREeG2PiIiQj/99FOu2yQlJeU6PikpKc/jxMXFafLkyTnWX3/99apQwdJfokIxTVPbtm5Xw/r7VbFic1VqvEXVq/+g6tV/VIUKp/XDD9IPP0j/+pevKy1/TNPUuXPnVKFCBTkcDl+XgwswN9bG/FgXc2Nddpubc+fOFXps2U9zXjBhwgS3q8JpaWlq0KCBPvroI3u0NBiGhgwZouXLH7/gVxg3+LQmOBmGocOHDyssLMwWv16yE+bG2pgf62JurMtuc5OWlqbq1asXaqylA2+tWrXk7++v5ORkt/XJycmqXbt2rtvUrl27SOMlKTAwUIGBgTnW+/n52eIbQpIcDoetzsdOmBvrYm6sjfmxLubGuuw0N0U5B0ufbUBAgLp06aLVq1e71hmGodWrVysqKirXbaKiotzGS9KqVavyHA8AAAB7s/QVXkkaP368YmNj1bVrV3Xv3l1Tp05Venq6br31VknSqFGjVK9ePcXFxUmSxo4dq+joaL3yyisaOnSo5s+fr82bN2vWrFm+PA0AAAD4iOUD7/Dhw3X48GE99dRTSkpKUseOHbVixQrXB9MOHDjgdkm7Z8+emjdvnp544gk99thjat68uRISEsrvPXgBAADKOcsHXkkaM2aMxowZk+t7a9euzbFu2LBhGjZsWAlXBQAAgLLA0j28AAAAQHEReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK0ReAEAAGBrBF4AAADYGoEXAAAAtkbgBQAAgK1ZOvCapqmnnnpKderUUXBwsAYMGKDdu3fnu82kSZPkcDjcXq1atSqligEAAGA1lg68L774ov71r39p5syZ+uabb1S5cmUNGjRIGRkZ+W7Xtm1bHTp0yPXasGFDKVUMAAAAq6ng6wLyYpqmpk6dqieeeELXXHONJOm9995TRESEEhISdNNNN+W5bYUKFVS7du3SKhUAAAAWZtnAu2/fPiUlJWnAgAGudVWrVlWPHj20adOmfAPv7t27VbduXQUFBSkqKkpxcXFq2LBhnuMzMzOVmZnpWk5NTZUknThxQoZheOFsfMswDJ09e1YnTpyQn5+lL+qXO4ZhKC0tTQEBAcyNxTA31sb8WBdzY112m5u0tDRJzoukBbFs4E1KSpIkRUREuK2PiIhwvZebHj16KD4+Xi1bttShQ4c0efJkXXbZZdqxY4eqVKmS6zZxcXGaPHlyjvWNGjUqxhlYT82aNX1dAgAAgFedPHlSVatWzXeMwyxMLC4Fc+fO1d133+1aXrZsmfr27as//vhDderUca2/8cYb5XA4tGDBgkLt98SJE2rUqJFeffVV3X777bmOufgKr2EYOnbsmGrWrCmHw+HhGVlHWlqaGjRooIMHDyo0NNTX5eACzI11MTfWxvxYF3NjXXabG9M0dfLkSdWtW7fAK9aWucJ79dVXq0ePHq7l7ACanJzsFniTk5PVsWPHQu+3WrVqatGihfbs2ZPnmMDAQAUGBubYzm5CQ0Nt8Q1uR8yNdTE31sb8WBdzY112mpuCruxms0wDR5UqVdSsWTPXq02bNqpdu7ZWr17tGpOWlqZvvvlGUVFRhd7vqVOntHfvXrfQDAAAgPLDMoH3Yg6HQ+PGjdMzzzyjpUuXavv27Ro1apTq1q2rmJgY17j+/fvrjTfecC0/+OCDWrdunX799Vdt3LhR1157rfz9/TVixAgfnAUAAAB8zTItDbl5+OGHlZ6errvuuksnTpxQ7969tWLFCgUFBbnG7N27V0eOHHEt//bbbxoxYoSOHj2qsLAw9e7dW19//bXCwsJ8cQqWEBgYqIkTJ+Zo24DvMTfWxdxYG/NjXcyNdZXnubHMh9YAAACAkmDZlgYAAADAGwi8AAAAsDUCLwAAAGyNwAsAAABbI/Da3PTp09W4cWMFBQWpR48e+vbbb31dEiStX79eV111lerWrSuHw6GEhARfl4S/xMXFqVu3bqpSpYrCw8MVExOjXbt2+bosSJoxY4YiIyNdN82PiorSZ5995uuykIvnn3/edXtR+N6kSZPkcDjcXq1atfJ1WaWKwGtjCxYs0Pjx4zVx4kRt3bpVHTp00KBBg5SSkuLr0sq99PR0dejQQdOnT/d1KbjIunXrNHr0aH399ddatWqVzp49q4EDByo9Pd3XpZV79evX1/PPP68tW7Zo8+bNuvzyy3XNNddo586dvi4NF/juu+/05ptvKjIy0tel4AJt27bVoUOHXK8NGzb4uqRSxW3JbKxHjx7q1q2b68EchmGoQYMGuu+++/Too4/6uDpkczgcWrx4sdsDVWAdhw8fVnh4uNatW6c+ffr4uhxcpEaNGnrppZd0++23+7oUyPl0086dO+vf//63nnnmGXXs2FFTp071dVnl3qRJk5SQkKDExERfl+IzXOG1qTNnzmjLli0aMGCAa52fn58GDBigTZs2+bAyoGxJTU2V5AxWsI6srCzNnz9f6enpRXrcPErW6NGjNXToULf/98Aadu/erbp166pp06b6+9//rgMHDvi6pFJl6SetwXNHjhxRVlaWIiIi3NZHRETop59+8lFVQNliGIbGjRunXr16qV27dr4uB5K2b9+uqKgoZWRkKCQkRIsXL1abNm18XRYkzZ8/X1u3btV3333n61JwkR49eig+Pl4tW7bUoUOHNHnyZF122WXasWOHqlSp4uvySgWBFwDyMHr0aO3YsaPc9bpZWcuWLZWYmKjU1FQtWrRIsbGxWrduHaHXxw4ePKixY8dq1apVCgoK8nU5uMjgwYNdf46MjFSPHj3UqFEj/ec//yk37UAEXpuqVauW/P39lZyc7LY+OTlZtWvX9lFVQNkxZswYffrpp1q/fr3q16/v63Lwl4CAADVr1kyS1KVLF3333XeaNm2a3nzzTR9XVr5t2bJFKSkp6ty5s2tdVlaW1q9frzfeeEOZmZny9/f3YYW4ULVq1dSiRQvt2bPH16WUGnp4bSogIEBdunTR6tWrXesMw9Dq1avpdwPyYZqmxowZo8WLF2vNmjVq0qSJr0tCPgzDUGZmpq/LKPf69++v7du3KzEx0fXq2rWr/v73vysxMZGwazGnTp3S3r17VadOHV+XUmq4wmtj48ePV2xsrLp27aru3btr6tSpSk9P16233urr0sq9U6dOuf1kvW/fPiUmJqpGjRpq2LChDyvD6NGjNW/ePC1ZskRVqlRRUlKSJKlq1aoKDg72cXXl24QJEzR48GA1bNhQJ0+e1Lx587R27VqtXLnS16WVe1WqVMnR5165cmXVrFmT/ncLePDBB3XVVVepUaNG+uOPPzRx4kT5+/trxIgRvi6t1BB4bWz48OE6fPiwnnrqKSUlJaljx45asWJFjg+yofRt3rxZ/fr1cy2PHz9ekhQbG6v4+HgfVQXJ+XADSerbt6/b+jlz5uiWW24p/YLgkpKSolGjRunQoUOqWrWqIiMjtXLlSl1xxRW+Lg2wtN9++00jRozQ0aNHFRYWpt69e+vrr79WWFiYr0srNdyHFwAAALZGDy8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AAABsjcALAAAAWyPwAgAAwNYIvAAAALA1Ai8AlKL//Oc/qlGjhk6dOlXkbRs3bqwrr7yyBKrKXXx8vBwOh3799ddSO+aFfvzxR1WoUEE7duzwyfEB2AeBFwBKSVZWliZOnKj77rtPISEhvi7H8tq0aaOhQ4fqqaee8nUpAMo4Ai8AlJJPPvlEu3bt0l133eXrUgpl5MiR+vPPP9WoUSOf1XDPPfdo8eLF2rt3r89qAFD2EXgBoJTMmTNHvXr1Ur169XxdSqH4+/srKChIDofDZzUMGDBA1atX17vvvuuzGgCUfQReACiEP//8U61atVKrVq30559/utYfO3ZMderUUc+ePZWVlZXn9hkZGVqxYoUGDBiQ4705c+bo8ssvV3h4uAIDA9WmTRvNmDEjz319/vnn6tixo4KCgtSmTRt9/PHHbu+fPXtWkydPVvPmzRUUFKSaNWuqd+/eWrVqldu4n376STfeeKPCwsIUHBysli1b6vHHH3e9n1sP7+bNmzVo0CDVqlVLwcHBatKkiW677Ta3/c6fP19dunRRlSpVFBoaqvbt22vatGluX7MHH3xQ7du3V0hIiEJDQzV48GD98MMPOc61YsWK6tu3r5YsWZLn1wMAClLB1wUAQFkQHBysd999V7169dLjjz+uV199VZI0evRopaamKj4+Xv7+/nluv2XLFp05c0adO3fO8d6MGTPUtm1bXX311apQoYI++eQT3XvvvTIMQ6NHj3Ybu3v3bg0fPlz33HOPYmNjNWfOHA0bNkwrVqzQFVdcIUmaNGmS4uLidMcdd6h79+5KS0vT5s2btXXrVteYbdu26bLLLlPFihV11113qXHjxtq7d68++eQTPfvss7meQ0pKigYOHKiwsDA9+uijqlatmn799Ve3wL1q1SqNGDFC/fv31wsvvCBJ+t///qevvvpKY8eOlST98ssvSkhI0LBhw9SkSRMlJyfrzTffVHR0tH788UfVrVvX7bhdunTRkiVLlJaWptDQ0HznCQByZQIACm3ChAmmn5+fuX79enPhwoWmJHPq1KkFbvf222+bkszt27fneO/06dM51g0aNMhs2rSp27pGjRqZksyPPvrItS41NdWsU6eO2alTJ9e6Dh06mEOHDs23nj59+phVqlQx9+/f77beMAzXn+fMmWNKMvft22eapmkuXrzYlGR+9913ee537NixZmhoqHnu3Lk8x2RkZJhZWVlu6/bt22cGBgaaU6ZMyTF+3rx5piTzm2++yfecACAvtDQAQBFMmjRJbdu2VWxsrO69915FR0frn//8Z4HbHT16VJJUvXr1HO8FBwe7/pyamqojR44oOjpav/zyi1JTU93G1q1bV9dee61rOTQ0VKNGjdL333+vpKQkSVK1atW0c+dO7d69O9daDh8+rPXr1+u2225Tw4YN3d7Lr1+3WrVqkqRPP/1UZ8+ezXNMenp6jvaJCwUGBsrPz/m/n6ysLB09elQhISFq2bKltm7dmmN89tfsyJEjee4TAPJD4AWAIggICNDs2bO1b98+nTx5UnPmzCnSh7pM08yx7quvvtKAAQNUuXJlVatWTWFhYXrsscckKUfgbdasWY7jtWjRQpJcvbZTpkzRiRMn1KJFC7Vv314PPfSQtm3b5hr/yy+/SJLatWtX6LolKTo6Wtdff70mT56sWrVq6ZprrtGcOXOUmZnpGnPvvfeqRYsWGjx4sOrXr6/bbrtNK1ascNuPYRh67bXX1Lx5cwUGBqpWrVoKCwvTtm3bcpyvdP5r5ssPzwEo2wi8AFBEK1eulOT8IFpeV1EvVrNmTUnS8ePH3dbv3btX/fv315EjR/Tqq69q2bJlWrVqle6//35JznBYVH369NHevXs1e/ZstWvXTm+//bY6d+6st99+u8j7upDD4dCiRYu0adMmjRkzRr///rtuu+02denSxfUgjfDwcCUmJmrp0qW6+uqr9eWXX2rw4MGKjY117ee5557T+PHj1adPH33wwQdauXKlVq1apbZt2+Z6vtlfs1q1ahWrfgDlmK97KgCgLPnhhx/MgIAA89ZbbzU7depkNmjQwDxx4kSB223YsMGUZC5ZssRt/WuvvWZKytFL+9hjj7n1z5qms4e3bt26bn22pmmajzzyiCnJPHToUK7HPnnypNmpUyezXr16pmmaZkpKiinJHDt2bL41X9zDm5u5c+eaksy33nor1/ezsrLMu+++25Rk7t692zRNZ49xv379coytV6+eGR0dnWP9M888Y/r5+RXq6wwAueEKLwAU0tmzZ3XLLbeobt26mjZtmuLj45WcnOy6GpufLl26KCAgQJs3b3Zbn31nB/OCVofU1FTNmTMn1/388ccfWrx4sWs5LS1N7733njp27KjatWtLOt8vnC0kJETNmjVztR6EhYWpT58+mj17tg4cOOA21syl5SLb8ePHc7zfsWNHSXLt++Jj+/n5KTIy0m2Mv79/jv0sXLhQv//+e67H3bJli9q2bauqVavmWRsA5IfbkgFAIT3zzDNKTEzU6tWrVaVKFUVGRuqpp57SE088oRtuuEFDhgzJc9ugoCANHDhQX3zxhaZMmeJaP3DgQAUEBOiqq67S3XffrVOnTumtt95SeHi4Dh06lGM/LVq00O23367vvvtOERERmj17tpKTk90Ccps2bdS3b1916dJFNWrU0ObNm7Vo0SKNGTPGNeZf//qXevfurc6dO+uuu+5SkyZN9Ouvv2rZsmVKTEzM9Rzeffdd/fvf/9a1116rSy65RCdPntRbb72l0NBQ17nfcccdOnbsmC6//HLVr19f+/fv1+uvv66OHTuqdevWkqQrr7xSU6ZM0a233qqePXtq+/btmjt3rpo2bZrjmGfPntW6det077335j85AJAf315gBoCyYcuWLWaFChXM++67z239uXPnzG7dupl169Y1jx8/nu8+Pv74Y9PhcJgHDhxwW7906VIzMjLSDAoKMhs3bmy+8MIL5uzZs3NtaRg6dKi5cuVKMzIy0gwMDDRbtWplLly40G1/zzzzjNm9e3ezWrVqZnBwsNmqVSvz2WefNc+cOeM2bseOHea1115rVqtWzQwKCjJbtmxpPvnkk673L25p2Lp1qzlixAizYcOGZmBgoBkeHm5eeeWV5ubNm13bLFq0yBw4cKAZHh5uBgQEmA0bNjTvvvtut3aLjIwM84EHHjDr1KljBgcHm7169TI3bdpkRkdH52hp+Oyzz9zaIQDAEw7TzOf3VwAAr8nKylKbNm1044036umnn/Z1OWVCTEyMHA6HWxsHABQVgRcAStGCBQv0j3/8QwcOHFBISIivy7G0//3vf2rfvr0SExOLfAs1ALgQgRcAAAC2xl0aAAAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANgagRcAAAC2RuAFAACArRF4AQAAYGsEXgAAANja/wO/kav13fKFjwAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\n" ] } ], "source": [ "# 可视化二维向量\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# 设置中文字体(如果系统有的话)\n", "try:\n", " plt.rcParams['font.sans-serif'] = ['SimHei', 'Noto Sans CJK SC', 'WenQuanYi Micro Hei']\n", " plt.rcParams['axes.unicode_minus'] = False\n", "except:\n", " pass # 如果没有中文字体就用默认\n", "\n", "# 创建画布\n", "fig, ax = plt.subplots(figsize=(8, 8))\n", "\n", "# 定义向量\n", "vectors = {\n", " 'A = [2, 3]': np.array([2, 3]),\n", " 'B = [4, 1]': np.array([4, 1]),\n", " 'C = [1, 1]': np.array([1, 1]),\n", "}\n", "\n", "# 画每个向量\n", "colors = ['red', 'blue', 'green']\n", "for (name, vec), color in zip(vectors.items(), colors):\n", " ax.annotate('', xy=vec, xytext=(0, 0),\n", " arrowprops=dict(arrowstyle='->', color=color, lw=2))\n", " ax.text(vec[0]+0.1, vec[1]+0.1, name, fontsize=12, color=color)\n", "\n", "# 画坐标系\n", "ax.axhline(y=0, color='black', linewidth=0.5)\n", "ax.axvline(x=0, color='black', linewidth=0.5)\n", "\n", "# 设置范围\n", "ax.set_xlim(-0.5, 5.5)\n", "ax.set_ylim(-0.5, 4)\n", "ax.set_xlabel('x (abscissa)', fontsize=12)\n", "ax.set_ylabel('y (ordinate)', fontsize=12)\n", "ax.set_title('2D Vector Visualization', fontsize=14)\n", "ax.grid(True, alpha=0.3)\n", "ax.set_aspect('equal')\n", "\n", "plt.show()\n", "print(\"Note: Arrows represent vectors. Endpoint of arrow = vector endpoint\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 向量的基本运算\n", "\n", "### 3.2.1 向量加法\n", "\n", "**规则:对应位置相加**\n", "\n", "```python\n", "[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n", "```\n", "\n", "**几何直观**:先走向量a,再走向量b,等价于直接从原点走到a+b\n", "\n", "```\n", " b=[4,5,6]\n", " ↗\n", " |\n", " a+b |\n", " ↙|\n", " ↙ |\n", "O →——→ a=[1,2,3]\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "向量加法演示\n", "==================================================\n", "向量 a = [1 2 3]\n", "向量 b = [4 5 6]\n", "a + b = [5 7 9]\n", "\n", "计算过程:\n", " 位置0: 1 + 4 = 5\n", " 位置1: 2 + 5 = 7\n", " 位置2: 3 + 6 = 9\n", "\n", "验证: True True True\n" ] } ], "source": [ "# 向量加法演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"向量加法演示\")\n", "print(\"=\" * 50)\n", "\n", "a = np.array([1, 2, 3])\n", "b = np.array([4, 5, 6])\n", "c = a + b\n", "\n", "print(f\"向量 a = {a}\")\n", "print(f\"向量 b = {b}\")\n", "print(f\"a + b = {c}\")\n", "print()\n", "print(\"计算过程:\")\n", "print(f\" 位置0: {a[0]} + {b[0]} = {a[0]+b[0]}\")\n", "print(f\" 位置1: {a[1]} + {b[1]} = {a[1]+b[1]}\")\n", "print(f\" 位置2: {a[2]} + {b[2]} = {a[2]+b[2]}\")\n", "print()\n", "print(\"验证:\", a[0]+b[0] == c[0], a[1]+b[1] == c[1], a[2]+b[2] == c[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2.2 向量数乘(标量乘法)\n", "\n", "**规则:每个元素都乘以这个标量(数字)**\n", "\n", "```python\n", "2 × [1, 2, 3] = [2×1, 2×2, 2×3] = [2, 4, 6]\n", "3 × [1, 2, 3] = [3×1, 3×2, 3×3] = [3, 6, 9]\n", "0.5 × [1, 2, 3] = [0.5, 1.0, 1.5]\n", "```\n", "\n", "**几何直观**:\n", "- 正数:方向不变,长度缩放\n", "- 负数:方向相反,长度缩放\n", "- 0:变成零向量" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "向量数乘(标量乘法)演示\n", "==================================================\n", "原始向量 v = [1 2 3]\n", "\n", "2 × v = [2 4 6]\n", "3 × v = [3 6 9]\n", "0.5 × v = [0.5 1. 1.5]\n", "-1 × v = [-1 -2 -3]\n", "0 × v = [0 0 0]\n" ] } ], "source": [ "# 向量数乘演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"向量数乘(标量乘法)演示\")\n", "print(\"=\" * 50)\n", "\n", "v = np.array([1, 2, 3])\n", "\n", "print(f\"原始向量 v = {v}\")\n", "print()\n", "\n", "for scalar in [2, 3, 0.5, -1, 0]:\n", " result = scalar * v\n", " print(f\"{scalar} × v = {result}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2.3 向量的长度(模/范数)\n", "\n", "**定义:从原点到向量终点的距离**\n", "\n", "对于二维向量 `[a, b]`:\n", "```\n", "长度 = √(a² + b²)\n", "\n", "这就是\"勾股定理\"!\n", "\n", " |\n", " b |\n", " | |\n", " | √(a²+b²)\n", " | /\n", " | /\n", " |/ a\n", " O——————\n", "```\n", "\n", "对于n维向量 `[a₁, a₂, ..., aₙ]`:\n", "```\n", "长度 = √(a₁² + a₂² + ... + aₙ²)\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "向量长度(模/范数)演示\n", "==================================================\n", "向量 v = [3 4]\n", "长度 = √(3² + 4²) = √(9 + 16) = √25 = 5.0\n", "\n", "向量长度计算例子:\n", " [np.int64(1), np.int64(1)] -> 长度 = 1.4142\n", " [np.int64(0), np.int64(5)] -> 长度 = 5.0000\n", " [np.int64(3), np.int64(4)] -> 长度 = 5.0000\n", " [np.int64(1), np.int64(2), np.int64(2)] -> 长度 = 3.0000\n", " [np.int64(1), np.int64(1), np.int64(1), np.int64(1)] -> 长度 = 2.0000\n" ] } ], "source": [ "# 向量长度计算\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"向量长度(模/范数)演示\")\n", "print(\"=\" * 50)\n", "\n", "# 二维向量例子\n", "v2d = np.array([3, 4])\n", "length_2d = np.linalg.norm(v2d)\n", "\n", "print(f\"向量 v = {v2d}\")\n", "print(f\"长度 = √({v2d[0]}² + {v2d[1]}²) = √({v2d[0]**2} + {v2d[1]**2}) = √{v2d[0]**2 + v2d[1]**2} = {length_2d}\")\n", "print()\n", "\n", "# 更多例子\n", "examples = [\n", " np.array([1, 1]), # 45度角\n", " np.array([0, 5]), # 在y轴上\n", " np.array([3, 4]), # 经典勾股数\n", " np.array([1, 2, 2]), # 三维向量\n", " np.array([1, 1, 1, 1]) # 四维向量\n", "]\n", "\n", "print(\"向量长度计算例子:\")\n", "for v in examples:\n", " length = np.linalg.norm(v)\n", " print(f\" {list(v)} -> 长度 = {length:.4f}\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "练习题3答案\n", "==================================================\n", "A = [3 4], B = [1 2]\n", "\n", "1. A + B = [4 6]\n", "2. 2 × A = [6 8]\n", "3. A的长度 = 5.0\n" ] } ], "source": [ "# 练习题3答案\n", "import numpy as np\n", "print(\"=\" * 50)\n", "print(\"练习题3答案\")\n", "print(\"=\" * 50)\n", "\n", "A = np.array([3, 4])\n", "B = np.array([1, 2])\n", "\n", "print(f\"A = {A}, B = {B}\")\n", "print()\n", "\n", "# 1. A + B\n", "print(\"1. A + B =\", A + B)\n", "\n", "# 2. 2 × A\n", "print(\"2. 2 × A =\", 2 * A)\n", "\n", "# 3. A的长度\n", "print(f\"3. A的长度 = {np.linalg.norm(A)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第四部分:余弦相似度\n", "\n", "## 4.1 什么是相似度?\n", "\n", "**相似度 = 两个向量有多\"像\"**\n", "\n", "### 日常生活中的相似例子\n", "\n", "| 相似度高 | 原因 | 相似度低 | 原因 |\n", "|----------|------|----------|------|\n", "| \"猫\" 和 \"狗\" | 都是动物,都四只脚 | \"猫\" 和 \"石头\" | 一个是动物,一个不是 |\n", "| \"红色\" 和 \"黄色\" | 都是颜色,暖色调 | \"热\" 和 \"冷\" | 意思相反 |\n", "| \"跑步\" 和 \"游泳\" | 都是运动 | \"太阳\" 和 \"细菌\" | 几乎没有共同点 |\n", "| \"苹果\" 和 \"梨\" | 都是水果 | \"苹果\" 和 \"手机\" | 需要上下文才能关联 |\n", "\n", "### 计算机如何量化相似度?\n", "\n", "文本相似度在计算机中的应用:\n", "\n", "```\n", "搜索场景:\n", " 用户输入: \"如何学习编程?\"\n", " 文档1: \"Python入门教程\" → 相似度高 ✅\n", " 文档2: \"做蛋糕的100种方法\" → 相似度低 ❌\n", "\n", "推荐场景:\n", " 用户喜欢: \"猫和狗的搞笑视频\"\n", " 推荐1: \"仓鼠的可爱瞬间\" → 相似度高 ✅\n", " 推荐2: \"汽车发动机维修教程\" → 相似度低 ❌\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 点积(Dot Product)— 最重要的运算\n", "\n", "### 定义:对应位置相乘,再求和\n", "\n", "```python\n", "a = [1, 2, 3]\n", "b = [4, 5, 6]\n", "\n", "点积 = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n", "```\n", "\n", "### 点积的几何意义\n", "\n", "```\n", "点积 = |A| × |B| × cos(θ)\n", "\n", "其中:\n", " |A| = 向量A的长度\n", " |B| = 向量B的长度\n", " θ = 两个向量之间的夹角\n", "```\n", "\n", "| 夹角 θ | cos(θ) | 点积结果 | 含义 |\n", "|--------|--------|----------|------|\n", "| 0° | 1 | |A|×|B|(最大) | 方向完全相同 |\n", "| 90° | 0 | 0 | 垂直/正交 |\n", "| 180° | -1 | -|A|×|B|(最小) | 方向完全相反 |" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "向量点积演示\n", "==================================================\n", "向量 a = [1 2 3]\n", "向量 b = [4 5 6]\n", "\n", "点积 a · b = 32\n", "验证: a @ b = 32\n", "手动计算: 32\n", "\n", "计算过程:\n", " a[0]×b[0] = 1×4 = 4\n", " a[1]×b[1] = 2×5 = 10\n", " a[2]×b[2] = 3×6 = 18\n", " 求和: 4 + 10 + 18 = 32\n" ] } ], "source": [ "# 点积计算演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"向量点积演示\")\n", "print(\"=\" * 50)\n", "\n", "a = np.array([1, 2, 3])\n", "b = np.array([4, 5, 6])\n", "\n", "# 方法1:使用np.dot()\n", "dot1 = np.dot(a, b)\n", "\n", "# 方法2:使用@运算符\n", "dot2 = a @ b\n", "\n", "# 方法3:手动计算\n", "dot3 = sum(a[i] * b[i] for i in range(len(a)))\n", "\n", "print(f\"向量 a = {a}\")\n", "print(f\"向量 b = {b}\")\n", "print()\n", "print(f\"点积 a · b = {dot1}\")\n", "print(f\"验证: a @ b = {dot2}\")\n", "print(f\"手动计算: {dot3}\")\n", "print()\n", "print(\"计算过程:\")\n", "print(f\" a[0]×b[0] = {a[0]}×{b[0]} = {a[0]*b[0]}\")\n", "print(f\" a[1]×b[1] = {a[1]}×{b[1]} = {a[1]*b[1]}\")\n", "print(f\" a[2]×b[2] = {a[2]}×{b[2]} = {a[2]*b[2]}\")\n", "print(f\" 求和: {a[0]*b[0]} + {a[1]*b[1]} + {a[2]*b[2]} = {dot1}\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "点积与夹角的关系\n", "==================================================\n", "夹角0°: a=[1 0], b=[2 0], 点积=2\n", "夹角90°: a=[1 0], b=[0 1], 点积=0\n", "夹角180°: a=[1 0], b=[-1 0], 点积=-1\n", "\n", "任意角度: a=[1 1], b=[1 0]\n", " 点积 = 1\n", " cos(θ) = 0.7071\n", " 夹角 θ = 45.0°\n" ] } ], "source": [ "# 点积与夹角的关系\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"点积与夹角的关系\")\n", "print(\"=\" * 50)\n", "\n", "# 夹角为0度:方向完全相同\n", "a = np.array([1, 0])\n", "b = np.array([2, 0])\n", "dot = np.dot(a, b)\n", "print(f\"夹角0°: a={a}, b={b}, 点积={dot}\")\n", "\n", "# 夹角为90度:垂直\n", "a = np.array([1, 0])\n", "b = np.array([0, 1])\n", "dot = np.dot(a, b)\n", "print(f\"夹角90°: a={a}, b={b}, 点积={dot}\")\n", "\n", "# 夹角为180度:方向相反\n", "a = np.array([1, 0])\n", "b = np.array([-1, 0])\n", "dot = np.dot(a, b)\n", "print(f\"夹角180°: a={a}, b={b}, 点积={dot}\")\n", "\n", "# 任意角度\n", "import math\n", "a = np.array([1, 1])\n", "b = np.array([1, 0])\n", "dot = np.dot(a, b)\n", "cos_angle = dot / (np.linalg.norm(a) * np.linalg.norm(b))\n", "angle = math.acos(cos_angle) * 180 / math.pi\n", "print(f\"\\n任意角度: a={a}, b={b}\")\n", "print(f\" 点积 = {dot}\")\n", "print(f\" cos(θ) = {cos_angle:.4f}\")\n", "print(f\" 夹角 θ = {angle:.1f}°\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3 余弦相似度 — 用点积判断\"像不像\"\n", "\n", "### 公式\n", "\n", "```\n", " A · B\n", "cos(θ) = ──────────\n", " |A| × |B|\n", "\n", "其中:\n", " A · B = 向量A和B的点积\n", " |A| = 向量A的长度(模)\n", " |B| = 向量B的长度(模)\n", " cos(θ) = 相似度,范围是 [-1, 1]\n", "```\n", "\n", "### 为什么叫\"余弦\"相似度?\n", "\n", "因为公式中计算的就是两个向量夹角的余弦值!\n", "\n", "从点积公式推导:\n", "```\n", "A · B = |A| × |B| × cos(θ)\n", " ↓\n", "cos(θ) = (A · B) / (|A| × |B|)\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "余弦相似度函数已定义:cosine_similarity(a, b)\n" ] } ], "source": [ "# 定义余弦相似度函数\n", "import numpy as np\n", "\n", "def cosine_similarity(a, b):\n", " \"\"\"\n", " 计算余弦相似度\n", " \n", " 参数:\n", " a, b: 两个numpy数组(向量)\n", " \n", " 返回:\n", " float: 余弦相似度,范围[-1, 1]\n", " \"\"\"\n", " dot = np.dot(a, b) # 点积\n", " norm_a = np.linalg.norm(a) # 向量a的长度\n", " norm_b = np.linalg.norm(b) # 向量b的长度\n", " \n", " # 防止除以零\n", " if norm_a == 0 or norm_b == 0:\n", " return 0.0\n", " \n", " return dot / (norm_a * norm_b)\n", "\n", "print(\"余弦相似度函数已定义:cosine_similarity(a, b)\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "余弦相似度计算示例\n", "==================================================\n", "1. 方向完全相同: a=[1 2 3], b=[2 4 6]\n", " 相似度 = 1.000 (应该是1.000)\n", "\n", "2. 方向完全相反: a=[1 2 3], b=[-1 -2 -3]\n", " 相似度 = -1.000 (应该是-1.000)\n", "\n", "3. 垂直向量: a=[1 0], b=[0 1]\n", " 相似度 = 0.000 (应该是0.000)\n", "\n", "4. 45度夹角: a=[1 1], b=[1 0]\n", " 相似度 = 0.707 (应该是0.707)\n" ] } ], "source": [ "# 余弦相似度计算示例\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"余弦相似度计算示例\")\n", "print(\"=\" * 50)\n", "\n", "# 示例1:方向完全相同的向量\n", "a = np.array([1, 2, 3])\n", "b = np.array([2, 4, 6]) # b是a的两倍,方向完全相同\n", "sim = cosine_similarity(a, b)\n", "print(f\"1. 方向完全相同: a={a}, b={b}\")\n", "print(f\" 相似度 = {sim:.3f} (应该是1.000)\")\n", "print()\n", "\n", "# 示例2:方向完全相反的向量\n", "a = np.array([1, 2, 3])\n", "b = np.array([-1, -2, -3]) # b是a的相反方向\n", "sim = cosine_similarity(a, b)\n", "print(f\"2. 方向完全相反: a={a}, b={b}\")\n", "print(f\" 相似度 = {sim:.3f} (应该是-1.000)\")\n", "print()\n", "\n", "# 示例3:垂直的向量\n", "a = np.array([1, 0])\n", "b = np.array([0, 1])\n", "sim = cosine_similarity(a, b)\n", "print(f\"3. 垂直向量: a={a}, b={b}\")\n", "print(f\" 相似度 = {sim:.3f} (应该是0.000)\")\n", "print()\n", "\n", "# 示例4:45度夹角\n", "a = np.array([1, 1])\n", "b = np.array([1, 0])\n", "sim = cosine_similarity(a, b)\n", "print(f\"4. 45度夹角: a={a}, b={b}\")\n", "print(f\" 相似度 = {sim:.3f} (应该是0.707)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 余弦相似度的值代表什么?\n", "\n", "| cos(θ) 值 | 夹角 θ | 相似程度 | 示例 |\n", "|----------|--------|---------|------|\n", "| 1.0 | 0° | **完全相同** | 同一向量 |\n", "| 0.8~0.99 | 0~37° | **非常相似** | \"猫\" vs \"狗\" |\n", "| 0.5~0.8 | 37~60° | **比较相似** | \"跑步\" vs \"运动\" |\n", "| 0.3~0.5 | 60~72° | **有些相似** | \"苹果\" vs \"水果\" |\n", "| 0 | 90° | **毫不相关** | \"猫\" vs \"石头\" |\n", "| -0.5~0 | 90~120° | **有些相反** | \"热\" vs \"冷\" |\n", "| -1.0 | 180° | **完全相反** | \"高\" vs \"矮\" |" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "语义相似度示例(用向量模拟词义)\n", "==================================================\n", "\n", "词向量(简化模拟):\n", " 猫 = [0.9 0.1 0.7 0.8 0.9]\n", " 狗 = [0.8 0.2 0.6 0.8 0.9]\n", " 苹果 = [0.1 0.9 0.9 0. 0. ]\n", " 汽车 = [0. 0. 0. 0.9 0. ]\n", " 石头 = [0. 0.1 0. 0. 0. ]\n", "\n", "维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n", "\n", "相似度计算结果:\n", " 猫 vs 狗: 0.996 (都是动物,都有宠物属性)\n", " 猫 vs 苹果: 0.382 (动物vs植物,很不同)\n", " 猫 vs 汽车: 0.482 (动物vs机械)\n", " 猫 vs 石头: 0.060 (动物vs无机物)\n", " 狗 vs 汽车: 0.507 (动物vs机械,但都能移动)\n", " 苹果 vs 石头: 0.705 (都是静态的)\n" ] } ], "source": [ "# 语义相似度示例\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"语义相似度示例(用向量模拟词义)\")\n", "print(\"=\" * 50)\n", "print()\n", "\n", "# 假设这些是词的\"意义向量\"(简化版)\n", "# 维度解释: [动物性, 植物性, 可食用性, 移动性, 宠物性]\n", "# 每个维度取值0-1,表示该属性的强弱\n", "\n", "cat = np.array([0.9, 0.1, 0.7, 0.8, 0.9]) # 猫\n", "dog = np.array([0.8, 0.2, 0.6, 0.8, 0.9]) # 狗\n", "apple = np.array([0.1, 0.9, 0.9, 0.0, 0.0]) # 苹果\n", "car = np.array([0.0, 0.0, 0.0, 0.9, 0.0]) # 汽车\n", "rock = np.array([0.0, 0.1, 0.0, 0.0, 0.0]) # 石头\n", "\n", "print(\"词向量(简化模拟):\")\n", "print(f\" 猫 = {cat}\")\n", "print(f\" 狗 = {dog}\")\n", "print(f\" 苹果 = {apple}\")\n", "print(f\" 汽车 = {car}\")\n", "print(f\" 石头 = {rock}\")\n", "print()\n", "print(\"维度说明: [动物性, 植物性, 可食用性, 移动性, 宠物性]\")\n", "print()\n", "\n", "# 计算相似度\n", "print(\"相似度计算结果:\")\n", "print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n", "print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物,很不同)\")\n", "print(f\" 猫 vs 汽车: {cosine_similarity(cat, car):.3f} (动物vs机械)\")\n", "print(f\" 猫 vs 石头: {cosine_similarity(cat, rock):.3f} (动物vs无机物)\")\n", "print(f\" 狗 vs 汽车: {cosine_similarity(dog, car):.3f} (动物vs机械,但都能移动)\")\n", "print(f\" 苹果 vs 石头: {cosine_similarity(apple, rock):.3f} (都是静态的)\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "练习题4答案\n", "==================================================\n", "A = [1 2 3], B = [4 5 6]\n", "\n", "1. 点积 A · B = 32\n", " 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32\n", "\n", "2. 余弦相似度 = 0.9746\n", "\n", "3. A=[1,0], B=[0,1] 的余弦相似度 = 0.0\n", " 原因:这两个向量垂直,夹角90°,cos(90°)=0\n" ] } ], "source": [ "# 练习题4答案\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"练习题4答案\")\n", "print(\"=\" * 50)\n", "\n", "A = np.array([1, 2, 3])\n", "B = np.array([4, 5, 6])\n", "\n", "print(f\"A = {A}, B = {B}\")\n", "print()\n", "\n", "# 1. 点积\n", "dot = np.dot(A, B)\n", "print(f\"1. 点积 A · B = {dot}\")\n", "print(f\" 计算: 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = {dot}\")\n", "print()\n", "\n", "# 2. 余弦相似度\n", "cos_sim = cosine_similarity(A, B)\n", "print(f\"2. 余弦相似度 = {cos_sim:.4f}\")\n", "print()\n", "\n", "# 3. 垂直向量的相似度\n", "A = np.array([1, 0])\n", "B = np.array([0, 1])\n", "cos_sim = cosine_similarity(A, B)\n", "print(f\"3. A=[1,0], B=[0,1] 的余弦相似度 = {cos_sim}\")\n", "print(\" 原因:这两个向量垂直,夹角90°,cos(90°)=0\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第五部分:文本向量化的核心思想\n", "\n", "## 5.1 核心目标:把所有文本变成\"向量\"\n", "\n", "```\n", "┌──────────────────────────────────────────────────────────────────┐\n", "│ │\n", "│ 文本(符号) ──→ 数值向量 ──→ 计算机可以计算 ──→ AI模型处理 │\n", "│ │\n", "│ \"猫\" [0.9, 0.1, 0.8] │\n", "│ \"狗\" [0.8, 0.2, 0.7] │\n", "│ │\n", "└──────────────────────────────────────────────────────────────────┘\n", "```\n", "\n", "### 为什么必须是向量?\n", "\n", "| 计算机擅长 | 计算机不擅长 |\n", "|------------|-------------|\n", "| 向量加减:v1 + v2 = ? | 字符串比较:\"Python\" == \"Java\" ? |\n", "| 向量点积:v1 · v2 = ? | 词语推理:\"猫\" 类似于 \"狗\" ? |\n", "| 向量距离:|v1 - v2| = ? | 语义理解:\"你好\"是问候语 |\n", "| 余弦相似度:cos(θ) = ? | 情感判断:\"绝了\"是夸还是骂? |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 向量化示例:从\"词\"到\"数\"\n", "\n", "### 方法1:位置编码(只有位置信息,没有语义)\n", "\n", "```python\n", "# 假设我们有一个很小的词汇表(只有5个词)\n", "vocab = [\"猫\", \"狗\", \"鱼\", \"苹果\", \"香蕉\"]\n", "\n", "# 位置编码:每个词对应一个位置\n", "# \"猫\" → [1, 0, 0, 0, 0] 第1个位置是1,其他是0\n", "# \"狗\" → [0, 1, 0, 0, 0] 第2个位置是1,其他是0\n", "# \"苹果\" → [0, 0, 0, 1, 0] 第4个位置是1,其他是0\n", "```\n", "\n", "**问题**:这只是\"位置编码\",没有语义信息!\n", "\n", "```\n", "\"猫\" = [1, 0, 0, 0, 0]\n", "\"狗\" = [0, 1, 0, 0, 0]\n", "\n", "余弦相似度 = 0 (完全不相似)\n", "\n", "但实际上\"猫\"和\"狗\"都是动物,应该很相似!\n", "```" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "位置编码的缺陷\n", "==================================================\n", "位置编码向量:\n", " 猫 = [1 0 0 0 0]\n", " 狗 = [0 1 0 0 0]\n", " 苹果 = [0 0 0 1 0]\n", "\n", "余弦相似度(用位置编码):\n", " 猫 vs 狗: 0.000\n", " 猫 vs 苹果: 0.000\n", "\n", "问题:猫和狗都是动物,相似度却是0!\n", " 猫和苹果不是同类,相似度也是0!\n", " 位置编码没有语义信息!\n" ] } ], "source": [ "# 位置编码的缺陷演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"位置编码的缺陷\")\n", "print(\"=\" * 50)\n", "\n", "# 位置编码向量\n", "cat_onehot = np.array([1, 0, 0, 0, 0]) # \"猫\"\n", "dog_onehot = np.array([0, 1, 0, 0, 0]) # \"狗\"\n", "apple_onehot = np.array([0, 0, 0, 1, 0]) # \"苹果\"\n", "\n", "print(\"位置编码向量:\")\n", "print(f\" 猫 = {cat_onehot}\")\n", "print(f\" 狗 = {dog_onehot}\")\n", "print(f\" 苹果 = {apple_onehot}\")\n", "print()\n", "\n", "# 相似度计算\n", "print(\"余弦相似度(用位置编码):\")\n", "print(f\" 猫 vs 狗: {cosine_similarity(cat_onehot, dog_onehot):.3f}\")\n", "print(f\" 猫 vs 苹果: {cosine_similarity(cat_onehot, apple_onehot):.3f}\")\n", "print()\n", "print(\"问题:猫和狗都是动物,相似度却是0!\")\n", "print(\" 猫和苹果不是同类,相似度也是0!\")\n", "print(\" 位置编码没有语义信息!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 方法2:语义编码(有语义信息)\n", "\n", "```python\n", "# 语义编码:每个词用\"含义\"来表示\n", "# 维度:[动物性, 植物性, 可食用性, 宠物性]\n", "\n", "cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n", "dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n", "apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n", "```\n", "\n", "**这就是文本向量化的威力:把\"语义\"变成\"可计算的数值\"!**" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "语义编码的优点\n", "==================================================\n", "语义编码向量:\n", " 猫 = [0.9 0.1 0.7 0.9]\n", " 狗 = [0.8 0.2 0.6 0.9]\n", " 苹果 = [0.1 0.9 0.9 0. ]\n", "\n", "维度说明: [动物性, 植物性, 可食用性, 宠物性]\n", "\n", "余弦相似度(用语义编码):\n", " 猫 vs 狗: 0.995 (都是动物,都有宠物属性)\n", " 猫 vs 苹果: 0.436 (动物vs植物)\n", "\n", "太棒了!语义编码可以捕捉到词的语义相似性!\n" ] } ], "source": [ "# 语义编码的优点演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"语义编码的优点\")\n", "print(\"=\" * 50)\n", "\n", "# 语义编码向量\n", "# 维度: [动物性, 植物性, 可食用性, 宠物性]\n", "cat = np.array([0.9, 0.1, 0.7, 0.9]) # 猫\n", "dog = np.array([0.8, 0.2, 0.6, 0.9]) # 狗\n", "apple = np.array([0.1, 0.9, 0.9, 0.0]) # 苹果\n", "\n", "print(\"语义编码向量:\")\n", "print(f\" 猫 = {cat}\")\n", "print(f\" 狗 = {dog}\")\n", "print(f\" 苹果 = {apple}\")\n", "print()\n", "print(\"维度说明: [动物性, 植物性, 可食用性, 宠物性]\")\n", "print()\n", "\n", "# 相似度计算\n", "print(\"余弦相似度(用语义编码):\")\n", "print(f\" 猫 vs 狗: {cosine_similarity(cat, dog):.3f} (都是动物,都有宠物属性)\")\n", "print(f\" 猫 vs 苹果: {cosine_similarity(cat, apple):.3f} (动物vs植物)\")\n", "print()\n", "print(\"太棒了!语义编码可以捕捉到词的语义相似性!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3 向量化方法演进\n", "\n", "```\n", "文本向量化的三种主要方法:\n", "\n", "[ BoW ] ───→ [ TF-IDF ] ───→ [ Word Embedding ]\n", " (词袋模型) (词频权重) (词向量嵌入)\n", " \n", " 简单粗暴 加入词重要性 蕴含语义信息\n", " 无语义 部分语义 深度语义\n", " \n", " 1980年代 1990年代 2013年后\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第六部分:BoW词袋模型\n", "\n", "## 6.1 原理\n", "\n", "把文本看成\"一袋词\",**不考虑顺序**,只管词出现了几次。\n", "\n", "```\n", "文本1: \"Python 是 编程 语言\"\n", "文本2: \"Java 是 编程 语言\"\n", "\n", "分词后:\n", " Doc1: [\"Python\", \"是\", \"编程\", \"语言\"]\n", " Doc2: [\"Java\", \"是\", \"编程\", \"语言\"]\n", "\n", "构建词表(所有文档的词集合):\n", " 词表: [\"Python\", \"Java\", \"是\", \"编程\", \"语言\"]\n", "\n", "向量化:统计每个词出现的次数\n", " Doc1 → [1, 0, 1, 1, 1] # Python出现1次,Java出现0次,...\n", " Doc2 → [0, 1, 1, 1, 1] # Python出现0次,Java出现1次,...\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "BoW词袋模型演示(手动实现)\n", "==================================================\n", "【示例1】文档集合:\n", " Doc1: Python 是 编程 语言\n", " Doc2: Java 是 编程 语言\n", "\n", "词表: ['Java', 'Python', '是', '编程', '语言']\n", "\n", "BoW矩阵(每行是一个文档,每列是一个词):\n", " Doc1: [0, 1, 1, 1, 1]\n", " Doc2: [1, 0, 1, 1, 1]\n", "\n", "详细解释:\n", "\n", "Doc1: Python 是 编程 语言\n", " -> 'Python' 出现 1 次\n", " -> '是' 出现 1 次\n", " -> '编程' 出现 1 次\n", " -> '语言' 出现 1 次\n", "\n", "Doc2: Java 是 编程 语言\n", " -> 'Java' 出现 1 次\n", " -> '是' 出现 1 次\n", " -> '编程' 出现 1 次\n", " -> '语言' 出现 1 次\n" ] } ], "source": [ "# BoW词袋模型演示(纯Python实现,不依赖sklearn)\n", "\n", "print(\"=\" * 50)\n", "print(\"BoW词袋模型演示(手动实现)\")\n", "print(\"=\" * 50)\n", "\n", "def simple_bow(docs):\n", " \"\"\"\n", " 简单的BoW实现\n", " \n", " 参数:\n", " docs: 文档列表,每篇文档已经是分词后的词列表\n", " 返回:\n", " vocab: 词表(有序列表)\n", " bow_matrix: BoW矩阵 (n_docs x n_vocab)\n", " \"\"\"\n", " # 1. 构建词表\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set)) # 排序保证顺序一致\n", " \n", " # 2. 构建BoW矩阵\n", " bow_matrix = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow_matrix.append(vec)\n", " \n", " return vocab, bow_matrix\n", "\n", "\n", "# 示例1:中文文档(用空格分词)\n", "docs = [\n", " [\"Python\", \"是\", \"编程\", \"语言\"],\n", " [\"Java\", \"是\", \"编程\", \"语言\"],\n", "]\n", "\n", "vocab, bow_matrix = simple_bow(docs)\n", "\n", "print(\"【示例1】文档集合:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "\n", "print(\"BoW矩阵(每行是一个文档,每列是一个词):\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", "print()\n", "\n", "# 详细解释\n", "print(\"详细解释:\")\n", "for i, doc in enumerate(docs):\n", " print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n", " for j, word in enumerate(vocab):\n", " if bow_matrix[i][j] > 0:\n", " print(f\" -> '{word}' 出现 {bow_matrix[i][j]} 次\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "BoW词袋模型:更多示例\n", "==================================================\n", "文档集合:\n", " Doc1: 我 爱 Python 编程\n", " Doc2: Python 很 好 学\n", " Doc3: 我 爱 写 代码\n", "\n", "词表: ['Python', '代码', '写', '好', '学', '很', '我', '爱', '编程']\n", "\n", "BoW矩阵:\n", " Doc1: [1, 0, 0, 0, 0, 0, 1, 1, 1]\n", " Doc2: [1, 0, 0, 1, 1, 1, 0, 0, 0]\n", " Doc3: [0, 1, 1, 0, 0, 0, 1, 1, 0]\n", "\n", "表格形式:\n", "Doc | Python | 代码 | 写 | 好 | 学 | 很\n", "----------------------------------\n", "Doc1 | 1 | 0 | 0 | 0 | 0 | 0\n", "Doc2 | 1 | 0 | 0 | 1 | 1 | 1\n", "Doc3 | 0 | 1 | 1 | 0 | 0 | 0\n" ] } ], "source": [ "# 更多BoW示例(纯Python实现)\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"BoW词袋模型:更多示例\")\n", "print(\"=\" * 50)\n", "\n", "def simple_bow(docs):\n", " \"\"\"简单的BoW实现\"\"\"\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow_matrix = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow_matrix.append(vec)\n", " return vocab, bow_matrix\n", "\n", "docs = [\n", " [\"我\", \"爱\", \"Python\", \"编程\"],\n", " [\"Python\", \"很\", \"好\", \"学\"],\n", " [\"我\", \"爱\", \"写\", \"代码\"]\n", "]\n", "\n", "vocab, bow_matrix = simple_bow(docs)\n", "\n", "print(\"文档集合:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "\n", "print(\"BoW矩阵:\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", "\n", "print()\n", "\n", "# 显示成表格\n", "print(\"表格形式:\")\n", "header = \"Doc | \" + \" | \".join(vocab[:6])\n", "print(header)\n", "print(\"-\" * len(header))\n", "for i, row in enumerate(bow_matrix):\n", " print(f\"Doc{i+1} | \" + \" | \".join(map(str, row[:6])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6.2 BoW 的优缺点\n", "\n", "| 优点 | 缺点 |\n", "|------|------|\n", "| **简单直观** | 忽略词序 |\n", "| **容易实现** | \"我爱你\"和\"你爱我\"向量完全相同 |\n", "| **计算速度快** | 所有词同等重要 |\n", "| **适合基线模型** | 无法捕捉语义 |\n", "| | 无法处理同义词:\"电脑\"和\"计算机\"完全不同 |\n", "| | 维度很高(词表有多大,维度就多大) |" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "BoW忽略词序的演示\n", "==================================================\n", "文档:\n", " Doc1: 我爱你\n", " Doc2: 你爱我\n", " Doc3: 爱你我\n", "\n", "BoW矩阵:\n", " Doc1: [1, 1, 1, 0]\n", " Doc2: [1, 1, 1, 0]\n", " Doc3: [0, 0, 0, 1]\n", "\n", "词表: ['你', '我', '爱', '爱你我']\n", "\n", "问题:这三个完全不同的句子,BoW向量完全相同!\n", "Doc1: 我爱你(表达爱意)\n", "Doc2: 你爱我(对方爱我)\n", "Doc3: 爱你我(意义不明)\n", "\n", "结论:BoW模型丢失了词序信息!\n" ] } ], "source": [ "# BoW忽略词序的演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"BoW忽略词序的演示\")\n", "print(\"=\" * 50)\n", "\n", "def simple_bow(docs):\n", " \"\"\"简单的BoW实现\"\"\"\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow_matrix = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow_matrix.append(vec)\n", " return vocab, bow_matrix\n", "\n", "# 两个完全不同的句子,但BoW向量相同\n", "docs = [\n", " [\"我\", \"爱\", \"你\"], # 正常语序\n", " [\"你\", \"爱\", \"我\"], # 完全相反\n", " [\"爱你我\"], # 没有空格(中文连续)\n", "]\n", "\n", "vocab, bow_matrix = simple_bow(docs)\n", "\n", "print(\"文档:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {''.join(doc)}\")\n", "print()\n", "\n", "print(\"BoW矩阵:\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "\n", "print(\"问题:这三个完全不同的句子,BoW向量完全相同!\")\n", "print(\"Doc1: 我爱你(表达爱意)\")\n", "print(\"Doc2: 你爱我(对方爱我)\")\n", "print(\"Doc3: 爱你我(意义不明)\")\n", "print()\n", "print(\"结论:BoW模型丢失了词序信息!\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "练习题5答案\n", "==================================================\n", "文档集合:\n", " Doc1: Python 是 编程 语言\n", " Doc2: Java 是 编程 语言\n", " Doc3: Python Python Python\n", "\n", "词表: ['Java', 'Python', '是', '编程', '语言']\n", "\n", "BoW矩阵(每行是一个文档的向量):\n", " Doc1: [0, 1, 1, 1, 1]\n", " Doc2: [1, 0, 1, 1, 1]\n", " Doc3: [0, 3, 0, 0, 0]\n", "\n", "解析:\n", " Doc1: [0, 1, 1, 1, 1]\n", " - 'Python' 出现 1 次\n", " - '是' 出现 1 次\n", " - '编程' 出现 1 次\n", " - '语言' 出现 1 次\n", " Doc2: [1, 0, 1, 1, 1]\n", " - 'Java' 出现 1 次\n", " - '是' 出现 1 次\n", " - '编程' 出现 1 次\n", " - '语言' 出现 1 次\n", " Doc3: [0, 3, 0, 0, 0]\n", " - 'Python' 出现 3 次\n" ] } ], "source": [ "# 练习题5答案\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"练习题5答案\")\n", "print(\"=\" * 50)\n", "\n", "def simple_bow(docs):\n", " \"\"\"简单的BoW实现\"\"\"\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow_matrix = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow_matrix.append(vec)\n", " return vocab, bow_matrix\n", "\n", "docs = [\n", " [\"Python\", \"是\", \"编程\", \"语言\"],\n", " [\"Java\", \"是\", \"编程\", \"语言\"],\n", " [\"Python\", \"Python\", \"Python\"]\n", "]\n", "\n", "vocab, bow_matrix = simple_bow(docs)\n", "\n", "print(\"文档集合:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "\n", "print(\"BoW矩阵(每行是一个文档的向量):\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", "print()\n", "\n", "print(\"解析:\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", " for j, count in enumerate(vec):\n", " if count > 0:\n", " print(f\" - '{vocab[j]}' 出现 {count} 次\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第七部分:TF-IDF\n", "\n", "## 7.1 为什么需要TF-IDF?\n", "\n", "**BoW的问题**:所有词同等重要!\n", "\n", "```\n", "文档A: \"Python 是 编程 语言,Python Python Python\"\n", "文档B: \"Python 是 编程 语言\"\n", "\n", "BoW结果:\n", " 文档A: Python=4, 是=1, 编程=1, 语言=1\n", " 文档B: Python=1, 是=1, 编程=1, 语言=1\n", "\n", "问题:\"Python\"在A中出现4次,在B中出现1次\n", " 但\"是\"、\"编程\"、\"语言\"出现次数相同\n", " 我们希望\"Python\"的权重更高(因为它更重要)\n", "```\n", "\n", "**关键洞察**:\n", "- 高频出现的词 ≠ 一定重要(\"的\"、\"了\"在所有文章都出现)\n", "- 罕见词 ≠ 不重要(\"TensorFlow\"只在AI文章出现,很重要)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.2 TF-IDF公式\n", "\n", "**TF-IDF = 词频(TF) × 逆文档频率(IDF)**\n", "\n", "```\n", "TF = 这个词在本文中出现了多少次\n", "IDF = log(总文档数 / 包含该词的文档数)\n", "\n", "TF-IDF = TF × IDF\n", "```\n", "\n", "### IDF的含义\n", "\n", "| 词 | 在多少文档出现 | IDF值 | 解释 |\n", "|----|----------------|-------|------|\n", "| \"的\" | 所有文档 | log(很高) ≈ 0 | 到处都是,不重要 |\n", "| \"Python\" | 少数文档 | log(中等) = 高 | 较独特,重要 |\n", "| \"TensorFlow\" | 极少数文档 | log(很低) = 更高 | 很独特,非常重要 |\n", "| \"AI\" | 只有1篇 | log(总文档数/1) = 最高 | 最独特,最重要 |" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "TF-IDF词频-逆文档频率演示\n", "==================================================\n", "文档集合:\n", " Doc1: Python 编程 语言\n", " Doc2: Python Python Python\n", " Doc3: Java 编程 语言\n", "\n", "词表: ['Java', 'Python', '编程', '语言']\n", "\n", "IDF值: [1.4055, 1.0, 1.0, 1.0]\n", "\n", "TF-IDF矩阵:\n", " Doc1: [0.0, 1.0, 1.0, 1.0]\n", " Doc2: [0.0, 3.0, 0.0, 0.0]\n", " Doc3: [1.4055, 0.0, 1.0, 1.0]\n", "\n", "详细分析:\n", "\n", "Doc1: Python 编程 语言\n", " 'Python': TF-IDF = 1.0000\n", " '编程': TF-IDF = 1.0000\n", " '语言': TF-IDF = 1.0000\n", "\n", "Doc2: Python Python Python\n", " 'Python': TF-IDF = 3.0000\n", "\n", "Doc3: Java 编程 语言\n", " 'Java': TF-IDF = 1.4055\n", " '编程': TF-IDF = 1.0000\n", " '语言': TF-IDF = 1.0000\n" ] } ], "source": [ "# TF-IDF演示(纯Python实现)\n", "import math\n", "\n", "print(\"=\" * 50)\n", "print(\"TF-IDF词频-逆文档频率演示\")\n", "print(\"=\" * 50)\n", "\n", "def simple_tfidf(docs):\n", " \"\"\"\n", " 简单的TF-IDF实现\n", " \n", " 参数:\n", " docs: 文档列表,每篇文档已经是分词后的词列表\n", " 返回:\n", " vocab: 词表\n", " tfidf_matrix: TF-IDF矩阵\n", " idf: 每个词的IDF值\n", " \"\"\"\n", " # 1. 构建词表和BoW\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " \n", " # 2. 构建BoW矩阵\n", " bow = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow.append(vec)\n", " \n", " n_docs = len(docs)\n", " \n", " # 3. 计算IDF\n", " idf = []\n", " for j, word in enumerate(vocab):\n", " df = sum(1 for vec in bow if vec[j] > 0)\n", " idf_j = math.log(n_docs / (df + 1)) + 1\n", " idf.append(idf_j)\n", " \n", " # 4. 计算TF-IDF\n", " tfidf = []\n", " for vec in bow:\n", " tfidf_vec = []\n", " for i, tf in enumerate(vec):\n", " tfidf_vec.append(tf * idf[i])\n", " tfidf.append(tfidf_vec)\n", " \n", " return vocab, tfidf, idf\n", "\n", "docs = [\n", " [\"Python\", \"编程\", \"语言\"],\n", " [\"Python\", \"Python\", \"Python\"], # Python出现3次\n", " [\"Java\", \"编程\", \"语言\"],\n", "]\n", "\n", "vocab, tfidf_matrix, idf = simple_tfidf(docs)\n", "\n", "print(\"文档集合:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n", "print()\n", "\n", "print(\"TF-IDF矩阵:\")\n", "for i, vec in enumerate(tfidf_matrix):\n", " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", "print()\n", "\n", "print(\"详细分析:\")\n", "for i, doc in enumerate(docs):\n", " print(f\"\\nDoc{i+1}: {' '.join(doc)}\")\n", " for j, score in enumerate(tfidf_matrix[i]):\n", " if score > 0:\n", " print(f\" '{vocab[j]}': TF-IDF = {score:.4f}\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "TF-IDF vs BoW 对比\n", "==================================================\n", "文档:\n", " Doc1: Python 编程\n", " Doc2: Java 编程\n", " Doc3: Python Python Python\n", "\n", "BoW矩阵:\n", " Doc1: [0, 1, 1]\n", " Doc2: [1, 0, 1]\n", " Doc3: [0, 3, 0]\n", "\n", "TF-IDF矩阵:\n", " Doc1: [0.0, 1.0, 1.0]\n", " Doc2: [1.4055, 0.0, 1.0]\n", " Doc3: [0.0, 3.0, 0.0]\n", "\n", "重点分析:\n", "Doc3 'Python Python Python':\n", " BoW: Python出现3次\n", " TF-IDF: Python的TF-IDF = 0.0000\n", "\n", "为什么Doc3的TF-IDF不是最高的?\n", "因为Python在Doc1和Doc2也出现了,IDF值被稀释\n" ] } ], "source": [ "# TF-IDF vs BoW 对比\n", "import math\n", "\n", "print(\"=\" * 50)\n", "print(\"TF-IDF vs BoW 对比\")\n", "print(\"=\" * 50)\n", "\n", "def simple_bow(docs):\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow_matrix = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow_matrix.append(vec)\n", " return vocab, bow_matrix\n", "\n", "def simple_tfidf(docs):\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow.append(vec)\n", " \n", " n_docs = len(docs)\n", " idf = []\n", " for j, word in enumerate(vocab):\n", " df = sum(1 for vec in bow if vec[j] > 0)\n", " idf_j = math.log(n_docs / (df + 1)) + 1\n", " idf.append(idf_j)\n", " \n", " tfidf = []\n", " for vec in bow:\n", " tfidf_vec = []\n", " for i, tf in enumerate(vec):\n", " tfidf_vec.append(tf * idf[i])\n", " tfidf.append(tfidf_vec)\n", " \n", " return vocab, tfidf, idf\n", "\n", "docs = [\n", " [\"Python\", \"编程\"],\n", " [\"Java\", \"编程\"],\n", " [\"Python\", \"Python\", \"Python\"] # Python出现3次\n", "]\n", "\n", "vocab_bow, bow_matrix = simple_bow(docs)\n", "vocab_tfidf, tfidf_matrix, idf = simple_tfidf(docs)\n", "\n", "print(\"文档:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(\"BoW矩阵:\")\n", "for i, vec in enumerate(bow_matrix):\n", " print(f\" Doc{i+1}: {vec}\")\n", "print()\n", "\n", "print(\"TF-IDF矩阵:\")\n", "for i, vec in enumerate(tfidf_matrix):\n", " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", "print()\n", "\n", "# 重点分析Doc3\n", "print(\"重点分析:\")\n", "print(f\"Doc3 'Python Python Python':\")\n", "print(f\" BoW: Python出现3次\")\n", "print(f\" TF-IDF: Python的TF-IDF = {tfidf_matrix[2][0]:.4f}\")\n", "print()\n", "print(\"为什么Doc3的TF-IDF不是最高的?\")\n", "print(\"因为Python在Doc1和Doc2也出现了,IDF值被稀释\")" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "附加题答案\n", "==================================================\n", "文档:\n", " Doc1: Python 编程\n", " Doc2: Java 编程\n", " Doc3: Python Python\n", "\n", "词表: ['Java', 'Python', '编程']\n", "\n", "IDF值: [1.4055, 1.0, 1.0]\n", "\n", "TF-IDF矩阵:\n", " Doc1: [0.0, 1.0, 1.0]\n", " Doc2: [1.4055, 0.0, 1.0]\n", " Doc3: [0.0, 2.0, 0.0]\n", "\n", "问题1:为什么Python在Doc3中的TF-IDF值不是最高?\n", "答:因为Python在Doc1、Doc2、Doc3中都出现了,\n", " IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\n", "\n", "问题2:Java在Doc2中的TF-IDF值是多少?\n", "答:Java在Doc2的TF-IDF值 = 1.4055\n", " 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\n" ] } ], "source": [ "# 附加题答案\n", "import math\n", "\n", "print(\"=\" * 50)\n", "print(\"附加题答案\")\n", "print(\"=\" * 50)\n", "\n", "def simple_tfidf(docs):\n", " vocab_set = set()\n", " for doc in docs:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " bow = []\n", " for doc in docs:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " vec[vocab.index(word)] += 1\n", " bow.append(vec)\n", " \n", " n_docs = len(docs)\n", " idf = []\n", " for j, word in enumerate(vocab):\n", " df = sum(1 for vec in bow if vec[j] > 0)\n", " idf_j = math.log(n_docs / (df + 1)) + 1\n", " idf.append(idf_j)\n", " \n", " tfidf = []\n", " for vec in bow:\n", " tfidf_vec = []\n", " for i, tf in enumerate(vec):\n", " tfidf_vec.append(tf * idf[i])\n", " tfidf.append(tfidf_vec)\n", " \n", " return vocab, tfidf, idf\n", "\n", "docs = [[\"Python\", \"编程\"], [\"Java\", \"编程\"], [\"Python\", \"Python\"]]\n", "\n", "vocab, tfidf_matrix, idf = simple_tfidf(docs)\n", "\n", "print(\"文档:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {' '.join(doc)}\")\n", "print()\n", "\n", "print(f\"词表: {vocab}\")\n", "print()\n", "print(f\"IDF值: {[round(x, 4) for x in idf]}\")\n", "print()\n", "\n", "print(\"TF-IDF矩阵:\")\n", "for i, vec in enumerate(tfidf_matrix):\n", " print(f\" Doc{i+1}: {[round(x, 4) for x in vec]}\")\n", "print()\n", "\n", "print(\"问题1:为什么Python在Doc3中的TF-IDF值不是最高?\")\n", "print(\"答:因为Python在Doc1、Doc2、Doc3中都出现了,\")\n", "print(\" IDF = log(3/3) = 0,所以TF-IDF = 3 * 0 = 0!\")\n", "print()\n", "print(\"问题2:Java在Doc2中的TF-IDF值是多少?\")\n", "java_idx = vocab.index(\"Java\")\n", "print(f\"答:Java在Doc2的TF-IDF值 = {tfidf_matrix[1][java_idx]:.4f}\")\n", "print(\" 因为Java只出现在Doc2中,其他文档没有,所以IDF值高\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.3 TF-IDF的优缺点\n", "\n", "| 优点 | 缺点 |\n", "|------|------|\n", "| 考虑词的重要性 | 忽略词序 |\n", "| 降低常见词权重 | 无法捕捉语义 |\n", "| 提高独特词权重 | \"猫\"和\"狗\"的TF-IDF可能相似也可能不相似 |\n", "| 可以提取关键词 | 无法处理同义词 \"电脑\" vs \"计算机\" |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第八部分:Word Embedding词嵌入\n", "\n", "## 8.1 BoW和TF-IDF的根本问题\n", "\n", "```python\n", "# 位置编码的问题\n", "\"猫\" → [1, 0, 0, ...] # 只是\"位置编码\"\n", "\"狗\" → [0, 1, 0, ...] # 猫和狗的位置不同\n", "\"小猫\" → [0, 0, 1, ...] # 但它们语义相近,向量却正交!\n", "\n", "# 问题:无法表达语义相似性!\n", "# \"猫\"和\"狗\"都是动物,语义很相似\n", "# 但在BoW/TF-IDF中,它们的向量可能完全不同\n", "```\n", "\n", "### 词嵌入的核心思想\n", "\n", "```\n", "不再用\"位置\"表示词,而是用\"语义空间\"表示词\n", "\n", "语义空间示例(二维简化):\n", " ↑ 动物性\n", " 狗 | ↑ 猫\n", " | ↗\n", " 0 |↗ ↑ 苹果\n", " |___________→ 植物性\n", " ↑ 香蕉\n", " \n", " 语义相近的词在空间中距离近\n", "```" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "词嵌入(Word Embedding)概念演示\n", "==================================================\n", "\n", "词向量(简化版3维)示意:\n", "维度含义: [动物性, 植物性, 其他/技术性]\n", "\n", " 猫: [0.9 0.1 0.2]\n", " 狗: [0.8 0.3 0.1]\n", " 小猫: [0.85 0.2 0.15]\n", " 苹果: [0.1 0.2 0.9]\n", " 香蕉: [0.1 0.1 0.85]\n", " Python: [0.1 0. 0.9]\n", " Java: [0.1 0. 0.85]\n", "\n", "语义相似度:\n", " 猫 vs 狗: 0.965\n", " 猫 vs 小猫: 0.992\n", " 猫 vs 苹果: 0.337\n", " 苹果 vs 香蕉: 0.995\n", " Python vs Java: 1.000\n", "\n", "词嵌入的优势:\n", " - 语义相似的词,向量也相似\n", " - 可以做类比推理:国王-男人+女人=女王\n" ] } ], "source": [ "# Word2Vec词嵌入的概念演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"词嵌入(Word Embedding)概念演示\")\n", "print(\"=\" * 50)\n", "print()\n", "\n", "# 假设这些是用Word2Vec等方法训练出来的词向量(简化版,3维)\n", "# 实际中向量通常是50/100/300维\n", "word_vectors = {\n", " \"猫\": np.array([0.9, 0.1, 0.2]), # 动物属性高,其他低\n", " \"狗\": np.array([0.8, 0.3, 0.1]), # 动物属性高\n", " \"小猫\": np.array([0.85, 0.2, 0.15]), # 小动物,也像猫\n", " \"苹果\": np.array([0.1, 0.2, 0.9]), # 水果属性高\n", " \"香蕉\": np.array([0.1, 0.1, 0.85]), # 水果属性高\n", " \"Python\": np.array([0.1, 0.0, 0.9]), # 编程语言\n", " \"Java\": np.array([0.1, 0.0, 0.85]), # 编程语言\n", "}\n", "\n", "print(\"词向量(简化版3维)示意:\")\n", "print(\"维度含义: [动物性, 植物性, 其他/技术性]\")\n", "print()\n", "for word, vec in word_vectors.items():\n", " print(f\" {word}: {vec}\")\n", "print()\n", "\n", "# 计算相似度\n", "print(\"语义相似度:\")\n", "print(f\" 猫 vs 狗: {cosine_similarity(word_vectors['猫'], word_vectors['狗']):.3f}\")\n", "print(f\" 猫 vs 小猫: {cosine_similarity(word_vectors['猫'], word_vectors['小猫']):.3f}\")\n", "print(f\" 猫 vs 苹果: {cosine_similarity(word_vectors['猫'], word_vectors['苹果']):.3f}\")\n", "print(f\" 苹果 vs 香蕉: {cosine_similarity(word_vectors['苹果'], word_vectors['香蕉']):.3f}\")\n", "print(f\" Python vs Java: {cosine_similarity(word_vectors['Python'], word_vectors['Java']):.3f}\")\n", "print()\n", "print(\"词嵌入的优势:\")\n", "print(\" - 语义相似的词,向量也相似\")\n", "print(\" - 可以做类比推理:国王-男人+女人=女王\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "词嵌入的类比推理\n", "==================================================\n", "\n", "词向量(简化版):\n", " King: [0.9 0.1 0.8 0.3]\n", " Man: [0.8 0.1 0.2 0.5]\n", " Woman: [0.1 0.8 0.2 0.5]\n", " Queen: [0.1 0.9 0.8 0.3]\n", "\n", "维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\n", "\n", "King - Man + Woman = [0.2 0.8 0.8 0.3]\n", "Queen = [0.1 0.9 0.8 0.3]\n", "\n", "相似度验证:\n", " (King-Man+Woman) vs Queen: 0.994\n", "\n", "结论:词嵌入可以捕捉语义关系!\n", " '国王' - '男人' + '女人' ≈ '女王'\n", " 这说明词向量编码了语义信息!\n" ] } ], "source": [ "# 词嵌入的类比推理演示\n", "import numpy as np\n", "\n", "print(\"=\" * 50)\n", "print(\"词嵌入的类比推理\")\n", "print(\"=\" * 50)\n", "print()\n", "\n", "# 经典例子:King - Man + Woman ≈ Queen\n", "# 这个例子说明了词嵌入可以捕捉语义关系\n", "\n", "# 简化版词向量(实际中这些向量由神经网络学习得到)\n", "king = np.array([0.9, 0.1, 0.8, 0.3]) # 皇室、男性、有权力\n", "man = np.array([0.8, 0.1, 0.2, 0.5]) # 男性\n", "woman = np.array([0.1, 0.8, 0.2, 0.5]) # 女性\n", "queen = np.array([0.1, 0.9, 0.8, 0.3]) # 皇室、女性、有权力\n", "\n", "print(\"词向量(简化版):\")\n", "print(f\" King: {king}\")\n", "print(f\" Man: {man}\")\n", "print(f\" Woman: {woman}\")\n", "print(f\" Queen: {queen}\")\n", "print()\n", "print(\"维度含义: [皇室属性, 女性属性, 权力属性, 人类属性]\")\n", "print()\n", "\n", "# 计算 King - Man + Woman\n", "result = king - man + woman\n", "print(f\"King - Man + Woman = {result}\")\n", "print(f\"Queen = {queen}\")\n", "print()\n", "\n", "# 相似度\n", "print(\"相似度验证:\")\n", "print(f\" (King-Man+Woman) vs Queen: {cosine_similarity(result, queen):.3f}\")\n", "print()\n", "\n", "print(\"结论:词嵌入可以捕捉语义关系!\")\n", "print(\" '国王' - '男人' + '女人' ≈ '女王'\")\n", "print(\" 这说明词向量编码了语义信息!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.2 词嵌入的发展历史\n", "\n", "| 方法 | 年份 | 特点 |\n", "|------|------|------|\n", "| Word2Vec | 2013 | Google开源,开启词嵌入时代 |\n", "| GloVe | 2014 | Stanford提出,基于全局共现矩阵 |\n", "| FastText | 2016 | Facebook开源,支持子词 |\n", "| ELMo | 2018 | 考虑上下文,动态词向量 |\n", "| BERT | 2018 | Transformer架构,预训练大模型 |\n", "| GPT系列 | 2018-现在 | 生成式AI,ChatGPT核心 |" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==================================================\n", "预训练词向量演示(使用内置示例向量)\n", "==================================================\n", "\n", "注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\n", "本notebook使用内置示例向量进行演示\n", "\n", "词向量示例(每个词用一个5维向量表示):\n", "维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n", "\n", " cat : [0.9, 0.1, 0.2, 0.8, 0.3]\n", " dog : [0.8, 0.2, 0.1, 0.9, 0.3]\n", " bird : [0.7, 0.3, 0.1, 0.9, 0.2]\n", " fish : [0.6, 0.2, 0.1, 0.8, 0.2]\n", " apple : [0.1, 0.9, 0.3, 0.0, 0.2]\n", " rose : [0.1, 0.8, 0.1, 0.0, 0.1]\n", " python : [0.1, 0.0, 0.9, 0.0, 0.5]\n", " java : [0.1, 0.0, 0.8, 0.0, 0.4]\n", " computer : [0.1, 0.0, 0.9, 0.3, 0.4]\n", " love : [0.3, 0.2, 0.1, 0.1, 0.9]\n", " hate : [0.2, 0.1, 0.1, 0.1, 0.8]\n", "\n", "==================================================\n", "1. 语义相似度计算\n", "==================================================\n", " cat vs dog : 0.987\n", " cat vs apple : 0.244\n", " python vs java : 0.998\n", " python vs cat : 0.322\n", " love vs hate : 0.993\n", "\n", "==================================================\n", "2. 类比推理(Word2Vec核心能力)\n", "==================================================\n", "类比问题:man -> woman, king -> ?\n", "\n", " King = [0.6 0.1 0.3 0.3 0.6]\n", " Man = [0.8 0.1 0.2 0.5 0.3]\n", " Woman = [0.2 0.8 0.2 0.5 0.5]\n", " King - Man + Woman = [-0. 0.8 0.3 0.3 0.8]\n", " Queen (真实) = [0.2 0.9 0.3 0.3 0.6]\n", "\n", " 相似度: 0.969\n", "\n", "太棒了!词嵌入可以捕捉语义关系!\n", "\n", "==================================================\n", "真实环境中加载Gensim预训练模型的方法\n", "==================================================\n", "如需加载真实的预训练词向量,可以运行:\n", "\n", " import gensim.downloader as api\n", " model = api.load('glove-wiki-gigaword-50')\n", "\n", "这会下载约66MB的预训练词向量模型\n" ] } ], "source": [ "# 实战:用预训练词向量演示词嵌入(跳过实际下载)\n", "import numpy as np\n", "\n", "def cosine_similarity(a, b):\n", " \"\"\"计算余弦相似度\"\"\"\n", " dot = np.dot(a, b)\n", " norm_a = np.linalg.norm(a)\n", " norm_b = np.linalg.norm(b)\n", " if norm_a == 0 or norm_b == 0:\n", " return 0.0\n", " return dot / (norm_a * norm_b)\n", "\n", "print(\"=\" * 50)\n", "print(\"预训练词向量演示(使用内置示例向量)\")\n", "print(\"=\" * 50)\n", "print()\n", "print(\"注意:真实环境中加载Gensim预训练模型需要下载(约66MB)\")\n", "print(\"本notebook使用内置示例向量进行演示\")\n", "print()\n", "\n", "# 使用内置的小规模词向量示例(模拟真实词向量)\n", "# 维度: [动物性, 植物性, 技术性, 动态性, 抽象概念]\n", "word_vectors = {\n", " # 动物\n", " \"cat\": np.array([0.9, 0.1, 0.2, 0.8, 0.3]),\n", " \"dog\": np.array([0.8, 0.2, 0.1, 0.9, 0.3]),\n", " \"bird\": np.array([0.7, 0.3, 0.1, 0.9, 0.2]),\n", " \"fish\": np.array([0.6, 0.2, 0.1, 0.8, 0.2]),\n", " # 植物\n", " \"apple\": np.array([0.1, 0.9, 0.3, 0.0, 0.2]),\n", " \"rose\": np.array([0.1, 0.8, 0.1, 0.0, 0.1]),\n", " # 技术\n", " \"python\": np.array([0.1, 0.0, 0.9, 0.0, 0.5]),\n", " \"java\": np.array([0.1, 0.0, 0.85, 0.0, 0.4]),\n", " \"computer\": np.array([0.1, 0.0, 0.9, 0.3, 0.4]),\n", " # 抽象概念\n", " \"love\": np.array([0.3, 0.2, 0.1, 0.1, 0.9]),\n", " \"hate\": np.array([0.2, 0.1, 0.1, 0.1, 0.8]),\n", "}\n", "\n", "# 显示词向量\n", "print(\"词向量示例(每个词用一个5维向量表示):\")\n", "print(\"维度含义: [动物性, 植物性, 技术性, 动态性, 抽象概念]\")\n", "print()\n", "for word, vec in word_vectors.items():\n", " print(f\" {word:12s}: [{vec[0]:.1f}, {vec[1]:.1f}, {vec[2]:.1f}, {vec[3]:.1f}, {vec[4]:.1f}]\")\n", "\n", "print()\n", "print(\"=\" * 50)\n", "print(\"1. 语义相似度计算\")\n", "print(\"=\" * 50)\n", "pairs = [\n", " (\"cat\", \"dog\"), # 都是动物\n", " (\"cat\", \"apple\"), # 动物 vs 植物\n", " (\"python\", \"java\"), # 都是编程语言\n", " (\"python\", \"cat\"), # 编程语言 vs 动物\n", " (\"love\", \"hate\"), # 情感词\n", "]\n", "for w1, w2 in pairs:\n", " sim = cosine_similarity(word_vectors[w1], word_vectors[w2])\n", " print(f\" {w1:10s} vs {w2:10s}: {sim:.3f}\")\n", "\n", "print()\n", "print(\"=\" * 50)\n", "print(\"2. 类比推理(Word2Vec核心能力)\")\n", "print(\"=\" * 50)\n", "print(\"类比问题:man -> woman, king -> ?\")\n", "print()\n", "\n", "# 简化版类比:使用语义维度\n", "# man=[0.8, 0.1, 0.2, 0.5, 0.3], woman=[0.2, 0.8, 0.2, 0.5, 0.5]\n", "# king=[0.6, 0.1, 0.3, 0.3, 0.6], queen=[0.2, 0.9, 0.3, 0.3, 0.6]\n", "man = np.array([0.8, 0.1, 0.2, 0.5, 0.3])\n", "woman = np.array([0.2, 0.8, 0.2, 0.5, 0.5])\n", "king = np.array([0.6, 0.1, 0.3, 0.3, 0.6])\n", "queen = np.array([0.2, 0.9, 0.3, 0.3, 0.6])\n", "\n", "# king - man + woman ≈ queen\n", "result = king - man + woman\n", "\n", "print(f\" King = {king}\")\n", "print(f\" Man = {man}\")\n", "print(f\" Woman = {woman}\")\n", "print(f\" King - Man + Woman = {np.round(result, 2)}\")\n", "print(f\" Queen (真实) = {queen}\")\n", "print()\n", "print(f\" 相似度: {cosine_similarity(result, queen):.3f}\")\n", "print()\n", "print(\"太棒了!词嵌入可以捕捉语义关系!\")\n", "print()\n", "\n", "# 真实环境中加载Gensim模型的方法(仅供参考,不执行)\n", "print(\"=\" * 50)\n", "print(\"真实环境中加载Gensim预训练模型的方法\")\n", "print(\"=\" * 50)\n", "print(\"如需加载真实的预训练词向量,可以运行:\")\n", "print()\n", "print(\" import gensim.downloader as api\")\n", "print(\" model = api.load('glove-wiki-gigaword-50')\")\n", "print()\n", "print(\"这会下载约66MB的预训练词向量模型\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第九部分:文本处理完整流程\n", "\n", "## 9.1 流程图\n", "\n", "```\n", "┌──────────────────────────────────────────────────────────────────┐\n", "│ 文本数据 │\n", "│ \"今天天气真不错!\" │\n", "└─────────────────────────┬────────────────────────────────────────┘\n", " │\n", " ▼\n", "┌──────────────────────────────────────────────────────────────────┐\n", "│ 1. 文本预处理 │\n", "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", "│ │ 分词 │→ │ 去停用词│→ │ 统一大小│→ │ 去除标点│ │\n", "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", "│ \"今天/天气/真/不错\" → \"今天/天气/不错\" │\n", "└─────────────────────────┬────────────────────────────────────────┘\n", " │\n", " ▼\n", "┌──────────────────────────────────────────────────────────────────┐\n", "│ 2. 文本向量化 │\n", "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", "│ │ BoW │ │ TF-IDF │ │ Embedding│ │ 预训练模型│ │\n", "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", "│ ↓ ↓ ↓ ↓ │\n", "│ [1,0,2,0,1] [0.5,0,0.8] [0.9,0.3] [BERT向量] │\n", "└─────────────────────────┬────────────────────────────────────────┘\n", " │\n", " ▼\n", "┌──────────────────────────────────────────────────────────────────┐\n", "│ 3. 下游任务 │\n", "│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │\n", "│ │ 分类 │ │ 相似度 │ │ 聚类 │ │ 生成 │ │\n", "│ │ 情感分析│ │ 文本匹配│ │ 主题分组│ │ 聊天机器人│ │\n", "│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │\n", "└──────────────────────────────────────────────────────────────────┘\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9.2 各环节详解\n", "\n", "### 环节1:文本预处理\n", "\n", "| 步骤 | 输入 | 输出 | 作用 |\n", "|------|------|------|------|\n", "| 分词 | \"今天天气不错\" | [\"今天\", \"天气\", \"不错\"] | 把文本切成词 |\n", "| 去停用词 | [\"今天\", \"天气\", \"不错\"] | [\"天气\", \"不错\"] | 去掉\"的、了、在\"等无意义词 |\n", "| 统一大小写 | [\"Python\", \"python\"] | [\"python\", \"python\"] | 归一化 |\n", "| 去标点 | [\"语言!!!\"] | [\"语言\"] | 清理噪音 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 环节2:文本向量化\n", "\n", "| 方法 | 适用场景 | 不适用场景 |\n", "|------|---------|-----------|\n", "| BoW | 基线模型、快速原型 | 需要语义理解 |\n", "| TF-IDF | 文本分类、关键词提取 | 同义词识别 |\n", "| Embedding | 语义相似度、推荐系统 | 需要精确匹配 |\n", "| 预训练模型 | 通用NLP任务 | 计算资源有限 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 环节3:下游任务\n", "\n", "```python\n", "# 分类任务:\n", "\"这部电影太好看了!\" → 情感分类 → 正面 ✅\n", "\n", "# 相似度任务:\n", "\"如何学习Python?\" → 查找相似文档 → \"Python入门教程\" ✅\n", "\n", "# 生成任务:\n", "\"今天天气\" → GPT续写 → \"今天天气真好,适合出去玩\" ✅\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "# 第十部分:实战用jieba进行中文分词\n", "\n", "## 10.1 安装jieba\n", "\n", "```bash\n", "!pip install jieba\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 安装jieba\n", "import subprocess\n", "subprocess.run(['pip', 'install', 'jieba', '-q'])\n", "\n", "print(\"jieba安装完成!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.2 基础分词\n", "\n", "jieba支持三种分词模式:\n", "\n", "| 模式 | 说明 | 适用场景 |\n", "|------|------|---------|\n", "| 精确模式 | 试图将句子最精确地切开,适合文本分析 | **默认,推荐** |\n", "| 全模式 | 把所有可能的词都扫描出来,速度快 | 速度要求高 |\n", "| 搜索引擎模式 | 在精确模式基础上,对长词再次切分 | 搜索引擎 |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import jieba\n", "\n", "print(\"=\" * 50)\n", "print(\"jieba分词演示\")\n", "print(\"=\" * 50)\n", "\n", "text = \"我喜欢深度学习和人工智能\"\n", "\n", "print(f\"原文: {text}\")\n", "print()\n", "\n", "# 精确模式(默认)\n", "words精确 = list(jieba.cut(text, cut_all=False))\n", "print(f\"精确模式: {' / '.join(words精确)}\")\n", "\n", "# 全模式\n", "words全 = list(jieba.cut(text, cut_all=True))\n", "print(f\"全模式: {' / '.join(words全)}\")\n", "\n", "# 搜索引擎模式\n", "words搜索 = list(jieba.cut_for_search(text))\n", "print(f\"搜索模式: {' / '.join(words搜索)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 更多分词示例\n", "import jieba\n", "\n", "print(\"=\" * 50)\n", "print(\"更多分词示例\")\n", "print(\"=\" * 50)\n", "\n", "examples = [\n", " \"今天天气真不错\",\n", " \"人工智能是未来的发展方向\",\n", " \"Python是一门非常流行的编程语言\",\n", " \"小明毕业于清华大学计算机系\",\n", " \"我今天在京东买了一部iPhone手机\"\n", "]\n", "\n", "for i, text in enumerate(examples):\n", " words = list(jieba.cut(text))\n", " print(f\"{i+1}. {text}\")\n", " print(f\" → {' / '.join(words)}\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.3 词性标注\n", "\n", "jieba支持词性标注,可以标注每个词是名词、动词、形容词等。\n", "\n", "| 词性代码 | 含义 | 示例 |\n", "|----------|------|------|\n", "| n | 名词 | 人、山、电脑 |\n", "| v | 动词 | 跑、吃、学习 |\n", "| adj | 形容词 | 漂亮、好吃、优秀 |\n", "| adv | 副词 | 很、非常、慢慢 |\n", "| m | 数词 | 一、百、千 |\n", "| q | 量词 | 个、本、件 |" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import jieba.posseg as pseg\n", "\n", "print(\"=\" * 50)\n", "print(\"jieba词性标注演示\")\n", "print(\"=\" * 50)\n", "\n", "text = \"我喜欢深度学习和人工智能\"\n", "\n", "print(f\"原文: {text}\")\n", "print()\n", "\n", "words = pseg.cut(text)\n", "print(\"分词 + 词性标注:\")\n", "for word, flag in words:\n", " print(f\" {word}: {flag}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10.4 停用词处理\n", "\n", "停用词是在文本处理中需要过滤掉的常见词,如\"的\"、\"了\"、\"在\"等。\n", "\n", "这些词在所有文档中都可能出现,对区分文档没有帮助。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import jieba\n", "\n", "print(\"=\" * 50)\n", "print(\"停用词处理演示\")\n", "print(\"=\" * 50)\n", "\n", "# 常见停用词列表\n", "stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'])\n", "\n", "text = \"人工智能是未来的发展方向,也是当前科技领域的热门话题\"\n", "\n", "print(f\"原文: {text}\")\n", "print()\n", "\n", "# 不使用停用词\n", "words_all = list(jieba.cut(text))\n", "print(f\"不使用停用词: {' / '.join(words_all)}\")\n", "\n", "# 使用停用词\n", "words_filtered = [w for w in words_all if w not in stopwords]\n", "print(f\"使用停用词: {' / '.join(words_filtered)}\")\n", "print()\n", "\n", "# 更完整的停用词表可以从网上下载\n", "print(\"提示:实际项目中可以从以下地方获取停用词表:\")\n", "print(\" - 哈工大停用词表\")\n", "print(\" - 百度停用词表\")\n", "print(\" - 四川大学机器学习实验室停用词表\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 实战:完整的文本预处理流程\n", "import jieba\n", "\n", "print(\"=\" * 50)\n", "print(\"完整的文本预处理流程\")\n", "print(\"=\" * 50)\n", "\n", "# 示例文档集合\n", "docs = [\n", " \"今天天气真不错!适合出去玩。\",\n", " \"Python是一门很棒的编程语言。\",\n", " \"人工智能和机器学习是未来的发展方向。\",\n", " \"今天在咖啡馆喝了一杯很好喝的拿铁。\"\n", "]\n", "\n", "# 停用词表\n", "stopwords = set(['的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这', '!', '。', ','])\n", "\n", "def preprocess_text(text):\n", " \"\"\"完整的文本预处理流程\"\"\"\n", " # 1. 分词\n", " words = jieba.cut(text)\n", " \n", " # 2. 去除停用词\n", " words = [w for w in words if w not in stopwords and len(w) > 0]\n", " \n", " # 3. 去除空格\n", " words = [w for w in words if w.strip()]\n", " \n", " return words\n", "\n", "print(\"预处理结果:\")\n", "for i, doc in enumerate(docs):\n", " words = preprocess_text(doc)\n", " print(f\"\\nDoc{i+1}: {doc}\")\n", " print(f\" → {' / '.join(words)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 实战:jieba分词 + TF-IDF完整流程\n", "import jieba\n", "import math\n", "\n", "print(\"=\" * 50)\n", "print(\"实战:jieba分词 + TF-IDF完整流程\")\n", "print(\"=\" * 50)\n", "\n", "def simple_tfidf_tokenized(docs, stopwords=None):\n", " \"\"\"\n", " 结合分词的TF-IDF实现\n", " 参数:\n", " docs: 原始文档列表\n", " stopwords: 停用词集合\n", " 返回:\n", " vocab, tfidf_matrix\n", " \"\"\"\n", " # 1. 分词\n", " tokenized = []\n", " for doc in docs:\n", " words = jieba.cut(doc)\n", " if stopwords:\n", " words = [w for w in words if w not in stopwords and len(w) > 1]\n", " else:\n", " words = [w for w in words if len(w) > 1]\n", " tokenized.append(words)\n", " \n", " # 2. 构建词表\n", " vocab_set = set()\n", " for doc in tokenized:\n", " vocab_set.update(doc)\n", " vocab = sorted(list(vocab_set))\n", " \n", " # 3. 构建TF矩阵并计算IDF\n", " n_docs = len(tokenized)\n", " tf_matrix = []\n", " df_dict = {word: 0 for word in vocab}\n", " \n", " for doc in tokenized:\n", " vec = [0] * len(vocab)\n", " for word in doc:\n", " if word in vocab:\n", " idx = vocab.index(word)\n", " vec[idx] += 1\n", " tf_matrix.append(vec)\n", " \n", " # 计算DF\n", " for vec in tf_matrix:\n", " for j, count in enumerate(vec):\n", " if count > 0:\n", " word = vocab[j]\n", " df_dict[word] += 1\n", " \n", " # 计算IDF\n", " idf = []\n", " for word in vocab:\n", " df = df_dict[word]\n", " idf_j = math.log(n_docs / (df + 1)) + 1\n", " idf.append(idf_j)\n", " \n", " # 计算TF-IDF\n", " tfidf = []\n", " for vec in tf_matrix:\n", " tfidf_vec = [vec[i] * idf[i] for i in range(len(vec))]\n", " tfidf.append(tfidf_vec)\n", " \n", " return vocab, tfidf, tokenized\n", "\n", "# 示例文档集合\n", "docs = [\n", " \"Python是一门很棒的编程语言\",\n", " \"人工智能是未来的发展方向\",\n", " \"深度学习是机器学习的一个分支\",\n", " \"Python和Java都是很流行的编程语言\"\n", "]\n", "\n", "# 停用词\n", "stopwords = set([\"的\", \"是\", \"一个\", \"很\", \"和\", \"在\", \"了\"])\n", "\n", "vocab, tfidf_matrix, tokenized = simple_tfidf_tokenized(docs, stopwords)\n", "\n", "print(\"文档集合:\")\n", "for i, doc in enumerate(docs):\n", " print(f\" Doc{i+1}: {doc}\")\n", "print()\n", "\n", "print(f\"分词结果:\")\n", "for i, words in enumerate(tokenized):\n", " print(f\" Doc{i+1}: {' / '.join(words)}\")\n", "print()\n", "\n", "print(f\"词表(共{len(vocab)}个词):\")\n", "print(f\" {vocab}\")\n", "print()\n", "\n", "print(\"TF-IDF矩阵:\")\n", "for i, vec in enumerate(tfidf_matrix):\n", " # 只显示非零值\n", " nonzero = [(vocab[j], round(vec[j], 4)) for j in range(len(vec)) if vec[j] > 0]\n", " print(f\" Doc{i+1}: {nonzero}\")\n", "\n", "print()\n", "\n", "# 找每个文档最重要的词\n", "print(\"每个文档最重要的词(TF-IDF值最高):\")\n", "for i, vec in enumerate(tfidf_matrix):\n", " max_idx = max(range(len(vec)), key=lambda j: vec[j])\n", " max_score = vec[max_idx]\n", " if max_score > 0:\n", " print(f\" Doc{i+1}: '{vocab[max_idx]}' (TF-IDF={max_score:.4f})\")" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "---\n", "\n", "# 📋 总结\n", "\n", "## 本章核心概念\n", "\n", "```\n", "文本数据处理\n", " │\n", " ├── 核心问题:文本(符号) → 向量(数字)\n", " │\n", " ├── 向量化方法\n", " │ ├── BoW(词袋模型)\n", " │ │ └── 核心:统计词频,忽略顺序\n", " │ │\n", " │ ├── TF-IDF(词频-逆文档频率)\n", " │ │ └── 核心:词的重要性 × 词的独特性\n", " │ │\n", " │ └── Word Embedding(词嵌入)\n", " │ └── 核心:用语义空间表示词\n", " │\n", " └── 处理流程\n", " ├── 文本预处理(分词、去停用词)\n", " ├── 向量化\n", " └── 下游任务(分类、相似度、生成)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 关键公式速查\n", "\n", "| 方法 | 公式 | 含义 |\n", "|------|------|------|\n", "| 向量加法 | [1,2] + [3,4] = [4,6] | 对应位置相加 |\n", "| 向量数乘 | 2 × [1,2] = [2,4] | 每个元素乘以标量 |\n", "| 向量点积 | [1,2] · [3,4] = 11 | 对应相乘再求和 |\n", "| 向量长度 | |[3,4]| = √(3²+4²) = 5 | 勾股定理 |\n", "| 余弦相似度 | cos(θ) = (A·B) / (|A|×|B|) | 向量相似程度 |\n", "| TF-IDF | TF × IDF | 词频 × 逆文档频率 |\n", "\n", "---\n", "\n", "> **记住:文本向量化的核心目标是把\"符号\"变成\"可计算的数值向量\"!**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 4 }