Perceptron, Support Vector Machine and Dual Optimization Problem (1)

2023-03-28 21:35 由车天健发表于 #其他

Linear Decision Boundary（线性决策边界）

Example. (classification problem)

给定一个二元的特征空间 \(\mathcal{X} = \left\{ \text{weight} \times \text{height} \right\}\)，对标签 \(\left\{ \text{male, female} \right\}\) 进行分类，即，根据隐去性别的体重与身高的二元数据，预测 / 判断该样本的性别。性别 \(\left\{ \text{male, female} \right\}\) 可以抽象为 \(\mathcal{Y} = \left\{ -1, 1 \right\}\)。

Decision Boundary and Linear Classifier

设：\(g = \text{decision boundary}\)，\(d=1\) 的情况：

\[g(x) = w_{1} x + w_{0} = 0 \]

一般情况：

\[g(\vec{x}) = \vec{w}^{T} \cdot \vec{x} + w_{0} = 0 \]

注意到一般情况中 \(\vec{w}^{T} \cdot \vec{x}\) 为点积，结果输出为一个标量。

设 \(f = \text{linear classifier}\)，where:

\[\begin{align*} f(\vec{x}) & = \begin{cases} 1 \qquad \text{ if } ~ g(\vec{x}) \geq 0 \\ -1 ~ \quad \text{ if } ~ g(\vec{x}) < 0 \end{cases} \\ & \\ & = \text{sign}\big( \vec{w}^{T} \cdot \vec{x} + w_{0} \big) \end{align*} \]

对于 decision boundary \(g\)，我们可以将 \(w_{0}\) 也写入向量形式：

\[g(\vec{x}) = \vec{w}^{T} \cdot \vec{x} + w_{0} = \begin{pmatrix} \vec{w} \\ w_{0} \end{pmatrix}^{T} \cdot \begin{pmatrix} \vec{x} \\ 1 \end{pmatrix} = {\vec{w}^{'}}^{T} \cdot \vec{x}^{'} \]

so that the homogenous form:

\[g(\vec{x}^{'}) = {\vec{w}^{'}}^{T} \cdot \vec{x}^{'} \]

所谓 linear classifier，linear（线性）体现为：

\[\sum\limits_{i} w_{i} x_{i} + w_{0} = w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{d}x_{d} + w_{0} \]

其中，\(d\) 为特征空间的维度，即 \(\vec{x} \in \mathbb{R}^{d}\)。

而非线性的 classifier，例如 sigmoid:

\[\sigma(\vec{x}) = \frac{1}{1 + e^{-\vec{w} \cdot \vec{x}}} = \frac{1}{1 + e^{-(w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{d}x_{d} + w_{0})}} \]

将一个特征向量 \(\vec{x} \in \mathbb{R}^{d}\) 输入，经过一定的权重运算（或者同时进行一定的非线性变换）后输出为一个标量，这便是神经网络中一个 neuron（神经元）的运算过程，由多个如此的基本单元的相互连接组成神经网络。事实上，可以用神经网络模拟任意的光滑函数 (smooth function)，意味着该函数无限可微。

How to learn weights?

给定具有标签的训练数据（包含 bias）：

\[(\vec{x_{1}}, y_{1}), ~ (\vec{x_{2}}, y_{2}), ~ \ldots, ~ (\vec{x_{n}}, y_{n}) \]

希望找到 \(\vec{w}\) 使得 training error 最小，即：

\[\mathop{\arg\min}\limits_{\vec{w}} \frac{1}{n} \sum\limits_{i=1}^{n} \mathbb{I}_{\left\{ \text{sign}\big( \vec{w} \cdot \vec{x_{i}} \big) \neq y_{i} \right\}} = \mathop{\arg\min}\limits_{\vec{w}} \Big( \sum\limits_{x_{i} \text{ s.t. } y_{i} = 1} \mathbb{I}_{\left\{ \vec{w} \cdot \vec{x_{i}} < 0 \right\}} + \sum\limits_{x_{i} \text{ s.t. } y_{i} = -1} \mathbb{I}_{\left\{ \vec{w} \cdot \vec{x_{i}} \geq 0 \right\}} \Big) \]

传统方法是对 loss function 求导并检验 stationary point，但此处使用传统方法无法得到最优解，since it’s NP-hard to solve or approximate.

尽管问题是 NP-hard，我们可以通过做出进一步的合理假设来简化问题，使得 weight 可以被估计。例如，我们可以假设 training data 是线性可分的（linear separable）。

Linear Separability

假设：存在一条线性的 linear decision boundary 完美地对 training data 进行分类。

现在给定 labeled training data:

\[\mathcal{S} = \left\{ (\vec{x_{1}}, y_{1}), ~ (\vec{x_{2}}, y_{2}), ~ \ldots, ~ (\vec{x_{n}}, y_{n}) \right\} \]

希望判断：是否存在 \(\vec{w}\) such that:

\[\forall i \in \mathbb{N}^{+}, i \leq n: ~ y_{i} (\vec{w} \cdot \vec{x_{i}}) \geq 0 \]

注意上式中 \(y_{i} (\vec{w} \cdot \vec{x_{i}}) \geq 0\) 说明 \(y_{i}\) 和 \((\vec{w} \cdot \vec{x_{i}})\) 符号相同，即代表分类正确。

这等价于：training data 是否 linearly separable？

由于存在 \(d+1\) 个变量（指 \(\vec{x_{i}} \in \mathbb{R}^{d}\)，即 \(\vec{x_{i}} = \big( x_{i, 1}, ~ x_{i, 2}, ~ \ldots, ~x_{i, d} \big)\) 所对应的 \(d\) 个权重与 \(w_{0}\)），以及 \(|S| = n\) 个 constraints，通过一个 constraint optimization program 将优化问题解出是可能的。

The Perceptron Algorithm

感知机算法如下：

\[\begin{align*} & \text{初始化}: \vec{w}^{(0)} = \vec{0} \\ & \\ & \text{for}: t = 1, 2, 3, \cdots: \\ & \\ & \qquad \text{若存在} (\vec{x}, y) \in \mathcal{S} \text{ s.t. : } \text{sign}(\vec{w}^{(t-1)} \cdot \vec{x}) \neq y: \\ & \qquad \qquad \vec{w}^{(t)} \leftarrow \begin{cases} & \vec{w}^{(t-1)} + \vec{x} \quad \text{ if } y = 1 \\ & \vec{w}^{(t-1)} - \vec{x} \quad \text{ if } y = -1 \end{cases} = \vec{w}^{(t-1)} + y \vec{x} \\ & \\ & \qquad \text{若不存在} (\vec{x}, y) \in \mathcal{S} \text{ s.t. : } \text{sign}(\vec{w}^{(t-1)} \cdot \vec{x}) \neq y: \\ & \qquad \qquad \text{终止算法}。 \end{align*} \]

Definition. (Margin of a Hyperplane)

一个数据集所属空间中的一个超平面的 margin 是距离超平面最近的数据点到超平面的距离。

Definition. (Margin of a Dataset)

一个数据集的 margin (记作 \(\gamma\)) 是对于这个数据集而言，使用任意的权重向量 (weight vector) 分别点乘每一个数据点，其新数据集在新空间内所尽可能达到的最大 margin (of the hyperplane)。

换言之，margin of hyperplane 是一个 \(\min\) 的概念，而 margin of dataset 是一个 \(\max\min\) 的概念。并且，对于 margin of dataset 的定义，假设初始 feature dataset 为 \(\mathcal {X} = \left\{\vec{x_{1}}, \vec{x_{2}}, \ldots, \vec{x_{n}} \right\}\)，对于每一个数据点 \(\vec{x}_{i}\)，点乘以一个任意的权重向量 \(\vec{w}\)，得到新的一组 feature dataset \(\mathcal{X}^{\star} = \left\{ \vec{w} \cdot \vec{x_{1}}, \vec{w} \cdot \vec{x_{2}}, \ldots, \vec{w} \cdot \vec{x_{n}} \right\}\)，再配对以各自的 label 得到二维平面上的新数据集 \(\mathcal{X}^{\star} \times \mathcal{Y} = \left\{ (\vec{w} \cdot \vec{x_{1}}, y_{1}), (\vec{w} \cdot \vec{x_{2}}, y_{2}), \ldots, (\vec{w} \cdot \vec{x_{n}}, y_{n}) \right\}\)。在这个二维空间内根据给定的 hyperplane （即一根直线），求出使得 margin of the dataset \(\mathcal{X}^{\star} \times \mathcal{Y}\) 最大的上述 \(\vec{w}\)。

Perceptron Algorithm Guarantee

Theorem. (Perceptron Mistake Bound)

假设存在一个 \(\vec{w}^{*}\) （单位长度为 \(1\)），使 \(\text{margin} = \gamma\) 的 training sample 变得 linear separable。令 \(R = \max\limits_{\vec{x} \in \mathcal{X}} ||\vec{x}||\)，那么 perceptron algorithm 最多会犯 \(T = \big(\frac{R}{\gamma}\big)^{2}\) 次错误。

Proof.

证明的关键点在于说明：经过第 \(t\) 次遍历后得到的 \(\vec{w}^{(t)}\) 和 \(\vec{w}^{*}\) 相差多少？

假设 perceptron algorithm 在第 \(t\) 次遍历时犯了一个错误，那么：

\[\begin{align*} \vec{w}^{(t)} \cdot \vec{w}^{*} & = (\vec{w}^{(t-1)} + y \vec{x}) \cdot \vec{w}^{*} \\ & = \vec{w}^{(t-1)} \cdot \vec{w}^{*} + y \vec{x} \cdot \vec{w}^{*} \\ & \geq \vec{w}^{(t-1)} \cdot \vec{w}^{*} + \gamma \end{align*} \]

上面最后一个不等式的推导，我后面会解释。简单来说，这是因为 data is separable by a margin \(\gamma\)。

同时，我们有：

\[\begin{align*} ||\vec{w}^{(t)}||^{2} & = ||\vec{w}^{(t-1)} + y \vec{x}||^{2} \\ & = ||\vec{w}^{(t-1)}||^{2} + 2 y (\vec{w}^{(t-1)} \cdot \vec{x}) + ||y \vec{x}||^{2} \\ & \leq ||\vec{w}^{(t-1)}||^{2} + R^{2} \end{align*} \]

解释：

上面最后一个不等式的推导意味着：\(2 y (\vec{w}^{(t-1)} \cdot \vec{x}) + ||y \vec{x}||^{2} \leq R^{2} = \max\limits_{x \in \mathcal{X}} || \vec{x} ||^{2}\)。这是因为，perceptron algorithm 仅在分类错误的时候对权重向量 \(\vec{w}\) 进行更新，由我们上述的假设，即第 \(t\) 次遍历的时候出现分类错误，这意味着：

\[y (\vec{w}^{(t-1)} \cdot \vec{x}) < 0 \]

代表着 label \(y\) 与 \(\vec{w}^{(t-1)} \cdot \vec{x}\) 异号。

由于label \(y\) 为标量，则 \(||y \vec{x}||^{2} = y^{2} ||\vec{x}||^{2}\)，且由于 \(y \in \left\{ -1, 1 \right\}\)，则：

\[||y \vec{x}||^{2} = ||\vec{x}||^{2} \leq \max\limits_{\vec{x} \in \mathcal{X}} ||\vec{x}||^{2} \]

自然成立。

那么，由上我们证得：

\[\vec{w}^{(t)} \leq ||\vec{w}^{(t-1)}||^{2} + R^{2} \]

对于每次遍历 \(1, 2, \ldots, t\)，由上式，有：

\[\begin{cases} \vec{w}^{(t)} \cdot \vec{w}^{*} \geq \vec{w}^{(t-1)} + \gamma \\ \\ ||\vec{w}^{(t)}||^{2} \leq ||\vec{w}^{(t-1)}||^{2} + R^{2} \end{cases} \]

因此，在 \(T\) 次遍历后：

\[\begin{align*} T \gamma & \leq \vec{w}^{T} \cdot \vec{w}^{*} \\ & \leq ||\vec{w}^{(T)}|| \cdot ||\vec{w}^{*}|| \\ & \leq R \sqrt{T} \end{align*} \]

解释：
- 第一个不等式可以 recursively 得到，即：
  
  \[\begin{align*} \vec{w}^{(T)} \cdot \vec{w}^{*} & \geq \vec{w}^{(T-1)} \cdot \vec{w}^{*} + \gamma \\ & \geq \vec{w}^{(T-2)} \cdot \vec{w}^{*} + 2\gamma \\ & \geq \vec{w}^{(T-3)} \cdot \vec{w}^{*} + 3\gamma \\ & \; \; \vdots \quad \text{recursively} \\ & \geq T \gamma \end{align*} \]
- 第二个不等式则是因为：
  
  \[\vec{w}^{(T)} \cdot \vec{w}^{*} = \cos \langle \vec{w}^{(T)}, \vec{w}^{*} \rangle \cdot ||\vec{w}^{(T)} || \cdot ||\vec{w}^{*}|| \]
  
  而 \(\cos \langle \vec{w}^{(T)}, \vec{w}^{*} \rangle \in [-1, 1]\)。
- 第三个不等式同样可以 recursively 得到，即：
  
  \[\begin{align*} & \vec{w}^{(0)} = \vec{0}, \text{ recursively: } \\ & ||\vec{w}^{(1)}||^{2} \leq R^{2} \\ & ||\vec{w}^{(2)}||^{2} \leq ||\vec{w}^{(1)}||^{2} + R^{2} \leq 2 R^{2} \\ & \vdots \\ & ||\vec{w}^{(T)}||^{2} \leq ||\vec{w}^{(T-1)}||^{2} + R^{2} \leq ||\vec{w}^{(T-2)}||^{2} + 2R^{2} \leq \cdots \leq T R^{2} \end{align*} \]

因此，我们得到：

\[T \gamma \leq R \sqrt{T} \]

即：

\[T \leq \Big( \frac{R}{\gamma} \Big)^{2} \]

Why we have: \(y \vec{x} \cdot \vec{w}^{*} \geq \gamma\)?

首先，\(\gamma\) 为 sample dataset 的 margin，它的含义如上文所述，其代表对于任意可能的权重向量，能使经权重向量变换后的数据集中离超平面最近的数据点到超平面的这段 “最小距离” 最大的距离（比较拗口，注意理解）。即：

\[\max\limits_{\vec{w}} \big( \min\limits_{\vec{x} \in \mathcal{X}} D \big) \]

现在，由于假设了存在 \(\vec{w}^{*}\) 使超平面将 sample dataset 线性区分，则超平面的解析式为：

\[y (\vec{w}^{*} \cdot \vec{x}) = 0 \]

对于任意的数据点 \(\vec{x_{0}}\)，其到超平面的距离为：

\[\begin{align*} \frac{|y(\vec{w}^{*} \cdot \vec{x_{0}})|}{\sqrt{||\vec{w}^{*}||^{2}}} & = |y(\vec{w}^{*} \cdot \vec{x_{0}})| \\ & = y (\vec{w}^{*} \cdot \vec{x_{0}}) \end{align*} \]

其中第一个等式的成立是因为有假设： \(||\vec{w}^{*}||^{2} = 1\)，第二个等式的成立同样因为假设：\(y (\vec{w}^{*} \cdot \vec{x_{0}}) \geq 0\)。

然而，即使 \(\gamma\) 代表的是 “最大的最小距离”，即 \(\max\limits_{\vec{w}} \big( \min\limits_{\vec{x} \in \mathcal{X}} D \big)\)，在给定的 \(\vec{w}\) 下，\(\gamma\) 所代表的依然是在这种选择 \(\vec{w}\) 下的、对于所有 \(\vec{x} \in \mathcal{X}\) 的最小距离，即一个 lower bound。

那么给定 \(\vec{w}^{*}\) 下，\(\gamma\) 即为给定的 \(\vec{w}^{*}\) 下的超平面的 margin，而非数据集的 margin，这意味着：

\[\forall \vec{x_{0}} \in \mathcal{X}: ~ y(\vec{w}^{*} \cdot \vec{x_{0}}) = y \vec{w}^{*} \cdot \vec{x_{0}} \geq \gamma \]

热门相关：骑士归来巡狩万界霸皇纪照见星星的她情生意动