在之前的期望方差一文中提到过协方差,因为协方差矩阵在机器学习领域内出镜率很高,所以再展开阐述下。
协方差用于衡量两个变量的总体误差。以离散随机变量为例,
方差:$\sigma^{2}=\frac{\Sigma_{i=1}^{n}(x_{i}-\bar{x})^2}{n-1}$
协方差:$Cov(x, y)=\frac{\Sigma_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}$
x,y是两个变量空间,在机器学习中,就是量特征维度,比如,x可以表示所有样本的高度特征集合,y可以表示所有样本的长度特征集合
注:分母是n-1而不是n,是无偏估计,均值$\mu$和$\bar{x}$之间是有误差的,导致$\frac{\Sigma_{i=1}^{n}(x_{i}-\bar{x})^2}{n}<\frac{\Sigma_{i=1}^{n}(x_{i}-\mu)^2}{n}$,推导如下,不关注可以忽略:
$\frac{1}{n}\Sigma_{i=1}^{n}(x_{i}-\bar{x})^2$
$=\frac{1}{n}\Sigma_{i=1}^{n}[(x_{i}-\mu)+(\mu-\bar{x})]^2$
$=\frac{1}{n}\Sigma_{i=1}^{2}(x_{i}-\mu)^2+\frac{2}{n}\Sigma_{i=1}^{n}(x_{i}-\mu)(\mu-\bar{x})+\frac{1}{n}\Sigma_{i=1}^{n}(\mu-\bar{x})^2$
$=\frac{1}{n}\Sigma_{i=1}^{2}(x_{i}-\mu)^2+2(\bar{x}-\mu)(\mu-\bar{x})+(\mu-\bar{x})^2$
$=\frac{1}{n}\Sigma_{i=1}^{2}(x_{i}-\mu)^2-(\mu-\bar{x})^2$
协方差矩阵有:
$
\Sigma =
\left[
\begin{matrix}
Cov(X_{1},X_{1}) & Cov(X_{1}, X_{2}) & \cdots & Cov(X_{1}, X_{n}) \\
Cov(X_{2},X_{1}) & Cov(X_{2}, X_{2}) & \cdots & Cov(X_{2}, X_{n}) \\
\vdots & \vdots & \ddots & \vdots \\
Cov(X_{n},X_{1}) & Cov(X_{n}, X_{2}) & \cdots & Cov(X_{n}, X_{n})
\end{matrix}
\right]
$
其中$X_{i}$表示所有样本的第i维特征的值组成的向量,$协方差矩阵\Sigma$是对称方阵,根据计算定义,不难得到
$\Sigma_{ij}=Cov(X_{i}, X_{j})=\frac{(X_{i}-\bar{X_{i}})^{T}(X_{j}-\bar{X_{j}})}{n-1}$
假设有3个样本,每个样本有3个维度,(1, 2, 1), (2, 3, 2), (3, 4, 5)
$
A =
\left[
\begin{matrix}
1 & 2 & 1 \\
2 & 3 & 2 \\
3 & 4 & 5
\end{matrix}
\right]
$
则每个维度的特征列向量有X1=[1, 2, 3], X2=[2, 3, 4], X3=[1, 2, 5]1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17#! /usr/bin/python
# -*- coding:utf-8 -*-
import numpy as np
x1 = np.array([1, 2, 3])
x2 = np.array([2, 3, 4])
x3 = np.array([1, 2, 5])
x1_avg = np.mean(x1)
x2_avg = np.mean(x2)
x3_avg = np.mean(x3)
print x1_avg
cov_1_1 = np.dot((x1-x1_avg).T, (x1-x1_avg)) / 2
cov_1_2 = np.dot((x1-x1_avg).T, (x2-x2_avg)) / 2
print cov_1_1
print cov_1_2