COMP5270 Assignment 3 题解 — Euclidean LSH 证明修复

COMP5270 Assignment 3 题解 — Euclidean LSH 证明修复

这是一道关于 Euclidean 空间 Locality-Sensitive Hashing (LSH) 的完整证明题。故事背景是"猫咪打翻咖啡,损坏了笔记本",需要补全证明中的 15 处缺失单词或表达式。No justification is needed。


完整定理与证明(含 15 个答案)

Theorem 1. 考虑哈希族 \(\mathcal{H}_w = \{h_{g,u} : \mathbb{R}^d \to \mathbb{N}\}_{g\in\mathbb{R}^d, u\in(0,1)}\),其中

\[h_{g,u}(x) = \left\lfloor \frac{\langle x, g \rangle}{w} + u \right\rfloor\]

\(w > 0\) 是待选参数。从 \(\mathcal{H}_w\) 中抽取函数的方式为:独立抽取 \(G \sim \mathcal{N}(0, I_d)\)\(U \sim \text{Unif}(0, 1)\),返回 \(h_{G,U}\)

则对任意 \(C > 1\)\(r > 0\),设 \(w = \boxed{2C^2 r}\)\(\mathcal{H}_w\) 是一个 \((r, C, p, q)\)-LSH 族,其 sensitivity 参数满足 \(\rho \leq \boxed{1/C}\),其中

\[p = f(2C^2), \quad q = f(2C), \quad f(x) = x \cdot \sqrt{\frac{2}{\pi}} \int_0^1 e^{-\frac{x^2 t^2}{2}} (1 - t) \, dt\]

证明. 固定 \(C > 1\)\(r > 0\)\(w\) 如上。分析 \(\Pr_{G,U} [h_{G,U}(x) = h_{G,U}(y)]\),对 \(x, y \in \mathbb{R}^d\),当 (1) \(\|x - y\|_2 \leq \boxed{r}\) 和 (2) \(\|x - y\|_2 \geq C \cdot \boxed{r}\) 时。令 \(\delta := x - y\)

观察 1

\(|\langle \delta, g \rangle| \geq w\),则 \(\boxed{h_{g,u}(x)} \neq h_{g,u}(y)\) 对所有 \(u \in (0,1)\) 成立。

理由: \(|(\langle x,g\rangle/w + u) - (\langle y,g\rangle/w + u)| = |\langle\delta,g\rangle|/w \geq 1\),故取整后不等。

观察 2

\(|\langle \delta, g \rangle| < w\),则 \(\Pr_{U} [h_{g,U}(x) \neq h_{g,U}(y)] = \boxed{|\langle \delta, g \rangle| / w}\)

理由:\(a = \langle x,g\rangle/w \leq \langle y,g\rangle/w = b\)\(k \leq b < k+1\)。由于 \(b-a < 1\),只有两种情况:

Case 1: \(k \leq a \leq b < k+1\)

\[\Pr[h_{g,U}(x) \neq h_{g,U}(y)] = \Pr[a+U < k+1, b+U \geq k+1] = \Pr[k+1-b \leq U < k+1-a] = \boxed{b - a}\]

Case 2: \(a < k < b < k+1\)\(\lfloor a+U\rfloor < \lfloor b+U\rfloor\) 当且仅当 \(U < k-a\)

\[U \geq \max(k+1-b, k-a) = \boxed{k+1-b}\]

(因 \(k+1-b = k-a+1-(b-a) \geq k-a\))。于是

\[\Pr = (k-a) + (1-(k+1-b)) = b - a\]

结论得证,因为 \(b-a = \langle\delta,g\rangle/w\)

积分步骤

调用熟知事实:\(G \sim \mathcal{N}(0, I_d)\),则 \(\langle \delta, G \rangle \sim \mathcal{N}(0, \sigma^2)\),方差 \(\sigma^2 = \boxed{\|\delta\|^2}\)。令 \(z := \langle\delta,g\rangle\),积分得:

\[\Pr_{G,U}[h_{G,U}(x) = h_{G,U}(y)] = \frac{1}{\sqrt{2\pi\sigma^2}} \int_{-\boxed{w}}^{\boxed{w}} e^{-\frac{z^2}{2\sigma^2}} \left(1 - \boxed{\frac{|z|}{w}}\right) dz\]

\[= \frac{1}{\sigma} \sqrt{\frac{2}{\pi}} \int_0^w e^{-\frac{(z/\sigma)^2}{2}} \left(1 - \frac{z}{w}\right) dz = \frac{w}{\sigma} \sqrt{\frac{2}{\pi}} \int_0^1 e^{-\frac{(w/\sigma)^2 t^2}{2}} (1 - t) \, dt\]

\(f\) 的单调性

\(f(x) = x \sqrt{\frac{2}{\pi}} \int_0^1 e^{-\frac{x^2 t^2}{2}} (1-t) dt\)

数值验证 \(f\)\(x > 0\) 上的单调性:

\(x\) \(f(x)\) 趋势
0.1 0.040
0.5 0.195
1.0 0.369
2.0 0.610
5.0 0.840
10.0 0.920
50.0 0.984
100.0 0.992

\(f\)\(x > 0\) 上单调 \(\boxed{\text{inc}}\)reasing(递增,从 \(0\) 增到 \(1\))。

⚠️ 此空容易写成 "dec"(decreasing)。但从数值验证可见 \(f\) 递增,且必须递增才能使后续不等式方向正确(见下文)。

完成证明

\(p = f(2C^2)\)\(q = f(2C)\),回顾 \(w = 2C^2 r\)

  • \(\sigma = \|x-y\|_2 \leq r\)(近点):\(w/\sigma \geq w/r\)\(f\) 递增 \(\implies\)

    \[\Pr[h_{G,U}(x) = h_{G,U}(y)] = f(w/\sigma) \;\boxed{\geq}\; f(w/r) = p\]

  • \(\sigma = \|x-y\|_2 \geq C \cdot r\)(远点):\(w/\sigma \leq w/(Cr)\)\(f\) 递增 \(\implies\)

    \[\Pr[h_{G,U}(x) = h_{G,U}(y)] = f(w/\sigma) \;\boxed{\leq}\; f(w/(Cr)) = q\]

因此 \(\mathcal{H}_w\)\((r, C, p, q)\)-LSH 族,其 \(\boxed{\text{sensitivity}}\) 参数为

\[\rho = \frac{\log \frac{1}{f(2C^2)}}{\log \frac{1}{f(2C)}}\]

最后只需证明 \(\rho \leq 1/C\)——可通过 Taylor 展开验证:\(C \to \infty\)\(\rho = \frac{1}{C} + O\left(\frac{1}{C^2}\right)\)。证毕。


15 个答案汇总表

# 上下文 答案 所在页 解释
1 $w = $ \(2C^2 r\) P1 Gaussian LSH 的 standard bandwidth 设置
2 $$ \(1/C\) P1 sensitivity 参数的目标上界
3 $|x-y|_2 $ \(r\) P1 近点距离阈值
4 \(\|x-y\|_2 \geq C \cdot\) \(r\) P1 远点距离阈值,放大了 \(C\)
5 then \(\_\_ \neq h_{g,u}(y)\) \(h_{g,u}(x)\) P1 观察 1:投影差 \(\geq w\) 时二者恒不等
6 $= $ \(\|\langle\delta,g\rangle\|/w\) P1 观察 2:冲突概率
7 Case 1: \(=\) \(b-a\) P2 \(\Pr[k+1-b \leq U < k+1-a] = b-a\)
8 \(\max(k+1-b, k-a)=\) \(k+1-b\) P2 \(b-a<1\)\(k+1-b \geq k-a\)
9 方差 \(\|\delta\|^2\) P2 \(\langle\delta,G\rangle \sim \mathcal{N}(0,\|\delta\|^2)\)
10 \(\int_{-\_\_}^{\_\_}\) 上限 \(w\) P2 被积函数积分上限
11 \(1 - \_\_\) \(\|z\|/w\) P2 被积函数中的减项
12 monotonically \(\_\_\)reasing inc P2 \(f\) 递增(0→1),非递减
13 \(f(w/\sigma)\) \(\_\_\) \(f(w/r)\) \(\geq\) P2 近点:\(w/\sigma \geq w/r\)\(f\)
14 \(f(w/\sigma)\) \(\_\_\) \(f(w/(Cr))\) \(\leq\) P3 远点:\(w/\sigma \leq w/(Cr)\)\(f\)
15 with \(\_\_\) parameter sensitivity P3 \(\rho\) 的标准名称

关键知识点

LSH 的核心定义

\((r, C, p, q)\)-LSH 族需要满足: - 若 \(\|x-y\| \leq r\),则 \(\Pr[h(x)=h(y)] \geq p\)(近点高概率碰撞) - 若 \(\|x-y\| \geq Cr\),则 \(\Pr[h(x)=h(y)] \leq q\)(远点低概率碰撞) - 并且 \(p > q\),保证可以区分远近点

Sensitivity 参数

\[\rho = \frac{\log(1/p)}{\log(1/q)}\]

是衡量 LSH 族查询效率的核心指标。\(\rho\) 越小,查询效率越高。对 Euclidean LSH,目标 \(\rho \leq 1/C\)

Gaussian (p-stable) LSH 原理

  1. p-stable 分布性质:高斯分布是 2-stable,即 \(\langle x, G \rangle - \langle y, G \rangle = \langle x-y, G \rangle \sim \mathcal{N}(0, \|x-y\|^2)\)
  2. 哈希构造\(h_{g,u}(x) = \lfloor \langle x,g\rangle/w + u \rfloor\),其中 \(g\) 是高斯向量,\(u\) 是随机偏移
  3. \(w\) 的选择\(w = 2C^2 r\) 使 \(p = f(2C^2)\)\(q = f(2C)\) 独立于 \(r\)

为什么 \(f\) 必须递增(第 12 空)

\[\sigma \leq r \implies w/\sigma \geq w/r \xrightarrow{f \uparrow} f(w/\sigma) \geq f(w/r) = p\] \[\sigma \geq Cr \implies w/\sigma \leq w/(Cr) \xrightarrow{f \uparrow} f(w/\sigma) \leq f(w/(Cr)) = q\]

如果 \(f\) 是递减的,则两条不等号方向全部反转,与 LSH 定义矛盾。数值验证也确认 \(f(0)=0\)\(f(\infty)=1\),单调递增。

常见陷阱

  • "dec" vs "inc":第 12 空 "monotonically ___reasing" 看起来很暧昧,但从数学自洽性可确定必须是 "inc"(increasing)
  • 积分上下限:利用 \(\int_{-w}^{w} = 2\int_0^{w}\) 的对称性,积分变量替换 \(t = z/w\) 得到 \(f(w/\sigma)\)
  • 冲突 vs 不冲突:观察 2 给出的是 \(\Pr[\text{不等}]\),最终需要的是 \(\Pr[\text{相等}] = 1 - \cdots\) 的简化形式