{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sampling High-Dimensional Vectors\n",
    "Aaron R. Voelker (January 15, 2016)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "import numpy as np\n",
    "import pylab\n",
    "try:\n",
    "    import seaborn as sns  # optional; prettier graphs\n",
    "except ImportError:\n",
    "    sns = None\n",
    "\n",
    "import nengo\n",
    "from nengolib.compat import get_activities\n",
    "from nengolib.stats import ScatteredHypersphere, Sobol\n",
    "\n",
    "uniform_ball = nengo.dists.UniformHypersphere(surface=False)\n",
    "uniform_sphere = nengo.dists.UniformHypersphere(surface=True)\n",
    "\n",
    "# base=Rd() is now the default\n",
    "scatter_ball = ScatteredHypersphere(surface=False, base=Sobol())\n",
    "scatter_sphere = ScatteredHypersphere(surface=True, base=Sobol())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Abstract\n",
    "\n",
    "The **Monte Carlo (MC)** method of sampling is notoriously bad at reproducing the same statistics as the distribution being sampled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_dist(dist, title, num_samples=500):\n",
    "    pylab.figure(figsize=(4, 4))\n",
    "    pylab.title(title)\n",
    "    pylab.scatter(*dist.sample(num_samples, 2).T, s=10, alpha=0.7)\n",
    "    pylab.xlim(-1, 1)\n",
    "    pylab.ylim(-1, 1)\n",
    "    pylab.show()\n",
    "    \n",
    "plot_dist(uniform_ball, 'Uniform 2-Ball')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Intuitively, MC sampling gives lots of \"gaps\" and \"clumps\", while instead what we want is more of a \"**scattering**\" of points uniformly about the sphere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_dist(scatter_ball, 'Scattered 2-Ball')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We currently have three reasons to sample vectors in Nengo:\n",
    " 1. Choosing the **encoders** for a population\n",
    " 2. Choosing the **evaluation points** to solve for decoders\n",
    " 3. Choosing the **semantic pointers** in a vocabulary\n",
    " \n",
    "MC is bad for problem 1, because the neurons should be uniformly representing all parts of the vector space. \n",
    "MC sampling is _also_ bad for problem 2, because the \"**empirical distribution**\" does not match the actual distribution unless there are a large number of samples, and so the decoders are biased to minimize the approximation error of certain vectors over others. A scattered distribution overcomes this problem by giving a closer match to the uniform distribution with fewer samples.\n",
    "\n",
    "In fact, problems 1 and 2 are basically equivalent. When sampling encoders, we are effectively choosing which vectors should have the least error (by principle (1) they fire the most, and then by principle (2) they contribute the most to the estimate). The only 'real' difference is that encoders are on the $D$-sphere, while evaluation points are on the $D$-ball. These two problems can be solved efficiently by the **number-theoretic method (NTM)**, also known as the **quasi Monte Carlo method**, to generate scattered points. These solutions can then be used to sample encoders and evaluation points, to improve the representation of a population and its decoders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_trial(seed, encoders, eval_points, n_eval_points, test_points, n_test_points,\n",
    "             n_neurons, dims):\n",
    "    with nengo.Network(seed=seed) as model:\n",
    "        # Make a single ensemble and connection\n",
    "        ens = nengo.Ensemble(\n",
    "            n_neurons, dims, encoders=encoders, eval_points=eval_points,\n",
    "            n_eval_points=n_eval_points)\n",
    "        conn = nengo.Connection(ens, nengo.Node(size_in=dims))\n",
    "        \n",
    "        # Build the model\n",
    "        built = nengo.builder.Model(decoder_cache=nengo.cache.NoDecoderCache())\n",
    "        built.build(model)\n",
    "        with nengo.Simulator(None, model=built, progress_bar=False) as sim:\n",
    "            pass\n",
    "        \n",
    "        # Find the optimal decoders and their corresponding RMSE on the eval_points\n",
    "        decoders = sim.data[conn].weights\n",
    "        eval_rmses = np.mean(sim.data[conn].solver_info['rmses'])\n",
    "        \n",
    "        # Sample some new test_points and test them on the same decoders\n",
    "        x = test_points.sample(n_test_points, dims, rng=np.random.RandomState(seed))\n",
    "        a = get_activities(sim.model, ens, x)\n",
    "        x_hat = np.dot(a, decoders.T)\n",
    "        test_rmses = nengo.utils.numpy.rmse(x, x_hat, axis=1)\n",
    "        \n",
    "        # Return the average training and test errors\n",
    "        return np.mean(eval_rmses), np.mean(test_rmses)\n",
    "\n",
    "def do_experiment(n_neurons, dims, n_eval_points=500, test_points=uniform_ball,\n",
    "                  n_test_points=500, trials=10):\n",
    "    fig, ax = pylab.subplots(1, 2, sharey=True, figsize=(15, 6))\n",
    "    ax[0].set_title('Train Error')\n",
    "    ax[1].set_title('Test Error')\n",
    "    default_means = None\n",
    "    for i, (label, encoders, eval_points) in enumerate((\n",
    "            ('Default',     uniform_sphere, uniform_ball),\n",
    "            ('Encoders',    scatter_sphere, uniform_ball),\n",
    "            ('Eval Points', uniform_sphere, scatter_ball),\n",
    "            ('Both',        scatter_sphere, scatter_ball))):\n",
    "        errors = np.empty((trials, 2))\n",
    "        for seed in range(trials):\n",
    "            errors[seed] = do_trial(\n",
    "                seed, encoders, eval_points, n_eval_points,\n",
    "                test_points, n_test_points, n_neurons, dims)\n",
    "        means = np.mean(errors, axis=0)\n",
    "        if default_means is None:\n",
    "            default_means = means\n",
    "        for j in range(2):\n",
    "            l = '%s (%d%%)' % (label, default_means[j] / means[j] * 100)\n",
    "            if sns is None:\n",
    "                ax[j].hist(errors[:, j], label=l, lw=1, alpha=0.3)\n",
    "            else:\n",
    "                sns.kdeplot(errors[:, j], ax=ax[j], label=l, lw=4, alpha=0.6)\n",
    "    ax[0].legend()\n",
    "    ax[1].legend()\n",
    "    pylab.show()\n",
    "\n",
    "do_experiment(n_neurons=100, dims=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, problem 3 is _strictly_ harder (and in fact almost completely different), and so we will save that talk for a later day.\n",
    "\n",
    "## The Number-Theoretic Method (NTM)\n",
    "\n",
    "This exact same problem showed up as early as 1961 and was studied extensively in the 1980s [1] for applications in **experimental design** and **statistical finance**, in which the task boils down to evaluating a high-dimensional integral:\n",
    "\n",
    "$$\\int_{S} f({\\bf u})\\,{\\rm d}{\\bf u}$$\n",
    "\n",
    "where $S$ is some $D$-dimensional space. Due to the **curse of dimensionality**, even modestly sized $D$ requires too many points to evaluate using standard numerical integration techniques like the trapezoidal rule. Instead, the standard approach is to choose a sub-sampling of **representative points** $\\{{\\bf x_1}, \\ldots, {\\bf x_N}\\}$ over the domain of $S$, and compute:\n",
    "\n",
    "$$\\approx \\frac{1}{N}\\,\\sum_{i=1}^N f({\\bf x_i})$$\n",
    "\n",
    "This works well as long as we can sample these points uniformly. It has been theoretically proven that the approximation error from using the NTM is superior to that of MC sampling.\n",
    "\n",
    "---\n",
    "\n",
    "To make the connection back to our original problem explicit, when solving for the decoders we are minimizing the mean squared error given by:\n",
    "\n",
    "$$\\int_{S} ({\\bf x} - {\\bf \\hat{x}})^2 \\,{\\rm d}{\\bf x}$$\n",
    "\n",
    "where $S$ is the $D$-ball. Thus, $f({\\bf u}) = ({\\bf u} - {\\bf \\hat{u}})^2$ is the function we need to integrate. And the points $\\{{\\bf x_1}, \\ldots, {\\bf x_N}\\}$ are precisely the set of $N$ evaluation points that we are, in effect, choosing to approximate this integral. This explains more formally why this new approach out-performs the old approach.\n",
    "\n",
    "---\n",
    "\n",
    "Now the NTM goes by a number of roughly equivalent names:\n",
    " - uniform design\n",
    " - NT-net\n",
    " - quasi Monte Carlo\n",
    " - quasi-random\n",
    " - low-discrepancy sequence\n",
    " - representative points\n",
    " - uniformly scattered points\n",
    "\n",
    "We will refer to the collective theory as the NTM, and to a specific sampling as a uniform scattering.\n",
    "\n",
    "There are many algorithms to generate scattered points in the literature:\n",
    " - Faure\n",
    " - Good Points\n",
    " - Good Lattice Points\n",
    " - ~~Latin Square~~\n",
    " - Haber\n",
    " - Halton (van der Corput)\n",
    " - Hammersley\n",
    " - Hua and Wang (cyclotomic field)\n",
    " - Niederreiter\n",
    " - ~~Poisson Disk~~\n",
    " - Sobol\n",
    " \n",
    "Since **Sobol** had the most readily-available Python library (and the largest Wikipedia page), I used it for all my experiments. The default method has been updated to **[Rd](http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/)** since the writing of this notebook.\n",
    "\n",
    "All of these approaches (except the Latin Square and Poisson Disk sampling) attempt to minimize the **discrepancy** of the samples. Informally, the discrepancy measure tells us how much the sample distribution differs from the underlying distribution. Formally, we define the empirical distribution as the cumulative distribution function (CDF) $F_N({\\bf x}) = P({\\bf X} \\le {\\bf x})$, where:\n",
    "\n",
    "$$P({\\bf X} = {\\bf x_i}) = \\frac{1}{N}, \\quad i = 1 \\ldots N$$\n",
    "\n",
    "Then the discrepancy of this set is defined as:\n",
    "\n",
    "$$D(N) = sup_{x \\in \\mathbb{R}^D} |F_N({\\bf x}) - F({\\bf x})|$$\n",
    "\n",
    "where $F$ is the true CDF. This gives an upper-bound on how well the empirical CDF approximates the true CDF. The lower this is, the better our sample set represents the true distribution at all points (not just the points that were sampled). And all of these approaches (again, except for Latin/Poisson) have worst-case (guaranteed) bounds that are asymptotically dominated by MC in theory (and for lower $N$ as well in practice).\n",
    "\n",
    "$$D(N) = \\begin{cases} \n",
    "      O(\\frac{1}{\\sqrt{N}}) & \\text{MC Sampling} \\\\\n",
    "      O(\\frac{(log N)^D}{N}) & \\text{NTM Sampling} \n",
    "   \\end{cases}$$\n",
    "\n",
    "The $\\sqrt{N} = o(N)$ is the important part. The numerator is a worst-case that is reported to be a small constant in practice.\n",
    "\n",
    "The nice fact that comes out of all of this, is that: the discrepancy is related to the error when computing the above integral by a constant factor (fixed for a given $f$), and thus reflects the error in approximating the decoders' true RMSE (its generalizability).\n",
    "\n",
    "$$\\left|\\underbrace{\\int_{S} ({\\bf x} - {\\bf \\hat{x}})^2 \\,{\\rm d}{\\bf x}}_{\\text{Testing Error}} - \\underbrace{\\frac{1}{N}\\,\\sum_{i=1}^N ({\\bf x_i} - {\\bf \\hat{x_i}})^2}_{\\text{Training Error}} \\right| \\le \\underbrace{C(f) \\, D(N)}_{\\text{Generalization Error}}$$\n",
    "\n",
    "where $C(f)$ is a constant that depends on the ensemble and the function being optimized, and $D(N)$ is the discrepancy of the $N$ evaluation points. When choosing evaluation points, we are fixing $C(f)$ and trying to minimize $D(N)$. Therefore, when using NTMs over MC sampling, we only need on the order of $\\sqrt{N}$ as many evaluation points to get the same level of performance; NTM squares the effective number of evaluation points!\n",
    "\n",
    "Now, everything is fine and dandy. It's a simple one-line call in Python to generate a sequence that does exactly what we need. However... all of these methods generate points on the $D$-cube, but not the $D$-sphere or $D$-ball as required. Fortunately, [1] describes an **inverse transform method** which allows us to generate scattered points on any distribution provided we can represent ${\\bf x}$ as a set of independent random variables with some known inverse CDF. Furthermore, the authors have already done this for the $D$-sphere and $D$-ball using the **spherical coordinate transformation**! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Inverse Transform Method\n",
    "\n",
    "It is not at all clear how to map scattered points from the $D$-cube to the $D$-sphere or $D$-ball. If we just try to normalize the vectors to the sphere, then the results are poor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_3d(xyz, title, s=10, alpha=0.7):\n",
    "    from mpl_toolkits.mplot3d import Axes3D\n",
    "    fig = pylab.figure(figsize=(7, 7))\n",
    "    ax = fig.add_subplot(111, projection='3d')\n",
    "    ax.set_title(title)\n",
    "    ax.scatter(*xyz.T, alpha=alpha, s=s)\n",
    "    ax.set_xlim(-1, 1)\n",
    "    ax.set_ylim(-1, 1)\n",
    "    ax.set_zlim(-1, 1)\n",
    "    ax.view_init(elev=35, azim=35)\n",
    "    pylab.show()\n",
    "\n",
    "def normalize(x):\n",
    "    return x / nengo.utils.numpy.norm(x, axis=1, keepdims=True)\n",
    "\n",
    "n_samples = 500\n",
    "sample = Sobol().sample(n_samples, 3)\n",
    "plot_3d(sample, 'Sobol 3-Cube')\n",
    "plot_3d(normalize(sample - 0.5), 'Normalized Sobol 3-Cube')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And it actually gets worse as the dimensionality goes up, because the volume of a $D$-cube is concentrated outside the region of the $D$-ball (and so vectors tend to get normalized to the corners)!\n",
    "\n",
    "The following procedure (referred to as the \"inverse transform method\") is a general way to sample an arbitrary multivariate random variable ${\\bf X}$, using only the hyper-cube:\n",
    " 1. Pick a transformation $T$ that maps each ${\\bf x}$ to a vector ${\\bf y} = T({\\bf x})$ such that its components are mutually independent (might be identity).\n",
    " 2. Given the CDF of ${\\bf X}$ and the chosen transformation, determine the CDF $F$ of ${\\bf Y}$ by a substitution of variables (hard part).\n",
    " 3. Then ${\\bf x} = T^{-1}(F^{-1}({\\bf z}))$ samples ${\\bf X}$ uniformly, when ${\\bf z}$ is sampled from a hyper-cube with the same dimensionality as ${\\bf Y}$ (numerical part).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_3d(scatter_sphere.sample(n_samples, 3), 'Scattered 3-Sphere')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spherical Coordinate Transformation\n",
    "\n",
    "There's a 10 page derivation in [1] for both the $D$-ball and $D$-sphere. The sphere case proceeds as follows:\n",
    " 1. The transformation $T$ is the spherical coordinate transformation, such that ${\\bf y}$ is a $(D-1)$-dimensional vector of angles.\n",
    " 2. The distribution of the $i^{th}$ element of ${\\bf y}$ is:\n",
    " \n",
    " $$F_i(y_i) = \\begin{cases} \n",
    "      \\frac{1}{2} B(sin(\\pi y_i)^2; \\frac{D-i}{2}, \\frac{1}{2}) & y_i < \\frac{1}{2} \\\\\n",
    "      1 - \\frac{1}{2} B(sin(\\pi y_i)^2; \\frac{D-i}{2}, \\frac{1}{2}) & otherwise \n",
    "   \\end{cases}, \\quad i=1 \\ldots D-1$$\n",
    "   \n",
    "   where $B$ is the regularized incomplete beta function.\n",
    " 3. This distribution can be inverted using scipy's `betaincinv` function. Also, $T$ is easy to invert. Then take a scattered sample from the $(D-1)$-cube and apply the inverse functions.\n",
    " \n",
    "To modify this to work for the $D$-ball, we instead sample from the $D$-cube and take the last component to be a normalization coefficient for the resulting vector (by raising it to the power of $\\frac{1}{D}$). See code for details.\n",
    " \n",
    "Note: The distribution given by $F$ is closely related to Jan's `SqrtBeta` distribution of subvector lengths. They are identical after substituting the variable $x = sin(\\pi y_i)$ with $n=1$ and $m=D-i$, scaling by $2$, and dealing with the reflection about $y_i = \\frac{1}{2}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_3d(scatter_ball.sample(10000, 3), 'Scattered 3-Ball', 1)\n",
    "plot_3d(uniform_ball.sample(10000, 3), 'Uniform 3-Ball', 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Acknowledgements\n",
    "\n",
    "Many thanks to Michael Hopkins from the SpiNNaker group in Manchester for providing me with all of the relevant background and reading material.\n",
    "\n",
    "[1] K.-T. Fang and Y. Wang, _Number-Theoretic Methods in Statistics_. Chapman & Hall, 1994."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}