Foundations of Probability: Monte Carlo

I. Introduction:

Main Idea:

Monte Carlo sampling revolves around the idea of estimation using sampling. In statistics and machine learning, we are often trying to estimate, or approximate distributions or expected values by sampling.

Largely, the expectations or distrutions we are trying to estimate are intractible, and so approximation is usually the best method toward inference.

Monte Carlo was originally named after the region in Monaco by researchers working in the Manhattan Project. This method of sampling, although seen as crude and inefficient, had the benefit of being incredibly fast. Monte Carlo simulations was such an impactful method that it was noted to be essential to the developement of the Hydrogen bomb.

II. Theory Pt.1:

Initial Definition:

Although a clear, fine-grained definition of Monte Carlo is often debated, one of the simplest and accepted definitions is that the expectation of a function f can be approximated by the sample mean of samples coming from a distribution \( p(x) \).

\begin{align} \mathbb{E}_{X \sim p(x)} [f(x)] &\approx \frac{1}{N} \sum_{s=1}^{N} f(x_s) \\ x_s &\sim p(x) \end{align}

A Few Remarks

  • Monte Carlo is an unbiased estimator, meaning: the expected value of the random variable \( \hat{\mu}_n \), which comes from the distribution of the sample means, is equal to the expected value of the function, or $$ \mathbb{E}[\hat{\mu}_n] = \mathbb{E}[f(x)] $$
  • Monte Carlo is a consistent estimator that follows the Weak Law of Large Numbers (ie.) given that the varianace of f(x) is not infinity, or \( \text{Var}(f(x)) \). We can also express this as \begin{align} \hat{\mu}_n \xrightarrow[]{\text{p}} \mathbb{E}(f(x)) \text{ as n} \rightarrow \inf \end{align} In other words, we can say that for any \( \epsilon \), the distance between the distrubtion space by the sample mean \( \hat{\mu}_n \) and the expected value of f(x), \( \mathbb{E}(f(x)) \), will decrease such that the probability of that distance being less than \( \epsilon \) goes to 1 as the number of samples n, increases, \begin{align} (\forall \epsilon > 0, p(\vert \hat{\mu}_n - \mathbb{E}[f(x)] \vert < \epsilon) \rightarrow 1 \text{ as n} \rightarrow \inf \end{align}
  • Once again following the same logic as above, as n goes to infinity, the error and variance should get smaller, so \begin{equation} \sigma^2(\hat{\mu}_n) = \frac{1}{n}\sigma^2(f(x)) \rightarrow 0 \end{equation} Another way to interpret this is that the mean squared error, given that our bias is zero, is just the variance of \( \hat{\mu}_n \). This value goes to zero as more samples are used. \begin{align} \sigma^2(\hat{\mu}_n) &= MSE(\hat{\mu}_n) \\ &= \text{bias}^2 + \text{var} \\ &= \frac{1}{n}\sigma^2(f(x)) \rightarrow 0 \end{align} Furthermore, since \( \frac{1}{n} \) is the constant scaling \(\sigma^2(f(x))\) as it goes to zero, it's often said the the variance decreases by a rate of \( \frac{1}{n} \). Or equally, the standard deviation decreases by a rate of \( \frac{1}{\sqrt{n}} \).

  • It's assumed that we are sampling efficiently from the distribution, since sampling from less probable regions would cause approximation errors.

Let's say we have a circle circumscibed onto a square. Here the area of the circle is \( A(\text{circle})=\pi r^2 \) and the area of the square is \( A(\text{square})= 4r^2 \). The ratio of the area of the circle to the area of the square is then \( \frac{A(\text{circle})}{A(\text{square})}= \frac{\pi r^2}{4 r^2} = \frac{\pi}{4} \).

Let's approximate this ratio by asking: if we randomly threw points in the area of the square, what is the ratio of points that land in the circle to the total number of thrown points. \begin{align} \frac{A(\text{circle})}{A(\text{square})} &= \mathbb{E} \bigg[ \frac{\text{num_points_inside_circle}}{\text{tot_num_points}} \bigg] \\ &\approx \sum_i^N \bigg( \frac{\text{num_points_inside_circle}}{\text{tot_num_points}} \bigg) \\ &\approx \frac{1}{N} \sum_i^N \bigg( text{num_points_inside_circle} \bigg) \\ &\approx \frac{1}{N} \sum_i^N (x_i^2 + y_i^2 \leq r^2) \\ &= \text{sample_ratio} \end{align} Typically, this example is used to approximate \( \pi \), because as this sample ratio gets better, we have that it equals \( \frac{\pi}{4} \). Therefore, we can just approximate \( \pi \) as \( \pi = 4 \cdot \text{sample_ratio} \) .

Note:

Another interesting fact about the variance converging at a rate of \( \frac{1}{n} \) is that this rate is independent of the number of dimensions in x. This is due to the sampling f(x) assuming a one dimensional quantity.