Fundamentals of Probability: Introduction to Probability

I. Introduction:

Main Idea:

Probability can be described, at a very high level, as the ratio of an element over the total number of elements. We will expand on a mathematically-consistent definition in another article on probability spaces, but for now, we can simply think of probabilities are these ratios. \begin{equation} p(\text{element}) = \frac{\text{element}}{\text{all_elements}} \end{equation}

Largely, the expectations or distributions we are trying to estimate are intractible, and so approximation is usually the best method toward inference.

The usual example for probabilities is to reference a dice; here, the probability of each side would be 1/6, because there are 6 total elements, and each element appears once. What if we have two faces that have the same side? Well the value of that side would have a probability of 2/6.


Note:

A better way to describe an element is as an event>. We will see a more defined mathematical framework for this, but for now, we can just name them

II. Theory Pt.1:

Initial Definition:

Since we can think of probabilities as just taking fraction of elements with respect to the total amount of elements, the sum of all of these elements is 1. You can think of it as taking slices of a pie; no matter how many pieces to make from it, the addition of these pieces divided by the total pie will equal 1.

\begin{align} \sum_{i=1}^\infty P(\text{elem_i}) = \sum_{i=1}^\infty \frac{\text{elem}_i}{\text{all_elements}} \end{align}

A Few Remarks:

Mathematically, the sum of probabilities can be expressed as,

  • The sum of individual ratios with respect to the total sum of elements, \begin{align} \sum_{i=1}^\infty P(\text{elem_i}) &= \sum_{i=1}^\infty \frac{\text{elem}_i}{\text{all_elements}} \\ &= \frac{1}{\text{all_elements}} \sum_{i=1}^\infty \text{elem}_i \\ &= \frac{\text{all_elements}}{\text{all_elements}} \\ &= 1 \end{align}
  • The sum of all elements over the total elements, both of which equal one, \begin{align} \sum_{i=1}^\infty P(\text{elem_i}) &= \sum_{i=1}^\infty \frac{\text{elem}_i}{\text{all_elements}} \\ &= \frac{\text{elem}_1}{\text{all_elements}} + \dots + \frac{\text{elem}_k}{\text{all_elements}} \\ &= \frac{\text{elem}_1 + \dots + \text{elem}_k}{\text{all_elements}} \\ &= \frac{\text{all_elements}}{\text{all_elements}} \\ &= 1 \end{align}

Note:

Another interesting fact about the variance converging at a rate of \( \frac{1}{n} \) is that this rate is independent of the number of dimensions in x. This is due to the sampling f(x) assuming a one dimensional quantity.

III. Theory Pt.2:

Distributions:

The sum of all probabilities must equal 1, but there's no real other restriction on how the values of these probabilities should be distributed. You can have probabilities that are all equal, or ones with high probabilities for some elements and low for others. We call the distribution of probabilities for a set of elements simply a *distribution*. As long as sum of all probabilities in this distribution sums to 1, the distribution is valid.

\begin{align} \sum_{i=1}^\infty P(\text{elem_i}) = \sum_{i=1}^\infty \frac{\text{elem}_i}{\text{all_elements}} \end{align}

Random Variables:

We can think of a random variable as a function that takes as input an element, and outputs the probability of that element. The probabilities assigned to each element is dependent on what probability distribution we choose, and so mathematically, you see a random variable defined as

\begin{align} X \sim \text{Distribution} \end{align}

Notation: Upper case letters are random variables, and lower case letters are the values within the distribution. So \( p(X=x_i) \) is the probability of \( x_i \). It's often the case that probabilities are simply written as \( p(x_i) \).

Sampling:

What can we do with this random variable? Well, we take a random value from this *random variable* by picking a value in accordance to the probability distribution it is being described by. Therefore, each time you get a value, it may NOT be the same value; it is not deterministic. Getting a random value from a random variable described by a distribution is known as *sampling*. Often, sampling is introduced as imagining having a bag of marbles of colors blue and red. If you grab a marble randomly from the bag, this action is sampling.

Expectation:

Since we now have an undeterministic system that maps elements to their respective probabilities, we can ask what the average of those elements are. Mathematically, we can be more specific by asking: what is the average value of those elements is with respect to their probabilities? We can therefore start with the definition of a weighted average, which is described as: the sum of values with weights normalized by the sum of the weights.

\begin{align} \overline{X} = \frac{\sum w \, x}{\sum w} \end{align}

This equation might seem confusing at first but another way to look at it is to each value being applied to a normalized weight. These normalized weights are the ratios of its weight to the total sum of the weights.

\begin{align} \overline{X} &= \frac{\sum_i w_i \, x_i}{\sum w_j} \\ &= \sum_i x_i \, \bigg( \frac{w_i}{\sum_j w_j} \bigg) \\ \end{align}

This may sound familiar. Recall that probabilities can be though of as the ratio of an element to the total elements. If we write the weights as elements, normalized weights are then the probabilities of a respective element. Therefore, we can write the same weighted average equation as the sum of elements with probabilities normalized by the sum of all their respective probabilities.

\begin{align} \overline{X} &= \sum_i x_i \, \bigg( \frac{w_i}{\sum_j w_j} \bigg) \\ &= \sum_i x_i \, \bigg( \frac{\text{elem}_i}{\sum_j \text{elem}_j} \bigg) \\ &= \sum_i x_i \, \bigg( \frac{x_i}{\sum_j x_j} \bigg) \\ &= \sum_i x_i \, p(x_i) \end{align}

Being more explicit, the weighted average the element values of X and their respective probabilities is given the name of Expected Value, \( \mathbb{E}[X] \).

\begin{align} \mathbb{E}[X] &= \sum_{i=1}^n x_{i} \, p(x_i) \end{align}

Expanding on the Expected Value a little more, we can say that it is with respect to any probability distribution p.

\begin{align} \mathbb{E}_{X \sim p(x)}[g(X)] &= \sum_{i=1}^n g(x_{i}) \, p(x_i) \end{align}

Furthermore, we can take the expectation of a function of a random variable, \( g(X) \) with respect to a distribution of X, \( X \sim p(x) \),

\begin{align} \mathbb{E}_{X \sim p(x)}[g(X)] &= \sum_{i=1}^n g(x_{i}) \, p(x_i) \end{align}

Moments:

A quick introduction to moments is that they are defined as

\begin{align} \mathbb{E}_{X \sim p(x)}[X^n] &= \sum_{i=1}^n x_{i}^n \, p(x_i) \end{align}

Variance:

If we wanted to know the average distance away from the expected value, we can find the square distance,

\begin{align} \text{Var}[X] &= \mathbb{E}[(X-\mathbb{E}[X])^2] \\ &= \mathbb{E}[X^2 - 2X\mathbb{E}[X] + \mathbb{E}[X]^2] \\ &= \mathbb{E}[X^2] - 2\mathbb{E}[X]\mathbb{E}[X] + \mathbb{E}[X]^2 \\ &= \mathbb{E}[X^2] - 2\mathbb{E}[X]^2 + \mathbb{E}[X]^2 \\ &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\ \end{align}

Note:

Note: Weighted averages are \( \min(X) \leq \overline{X} \leq \max(X) \).