Probability Theory: Introduction

Chelsea Zou
7 min readOct 8, 2023

--

Probability, IMO, is one of the weirdest branches of mathematics out there. While super interesting, topics like randomness, chance, and uncertainty have always been very non-intuitive for me. So I’ve decided to start a series on (intro) probability theory — to try to somewhat reason my way through by writing. This first article will be brief, starting with the three fundamental axioms of probability, and ending with some philosophical garnish (of course), on the two major perspectives in probability. My next article for this series will be more technical, on random variables and probability distributions. I actually started with that, but got tired of trying to write equations on Medium, so here I am. I’ll get back to that soon. Ideally, I’d like to talk about conditional probabilities and Bayes theorem at some point, (I’m kind of obligated to — Reverend Thomas Bayes is literally on my lock screen). But we’ll see how far I get…

A s with all of math, there are some profound implications that are tied to probability — determinism vs indeterminism, causality, predictability, randomness, etc. — all deeply interwoven into the underpinnings of reality. In our own lives, we encounter a lot of uncertainty. The weather, poker, the stock market, sports, elections, births, deaths — to name a few. This is what makes probability theory so important: it is the mathematical framework that allows us to quantify and analyze uncertainty. To begin, what lies as the basis of probability theory are three foundational starting points called axioms. These axioms are essentially assumptions and postulates the entire theory is constructed upon. To understand some of the later concepts, we will begin with a brief explanation.

The Three Axioms of Probability

  1. Non-negativity: P(A) >= 0
  2. Normality: P(S) = 1
  3. Additivity: P(A U B) = P(A) + P(B)

(1) Non-negativity states that the probability of any event A occurring, must not be negative. 0 indicates that the event is impossible, and 1 indicates that the event is completely certain to happen. So a negative probability would indicate that something is less likely than impossible, which is… impossible. Similarly, a probability of greater than 1 would mean something is more than certain to happen, which also does not make sense (this is elucidated in the second axiom). In the real world, the probability of some event occurring usually lies somewhere in between 0 and 1.

(2) Normality states that the probability of the whole sample space S is 1. Imagine a sample space as a universe that contains a collection of events that one wants to consider in a given context. When we consider all possible events within that universe, then the sum of their probabilities must equal 1. What this means is that at least one out of all possible events is certain to occur within the universe. For example, if you’re rolling a fair six-sided die, the whole sample space S would be {1, 2, 3, 4, 5, 6}, each with a probability of P = ⅙, as those are all the possible outcomes. As P(S) = 1, this implies that there is a 100% chance something must happen — you must either roll a 1,2,3,4,5, or 6. This brings us to the final axiom.

(3) Additivity states that the probability of mutually exclusive events occurring is equal to the sum of their individual probabilities. Mutually exclusive events are events that cannot occur simultaneously. In other words, if one of these events happens, it excludes the possibility of the other event occurring at the same time. What additivity implies is that the probability of the union (which is equivalent to “or”) of mutually exclusive events is equal to the sum of the individual events occurring. For instance, in a fair coin toss, getting heads (P = 0.5) or tails (P = 0.5) are mutually exclusive events (getting heads means that you cannot get tails, and vice versa). Hence, the probability of getting heads OR tails must be P = 0.5 + 0.5 = 1. Which makes sense because there is a 100% chance that you will get EITHER heads OR tails in a coin toss event. Though this might not be the best example because it overlaps with Normality, since heads and tails make up the whole sample space. To be more clear, another example is this: if the probability that a baby cries, laughs, or yawns is P = 0.2, 0.15, and 0.1, then the probability that one of these instances will happen is P = 0.2 + 0.15 + 0.1 = 0.45. This is under the assumption that a baby cannot do two or more of these at the same time (which is false in the real world because there can be laughter/cry combinations).

Are you a Bayesian or a Frequentist?

This past summer, a friend and I went hiking together. To entertain ourselves on the long trail, we went back and forth giving each other math riddles and brain teasers (yes we’re nerds, and so are you if you’re reading this article). I had a fresh quant riddle in my mind that someone told me a few weeks prior, which went like this: “I have a coin and you have a coin (does not need to be fair). My coin is weighted with the probability of landing heads and tails with unknown values. What weight does your coin have to be such that if we both toss our coins, there is an equal chance they will both land on the same side?”. What thereon after led to a two week debate, was how he initiated the approach to the problem. Perhaps it was because he was a physics major, but he jumped straight to set the grounds for a practical Bayesian inference. “Ok, well since I don’t know the weights of your coin, the first step is to assume your coin is 50/50, then I — ”. Woah woah woah. I had to stop him right there. “Wait, why are you assuming anything? Your own assumptions and beliefs have nothing to do with the real probabilities out there”, I told him. And so, we spiraled down a rabbit hole. In his perspective, because of the uncertainty in the weights of my coin, his plan was to first base the question off of his starting assumptions, otherwise known as “priors” in a Bayesian scenario. However, my argument was that it didn’t matter what his starting assumptions were, because his own beliefs could not have influenced the true probabilities of the event. In his defense, his plan (as a good Bayesian) was to anchor the question with the starting priors, which was a pragmatic way to deal with the unknowns. In my defense, the true probabilities were independent of his beliefs, and thus, making such assumptions was unnecessary. So, who was right?

As much as I still hold true to my beliefs, I figured out later on, that was inherently a philosophical debate, which was why we made no progress. He had a Bayesian approach, whereas I had a frequentist viewpoint (which was fairly surprising because I always considered myself an overall Bayesian). While these two perspectives are not mutually exclusive, there tends to be a certain philosophical dichotomy between the two. Put simply, a Bayesian perspective is more subjective, whereas the Frequentist perspective is more objective. Bayesians update their beliefs in light of new information. Frequentists rely on fixed rules and beliefs based on past observations. Bayesians consider probabilities to be a measure of degrees of knowledge, whereas Frequentists see probabilities as a measure of observed frequencies after a large number of trials. Though of course, this is a difficult topic and I am vastly oversimplifying. There are entire texts written on the philosophical interpretation of probability, and my knowledge is limited. But this is how I see it.

Bayesian Perspective: Bayesians view probability as a measure of subjective belief or uncertainty. They believe that probability can be assigned to any event, including uncertain quantities and hypotheses, based on prior beliefs. These essentially represent what is known or assumed before observing any new information. Then, Bayes’ Theorem is used to update their beliefs in a principled way as new data becomes available. Incorporating these elements, they calculate posterior probabilities, which represent the updated beliefs after considering the data. For instance, suppose I am an intern, and I want to calculate the probability that I will get a return offer. I start with some prior knowledge about the usual return offer rate of the company — say, 1 in 50 interns successfully do. Then, let’s say I receive some new information — my manager sends me an email telling me just how amazing I am. Incorporating the prior with this new information, I would update my beliefs and calculate a better chance of getting the return offer. Though, I’ll save the details of Bayes Theorem in another article.

Frequentist Perspective: On the other hand, Frequentists view probability as a long-run relative frequency or a property of the data generation process. Probability is seen as an objective concept related to the behavior of random events in repeated experiments. Furthermore, Frequentist methods do not incorporate prior beliefs or subjective information into their analysis. They focus solely on the properties of the observed data and the sampling process, typically seeking out point estimates that represent the “best” estimate of a parameter based on the observed data. In the same return offer example above, the frequentist perspective would just take into account all the past data of interns and their return offers without considering the manager’s email or any prior beliefs (this is kind of the gist of it, but not entirely accurate). They would analyze historical data, calculate the observed rate of return offers among interns, and use statistical techniques to estimate the probability of receiving a return offer based solely on the empirical data.

--

--

Chelsea Zou

ML @ Stanford | Dabbler in science, tech, maths, philosophy, neuroscience, and AI | http://bosonphoton.github.io