Probability Theory: Random Variables & Distributions
A continuation of my probability theory series. This second article will talk about the most common discrete and continuous random variables and their associated probability distributions. Each one will be accompanied by a short description, a simplified example, and its solution. The goal is to explain the applications and ways that random variables can appear in real life (of course, with several assumptions in mind).
What is a Random Variable?
Formally, a random variable X is a function that maps a possible outcome from the sample space of a random experiment to a real number. This number represents an observable quantity, measurement, or a characteristic of interest. Relatedly, the probability distribution of random variables provides a systematic way to specify the likelihood of these possible outcomes. For instance, if we are interested in the number of heads observed in a trial of 5 coin flips, then the random variable X can be represented with possible values {0,1,2,3,4,5}.
There are two types of random variables. In the coin flip example above, X is a discrete random variable:
a. Discrete Random Variables: A discrete random variable takes on a countable set of distinct values (i.e., number of children, dice roll outcomes, etc.) and its probability distribution is described by a probability mass function (PMF). The PMF assigns probabilities to each of the possible values, and is summed over when we want to find the probability over a range of values.
b. Continuous Random Variables: If the random variable is continuous, it has an uncountably infinite set of values (i.e., time, temperature, etc.). Rather than a PMF, its probability distribution is described by a probability density function (PDF). The PDF is integrated over to tell us how likely the random variable falls between a range of values.
Discrete Random Variables
Think of the PMF as a function that produces different probabilities assigned to each possible outcome. The sum of all these probabilities must equal 1 (by the 2nd axiom of Normality, as described in my previous article). To find the probability of an event occurring within a range of values, we sum the individual probabilities associated with each specific value within that range. Why? Consider a simple example regarding a six-sided die, with possible outcomes of {1,2,3,4,5,6}, where we are interested in the probability of rolling anything below a three, P(X < 3). The only possible values for this condition are 1 and 2. Therefore, P(X < 3) must include both the possibility of rolling a 1 and rolling a 2, which is the sum from the lowest value of X all the way to 2, P(X = 1) + P(X = 2).
X ~ Bern(p)
The simplest case is a Bernoulli distribution . This describes a single trial with only 2 outcomes, either a “success” or a “failure”. Notice the quotations, we can define a success or failure to be anything we want (i.e., heads, tails, double yolks in 1000 eggs, etc.) Since the outcome is binary, the probability of a success will be denoted p, and the probability of a failure is the remainder, 1 — p, or typically denoted q.
Example: Bernie, a seasoned venture capitalist, is being pitched to by a biotech startup that makes carbonated beverages by extracting CO2 from the air. Bernie knows that by investing in the company, there can only be 2 outcomes: It will either be a success (10X his investment), or a failure (he will lose all his initial investment). Based on Bernie’s experience, he estimates there is a 2% chance this startup succeeds, and a 98% chance it will fail. This represents a single Bernoulli trial, with p = 0.02, the probability of success, and q = 0.98, the probability of failure.
X ~ Bin(n, p)
A binomial distribution is an extension of the Bernoulli distribution. However, instead of describing a single trial, there are now multiple trials N. The binomial distribution describes the probability of X over N different trials, with again, each trial having two outcomes, a success or failure.
Example: It’s demo day, and there are 20 different startups lined up before Bernie. His renowned strategy is to invest in all of them with even amounts of money allocated for each. Assuming each successful company reap in equal profits, he does some calculations and realizes that his profit point requires 3 companies (out of 20) to succeed. (To simplify the question, Bernie will be happy with exactly 3 companies winning, no more, no less — I will discuss how to find the probability of 3 or more companies winning in a later example). Again, suppose all startups have a 2% chance of succeeding. What is the probability that 3 out of 20 companies will actually succeed?
Solution: This is a binomial distribution with N = 20 trails (number of companies). Bernie needs X = 3 companies to succeed. First, we need to account for all the combinations in which any 3 companies will win out of 20 companies. This is essentially the binomial coefficient (N choose X), which provides the answer to the following question: in how many ways can you select X = 3 companies from N = 20? Next, we need to multiply this coefficient with the probability that 3 companies actually do succeed, which is (0.02)³. Finally, we also need the probability that the remaining 17 companies fail, which is (0.98)¹⁷. Putting this all together in the binomial PMF, The answer is P(X = 3) = (20C3)*((0.02)³)*(0.98)¹⁷). Bernie remains hopeful.
X ~ Pois (λ)
A poisson distribution is used to describe unlikely events occurring within a large number of independent trials. Specifically, it tells you the probability of observing some event, given that you know the average rate of the event occurring, called the poisson rate (λ). For instance, it can be useful in assessing situations like accidents or child births. Interestingly, when the number of trials N becomes large and the probability of success p becomes small, we can also use a Poisson distribution to approximate a binomial distribution, in which λ = Np.
Example: Penelope is a missionary from the Church. However, she wishes to leave the church and become a software engineer at Google. She already has an offer from them, which she must accept or reject a week from today. She tells the Pastor her plans, and he allows her to leave the Church on one condition: she must convert 10 more people into devout Christians before she can resign. Given her average rate of conversion is 3 people per week, what is the probability that she will be able to fulfill the pastor’s request in exactly a week?
Solution: In this case, λ = 3 people per week. Here, we are interested in the probability that Penelope successfully converts 10 people in a week, so we want to find p(X = 10) = e^-3 * 3¹⁰ / 10!
X ~ Geom(p)
A geometric distribution models the probability of reaching the first success after N different trials. For example, if we wanted to model the number of darts N thrown before one succeeds in hitting the bull’s eye. One important thing to note is that in a geometric distribution, previous failures do not modify the probability of the next success. This is called the Memoryless Property, which will show up again in a continuous random variable we will examine later. The Memoryless Property is defined as P(X > x+a | X > a) = P(X > x) where a is the number of failures and x + a is the designated number of trials. For instance, the probability of a coin landing a tails on the 5th trial, given that it has landed 3 heads so far, is the same as the probability of a coin landing tails on the 2nd trial since P(X > 5 | X > 3) = P(X > 2).
Example: Geoffrey is a reckless and impulsive drunkard. After having too many drinks at a local pub in Yekaterinburg, he decides to play a game of Russian Roulette — by himself. Geoffrey is superstitious and will only entertain himself with a maximum of 8 rounds, his lucky number. The gun contains 6 chambers and 1 bullet. What’s the probability that his life will be spared by one round, where the gun would have fired on exactly the 9th round?
Solution: All we need for a geometric distribution is p, where the probability of the gun firing on any round is ⅙. To find the probability of it firing on the 9th round, we have P(X=9) = (⅚)⁸ * (⅙) = 0.0038 , roughly 0.4%. However, now is a good time to introduce a new concept. This percentage does NOT mean that the chance Geoffrey will die is 99.6%. Instead, we have only found the probability of the gun firing on EXACTLY the 9th round, but really it can also fire on the 10th, 11th, 12th…etc. If we want to find the probability he will survive in this game, we must calculate P(X≥9), which is essentially an infinite geometric sum starting from 9. Clearly, it is hard to calculate all the way up to infinity, so what we do instead is take a reverse approach and find its complement. Since all probabilities must sum to 1, then the probability that the gun WILL NOT fire before or on the 9th pull is equal to 1 minus the probability that the gun WILL fire before the 9th pull. This is 1 — P(X<9) which comes out to ~0.23, so actually there is a 23% chance he will survive given 8 trigger pulls.
X ~ NBin(r, p)
The negative binomial distribution is an extension of the geometric distribution, where instead of finding the probability of the number of trials before the first success, we want to find the probability of multiple successes, or the r’th success.
Example: Naegene, an evolutionary biologist, is conducting a study on a genetic mutation in a particular lineage of drosophila flies. She introduced a carrier gene into the first fly mother and wants to understand how many generations it will take for the mutation to be expressed in 3 flies.The mutation expresses itself randomly in flies, meaning that not every offspring will be affected. Suppose each offspring will develop the mutation with p = 0.05. What is the probability that it will take under 10 generations for the mutation to appear in three different flies within the population?
Solution: We want to find the number of trials until the r’th success. In this case, we want to observe 3 mutated flues, so r = 3. The probability of mutation in one fly is p = 0.05. We want to find the probability that it will take at least 10 generations, which means we need to find P(X < 10), where X represents the number of generations needed. Plugging these into the Negative Binomial PMF, we can calculate P(X < 10) by summing the probabilities for X = 1, 2, 3, …, 9.
X ~ HGeom(N, n, D)
The hypergeometric distribution describes the probability of finding an X number of “defectives” from a selected set of objects n, out of N total objects, containing D total defectives, without replacement. Essentially, 1) we have a batch of items, 2) there are some defective items in this batch, 3) we randomly choose some items from this batch, and 4) we want to find out the probability of finding a certain number of defective items from our random selections. TLDR: The gist of a hypergeometric distribution is sampling from a finite population without replacement and finding out how many of those are likely to have a specific characteristic of interest.
Example: Henry is a mycologist and needs to collect a certain kind of mushroom for his experiment. After a long day of scavenging the woods, he randomly collects a total of 10 mushrooms. The woods are filled with only two types of mushrooms indistinguishable from one another: the special mushrooms that Henry needs, and portobello mushrooms. Suppose there are only 100 mushrooms in the entire woods, and only 20 of them are special. In order to carry out his experiment, Henry needs exactly 4 of the special mushrooms. What is the probability that out of the 10 mushrooms he collected, 4 of them are special?
Solution: The total number of mushrooms available in the woods is N = 100. Out of this, only D = 20 are special (defective). Finally, Henry randomly collects n = 10 mushroom samples. We are interested in the probability that 4 of the 10 selected mushrooms are special, so we want to find P(X = 4).
Continuous Random Variables
While discrete random variables are used to describe events with distinct outcomes, continuous random variables deal with situations that fall along an uncountable and infinite continuous spectrum, such as time or temperature. In essence, they provide a way to describe processes that involve quantities where infinitesimal changes can make a significant difference. Instead of a PMF, we now have a probability density function (PDF). Why do we integrate instead of sum? Let’s think of a standard 2D bell curve, which is the PDF representing, say, the heights of individuals. The x-axis represents the values that the random variable X can take on, and the y-value of a point on the curve represents the “density”, or the frequency at which X occurs. So given a value X, (height), we can find Y, the number of individuals with the same height. The area under the PDF curve must always equal 1, so integrating over some boundary of the x-axis allows us to estimate the probability of random variable X falling within those bounds. This is where the 68%, 95%, 99.7% rule comes from — integrating over one, two and three standard deviations from the mean. The cumulative distribution function (CDF) is defined as the integral of the PDF evaluated from the bounds of negative infinity to some upper bound t, allowing us to find P(X ≤ t), for instance the probability that an individual is under 6ft tall.
X ~ N(u, σ)
Lets begin with a super interesting note about the Gaussian distribution, aka normal, aka bell curve. The Gaussian is extremely powerful because of the Central Limit Theorem, which reveals the profound symmetries found in the natural world. This theorem states that the means of entirely random independent samples will eventually converge to a normal distribution given a large enough sample size, regardless of the distribution each sample comes from. So pile together a bunch of arbitrary random variables — and eventually they will be Gaussian distributed. Does this imply that amidst randomness, there lies an underlying order and symmetry? Maybe that’s a stretch, but who knows. Also fun fact: The Gaussian can be used as a continuous approximation of the discrete binomial distribution.
Example: Gerald likes to rave at underground clubs in Berlin. His favorite genre is trance, which is characterized by a BPM ranging from 135–145. Assume that the average BPM played across all underground clubs is 150 BPM, with a standard deviation of 10. What is the probability that the genre will be trance?
Solution: The average BPM played is u = 150, and the standard deviation is σ = 10. To calculate the probability that the genre will be trance, which falls within the range of 135–145 BPM, we want P(135 ≤ X ≤ 145) so we integrate the Gaussian PDF from 135 to 145.
X ~ Unif(a,b)
Uniform distributions describe random values that have an equal probability of occurring. A property of the uniform random variable is that the probability of selecting a value within a particular interval is proportional to the length of the interval. Same with the 2D case except length is replaced by area.
Example: Unice is going fishing for the first time. The pond she arrives at is murky, small, and round, with a diameter of 20 feet. This pond, however, has an interesting characteristic. A dense pack of fish exists only in the center of the pond, bounded within a circular region of approximately 6 feet in diameter. Since the pond is murky, Unice cannot see where the fishes lie, so she will not aim for anywhere in particular. Assuming that the fishing rod is capable of spanning the entire pond with the same probability, what is the likelihood that Unice’s fishing hook will land in the region with fish?
Solution: Because there is an equal probability that the fishing rod lands on any part of the pond, then the probability will be the area of the region over the total area of the pond. The area of the pond with a diameter of 20 feet is 100π. The area of the circular region filled with fish is 9π. Thus, the probability is 9π/100π = 9%.
X ~ Exp(λ)
Exponential distributions model the time elapsed between independent, randomly occurring events. Similar to the discrete geometric distribution, exponential distributions are also characterized by the memoryless property, suggesting that the probability of an event happening in the next instant remains constant, regardless of how much time has already passed.
Example: Epstein is trying to repress a certain thought in his head. This thought surfaces in his mind at an exponential rate of 2. Epstein will soon need to take a 45 minute exam that requires 100% of his attention. What is the probability that he will be able to repress this thought for the next 45 minutes?
Solution: Ensuring the same units, we are interested in the non-occurrence of this thought in the next 45 minutes, or 0.75 hours. We want to find P(X > 0.75), the probability that this thought will occur only after 0.75 hours. Using the complement trick, we need to find 1 — P(X ≤ 0.75).
X ~ Gamma(r, λ)
Gamma distributions model the time until the occurrence of the r’th success. How I like to think of it is through this analogy: Geometric is to Negative Binomial, as Exponential is to Gamma. The difference is that the first two are discrete cases modeling success in N different trials, whereas the last two are continuous cases modeling success over time.
Example: Gamar is the owner of a chicken hatchery and needs 8 chicks to hatch before the end of three weeks. The hatch time has a Gamma rate of 4. What is the probability his need is fulfilled?
Solution: Gamar needs the r = 8th chick to hatch before three weeks, so the time period we are interested in is within 3 weeks. Using λ = 4, we can find P(X ≤ 3).