PSTAT100: Data Science - Concepts and Analysis
April 7, 2026
Here we’ll introduce standard statistical terminology to describe data collection.

A statistical population is the collection of all units of interest. For example:
We can now imagine a few common sampling scenarios by varying the relationship between population, frame, and sample.
Let’s introduce some notation. Denote an observational unit by \(U_i\), and let:
\[ \begin{alignat*}{2} \mathscr{U} &= \{U_i\}_{i \in I} &&\quad(\text{universe}) \\ P &= \{U_1, \dots, U_N\} &&\quad(\text{population}) \\ F &= \{U_j: j \in J \subset I\} &&\quad(\text{frame})\\ S &\subseteq F &&\quad(\text{sample}) \end{alignat*} \]
Perhaps the simplest scenario is a population census, where the entire population is observed. In this case:
\[S = F = P\]
Population Census.
From a census, all properties of the population are definitevely known.
The statistical gold standard is the simple random sample (SRS) in which units are selected at random from the population. In this case:
\[S \subset F = P\]
Simple Random Sample.
From a SRS, sample properties are reflective of population properties.
More common in practice is a SRS from a sampling frame that overlaps with the population but does not cover the population. In this case:
\[S \subset F \quad\text{and}\quad F \cap P \neq \emptyset\]
Typical Sample.
In this scenario, sample properties are reflective of the frame.
Also common is administrative data in which all units are selected from a convenient frame that partly covers the population. In this case:
\[S = F \quad\text{and}\quad F\cap P \neq \emptyset\]
Administrative Data.
Administrative data are singular; they do not represent any broader group.
A poor sampling design will produce samples that distort the statistical properties of the population. If so, it is not sound to generalize:
The sampling scenarios above can be differentiated along two key attributes:
If you can articulate these two points, you have fully characterized the sampling design.
In order to describe sampling mechanisms precisely, we need a little terminology.
For any way of drawing a sample from a frame, each unit has some inclusion probability
Let’s suppose that the frame \(F\) comprises \(N\) units, and denote the inclusion probabilities by:
\[p_i = P(\text{unit } i \text{ is included in the sample})\]
The inclusion probability of each unit is usually determined by the physical procedure of collecting data, rather than fixed a priori.
Sampling mechanisms are methods of drawing samples and are categorized into four types based on inclusion probabilities.
In a census every unit is included:
In a random sample every unit is equally likely to be included:
In a probability sample units have different inclusion probabilities:
In a nonrandom sample inclusion probabilities are indeterminate:
Let’s characterize the sampling designs of some example datasets.
Annual observations of GDP growth for 234 countries from 1961 - 2018.
| Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
| Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
| Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.403933 | 3.471976 | 8.542188 | 4.954545 | 4.822628 | 0.943572 | -2.580050 | -0.147213 | -2.003630 | -0.624644 |
| Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.706892 | 2.545322 | 1.417526 | 1.001987 | 1.774487 | 2.218752 | 3.314805 | 3.802197 | 4.071301 | 2.240070 |
| Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -1.974958 | -0.008070 | -4.974444 | -3.547597 | 2.504466 | 1.434140 | 3.709678 | 0.346072 | 1.588765 | 1.849238 |
5 rows × 60 columns
| Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
| Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
| Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.403933 | 3.471976 | 8.542188 | 4.954545 | 4.822628 | 0.943572 | -2.580050 | -0.147213 | -2.003630 | -0.624644 |
| Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.706892 | 2.545322 | 1.417526 | 1.001987 | 1.774487 | 2.218752 | 3.314805 | 3.802197 | 4.071301 | 2.240070 |
| Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -1.974958 | -0.008070 | -4.974444 | -3.547597 | 2.504466 | 1.434140 | 3.709678 | 0.346072 | 1.588765 | 1.849238 |
5 rows × 60 columns
This is administrative data. Scope of inference: none.
Observations of average brain and body weights for 62 mammal species.
| species | body_weight | brain_weight | slow_wave | paradox | |
|---|---|---|---|---|---|
| 0 | African elephant | 6654.000 | 5712.0 | NaN | NaN |
| 1 | African giant pouched rat | 1.000 | 6.6 | 6.3 | 2.0 |
| 2 | Arctic fox | 3.385 | 44.5 | NaN | NaN |
| 3 | Arctic ground squirrel | 0.920 | 5.7 | NaN | NaN |
| 4 | Asian elephant | 2547.000 | 4603.0 | 2.1 | 1.8 |
| species | body_weight | brain_weight | slow_wave | paradox | |
|---|---|---|---|---|---|
| 0 | African elephant | 6654.000 | 5712.0 | NaN | NaN |
| 1 | African giant pouched rat | 1.000 | 6.6 | 6.3 | 2.0 |
| 2 | Arctic fox | 3.385 | 44.5 | NaN | NaN |
| 3 | Arctic ground squirrel | 0.920 | 5.7 | NaN | NaN |
| 4 | Asian elephant | 2547.000 | 4603.0 | 2.1 | 1.8 |
Let’s call this convenience data. Scope of inference: none.
Records of given names of babies in CA from 1990 - 2018.
| State | Sex | Year | Name | Count | |
|---|---|---|---|---|---|
| 0 | CA | F | 1990 | Jessica | 6635 |
| 1 | CA | F | 1990 | Ashley | 4537 |
| 2 | CA | F | 1990 | Stephanie | 4001 |
| 3 | CA | F | 1990 | Amanda | 3856 |
| 4 | CA | F | 1990 | Jennifer | 3611 |
Records of given names of babies in CA from 1990 - 2018.
| State | Sex | Year | Name | Count | |
|---|---|---|---|---|---|
| 0 | CA | F | 1990 | Jessica | 6635 |
| 1 | CA | F | 1990 | Ashley | 4537 |
| 2 | CA | F | 1990 | Stephanie | 4001 |
| 3 | CA | F | 1990 | Amanda | 3856 |
| 4 | CA | F | 1990 | Jennifer | 3611 |
This is __________________ data. Scope of inference: ______.
It may help to map the scenarios we’ve considered onto an ‘informativeness’ spectrum:
Missing data arise when one or more variable measurements fail for a subset of observations.
NA, NaN, NULL, None, etc.It is standard practice to record observations with missingness but enter a special symbol (.., -, NA, etcetera) for missing values.
In python, missing values are mapped to a special float: NaN
Pandas has the ability to map specified entries to NaN when parsing data files.
Here is some made-up data with two missing values:
Pandas has the ability to map specified entries to NaN when parsing data files:
NaNs halt calculations on numpy arrays.
However, the default behavior in pandas is to ignore the NaN’s, which allows the computation to proceed:
But here’s the rub: those missing values could have been anything, and ignoring them changes the result from what it would have been!
In a nutshell, the missing data problem is:
How should missing values be handled in a data analysis?
Getting the software to run is one thing, but this alone does not address the challenges posed by the missing data. Unless the analyst, or the software vendor, provides some way to work around the missing values, the analysis cannot continue because calculations on missing values are not possible. There are many approaches to circumvent this problem. Each of these affects the end result in a different way. (Stef van Buuren, 2018)
If you are interested in the topic, Stef van Buuren’s Flexible Imputation of Missing Data (the source of one of your readings this week) provides an excellent introduction.
A missing data mechanism is a process causing missingness.
Suppose we have a dataset \(\mathbf{X}\) (tidy) consisting of \(n\) rows/observations and \(p\) columns/variables, and define:
\[q_{ij} = P(x_{ij} \text{ is missing})\]
Data are missing completely at random (MCAR) if the probabilities of missing entries are uniformly equal.
\[q_{ij} = q \quad\text{for all}\quad i, j\]
This implies that the cause of missingness is unrelated to the data: missing values can be ignored.
This is the easiest scenario to handle.*
Data are missing at random (MAR) if the probabilities of missing entries depend on observed data.
\[q_{ij} = f(\mathbf{x}_i)\]
This implies that information about the cause of missingness is captured within the dataset: it is possible to model the missing data.
Missing data methods typically address this scenario.
Data are missing not at random (MNAR) if the probabilities of missing entries depend on unobserved data.
\[q_{ij} = \; ?\]
In the GDP growth data, growth measurements are missing for many countries before a certain year.
| Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -3.685029 | 3.446055 | -1.369863 | 4.198232 | 0.300000 | 5.700001 | 2.100000 | 1.999999 | NaN | NaN |
| Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 14.362441 | 0.426355 | 12.752287 | 5.600745 | 2.724543 | 1.451315 | 2.260314 | 2.647003 | 1.189228 | 3.911603 |
| Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.403933 | 3.471976 | 8.542188 | 4.954545 | 4.822628 | 0.943572 | -2.580050 | -0.147213 | -2.003630 | -0.624644 |
| Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.706892 | 2.545322 | 1.417526 | 1.001987 | 1.774487 | 2.218752 | 3.314805 | 3.802197 | 4.071301 | 2.240070 |
| Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | -1.974958 | -0.008070 | -4.974444 | -3.547597 | 2.504466 | 1.434140 | 3.709678 | 0.346072 | 1.588765 | 1.849238 |
5 rows × 60 columns
We might be able to hypothesize about why – perhaps a country didn’t exist or didn’t keep reliable records for a period of time.
However, the data as they are contain no additional information that might explain the cause of missingness. So these data are MNAR.
The easiest approach to missing data is to drop observations with missing values: df.dropna().
| Country Code | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Argentina | ARG | 5.427843 | -0.852022 | -5.308197 | 10.130298 | 10.569433 | -0.659726 | 3.191997 | 4.822501 | 9.679526 | ... | 10.125398 | 6.003952 | -1.026420 | 2.405324 | -2.512615 | 2.731160 | -2.080328 | 2.818503 | -2.565352 | -2.088015 |
| Australia | AUS | 2.485769 | 1.296087 | 6.214630 | 6.978522 | 5.983506 | 2.382458 | 6.302620 | 5.095814 | 7.044329 | ... | 2.067417 | 2.462756 | 3.918163 | 2.584898 | 2.533115 | 2.192647 | 2.770652 | 2.300611 | 2.949286 | 2.160956 |
| Austria | AUT | 5.537979 | 2.648675 | 4.138268 | 6.124354 | 3.480175 | 5.642861 | 3.008048 | 4.472313 | 6.275867 | ... | 1.837094 | 2.922797 | 0.680446 | 0.025505 | 0.661273 | 1.014502 | 1.989437 | 2.399588 | 2.580121 | 1.418734 |
| Burundi | BDI | -13.746135 | 9.063158 | 4.135407 | 6.273038 | 3.967226 | 4.612993 | 13.821519 | -0.297884 | -1.459541 | ... | 5.124163 | 4.032602 | 4.446708 | 4.924195 | 4.240652 | -3.900003 | -0.600020 | 0.500010 | 1.609933 | 1.842477 |
| Belgium | BEL | 4.978423 | 5.212004 | 4.351584 | 6.956685 | 3.560660 | 3.155895 | 3.868147 | 4.194130 | 6.629800 | ... | 2.864293 | 1.694514 | 0.739217 | 0.459242 | 1.578533 | 2.041459 | 1.266686 | 1.608087 | 1.812296 | 1.743820 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| St. Vincent and the Grenadines | VCT | 4.527283 | 3.694262 | -6.265339 | 3.669726 | 0.884935 | 0.000000 | -9.523813 | 6.509706 | 2.860845 | ... | -3.353437 | -0.419296 | 1.382087 | 1.832977 | 1.214039 | 1.330275 | 1.897441 | 1.000358 | 2.163129 | 0.494924 |
| World | WLD | 4.299183 | 5.554137 | 5.350678 | 6.713557 | 5.519644 | 5.768498 | 4.485951 | 6.313528 | 6.113628 | ... | 4.303017 | 3.137774 | 2.518507 | 2.665979 | 2.861098 | 2.873949 | 2.605790 | 3.298628 | 2.976776 | 2.343378 |
| South Africa | ZAF | 3.844751 | 6.177883 | 7.373613 | 7.939782 | 6.122761 | 4.438308 | 7.196576 | 4.153445 | 4.715831 | ... | 3.039731 | 3.284168 | 2.213355 | 2.485200 | 1.846992 | 1.193733 | 0.399088 | 1.414513 | 0.787056 | 0.152583 |
| Zambia | ZMB | 1.361382 | -2.490839 | 3.272393 | 12.214048 | 16.647456 | -5.570310 | 7.919697 | 1.248330 | -0.436916 | ... | 10.298223 | 5.564602 | 7.597593 | 5.057232 | 4.697992 | 2.920375 | 3.776679 | 3.504336 | 4.034378 | 1.441785 |
| Zimbabwe | ZWE | 6.316157 | 1.434471 | 6.244345 | -1.106172 | 4.910571 | 1.523130 | 8.367009 | 1.970135 | 12.428236 | ... | 19.675323 | 14.193913 | 16.665429 | 1.989493 | 2.376929 | 1.779873 | 0.755869 | 4.704035 | 4.829674 | -8.100000 |
119 rows × 60 columns
Imputation is the process of replacing missing values with estimated values, typically statistical estimates.
Mean Imputation
Do:
Don’t:
President Donald J. Trump amplified the statement in a tweet, the Chairman of the Federal Elections Commission (FEC) referenced the statement as indicative of fraud, and a conservative group prominently featured it in a legal brief seeking to overturn the Pennsylvania election results. (Samuel Wolf, Williams Record, 11/25/20)
The Miller affidavit was criticized by statisticians as incorrect, irresponsible, and unethical.
On a purely mathematical level, Miller’s calculations were standard. The key issue was a single flawed assumption:
The analysis is predicated on the assumption that the responders are a representative sample of the population of registered Republicans in Pennsylvania for whom a mail-in ballot was requested but not counted, and responded accurately to the questions during the phone calls. (Miller affidavit)
Essentially, Miller made two critical mistakes in the analysis:
We will conduct a post mortem and examine these issues.
Miller is a number theorist, not a trained survey statistician, so on some level his mistakes were understandable, but they did a lot of damage.
There were 165,412 unreturned mail ballots requested by registered republicans in PA.
Those voters were surveyed by phone by Matt Braynard’s private firm External Affairs on behalf of the Voter Integrity Fund.
We don’t really know how they obtained and selected phone numbers or exactly what the survey procedure was, but here’s what we do know:
Miller Survey Schematic
Population: republicans registered to vote in PA who had mail ballots officially requested that hadn’t been returned or counted by November 9?
Sampling frame: unknown; source of phone numbers unspecified.
Sample: 2684 registered republicans or family members of registered repbulicans who had a mail ballot officially requested in PA and answered survey calls on Nov. 9 or 10.
Sampling mechanism: nonrandom; depends on availability during calling hours on Monday and Tuesday, language spoken, and willingness to talk.
This is not a representative sample of any meaningful population.*
Respondents hung up at every stage of the survey.
This is probably not at random – individuals who do not believe voter fraud occurred are more likely to hang up.
However, we don’t have any information about whether respondents think fraud occurred.
So data are MNAR, and likely over-represent people more likely to claim they never requested a ballot.
It’s not too tricky to envision sources of bias that would affect Miller’s results. How much bias might there be?
This is an oversimplification, but if we are willing to assume that
Then we can show through a simple simulation that an actual fraud rate of under 1% will be estimated at over 20% almost all the time.
First let’s generate a population of 150K voters.
Then let’s introduce sampling weights based on the conditional probability that an individual will talk with the interviewer given whether they requested a ballot or not.
# assume respondents tell the truth
p_request = 1 - true_prop
p_nrequest = true_prop
# assume respondents who claim no request are 15x more likely to talk
talk_factor = 15
# observed nonresponse rate
p_talk = 0.09
# conditional probability of talking given claimed request or not
p_talk_request = p_talk/(p_request + talk_factor*p_nrequest)
p_talk_nrequest = talk_factor*p_talk_request
# draw sample weighted by conditional probabilities
np.random.seed(41021)
population.loc[population.requested == 1, 'sample_weight'] = p_talk_request
population.loc[population.requested == 0, 'sample_weight'] = p_talk_nrequest
samp = population.sample(n = 2500, replace = False, weights = 'sample_weight')Then let’s introduce missing values at different rates for respondents who requested a ballot and respondents who didn’t.
# assume respondents who affirm requesting are 4x more likely to hang up or deflect
missing_factor = 4
# observed missing/unsure rate
p_missing = 0.25
# conditional probabilities of missing given request status
p_missing_nrequest = p_missing/(0.8 + missing_factor*0.2)
p_missing_request = missing_factor*p_missing_nrequest
# input missing values
np.random.seed(41021)
samp.loc[samp.requested == 1, 'missing_weight'] = p_missing_request
samp.loc[samp.requested == 0, 'missing_weight'] = p_missing_nrequest
samp['missing'] = np.random.binomial(n = 1, p = samp.missing_weight.values)
samp.loc[samp.missing == 1, 'requested'] = float('nan')If we then drop all the missing values and calculate the proportion of respondents who didn’t request a ballot, we get:
So Miller’s result is expected if the sampling and missing mechanisms introduce bias, even if the true rate of fraudulent requests is under 1% – on the order of 1,000 ballots.
After the affidavit was filed, a colleague spoke with Miller; he recanted and acknowledged his mistakes, but this received far less attention than the conclusions in the affidavit.
The American Statistical Association publishes ethical guidelines for statistical practice. The Miller case violated a large number of these, most prominently, that an ethical practitioner:
Reports the sources and assessed adequacy of the data, accounts for all data considered in a study, and explains the sample(s) actually used.
In publications and reports, conveys the findings in ways that are both honest and meaningful to the user/reader. This includes tables, models, and graphics.
In publications or testimony, identifies the ultimate financial sponsor of the study, the stated purpose, and the intended use of the study results.
When reporting analyses of volunteer data or other data that may not be representative of a defined population, includes appropriate disclaimers and, if used, appropriate weighting.
python.