Basic Notions in Statistics

Population

In statistics, the population refers to the complete set of all items, individuals, events, or observations that are of interest in a particular study. It can be large (e.g., all people living in a country) or small (e.g., all students in a class). The population can be finite or infinite, depending on the scope of the study.
Statistical Units

A statistical unit is the individual element or entity in a population or sample that is being observed, measured, or counted in a study. These units are what the data points are collected from, and they must be clearly defined to avoid confusion.
Distribution

A distribution in statistics refers to how the values of a variable are spread or distributed across a dataset. It shows the frequency or likelihood of each value or range of values. There are different types of distributions (e.g., normal distribution, binomial distribution), but they all describe how data points are arranged within a dataset.
Frequency

Frequency refers to the number of times a particular value or group of values occurs within a dataset. There are three key types of frequency that are often used:
- a) Absolute Frequency: The absolute frequency is simply the count of how many times a certain value or group of values appears in the dataset.
- b) Relative Frequency: Relative frequency refers to the proportion or fraction of the total number of data points that belong to a particular category. It’s calculated by dividing the absolute frequency of a value by the total number of observations.
- c) Percentage Frequency: The percentage frequency is the relative frequency expressed as a percentage. This is useful for understanding the data in terms of parts per hundred.
Arithmetic Average (Mean)

The arithmetic average, also known as the mean, is one of the most commonly used measures of central tendency in statistics. It represents the sum of all values in a dataset divided by the number of values.
Computational Problems with Floating-Point Representation

When calculating the arithmetic mean (or any other statistic) on computers, floating-point representation introduces potential challenges due to the way real numbers are stored.
- a) Precision Errors: Computers use floating-point arithmetic to store real numbers, but since floating-point numbers have limited precision, operations on large or small numbers can result in rounding errors. These errors are typically small, but they can accumulate, especially when dealing with large datasets or iterative calculations.
- b) Catastrophic Cancellation: Catastrophic cancellation occurs when subtracting two nearly equal numbers, causing a significant loss of precision. This happens when large numbers cancel each other out, leaving only small differences, which may be poorly represented in floating-point arithmetic.
Numerical Solutions (Knuth’s Algorithm)

To address computational issues like precision loss, algorithms such as those proposed by Donald Knuth offer more stable numerical solutions. One such technique is Kahan summation (or compensated summation), which improves the accuracy of summing floating-point numbers.

Kahan Summation Algorithm:

The idea is to track small errors introduced during the summation process and compensate for them, thereby reducing the effect of floating-point precision issues. The algorithm works as follows:
1. Initialize a running sum and a compensation term, initially set to zero.
2. As each value is added to the sum, the compensation term is updated to account for any small error from previous steps.
3. The compensated sum is then corrected iteratively, resulting in a more accurate final total.
Steps:
- Add the next value to the sum.
- Compensate for the lost low-order bits by keeping track of the small errors.
- Correct the running sum as you go to maintain precision.
By carefully managing the addition of small differences, this algorithm helps to minimize the errors that accumulate when adding a large number of values, especially in datasets with values of different magnitudes.

EXERCISE

Assignment:

We have n servers with m attackers. The hacker has probability p to penetrate each server. Make a graphical representation (line flat if hacker doesn’t penetrate and a jump to 1 if he penetrates), try different n, m, p. At time n we want to complete distribution how many reached each level. (Draw the distribution histogram vertically at the end of the chart, so that each rectangle representing the attackers’ frequency is placed on the corresponding number of penetrations (or “successes”) they achieved).

You can find the code for the exercise here, while the online result can be accessed here.

Basic Notions in Statistics

Population

Statistical Units

Distribution

Frequency

Arithmetic Average (Mean)

Computational Problems with Floating-Point Representation

Numerical Solutions (Knuth’s Algorithm)

Kahan Summation Algorithm:

EXERCISE