Basic Notions in Statistics

  1. Population

    In statistics, the population refers to the complete set of all items, individuals, events, or observations that are of interest in a particular study. It can be large (e.g., all people living in a country) or small (e.g., all students in a class). The population can be finite or infinite, depending on the scope of the study.

  2. Statistical Units

    A statistical unit is the individual element or entity in a population or sample that is being observed, measured, or counted in a study. These units are what the data points are collected from, and they must be clearly defined to avoid confusion.

  3. Distribution

    A distribution in statistics refers to how the values of a variable are spread or distributed across a dataset. It shows the frequency or likelihood of each value or range of values. There are different types of distributions (e.g., normal distribution, binomial distribution), but they all describe how data points are arranged within a dataset.

  4. Frequency

    Frequency refers to the number of times a particular value or group of values occurs within a dataset. There are three key types of frequency that are often used:

  5. Arithmetic Average (Mean)

    The arithmetic average, also known as the mean, is one of the most commonly used measures of central tendency in statistics. It represents the sum of all values in a dataset divided by the number of values.

  6. Computational Problems with Floating-Point Representation

    When calculating the arithmetic mean (or any other statistic) on computers, floating-point representation introduces potential challenges due to the way real numbers are stored.

  7. Numerical Solutions (Knuth’s Algorithm)

    To address computational issues like precision loss, algorithms such as those proposed by Donald Knuth offer more stable numerical solutions. One such technique is Kahan summation (or compensated summation), which improves the accuracy of summing floating-point numbers.

    Kahan Summation Algorithm:

    The idea is to track small errors introduced during the summation process and compensate for them, thereby reducing the effect of floating-point precision issues. The algorithm works as follows:

    1. Initialize a running sum and a compensation term, initially set to zero.
    2. As each value is added to the sum, the compensation term is updated to account for any small error from previous steps.
    3. The compensated sum is then corrected iteratively, resulting in a more accurate final total.

    Steps:

    By carefully managing the addition of small differences, this algorithm helps to minimize the errors that accumulate when adding a large number of values, especially in datasets with values of different magnitudes.

EXERCISE

Assignment:

We have n servers with m attackers. The hacker has probability p to penetrate each server. Make a graphical representation (line flat if hacker doesn’t penetrate and a jump to 1 if he penetrates), try different n, m, p. At time n we want to complete distribution how many reached each level. (Draw the distribution histogram vertically at the end of the chart, so that each rectangle representing the attackers’ frequency is placed on the corresponding number of penetrations (or “successes”) they achieved).

You can find the code for the exercise here, while the online result can be accessed here.