# Statistics with Pure Python - Introduction

The popularity of data science is increasing every day. With it, the popularity of Python is skyrocketing in the fields of data science, machine learning, and artificial intelligence. In the noisy crowd of programming languages, libraries, and dev tools, people often forget to learn things the real way, leading to some major problems down the road. To excel in programming for data science, machine learning, artificial intelligence and other modern technologies, a knowledge of statistics is very important and is even a must in some cases.

Many people, who already know statistics, come to Python programming and immediately start to use various libraries instead of learning the language properly, keeping their learning incomplete. Many Python experts jump into using statistics with Python and various libraries, but without learning statistics properly. There are some who make sure there are no holes in their knowledge and skill set. Yet for others, the ground is never leveled due to their unconsciousness and the way resources for learning are designed and represented.

In this series of articles, we’ll attempt to fill in some of those gaps and holes that so many programmes seem to have. In this first article I’ll introduce you to some simple aspects of statistics with the help of Python. And remember—we won’t be using any external libraries.

**Prerequisites**

Before you start with this series of articles, it is expected that you have basic Python programming knowledge and perhaps some very basic knowledge of statistics. You should also be able to use the command line on your system. If you know how to work with Jupyter Notebook that will be an extra benefit for you.

**Preparing Your Environment**

You should have Python installed on your system to work with the examples shown here. You should try to use the latest version of Python. If you do not like to run code through the command line then you can use Jupyter Notebook for Python, but that is not a necessity for this series of articles.

Create or choose a directory where you want to put your python script files. Create a file with name *stat_001.py* in that directory. Run the file with the following command to see if everything is alright:

```
python stat_001.py
```

On some systems if you want to run with Python 3 you may need to use *python3 stat_001.py* instead of *python stat_001.py*.

**Population, Sample and their representation in Python**

*Population* denotes an entire collection of items, events, etc. of interest in a experiment or study. Simply stated: population is the whole collection of things that we want to perform statistical operations on. You can find a more broader definition in statistics books, but that is pretty much the simplified idea of population. It will become clearer with examples below.

On the other hand *sample* is a part of population. Sometimes sample can be equal to population. So, if population is an universal set, then sample is a subset of it.

Usually populations are huge in amount and we need to run our experiment on a part of it, thus requiring sample. For example, if we are working with prime numbers then the set of prime numbers is the population, and the first five prime number (or so) is the sample. So, how do we represent population in Python code?

Each member in a population or sample can be simple data like numbers, or a complex object like a human (e.g. we are interested in the age, sex, profession, salary, etc. of a human). So, if we have simple members that can be represented with a string, number or other such primitive data, we can just use a pure Python array (and if the order/sorting does not matter we can use sets) to represent samples. If the members are complex then we can encapsulate each member in a dictionary, dictionary like object, or other custom objects, and then put them in an array or set, or other such similar data structure. Samples are finite and thus we can keep each member in a data structure like list or set. But, population is not always finite—for example the set of prime numbers—and cannot be practically kept in lists or sets. If we do not represent population in code then we must not try to put in code. If somehow we need to put them in code we can use a custom class, function or a generator function to represent that. For a finite population we often end up using a database or similar system to keep them and retrieve them on demand. Also, for many cases we can retrieve members of population from external heterogeneous sources to get a sample.

Let's clarify with two examples. If we are working with even numbers, then the set of even numbers is the population and this kind of population is not finite. So, if I want to represent the population I can create a generator function like below:

```
def even_number_population():
current = 0
while True:
current += 1
if current % 2 == 0:
yield current
else:
continue
```

Calling this function will return a generator.

```
population_gen = even_number_population()
```

Every time you call *next()* with this generator you will get the next even number. See example below:

```
print(next(population_gen))
print(next(population_gen))
print(next(population_gen))
print(next(population_gen))
```

Outputs:

```
2
4
6
8
```

The set of even numbers is infinite and we do not have infinite memory on ours systems to store them. So, we are using generator function to generate them one by one. Generators save states and when *next()* is invoked on it, it returns the next result. I will discuss generators in details in another writing.

Now, if we want to get a sample of five even numbers from the population we can code like below:

`sample = [next(population_gen), next(population_gen), next(population_gen), next(population_gen), next(population_gen)]`

Try to print *sample* and it will result in:

```
[2, 4, 6, 8, 10]
```

Or, we can use a for loop to iterate over the generator function and break when we have got our desired number of members.

**Finding an Arithmetic Mean**

Let's look at a very simple statistics problem. We want to find the arithmetic mean of the sample we got in the previous section. The arithmetic mean of numbers can be calculated by adding all the numbers and dividing the sum with the number of members in the sample. If we were doing this by hand we could do like below:

```
a_mean = (2+4+6+8+10)/5
```

But in programming we will not do that by hand and our data is not static—it is dynamic and can differ in various conditions. In our case the sample is a list. To get how many members a list contains we can call the built-in *len()* function with the list as the parameter:

```
no_of_members = len(sample)
```

To add the numbers we can run a for loop on the sample:

```
total = 0
for member in sample:
total = total + member
```

The value total will be *30* after the loop ends.

We could carry out the same operation using the built-in *sum()* function, but we will not do that since in the future lessons when we have complex data, *sum()* will not be much of help. So, it's better that you practice in the practical way from the get go.

So, our final code will be:

```
no_of_members = len(sample)
total = 0
for member in sample:
total = total + member
mean = total / no_of_members
print("Arithmetic mean: " + str(mean))
```

Outputs:

```
Arithmetic mean: 6.0
```

**Conclusion**

This was just the introductory article in the series of statistics with Python I am writing. In future articles we will move forward with more topics and will dive deep into harder topics of statistics gradually. Keep practicing and try to find a way to represent other types of infinite populations with Python.

## Recent Stories

### Top DiscoverSDK Experts

## Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}
## {{CommentsModel.TotalCount}} Comments

## Your Comment