Martin McBride, 2019-09-14
Tags data types efficiency
numpy supports five main data types - ints, unsigned ints, floats, complex numbers, and booleans.
Integers in Python can represent positive or negative numbers of any size. That is because Python integers are objects, and the implementation automatically grabs more memory if necessary to store very large values.
Integers in numpy are very different. An integer occupies a fixed number of bytes. For example, the type
np.int32 occupies exactly 4 byte of memory (A byte contains 8 bits, so 4 bytes is 32 bits, hence
int32). These are called primitive types because they aren't object, they are just data bytes stored directly in memory.
The reasons for using primitive types is explained in detail in the article on numpy efficiency. In summary:
- An arrays of primitive types takes a lot less memory than a list of Python integer objects.
- Accessing primitive values is faster.
- Primitive types don't require garbage collection.
In fact, numpy provides several different integer sizes:
|np.int8||1||-128 to 127|
|np.int16||2||-32768 to 32767|
|np.int32||4||-2147483648 to 2147483647|
|np.int64||8||-9223372036854775808 to 9223372036854775807|
There are a couple of reasons for this. The first is fairly obvious, if you are using data that has a limited range there is no point using more memory than you need. For example, sound data is often stored using 16 bits per sample (ie the sound is represented by an array of 16 bit values). Storing this data as 64 bit integers would make no sense, you would be using 4 times a much memory for no reason.
The second reason is slightly less obvious. Some applications use a mix of Python and C code for efficiency. With numpy, it is possible to pass a pointer to the array data into a C function, so that the C code can access the data in memory without the need to make a copy of it. This can improve efficiency when dealing with very large arrays. For this to work, the data needs to be stored in the format the C code is expecting. So if the C code is expecting an array of 16 bit integers, it is useful to be able to specify that in numpy. We won't be covering that in these tutorials, it is quite specialised.
Unsigned integers are similar to normal integers, but they can only hold non-zero values. Here are the available types:
|np.uint8||1||0 to 255|
|np.uint16||2||0 to 65535|
|np.uint32||4||0 to 4294967295|
|np.uint64||8||0 to 18446744073709551615|
Unsigned integers are useful for data that can never be negative, for example population data. The population of a town can never be less than zero.
The advantage of unsigned data is that it can represent larger positive numbers than signed data. An
int8 goes up to 127, but a
uint8 goes up to 255.
numpy floating point numbers also have different sizes (usually called precisions). There are two types:
|np.float32||4||±1.18×10−38 to ±3.4×1038||7 to 8 decimal digits|
|np.float64||8||±2.23×10−308 to ±1.80×10308||15 to 16 decimal digits|
float64 numbers store floating point numbers in the same way as a Python
float value. They are sometimes called double precision.
float32 numbers take half as much storage as
float64, but they have considerably smaller range and . They are sometimes called single precision.
A complex number consist of two floating point numbers, on representing the real part and one representing the imaginary part. If you have not met complex numbers before, here is a wikipedia article.
|np.complex64||8||Two 32-bit floats|
|np.complex128||16||Two 64-bit floats|
complex128 is equivalent to the Python
numpy supports boolean values
bool is one byte in size, with 0 representing false, and any non-zero value representing true.
Setting the data type
All of the functions available for created numpy arrays have an optional parameter
dtype that allows you to specify the data type (such as
np.float64 etc). For example:
a = np.zeros((2, 3), dtype=np.int32)
Creates an array that is 2 rows by 3 columns of zeros with data type int32:
[[0 0 0] [0 0 0]]
System dependent types
numpy also provides a numpy of types that don't specify a particular size. These include
np.long, amongst others. There are also unsigned versions
These types have system dependent sizes. For example
np.int might be equivalent to
np.int64 depending on the system it is running on. It depends on the type of processor, the type of operating system, and perhaps the version of the operating system.
In general, don't use these types. They are provided for situations where numpy is passing data in memory to a library written in C. For historical reasons, C has always had system dependent types like
short whose exact size can vary between systems. If you were interfacing to such a library you would need to use compatible types. Unless you are using any libraries that specifically tell you to use these types, don't use them. Stick to the fixed-size types shown above instead.
Some functions (such as
zeros used above) allow you to select an
order for the data. The choices are C-style or Fortran-style ordering (sometimes a couple of other variants too). Again, these options are intended for use if you are passing data in memory to a library written in C (or even Fortran). Unless you have good reason to change it, just use the default option.