Basic data types encountered in scientific computing in Nim
Most operations using scientific computing packages in Nim will require one of three different data types:
seq[T]
Tensor[T]
DataFrame
to store multiple scalar values (typically float
values).
T
is the typical letter used to indicate generics in Nim. This means the explicit type
will be determined by the argument / desired type to store in the container, for example
seq[int], Tensor[float]
etc.
There are of course further types used in many packages, but these three are typically used to actually store data. Other objects may wrap any of these for different purposes, e.g. numericalnim contains different helper objects for integration or interpolation.
We will now look at each of these three data types individually and discuss how to create variables of each type and what typical use cases are.
seq[T]
- homogeneous, dynamically resizable sequence
seq[T]
is the default, dynamically resizable container from Nim's standard library. As the
single generic argument T
implies it is homogeneous, which means one sequence stores
elements of a single data type.
Their implementation is essentially a pointer to a memory array, the length of the allocated memory as well as the length of elements actually stored in it. We will discuss this further down in section "Length and capacity of a sequence".
In addition to seq[T]
Nim also supports fixed size arrays. While these can be very useful
they won't be discussed here.
The standard library provides different ways to construct a sequence. Let's look at the default two constructors first:
let x = @[0.0, 1.0, 2.0, 3.0]
echo x
echo "Length: ", len(x)
@[0.0, 1.0, 2.0, 3.0] Length: 4
The first constructor explicitly converts a number of elements into a sequence with
4 elements. The length of the sequence can be accessed using len
.
var x = newSeq[float]()
echo "Length: ", x.len
Length: 0
The second way to construct a sequence uses the newSeq
procedure. It receives the
generic type that should be housed in the sequence and as an argument the number of
initial elements (the default being 0).
var x = newSeq[float](4)
echo x
echo "Length: ", x.len
@[0.0, 0.0, 0.0, 0.0] Length: 4
x
then uses the newSeq
constructor to directly construct a sequence of floats of
length 4. All elements in the sequence are initialized to zero!
From here we can modify any created sequence, remove elements or add new elements as
long as the variable is declared as a var
(instead of let
).
Access
Elements in the sequence are accessed using bracket []
access:
let x = @[0.0, 1.0, 2.0, 3.0]
echo x[2]
2.0
Mutation
Basic mutation of elements in the sequence is done using []=
(in Nim terms), which is simply
bracket access and an assignment:
var x = newSeq[float](4)
x[0] = 5.0
echo x
@[5.0, 0.0, 0.0, 0.0]
New elements are added using add
as is typical in Nim:
var x = newSeq[float]()
x.add 10.0
echo x
echo "Length: ", x.len
@[10.0] Length: 1
So y1
now contains 1 element instead of 0.
Deleting elements is also supported, via delete
or del
. Both procedures take the index
to be removed. delete
keeps the order of the sequence intact, whereas del
simply overwrites
the given index with the last element of the sequence and reduces the length by one. Compare:
let x1 = @[0.0, 1.0, 2.0, 3.0]
var x2 = x1
var x3 = x1
echo "Starting from: ", x1
x2.delete(1)
echo "Remove index 1 using `delete`: ", x2
x3.del(1)
echo "Remove index 1 using `del`: ", x3
Starting from: @[0.0, 1.0, 2.0, 3.0] Remove index 1 using `delete`: @[0.0, 2.0, 3.0] Remove index 1 using `del`: @[0.0, 3.0, 2.0]
See how the order of x3
is now different, whereas x2
has the same order just with
index 1 removed.
Length and capacity of a sequence
Consider the following code:
var x = newSeq[int]()
for i in 0 ..< 10:
x.add i
A naive implementation of a sequence would have to reallocate the memory underlying the sequence for
each call to add
. To avoid the overhead of all these copying operations, the implementation
overallocates by a certain amount. This means reallocation is only required if the actual underlying
capacity is exceeded.
This has practical use cases as well. Sometimes we may not know exactly how many elements we will
store in a sequence, but we have a good idea of the order. In those cases we cannot very well create
a sequence with an existing length using newSeq
(if we overestimate we suddenly have a number of
empty entries).
For that usecase we can use newSeqOfCap
. It creates a sequence of length 0 but whose capacity is the
given number:
var x = newSeqOfCap[int](100)
echo "Length: ", x.len
Length: 0
As we can see the sequence is currently empty. But if we add to it, the sequence won't have to reallocate several times. In this way we can often get away with at most one reallocation or zero, if we accept a bit of overallocation.
var x = newSeqOfCap[int](100)
for i in 0 ..< 100:
x.add i
So this operation won't reallocate.
Note: for even more performance critical code there is also newSeqUninitialized
, which creates a
sequence of N elements that are not zero initialized to save one more (possibly useless) loop
over the memory.
Filling a seq with a fixed value
Sometimes we wish to create a sequence that is initialized not to zero, but some other constant
value. For this we can use newSeqWith
from sequtils
:
import sequtils
echo newSeqWith(3, 5.5)
@[5.5, 5.5, 5.5]
which takes as the first argument the size of the resulting sequence and as the second argument the value to initialize all values to.
Note: this can also be used to create nested sequences:
echo newSeqWith(3, newSeqWith(3, 5))
@[@[5, 5, 5], @[5, 5, 5], @[5, 5, 5]]
which gives us a nested sequence of seq[seq[int]]
where each element is a sequence of
integers with value 5.
A few more typical ways to create sequences
To finish of this section, let's look at a few more sequence constructors that are often useful.
Nim supports slices using the syntax a .. b
, which includes all values from a
to including b
.
Together with toSeq
it can be used to generate a sequence:
echo toSeq(10 .. 14)
@[10, 11, 12, 13, 14]
This essentially takes the role of arange
from numpy. Of course this only generates sequences of integers.
For succinctness (but not performance) we can convert such a sequence using mapIt
to map
each element from an input type to some other type:
echo toSeq(10 .. 14).mapIt(it.float)
@[10.0, 11.0, 12.0, 13.0, 14.0]
Returns a sequence of floats instead.
Similarly, it is often desirable to get a linearly spaced sequence of numbers. numericalnim
also
provides a linspace
implementation. Let's create 5 evenly spaced points between 1 and 2:
import numericalnim
echo linspace(1.0, 2.0, 5)
@[1.0, 1.25, 1.5, 1.75, 2.0]
Finally, one may need a sequence of randomly sampled numbers. The random
module of the Nim
standard library provides a rand
procedure we can combine with mapIt
:
import random
randomize()
echo toSeq(0 ..< 5).mapIt(rand(10.0))
@[0.8769893891051761, 8.003552359258268, 0.1919852990291604, 9.043200846330667, 1.789818960578717]
samples 5 floating point numbers between 0 and 10.
Tensor[T]
- an ND-array type from Arraymancer
Arraymancer provides an ND-array type called Tensor[T]
that is best compared to a numpy
ndarray. Same as a sequence seq[T]
it can only contain a single type. In contrast to it
however, it cannot be resized easily (only reshaped).
Under the hood the data is stored as a pointer + length pair for types that can be copied
using copyMem
(Nim's memcpy
). Otherwise it contains a seq[T]
for the data. The major
difference between a sequence and a tensor is the ability to handle multidimensional data
efficiently.
In case of a seq[T]
we either have to manually handle the indexing of the sequence (if we
store ND data in a 1D sequence) or deal with the inefficiencies of a nested sequence seq[seq[T]]
.
In that case every access requires an additional pointer indirection.
let x = @[ @[1, 2, 3], @[4, 5, 6] ]
echo x[1][0]
4
The access [1][0]
in this example first returns a sequence, which we have to dereference
again to get to an element. This makes accessing data expensive.
An Arraymancer tensor on the other hand always stores data in a one-dimensional data storage.
Not only does it make iterating over and accessing data faster, it also allows for essentially
free reshaping of the data, because the shape is only a piece of meta data.
Another important bit of information is that tensors have reference semantics. That means assigning a tensor to a new variable and modifying that variable also modifies the initial tensor! This is for efficiency reasons to not copy all the data for each assignment.
Two most basic ways to create are shown below:
import arraymancer
let t = @[1.0, 2.0, 3.0].toTensor
First we can just create a tensor from a (possibly nested) sequence or array using toTensor
.
Secondly:
let t = newTensor[float](9)
This is the default tensor constructor. It creates a tensor of type Tensor[float]
with
10 elements that is zero initialized. If multiple elements are given to the procedure a tensor
of different shape is created.
let t = newTensor[float](3, 3)
creates a tensor 2 dimensional tensor of size 3 in both dimensions (essentiall a 3x3 matrix).
Note that due to the shape being a piece of meta data, it is cheap to convert from one shape
to another using reshape
.
let t = newTensor[float](9).reshape(3, 3)
This essentially does not have any meaningful overhead over the creation of t3
above.
Some more ways to construct a tensor:
let t1 = zeros[float](9) ## a tensor that is explicit 0, the default
let t2 = ones[float](9) ## a tensor that is initialized to 1
let t3 = newTensorWith[float]([3, 3], 5) ## a 3x3 tensor initialized to 5
let t4 = newTensorUninit[float](10) ## a tensor that is *not* initialized
let t5 = arange(0, 10) ## the range 0 to 10 as a `Tensor[int]`
let t6 = linspace(0.0, 10.0, 1000) ## 1000 linearly spaced points between 0 and 10
These are only a few common ways to create a tensor.
Access and mutation
Arraymancer tensors are very similar to the Nim standard library seq[T]
in terms of
their element access and element mutation, with the aforementioned difference of reference
semantics.
However, because tensors deal with possibly multidimensional data, there are ways to slice and select parts of a tensor using syntax comparable to numpy's fancy indexing. Furthermore, support for element-wise operations between multiple tensors are supported.
As we won't make use of that in this tutorial, we won't cover it here. See the Arraymancer tutorial section to get an idea.
More
Of course Arraymancer provides a large amount of additional functionality, starting from linear algebra, to statistics, machine learning and more. View the full documentation here: