Data wrangling using the DataFrame
from Datamancer
The third major data type often encountered is a DataFrame
.
Data frames can be thought of as multiple, named tensors of possibly different types in one object. A data frame library then is supposed to make working with such data as convenient and powerful as possible.
In the specific case of Datamancer, the data structure is essentially an
OrderedTable[string, Column]
, where Column
is a variant object storing one
of 5 different Tensor[T]
types.
In order to use Datamancer, you must first import it. In addition to that, in this tutorial we will also import the Arraymancer tensor library, which we will use to demonstrate that you can create dataframes from Arraymancer tensors.
import datamancer
import arraymancer
Construction of a DataFrame
A DataFrame
from the Datamancer library can be constructed in two different ways. Either
from an input CSV file or from existing sequences or tensors.
Construction from a CSV file is performed using the readCsv
procedure. It provides multiple
different options (different separators, skipping lines, header symbols, ...), but for a
regular comma separated value file, the defaults are fine. For example:
let df1 = readCsv("foo.csv")
echo df1
Alternatively, if one already has a mix of sequences and tensors of the same length:
let s1 = [1, 2, 3]
let s2 = @["hello", "foo", "bar"]
let s3 = @[1.5, 2.5, 3.5].toTensor
let df2 = toDf(s1, s2, s3)
echo df2
echo "Column names: ", df2.getKeys() ## getKeys only returns the column names
DataFrame with 3 columns and 3 rows: Idx s1 s2 s3 dtype: int string float 0 1 hello 1.5 1 2 foo 2.5 2 3 bar 3.5 Column names: @["s1", "s2", "s3"]
which creates a data frame with three columns named "s1", "s2", "s3"
. We can see that
mixing different input types is not a problem. The supported types are
float
int
string
bool
and a mix of them in one column.
Printing a data frame by default prints the first 20 rows. This can be adjusted by calling
the pretty
procedure manually and handing the number of rows (-1 for all).
In addition one can always view a data frame in the browser by doing showBrowser(df)
where
df
is the data frame to view.
If one wishes to name the columns differently from construction (they can be renamed later as well), it is done by:
let df3 = toDf({"Id" : s1, "Word" : s2, "Number" : s3})
echo df3
DataFrame with 3 columns and 3 rows: Idx Id Word Number dtype: int string float 0 1 hello 1.5 1 2 foo 2.5 2 3 bar 3.5
Finally, one can also create a DataFrame
starting from an empty object and
assigning sequences, tensors or scalar values manually:
var df = newDataFrame()
df["x"] = @[1, 2, 3] ## assign a sequence. This sets the `DataFrame` length to 3
df["y"] = @[4.0, 5.0, 6.0].toTensor ## assign a tensor. Input now `must` match length 3
try:
df["z"] = @[5, 6] ## raises
except ValueError: discard ## type of exception might change in the future
df["z"] = constantColumn(1, df.len) ## assign a constant column of integers.
Accessing data underlying a column
The data stored in a column of a data frame can always be accessed easily. Because the data is stored in a variant object, the user needs to supply the data type to use to read the data as. Nim does not allow return type overloading, which means we cannot use the runtime information about the types to return the "correct" tensor. All we can make sure is that accessing the data with the correct type is a no-op.
This has the downside that an invalid type will produce a runtime error. On the upside it allows us to perform type conversions directly, for instance reading an integer column as floats or any column as strings.
The syntax is as follows:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
let t1: Tensor[int] = df["x", int] ## this is a no-op
let t2: Tensor[float] = df["x", float] ## converts integers to floats
let t3: Tensor[float] = df["y", float] ## also a no-op
let t4: Tensor[string] = df["x", string] ## convert to string
try:
let t5: Tensor[bool] = df["x", bool] ## would produce a runtime error
except ValueError: discard ## type of exception might be changed in the future
where we indicate the types explicitly on the left hand side for clarity.
This means we can in principle always access individual elements of a data frame column by getting the tensor and accessing elements from it. Of course this has some overhead, but due to reference semantics it is relatively cheap (no data is copied, unless type conversions need to be performed).
Computing single column aggregations
As we saw in the previous section, accessing a tensor of a column is cheap. We can use that to perform aggregations on full columns:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
echo df["x", int].sum
echo df["y", float].mean
6 5.0
and in that sense any operation acting on tensors can be used.
Data frame operations
In the more general case (the reason one uses a data frame in the first place) we don't want to only consider a single column.
Many different operations are supported, but can be grouped into a few general procedures.
Some of the procedures of Datamancer take so called FormulaNodes
. They are essentially
a domain specific language to succinctly express operations on data frame columns
without the need to fully refer to them. Their basic construction and usage should become
clear in the code below. The Datamancer documentation contains a much deeper introduction
into the specifics here:
select
- Selecting a subset of columns
If we have a data frame with multiple columns we may want to keep only
a subset of these going forward. This can be achieved using select
:
var df = newDataFrame()
for i in 0 ..< 100:
df["x" & $i] = @[1 + i, 2 + i, 3 + i]
echo df.select("x1", "x50", "x99")
DataFrame with 3 columns and 3 rows: Idx x1 x50 x99 dtype: int int int 0 2 51 100 1 3 52 101 2 4 53 102
which drops every column not selected.
The inverse is also possible using drop
:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0], "z" : @["a", "b", "c"]})
echo df.drop("x")
DataFrame with 2 columns and 3 rows: Idx y z dtype: float string 0 4 a 1 5 b 2 6 c
rename
- Renaming a column
rename
, as the name implies, is used to rename columns. Usage is rather simple. We'll
get our first glance at the f{}
macro to generate a FormulaNode
here:
let df = toDf({"x" : @[1, 2, 3], "y" : @[4.0, 5.0, 6.0]})
echo df.rename(f{"foo" <- "x"})
DataFrame with 2 columns and 3 rows: Idx y foo dtype: float int 0 4 1 1 5 2 2 6 3
So we can see that we simply assign <-
the old name "x" to the new name "foo".
arrange
- Sorting a data frame
Often we wish to sort a data frame by one or more columns. This is done using arrange
.
It can take one or more columns to sort by, where for multiple columns the order
of the inputs decides the precedence of what to sort by first, the later columns only
used to break ties between the former.
The sort order is handled in the same way as in Nim's standard library, i.e. using
an order
argument that takes either SortOrder.Ascending
or SortOrder.Descending
.
The default order is ascending order.
let df = toDf({ "x" : @[4, 2, 7, 4], "y" : @[2.3, 7.1, 3.3, 1.0],
"z" : @["b", "c", "d", "a"]})
echo df.arrange("x") ## sort by `x` in ascending order (default)
echo df.arrange("x", order = SortOrder.Descending) ## sort in descending order
echo df.arrange(["x", "z"]) ## sort by two columns, first `x` then `z` to break ties
DataFrame with 3 columns and 4 rows: Idx x y z dtype: int float string 0 2 7.1 c 1 4 2.3 b 2 4 1 a 3 7 3.3 d DataFrame with 3 columns and 4 rows: Idx x y z dtype: int float string 0 7 3.3 d 1 4 2.3 b 2 4 1 a 3 2 7.1 c DataFrame with 3 columns and 4 rows: Idx x y z dtype: int float string 0 2 7.1 c 1 4 1 a 2 4 2.3 b 3 7 3.3 d
unique
- Removing duplicate rows
Another useful operation is removal of duplicate entries. unique
is the procedure
to use. If no argument is given uniqueness is determined based on all existing
columns. This is not always the most desired option of course, which is why unique
accepts a variable number of columns. Then only uniqueness among these columns is
considered.
let df = toDf({ "x" : @[1, 2, 2, 2, 4], "y" : @[5.0, 6.0, 7.0, 8.0, 9.0],
"z" : @["a", "b", "b", "d", "e"]})
echo df.unique() ## consider uniqueness of all columns, nothing removed
echo df.unique("x") ## only consider `x`, only keeps keeps 1st, 2nd, last row
echo df.unique(["x", "z"]) ## considers `x` and `z`, one more unique (4th row)
DataFrame with 3 columns and 5 rows: Idx x y z dtype: int float string 0 1 5 a 1 2 6 b 2 2 7 b 3 2 8 d 4 4 9 e DataFrame with 3 columns and 3 rows: Idx x y z dtype: int float string 0 1 5 a 1 2 6 b 2 4 9 e DataFrame with 3 columns and 4 rows: Idx x y z dtype: int float string 0 1 5 a 1 2 6 b 2 2 8 d 3 4 9 e
mutate
- Creating new or modifying existing columns
mutate
is the procedure to use to add new columns to a data frame or modify
existing ones. For this procedure we need to hand formulas using the f{}
macro
again. Here it is advisable to name the formulas. Instead of the above assignment
operator <-
we now use the "x depends on y" operator ~
.
Further, to refer to a column in the computation we perform we will use accented quotes. This is all the complexity of that macro we will discuss in this introduction.
Let's compute the sum of two columns to get a feel:
let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12] })
echo df.mutate(f{"x+y" ~ `x` + `y`})
DataFrame with 3 columns and 3 rows: Idx x y x+y dtype: int int float 0 1 10 11 1 2 11 13 2 3 12 15
Of course we can use constants and local Nim symbols as well:
let df = toDf({ "x" : @[1, 2, 3]})
echo df.mutate(f{"x+5" ~ `x` + 5 })
let y = 2.0
echo df.mutate(f{"x + local y" ~ `x` + y})
DataFrame with 2 columns and 3 rows: Idx x x+5 dtype: int int 0 1 6 1 2 7 2 3 8 DataFrame with 2 columns and 3 rows: Idx x x + local y dtype: int float 0 1 3 1 2 4 2 3 5
Note: There is a slight subtlety at play here. If you look closely at the output of
these two mutate
commands you see that in the first case the resulting column is
of type int
, whereas in the second case it's float
. That is because the type
of the column is deduced based on the types in the rest of the formula. 5
is an
int
so x
is read as integers in the first case, whereas y
is a float
and so
x
is read as a float
. See the Datamancer documentation on details and how to
specify the types manually.
And as stated we can also overwrite columns:
let df = toDf({ "x" : @[1, 2, 3] })
echo df.mutate(f{"x" ~ `x` + `x`})
DataFrame with 1 columns and 3 rows: Idx x dtype: float 0 2 1 4 2 6
Under the hood these formulas are converted into a closure that takes a data frame as an input. The column references are extracted and converted into a preamble that reads the corresponding tensors. Then we run over the relevant tensors and perform the described operation for each element. The result is assigned to a resulting tensor, which is assigned as the new column.
The only restriction on the body of the formula is that it's a valid Nim expression (if one mentally replaces column references by tensor elements) that returns a value of a valid data type for a data frame.
If one wishes the same behavior as mutate
but does not require the columns anymore
that are not explicitly created / modified using a formula, there is transmute
for
this purpose. Otherwise it is equivalent to mutate
.
filter
- Removing rows based on a predicate
These mentioned formulas can of course also return boolean values. In combination
with the filter
procedure this allows us to remove rows of a data frame that
fail to pass a condition (or a "predicate").
let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @["a", "b", "c", "d", "e"] })
echo df.filter(f{ `x` < 3 or `y` == "e" })
DataFrame with 2 columns and 3 rows: Idx x y dtype: int string 0 1 a 1 2 b 2 5 e
summarize
- Computing aggregations on a full data frame
The approach described in "Computing single column aggregations" can be useful for
simple single column operations, but does not scale well. That's what summarize
is
for. Here we use the last operator used in the f{}
macro, namely the reduction
<<
operator:
let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @[5, 10, 15, 20, 25] })
echo df.summarize(f{float: mean(`x`) }) ## compute mean, auto creates a column name
echo df.summarize(f{float: "mean(x)" << mean(`x`) }) ## same but with a custom name
echo df.summarize(f{"mean(x)+sum(y)" << mean(`x`) + sum(`y`) })
DataFrame with 1 columns and 1 rows: Idx (mean x) dtype: float 0 3 DataFrame with 1 columns and 1 rows: Idx mean(x) dtype: float 0 3 DataFrame with 1 columns and 1 rows: Idx mean(x)+sum(y) dtype: float 0 78
Keen eyes will notice the float:
at the beginning of the first two examples. This is
a "type hint" for the formula, because the symbol "mean" is overloaded in Nim. But not
by a few distinct procedures, but generically. At this moment there are no heuristics
involved to choose one type over another in a generic case. Therefore, we don't know
what type x
should be read as. So we overwrite the input type manually and give the
macro a hint.
If we leave out the type information you will be greeted with a message of the type
information found of mean
and to consider giving such a type hint.
The situation is slightly different for the last case, in which an addition is involved. Due to some heuristic rules involving the most basic operators (maths and boolean) we can determine here that the input is probably supposed to be float.
group_by
summarize
and the other procedures can be spiced up if used in combination with
group_by
.
group_by
by itself doesn't perform any operations. It simply returns a new data frame
with the exact same data that is now "grouped" by one or more columns. These columns should
be columns containing discrete data. This grouping can be used (manually or indirectly)
via the groups
iterator. It yields all "sub data frames" contained in the grouped data
frame. These sub data frames are those of duplicate entries in the columns that we have
grouped by. It essentially yields everything as a sub data frame that would be reduced
to a single row if using unique
on the same columns as grouped by.
This should become clearer with an example:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C"],
"Num" : @[1, 5, 3, 4, 8, 7, 2] })
.group_by("Class")
for t, subDf in groups(df):
echo "Sub data frame: ", t
echo subDf
Sub data frame: @[("Class", (kind: VString, str: "A"))] DataFrame with 2 columns and 2 rows: Idx Class Num dtype: string int 0 A 1 1 A 8 Sub data frame: @[("Class", (kind: VString, str: "B"))] DataFrame with 2 columns and 2 rows: Idx Class Num dtype: string int 0 B 3 1 B 4 Sub data frame: @[("Class", (kind: VString, str: "C"))] DataFrame with 2 columns and 3 rows: Idx Class Num dtype: string int 0 C 5 1 C 7 2 C 2
We can see we have 3 sub data frames. One for each discrete value found in column Class
.
The actually interesting applications of groub_by
though is its combination with one
of the other procedures shown above, in particular summarize
, filter
and mutate
.
For a grouped data frame these operations will then performed group wise. Operations
that only use information of a single row are unaffected by this. But any formula that
includes a reference to a full column (mean, sum, ...
) will compute this value per
group.
A few examples:
summarize
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").summarize(f{int: "sum(Num)" << sum(`Num`)})
DataFrame with 2 columns and 3 rows: Idx Class sum(Num) dtype: string int 0 A 9 1 B 7 2 C 14
We can see this computes the sum for each class now.
filter
:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").filter(f{ sum(`Num`) <= 9 })
DataFrame with 2 columns and 6 rows: Idx Class Num dtype: string int 0 A 1 1 A 8 2 A 0 3 B 3 4 B 4 5 B 0
and again, the filtering is done per group. In this sense a filtering operation that uses a reducing formula as input would usually not make too much sense anyway.
mutate
:
let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo df.group_by("Class").mutate(f{"Num - mean" ~ `Num` - mean(`Num`)})
DataFrame with 3 columns and 9 rows: Idx Class Num Num - mean dtype: string int float 0 A 1 -2 1 A 8 5 2 A 0 -3 3 B 3 0.6667 4 B 4 1.667 5 B 0 -2.333 6 C 5 0.3333 7 C 7 2.333 8 C 2 -2.667
where we subtract the mean (of each class!) from each observation.
If one uses multiple columns to group by, we get instead the sub data frame corresponding to each unique combination of discrete values. Feel free to play around and try out such an example!
gather
- Converting a wide format data frame to long format
As one of the last things to cover, we will quickly talk about data frames in wide and long format. In a way the example data frame above with a column "Class" and a column "Num" can be considered a data frame in "long" format. Long format in the sense that we have one discrete column "Class" that maps to different "Num" values. Because the column "Class" contains discrete values, We can imagine "transposing" the data frame to columns "A", "B", "C" instead with the values for each of these groups as the values in the corresponding columns. Let's look at:
- this data frame
- the output of grouping that data frame by "Class"
- the same data frame in wide format
for clarity:
let dfLong = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C", "A", "B"],
"Num" : @[1, 5, 3, 4, 8, 7, 2, 0, 0] })
echo "Long format:\n", dfLong
echo "----------------------------------------"
echo "Grouping by `Class`:"
for _, subDf in groups(dfLong.group_by("Class")):
echo subDf
echo "----------------------------------------"
let dfWide = toDf({"A" : [1, 8, 0], "B" : [3, 4, 0], "C" : [5, 7, 2]})
echo "Wide format:\n", dfWide
Long format: DataFrame with 2 columns and 9 rows: Idx Class Num dtype: string int 0 A 1 1 C 5 2 B 3 3 B 4 4 A 8 5 C 7 6 C 2 7 A 0 8 B 0 ---------------------------------------- Grouping by `Class`: DataFrame with 2 columns and 3 rows: Idx Class Num dtype: string int 0 A 1 1 A 8 2 A 0 DataFrame with 2 columns and 3 rows: Idx Class Num dtype: string int 0 B 3 1 B 4 2 B 0 DataFrame with 2 columns and 3 rows: Idx Class Num dtype: string int 0 C 5 1 C 7 2 C 2 ---------------------------------------- Wide format: DataFrame with 3 columns and 3 rows: Idx A B C dtype: int int int 0 1 3 5 1 8 4 7 2 0 0 2
As we can see, the difference between wide and long format is the way the groub_by
results
are "assembled". As different columns for each group (wide format) or as two (key, value)
columns (long format).
The conversion from wide -> long format is always possible. But the the mapping of long -> wide format requires there to be the the same number of entries for each class. If that condition is not satisfied, there will be missing values in the columns of the separate classes.
Depending on circumstances one might have input data in either order. However, in particular for plotting purposes the long format is often more convenient as it allows to classify the discrete classes using different colors, shapes etc. automatically.
Therefore, there is the gather
procedure to convert a wide format data frame into a
long format one. It takes the columns to be "gathered", the name of the column containing
the "keys" (the column from which a value came) and a name for the column of the "values"
that were "gathered". We can use it to recover the ("Class", "Num") data frame from
the last one:
let df = toDf({"A" : [1, 8, 0], "B" : [3, 4, 0], "C" : [5, 7, 2]})
echo df.gather(df.getKeys(), ## get all keys to gather
key = "Class", ## the name of the `key` column
value = "Num")
DataFrame with 2 columns and 9 rows: Idx Class Num dtype: string float 0 A 1 1 A 8 2 A 0 3 B 3 4 B 4 5 B 0 6 C 5 7 C 7 8 C 2
which is exactly the same data frame as in the examples before.
(Note: the inverse procedure to convert a long format data frame back into wide format is currently still missing. It will be added soon)
innerJoin
- joining two data frames by a common column
As the last common example of data frame operations, we shall consider joining two data frames by a common column.
let df1 = toDf({ "Class" : @["A", "B", "C", "D", "E"],
"Num" : @[1, 5, 3, 4, 6] })
let df2 = toDf({ "Class" : ["E", "B", "A", "D", "C"],
"Ids" : @[123, 124, 125, 126, 127] })
echo innerJoin(df1, df2, by = "Class")
DataFrame with 3 columns and 5 rows: Idx Class Num Ids dtype: object int int 0 A 1 125 1 B 5 124 2 C 3 127 3 D 4 126 4 E 6 123
where we joined two data frames by the "Class" column, resulting in a data frame with 3 columns. The matching rows for the classes were put together aligning corresponding "Num" and "Ids" values.
Of course joining two data frames is only a sensible option for a column containing discrete data so that equal elements in that column for both input data frames can be found.
This already covers the majority of the API of Datamancer. There are more procedures, but the presented ones should be all that is needed in the vast majority of use cases.
Check out the Datamancer documentation for a full picture and in particular for a better and more thorough introduction to the formula syntax.