datamancer

Source   Edit  

Datamancer overview

The datamancer packages is a dataframe library, which (as a companion to ggplotnim) is also heavily inspired by a specific R library. In this case dplyr, a dataframe library of the tidyverse.

What follows is a basic introduction covering all fundamentals. We won't cover the full API in this document, but rather aim to give the understanding so that using the full library becomes easy.

What is a dataframe?

A dataframe is a data structure that consists of multiple 1 dimensional datasets of equal lengths - but possibly different types - that have names associated to them.

From an abstract perspective it is a set of heterogeneous arrays (array not specifically in terms of CT sized Nim arrays, but as general flat, contiguous data structures) stored in a hash table. Each entry in the table is called a column and the keys represent the column names.

In a sense then they are a data structure similar to what is represented by a spreadsheet or naturally in a CSV file.

One useful distinction about the nature of individual columns of a dataframe is whether the data described in it is continuous (possibly N different values for a dataframe of length N) or discrete (small number of N different values compared to a possibly much larger M number of elements).

What is the point of a dataframe?

A huge amount of data about the real world or most physical systems can be described by individual numbers (scalars) or sets of such. For one system or topic of interest, it is often natural (or slightly less natural, but possible) to express the state of the system by a set of numbers. Let's call that a "record". From this follows that possibly:

  • a set of such systems
  • the time evolution of such a system
  • ...

can be described by a (possibly ordered) list of such records. Such a list directly represents a dataframe.

What this implies is that as long as we can write down a number of operations we can perform on or with a dataframe, we can apply such operations to a large number of possible systems.

Therefore, dataframes are a very powerful datastructure.

The library inspiring datamancer defines a small set of (base) operations to perform with dataframes. A small number of operations makes it easy to reason about and combine them to produce complex operations.

The five base operations ("the verbs of dplyr") are:

  • mutate: modify an existing column or add a new one
  • select: select a subset of columns
  • filter: filter out a subset of records based on one or more conditions
  • summarize: reduce one or more columns to a single scalar
  • arrange: sort the dataframe according to one or more columns

For discrete columns in a dataframe one more procedure is almost as basic, namely group_by. It allows one to iterate over all subsets of a dataframe that have the same value 'x' in a column 'A'.

The five verbs above naturally combine with groub_by. This means if one of these operations is performed on a grouped dataframe, the operation will be performed of each subgroup instead of the full dataframe (which may or may not produce a different result).

Creating a dataframe

With some understanding of why we might want to bother with dataframes, we can now ask ourselves how to create one.

Usage typically starts with one of the following cases:

  1. data already available in seq[T]/Tensor[T] or some Nim object from which such can be created
  2. some CSV / TSV like ascii file or a string representing such a thing
  3. some binary file like HDF5
  4. some database

Note about 3 and 4: simple access (without manually reading into a seq[T]/Tensor[T]) is not supported for these two yet. These can be added easily (code for HDF5 exists, but not as part of this repository) if there is demand.

Supported datatypes and internal representation

The datamancer dataframes currently support the following types:

  • int
  • float
  • string
  • bool
  • Value

where Value is a variant object that can store either of the above datatypes. That type is used in case a single column stores multiple data types.

At this moment there is no direct support for DateTime or Time objects. That could be added if desired. It's a bit of work, but manageable. It's mainly missing, because so far I personally didn't really need it.

Internally, one column is stored in a Column object. This object is a variant object, with different possible ColumnKinds (one kind for each of the native data types). The column stores an Arraymancer Tensor[T] for the respective type of the column.

The usage of variant objects allows for a fully dynamic, runtime mutable design. This is a trade-off between safety and convenience, which is placed more towards convenience for the simple reason that for often recurring computations of the same kind it is highly recommended to make use of a custom datatype that allows for optimizations applicable to the specific domain.

From seq[T]/Tensor[T]

For the case of having the data as seq[T], we just use the toDf template to create a DF from it. The template does not care whether the input is of type seq[T] or Tensor[T]. In the future support for pointer + length pairs can be added as well.

There are two ways to use toDf. Assuming we have three sequences of possibly different types:

let s1: seq[int] = @[22, 54, 34]
let s2: seq[float] = @[1.87, 1.75, 1.78]
let s3: seq[string] = @["Mike", "Laura", "Sue"]

we can either create a DF and let the library automatically deduce the column names from the Nim identifiers of the given variables:

let dfAutoNamed = toDf(s1, s2, s3)

which will give us a DF with column names:

"s1", "s2", "s3"

In many cases one might rather like a different name. In this case use the following syntax:

let df = toDf({ "Age" : s1,
                "Height" : s2,
                "Name" : s3 })

which will then use the given strings for the column names.

If we print this dataframe we get the following output:

Dataframe with 3 columns and 3 rows:
       Idx       Age    Height      Name
    dtype:       int     float    string
         0        22      1.87      Mike
         1        54      1.75     Laura
         2        34      1.78       Sue

We see that we get information about:

  • the number of columns in the dataframe
  • the number of rows in the dataframe
  • the names of each column
  • the data types of each column
  • their values with one record per row
  • and an additional index column

From a CSV / TSV file

The second supported case is a CSV like file. For these the library provides a generalized readCsv proc. Strictly speaking it can also read TSV (or any delimited ASCII file) and provides a number of different options to make it suitable to a large variety of differently organized CSV files (what a euphemism):

proc readCsv*(fname: string,
              sep: char = ',',
              header: string = "",
              skipLines = 0,
              toSkip: set[char] = {},
              colNames: seq[string] = @[],
              skipInitialSpace = true,
              quote = '"',
             ): DataFrame

For a regular CSV file (comma separated) with a header line storing the column names, all we need is a filename.

In addition to that the separator can be changed with sep and the header can have a designation (e.g. # indicating header lines that will be skipped automatically after the first).

Further a specific number of lines can be skipped after the header. toSkip allows to skip any set of characters. These will be completely ignored outside of columns.

If the file does not have a header colNames can be used to give names to the columns.

skipInitialSpace is useful to remove whitespace that might appear in addition to a separator (e.g. a comma separated file that also has a space after every comma). If it is false such spaces will be parsed into the data fields.

Finally, quote allows to ignore all characters that would otherwise act as separators, line breaks, ... within these. If ="= is used as quote and spaces to separate, there may be spaces within individual fields as long as these are within quotes.

Let's use this procedure to read the supplied mpg dataset:

import datamancer
let df = readCsv("../data/mpg.csv")

We're now proud owner of dataframe that's a bit more exciting than the 3 column / 3 row one from before. We'll explore it in the next section.

Manipulating a DF

Now we have a DF. What then?

First of all we can look at it. Echoing a DF calls the pretty proc. For the DF introduced above, this looks like:

echo df

gives for the mpg dataset:

#+RESULTS:
Dataframe with 11 columns and 234 rows:
   Idx  manufacturer           model  displ  year  cyl  ...     drv   cty   hwy      fl     class
dtype:        string          string  float   int  int  ...  string   int   int  string    string
     0          audi              a4    1.8  1999    4  ...       f    18    29       p   compact
     1          audi              a4    1.8  1999    4  ...       f    21    29       p   compact
     2          audi              a4      2  2008    4  ...       f    20    31       p   compact
     3          audi              a4      2  2008    4  ...       f    21    30       p   compact
     4          audi              a4    2.8  1999    6  ...       f    16    26       p   compact
     5          audi              a4    2.8  1999    6  ...       f    18    26       p   compact
     6          audi              a4    3.1  2008    6  ...       f    18    27       p   compact
     7          audi      a4 quattro    1.8  1999    4  ...     "4"    18    26       p   compact
     8          audi      a4 quattro    1.8  1999    4  ...     "4"    16    25       p   compact
     9          audi      a4 quattro      2  2008    4  ...     "4"    20    28       p   compact
    10          audi      a4 quattro      2  2008    4  ...     "4"    19    27       p   compact
    11          audi      a4 quattro    2.8  1999    6  ...     "4"    15    25       p   compact
    12          audi      a4 quattro    2.8  1999    6  ...     "4"    17    25       p   compact
    13          audi      a4 quattro    3.1  2008    6  ...     "4"    17    25       p   compact
    14          audi      a4 quattro    3.1  2008    6  ...     "4"    15    25       p   compact
    15          audi      a6 quattro    2.8  1999    6  ...     "4"    15    24       p   midsize
    16          audi      a6 quattro    3.1  2008    6  ...     "4"    17    25       p   midsize
    17          audi      a6 quattro    4.2  2008    8  ...     "4"    16    23       p   midsize
    18     chevrolet c1500 suburb...    5.3  2008    8  ...       r    14    20       r       suv
    19     chevrolet c1500 suburb...    5.3  2008    8  ...       r    11    15       e       suv

(NOTE: I shortened the output for the docs here) Notice how in the drv column the 4WD entries are echoed as "4" instead of just 4. That is to highlight that those values are actually stored as strings to not confuse them with numbers.

By default only the first 20 entries will be shown. For more/less elements, call pretty directly:

echo df.pretty(100)

pretty also takes a precision argument. This is given to the string conversion for float values to set the number of digits printed after the decimal point. However, it can also be used to change the width of the columns more generally. Note however the precision is added to a width of 6 by default. Also the column is at least as wide as the longest DF key.

Let's now check which cars in the dataset have the highest and lowest city fuel economy. For that we can simply arrange the dataframe according to the cty column and take the tail or head of the result.

echo df.arrange("cty").head(5)

results in:

Dataframe with 11 columns and 5 rows:
Idx    manufacturer           model           displ  ...  cyl  ...  drv   cty   hwy      fl    class
dtype:       string          string           float       int    string   int   int  string   string
  0           dodge dakota picku...             4.7  ...    8  ...  "4"     9    12       e   pickup
  1           dodge     durango 4wd             4.7  ...    8  ...  "4"     9    12       e      suv
  2           dodge ram 1500 pic...             4.7  ...    8  ...  "4"     9    12       e   pickup
  3           dodge ram 1500 pic...             4.7  ...    8  ...  "4"     9    12       e   pickup
  4            jeep grand cherok...             4.7  ...    8  ...  "4"     9    12       e      suv

and looking at the tail instead:

echo df.arrange("cty").tail(5)

will tell us that a new beetle is the most efficient car in the dataset:

Dataframe with 11 columns and 5 rows:
Idx    manufacturer           model           displ  ...  cyl  ...  drv   cty   hwy     fl        class
dtype:       string          string           float       int    string   int   int  string      string
  0           honda           civic             1.6  ...    4  ...    f    28    33      r   subcompact
  1          toyota         corolla             1.8  ...    4  ...    f    28    37      r      compact
  2      volkswagen      new beetle             1.9  ...    4  ...    f    29    41      d   subcompact
  3      volkswagen           jetta             1.9  ...    4  ...    f    33    44      d      compact
  4      volkswagen      new beetle             1.9  ...    4  ...    f    35    44      d   subcompact

(arrange also takes an order argument, using the Nim stdlib's SortOrder enum).

As another example here to showcase the usage of FormulaNodes, let's find some cars with an engine displacement of more than 5 L and which are 2 seaters (I wonder what car might show up...):

echo df.filter(f{`displ` > 5.0 and `class` == "2seater"})
Dataframe with 11 columns and 5 rows:
Idx    manufacturer           model           displ  ...  cyl  ...  drv   cty   hwy     fl     class
dtype:       string          string           float       int    string   int   int string    string
  0       chevrolet        corvette             5.7  ...    8  ...    r    16    26      p   2seater
  1       chevrolet        corvette             5.7  ...    8  ...    r    15    23      p   2seater
  2       chevrolet        corvette             6.2  ...    8  ...    r    16    26      p   2seater
  3       chevrolet        corvette             6.2  ...    8  ...    r    15    25      p   2seater
  4       chevrolet        corvette               7  ...    8  ...    r    15    24      p   2seater

Surprise, surprise we found ourselves a bunch of corvettes!

(Note: for an explanation of this mythical f{} thing and those accented quotes, see the Formula section below).

Finally, let's make use of a formula, which takes an assignment. Let's say we want to convert the city fuel economy of the cars from MPG to L/100 km as is the standard in Germany. We'll do this with mutate. mutate will add an additional column to the dataframe. (well, if only it was clear whether the mpg given are US gallon or imperial gallon?)

import datamancer
let df = readCsv("../data/mpg.csv")

let dfl100km = df.filter(f{`displ` > 5.0 and `class` == "2seater"})
  .mutate(f{"cty / L/100km" ~ 235 / `cty`})
echo dfl100km.pretty(5)

shows us:

Dataframe with 12 columns and 5 rows:
Idx     manufacturer            model            displ  ...       trans  ...  cty   ...   cty / L/100km
dtype:        string           string            float  ...      string  ...  int   ...           float
  0        chevrolet         corvette              5.7  ...  manual(m6)  ...   16   ...           14.69
  1        chevrolet         corvette              5.7  ...    auto(l4)  ...   15   ...           15.67
  2        chevrolet         corvette              6.2  ...  manual(m6)  ...   16   ...           14.69
  3        chevrolet         corvette              6.2  ...    auto(s6)  ...   15   ...           15.67
  4        chevrolet         corvette                7  ...  manual(m6)  ...   15   ...           15.67

where I removed a couple of columns for better visibility again.

I used the chaining of filter and mutate above mainly to showcase that this works reliably.

When looking at the formula above note that as in ggplot2 the tilde ~ is used to indicate a dependency or in other words a mapping of something like Tensor to Tensor.

Formula

Here will go parts of what's in the README.

Exports

[]=, []=, /, constantToFull, drop, sortTypes, []=, lead, filter, [], <=, withNative2, $, group_by, %~, contains, isConstant, toColumn, toObject, newDataFrame, isNull, nativeColKind, hasExplicitTypeHint, map_inline, parsePreface, convenienceValueComparisons, $, items, asValue, bind_rows, combinedColKind, []=, withNativeConversion, toFloat, isNumber, [], %~, isConstant, raw, colMin, count, []=, null, Value, withNative, toTab, FormulaKind, filterToIdx, toHashSet, drop, high, reduce, toNativeColumn, withNativeDtype, valueTo, toObjectColumn, [], isInt, add, len, add, drop_null, clone, %~, high, ==, clone, buildResultColName, [], [], innerJoin, %~, get, nullColumn, lag, newColumn, withNative, groups, toTensor, arrange, pretty, asgn, %~, [], [], calcNewColumn, $, toValKind, almostEqual, transmuteInplace, transmute, toColumn, withDtypeByColKind, toNativeColumn, FormulaNode, keys, [], relocate, ColKind, head, +, isNumber, %~, []=, []=, pretty, [], row, <, evaluate, setDiff, compatibleColumns, toColKind, -, extendShortColumns, %~, DtypesAll, %~, dataFrame, formula, HeuristicType, fkVariable, contains, hash, isValidType, lag, fkVector, constantColumn, selectInplace, hash, Column, assignStack, nodeIsDf, generateClosure, toNimType, $, DataFrameKind, items, DtypeOrderMap, {}, buildName, ValueKind, *, AssignKind, fkAssign, bind_rows, toStr, [], tail, isColumn, toObject, mutateInplace, equal, select, TypeHint, toValueKind, toInt, []=, pretty, toColKind, keys, DataFrame, contains, to, nodeIsDfIdx, contains, %~, toTensor, extractCall, [], fn, evaluate, valTabToDf, toValueKind, [], withCombinedType, ncols, toDf, fkNone, largerOrFalse, newVObject, withNativeTensor, lead, Assign, fkScalar, sortTypes, compileFormulaImpl, isBool, toColumn, toObject, rename, $, toStrType, []=, map, add, ValueNull, summarize, strTabToDf, pairs, spread, smallerOrFalse, Lift, relocate, shallowCopy, max, isInt, calcNewConstColumnFromScalar, isColumnType, Preface, FormulaMismatchError, unique, FormulaCT, liftScalarToColumn, values, toColumn, mutate, add, seqsToDf, toBool, toUgly, gather, getKeys, unique, colMax, colsToDf, Dtypes, FormulaTypes, writeCsv, readCsv, parseCsvString, readCsvAlt, readCsv, showBrowser