datamancer/dataframe

    Dark Mode
Search:
Group by:
Source   Edit  

Consts

ValueNull = (kind: VNull, str: "", num: 0, fnum: 0.0, bval: false,
             fields: (data: [], counter: 0, first: 0, last: 0))
Source   Edit  

Procs

proc `[]`(df: DataFrame; idx: array[1, int]): Column {....raises: [KeyError],
    tags: [].}
Returns the column at index idx.

Example:

let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12], "z": ["5","6","7"] })
doAssert df[[0]].toTensor(int) == toTensor [1, 2, 3] ## get the first column
doAssert df[[1]].toTensor(int) == toTensor [10, 11, 12] ## get the second column
doAssert df[[2]].toTensor(string) == toTensor ["5", "6", "7"] ## get the third column
Source   Edit  
proc `[]`(df: DataFrame; k: string): var Column {.inline, ...raises: [KeyError],
    tags: [].}
Returns the column k from the DataFrame as a mutable object. Source   Edit  
proc `[]`(df: DataFrame; k: string; idx: int): Value {.inline,
    ...raises: [ValueError, KeyError], tags: [].}

Returns the element at index idx in column k directly as a Value, without converting (to Value) and returning the whole column first.

An efficient way to access few individual elements without specifying a data type.

If idx is not within the DF's length an IndexError is raised.

Source   Edit  
proc `[]`(df: DataFrame; k: string; slice: Slice[int]): Column {.inline,
    ...raises: [KeyError], tags: [].}
Returns the elements in slice in column k directly as a new Column without returning the full column first. Source   Edit  
proc `[]`(df: DataFrame; k: Value): Column {.inline,
    ...raises: [KeyError, Exception, ValueError], tags: [RootEffect].}
Returns the column k from the DataFrame for a Value object storing a column reference. Source   Edit  
proc `[]`[T, U](df: DataFrame; rowSlice: HSlice[T, U]): DataFrame
Returns a slice of the data frame given by rowSlice, which is simply a subset of the input data frame. Source   Edit  
proc `[]`[T](df: DataFrame; k: string; idx: int; dtype: typedesc[T]): T {.inline.}

Returns the element at index idx in column k directly, without returning returning the whole column first.

If dtype corresponds to the data type of the type of the underlying Tensor, no type conversions need to be performed.

If dtype does not match the data type of the underlying Tensor the value will be read as its native type and then converted to dtype via explicit conversion.

If idx is not within the DF's length an IndexError is raised.

Source   Edit  
proc `[]`[T](df: DataFrame; k: string; slice: Slice[int]; dtype: typedesc[T]): Tensor[
    T] {.inline.}

Returns the elements in slice in column k directly, without returning the whole column first as a tensor of type dtype.

If dtype corresponds to the data type of the type of the underlying Tensor, no type conversions need to be performed and the slice is returned as is.

If dtype does not match the data type of the underlying Tensor the slice will be read as its native type and then converted to dtype via explicit astype conversion.

Source   Edit  
proc `[]`[T](df: DataFrame; key: string; dtype: typedesc[T]): Tensor[T]

Returns the column key as a Tensor of type dtype.

If dtype matches the actual data type of the Tensor underlying the column, this is a no copy operation. Otherwise a type conversion is performed on the Tensor using astype

This is the easiest way to access the raw Tensor underlying a column for further processing.

Example:

import sequtils
let df = toDf({"x" : toSeq(1 .. 5)})
let t: Tensor[int] = df["x", int]
doAssert t.sum == 15
Source   Edit  
proc `[]=`(df: var DataFrame; k: string; col: Column) {.inline,
    ...raises: [ValueError], tags: [].}

Assigns the column col as a column with key k to the DataFrame.

If the length of the column does not match the existing DF length (unless it is 0), a ValueError is raised.

Source   Edit  
proc `[]=`[T: SomeNumber | string | bool](df: var DataFrame; k: string; t: T) {.
    inline.}
Assigns a scalar t as a constant column to the DataFrame.

Example:

var df = toDf({"x" : @[1,2,3]})
df["y"] = 5
doAssert df["y"].isConstant
doAssert df["y"].len == 3
doAssert df["y", 0, int] == 5
doAssert df["y", 1, int] == 5
doAssert df["y", 2, int] == 5
Source   Edit  
proc `[]=`[T: Tensor | seq | array](df: var DataFrame; k: string; t: T) {.inline.}

Assigns a Tensor, seq or array to the DataFrame df as column key k.

If the length of the input t does not match the existing DF's length, a ValueError is raised.

Source   Edit  
proc `[]=`[T](df: var DataFrame; fn: FormulaNode; key: string; val: T)

Evaluates the given FormulaNode fn, which needs to be a function returning a bool, and assigns a constant value val to all rows of column key matching the condition.

This is a somewhat esoteric procedure, but can be used to mask rows matching some condition.

Example:

var df = toDf({"x" : [1,2,3], "y" : [5,6,7]})
df[f{`x` > 1}, "x"] = 5 ## assign 5 to all rows > 1
doAssert df["x", int] == [1,5,5].toTensor
doAssert df["y", int] == [5,6,7].toTensor ## remains unchanged
df[f{`y` < 7}, "x"] = -1 ## can also refer to other columns of course
doAssert df["x", int] == [-1,-1,5].toTensor
doAssert df["y", int] == [5,6,7].toTensor ## still unchanged
Source   Edit  
proc `[]=`[T](df: var DataFrame; k: string; idx: int; val: T) {.inline.}

A low-level helper to assign a value val of type T directly to column k in the DataFrame df at index idx.

If idx is not within the DF's length an IndexError is raised.

WARNING: This procedure does not check the compatibility of the column types. Only use it if you know the type of t corresponds to the data type of the underlying column! Assign at an index on the column for a more sane behavior.

Example:

var df = toDf({"x" : [1,2,3]})
df["x", 1] = 5
doAssert df["x", int] == [1,5,3].toTensor
## df["x", 2] = 1.2 <- produces runtime error that specif `kind` field in Column
## is inaccesible!
Source   Edit  
proc add(df: var DataFrame; dfToAdd: DataFrame) {.
    ...raises: [ValueError, KeyError], tags: [].}

The simplest form of "adding" a data frame, resulting in both data frames stacked vertically on top of one another.

If the keys match exactly or df is empty dfToAdd will be stacked below. This makes a key check and then calls bind_rows for the job.

If they don't match a ValueError is thrown.

Example:

let a = [1, 2, 3]
let b = [3, 4, 5]
let c = [4, 5, 6, 7]
let d = [8, 9, 10, 11]

let df = toDf({"a" : a, "b" : b})
let df2 = toDf({"a" : c, "b" : d})
import sequtils
block:
  var dfRes = newDataFrame()
  dfRes.add df
  doAssert dfRes.len == 3
  dfRes.add df2
  doAssert dfRes.len == 7
  try:
    dfRes.add toDf({"c": [1,3]})
  except ValueError:
    discard
Source   Edit  
proc add[T: tuple](df: var DataFrame; args: T)

This procedure adds a given tuple as a new row to the DF.

If the DataFrame df does not have any column yet, columns of the names given by the tuple field names will be created. Otherwise the tuple field names are ignored and only the order is considered for placement into the different columns.

This should almost always be avoided, because it comes at a huge performance penalty. Every add causes a new allocation of every tensor of each column of length (N + 1). Only use this to add few (!!) rows to a DF. Otherwise consider storing your intermediate rows to be added in individual seqs or Tensors (if you know the length in advance) and add the new DF to the existing one using bind_rows, add or even assignStack.

Possibly use the add template, which takes a varargs[untyped] if you do not wish to construct a tuple manually.

NOTE: the input is treated in the order of the columns as they are stored in the internal OrderedTable! Make sure the order is as you think it is!

Example:

var df = newDataFrame()
df.add((x: 1, y: 2))
df.add((x: 3, y: 5))
df.add((z: 5, x: 10)) ## after colums exist, tuple names are ignored!
doAssert df.len == 3
doAssert df["x", int] == [1, 3, 5].toTensor
doAssert df["y", int] == [2, 5, 10].toTensor
Source   Edit  
proc arrange(df: DataFrame; by: varargs[string]; order = SortOrder.Ascending): DataFrame {.
    ...raises: [KeyError, ValueError, Exception], tags: [RootEffect].}

Sorts the data frame in ascending / descending order by key by.

The sort order is handled as in Nim's standard library using the SortOrder enum.

If multiple keys are given to order by, the priority is determined by the order in which they are given. We first order by by[0]. If there is a tie, we try to break it by by[1] and so on.

Do not depend on the order within a tie, if no further ordering is given!

Example:

let df = toDf({ "x" : @[5, 2, 3, 2], "y" : @[4, 3, 2, 1]})
block:
  let dfRes = df.arrange("x")
  doAssert dfRes["x", int] == [2, 2, 3, 5].toTensor
  doAssert dfRes["y", int][0] == 3
  doAssert dfRes["y", int][3] == 4
block:
  let dfRes = df.arrange("x", order = SortOrder.Descending)
  doAssert dfRes["x", int] == [5, 3, 2, 2].toTensor
  doAssert dfRes["y", int][0] == 4
  doAssert dfRes["y", int][1] == 2
block:
  let dfRes = df.arrange(["x", "y"])
  doAssert dfRes["x", int] == [2, 2, 3, 5].toTensor
  doAssert dfRes["y", int] == [1, 3, 2, 4].toTensor
Source   Edit  
proc asgn(df: var DataFrame; k: string; col: Column) {.inline, ...raises: [],
    tags: [].}

Low-level assign, which does not care about sizes of column. If used with a given column of different length than the DataFrame length, it results in a ragged DF. Only use this if you intend to extend these columns later or won't use any other procedure taking a DataFrame.

Used in toTab macro, where shorter columns are extended afterwards using extendShortColumns.

Source   Edit  
proc assignStack(dfs: seq[DataFrame]): DataFrame {.
    ...raises: [ValueError, KeyError, Exception], tags: [RootEffect].}

Returns a data frame built as a stack of the data frames in the sequence.

This is a somewhat unsafe procedure as it trades performance for safety. It's mainly intended to be used internally to speed up stacking outputs of certain operations.

In contrast to calling add multiple times, assignStack preallocates all data required for all arguments immediately and performs slice assignments. If your need to stack many equivalent data frames, use this procedure.

All dataframes must have matching keys and column types. It should only be called from places where this is made sure as the point of the procedure is speeding up assignment for cases where we know this holds.

Source   Edit  
proc bind_rows(dfs: varargs[(string, DataFrame)]; id: string = ""): DataFrame {.
    ...raises: [ValueError, KeyError], tags: [].}

bind_rows combines several data frames row wise (i.e. data frames are stacked on top of one another).

The origin of each row is indicated in a new id column, where the values are the first argument in each of the given tuples.

If a given column does not exist in one of the data frames, the corresponding rows of the data frame missing it, will be filled with VNull.

Example:

let a = [1, 2, 3]
let b = [3, 4, 5]
let c = [4, 5, 6, 7]
let d = [8, 9, 10, 11]

let df = toDf({"a" : a, "b" : b})
let df2 = toDf({"a" : c, "b" : d})
import sequtils

block:
  let res = bind_rows([("A", df), ("B", df2)], id = "From")
  doAssert res.len == 7
  doAssert res.ncols == 3
  doAssert res["a", int] == concat(@a, @c).toTensor
  doAssert res["b", int] == concat(@b, @d).toTensor
  doAssert res["From", string] == concat(newSeqWith(3, "A"),
                                         newSeqWith(4, "B")).toTensor
block:
  let df3 = toDf({"a" : c, "d" : d})
  let res = bind_rows([("A", df), ("B", df3)], id = "From")
  doAssert res.ncols == 4
  doAssert res["a", int] == concat(@a, @c).toTensor
  doAssert res["b"].kind == colObject
  doAssert res["b", Value] == [%~ 3, %~ 4, %~ 5, ## equivalent to `b`
                               null(), null(), null(), null()].toTensor
  doAssert res["d"].kind == colObject
  doAssert res["d", Value] == [null(), null(), null(),
                               %~ 8, %~ 9, %~ 10, %~ 11].toTensor ## equivalent to `d`
  doAssert res["From", string] == concat(newSeqWith(3, "A"),
                                         newSeqWith(4, "B")).toTensor
Source   Edit  
proc calcNewColumn(df: DataFrame; fn: FormulaNode): (string, Column) {.
    ...raises: [Exception], tags: [RootEffect].}

Calculates a new column based on the fn given. Returns the name of the resulting column (derived from the formula) as well as the column.

This is not indented for the user facing API. It is used internally in ggplotnim.

Source   Edit  
proc calcNewConstColumnFromScalar(df: DataFrame; fn: FormulaNode): (string,
    Column) {....raises: [Exception], tags: [RootEffect].}

Calculates a new constant column based on the scalar (reducing) fn given. Returns the name of the resulting column (derived from the formula) as well as the column.

This is not indented for the user facing API. It is used internally in ggplotnim.

Source   Edit  
proc clone(df: DataFrame): DataFrame {....raises: [ValueError, KeyError], tags: [].}
Returns a cloned version of df, which deep copies the tensors of the DataFrame. This makes sure there is no data sharing due to reference semantics between the input and output DF. Source   Edit  
proc colMax(df: DataFrame; col: string; ignoreInf = true): float {.
    ...raises: [ValueError, KeyError], tags: [].}

Returns the maximum value along a given column, which must be readable as a float tensor.

If ignoreInf is true Inf values are ignored.

In general this is not intended as a user facing procedure. It is used in ggplotnim to determine scales. As a user a simple df["foo", float].max is preferred, unless the ignoreInf functionality seems useful.

Source   Edit  
proc colMin(df: DataFrame; col: string; ignoreInf = true): float {.
    ...raises: [ValueError, KeyError], tags: [].}

Returns the minimum value along a given column, which must be readable as a float tensor.

If ignoreInf is true Inf values are ignored.

In general this is not intended as a user facing procedure. It is used in ggplotnim to determine scales. As a user a simple df["foo", float].max is preferred, unless the ignoreInf functionality seems useful.

Source   Edit  
proc contains(df: DataFrame; key: string): bool {....raises: [], tags: [].}
Checks if the key names column in the DataFrame. Source   Edit  
proc count(df: DataFrame; col: string; name = "n"): DataFrame {.
    ...raises: [KeyError, ValueError, Exception], tags: [RootEffect].}

Counts the number of elements per type in col of the data frame.

The counts are stored in a new column given by name.

It is essentially a shorthand version of first grouping the data frame by column col and then applying a reducing summarize call that returns the length of each sub group.

Example:

let df = toDf({"Class" : @["A", "C", "B", "B", "A", "C", "C"]})
let dfRes = df.count("Class")
doAssert dfRes.len == 3 # one row per class
doAssert dfRes["n", int] == [2, 2, 3].toTensor
Source   Edit  
proc drop(df: DataFrame; keys: varargs[string]): DataFrame {....raises: [],
    tags: [].}
Returns a DataFrame with the given keys dropped. Source   Edit  
proc drop(df: var DataFrame; key: string) {.inline, ...raises: [], tags: [].}
Drops the given key from the DataFrame. Source   Edit  
proc drop_null(df: DataFrame; cols: varargs[string]; convertColumnKind = false;
               failIfConversionFails: bool = false): DataFrame {.
    ...raises: [KeyError, Exception, ValueError, FormulaMismatchError],
    tags: [RootEffect].}

Returns a DF with only those rows left, which contain no null values. Null values can only appear in object columns.

By default this includes all columns in the data frame. If one or more cols are given, only those columns will be considered.

By default no attempt is made to convert the new columns to a native data type, since it introduces another walk over the data. If convertColumnKind is true, conversion is attempted. Whether that throws an assertion error if the conversion is not possible to a single native type is controlled by the static failIfConversionFails.

Note: In general this is not a particularly fast proc, since each column which should drop null values causes a filter of the DF, i.e. a full run over the lenght of the DF.

Source   Edit  
func evaluate(node: FormulaNode): Value {....raises: [], tags: [].}

Tries to return a single Value from a FormulaNode.

Works either if formula is fkNone or fkVariable. Returns the stored value in these cases.

Raises for fkVector and fkScalar.

Source   Edit  
proc evaluate(node: FormulaNode; df: DataFrame): Column {.
    ...raises: [ValueError, KeyError, Exception], tags: [RootEffect].}

Tries to return a Column from a FormulaNode by evaluating it on a DataFrame df.

This is usually not extremely useful. It can be handy to understand what a formula does without having mutate and friends interfer.

Source   Edit  
proc extendShortColumns(df: var DataFrame) {....raises: [KeyError, ValueError],
    tags: [].}
initial calls to toDf and other procs may result in a ragged DF, which has less entries in certain columns than the data frame length. This proc fills up the mutable dataframe in those columns Source   Edit  
proc filter(df: DataFrame; conds: varargs[FormulaNode]): DataFrame {.
    ...raises: [Exception, KeyError, ValueError, FormulaMismatchError],
    tags: [RootEffect].}

Returns the data frame filtered by the conditions given. Multiple conditions are evaluated successively and all only elements matching all conditions as true will remain. If the input data frame is grouped, the subgroups are evaluated individually.

Both mapping and reducing formulas are supported, but each formula kind must return a boolean value. In a case of a mismatch FormulaMismatchError is thrown.

Example:

let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @["a", "b", "c", "d", "e"] })
let dfRes = df.filter(f{ `x` < 3 or `y` == "e" }) ## arbitrary boolean expressions supported
doAssert dfRes["x", int] == [1, 2, 5].toTensor
doAssert dfRes["y", string] == ["a", "b", "e"].toTensor
Source   Edit  
proc filterToIdx[T: seq[int] | Tensor[int]](df: DataFrame; indices: T;
    keys: seq[string] = @[]): DataFrame

Filters the input dataframe to all rows matching the indices of idx.

If the keys argument is empty, all columns are filtered.

WARNING: If keys is given and only represents a subset of the DF, the resulting DataFrame will be ragged and the unfiltered columns are "invisible". The dataframe then technically is invalid. Use at your own risk!

Mostly used internally, but very useful in certain contexts.

Source   Edit  
proc gather(df: DataFrame; cols: varargs[string]; key = "key"; value = "value";
            dropNulls = false): DataFrame {.
    ...raises: [KeyError, ValueError, Exception], tags: [RootEffect].}

Gathers the cols from df and merges these columns into two new columns, where the key column contains the name of the column from which the value entry is taken. I.e. transforms cols from wide to long format.

A different way to think about the operation is that all columns to be gathered belong to one class. They are simply different labels in that class. gather is used to collect all labels in the class and produces a new data frame, in which we have a column for the class labels (key) and their values as they appeared in each label's column before (value).

The inverse operation is spread.

Example:

let df = toDf({"A" : [1, 8, 0], "B" : [3, 4, 0], "C" : [5, 7, 2]})
let dfRes = df.gather(df.getKeys(), ## get all keys to gather
                      key = "Class", ## the name of the `key` column
                      value = "Num")
doAssert "Class" in dfRes
doAssert "Num" in dfRes
doAssert dfRes.ncols == 2
doAssert dfRes["Class", string] == ["A", "A", "A", "B", "B", "B", "C", "C", "C"].toTensor
doAssert dfRes["Num", int] == [1, 8, 0, 3, 4, 0, 5, 7, 2].toTensor
Source   Edit  
proc get(df: DataFrame; key: string): Column {.inline, ...raises: [KeyError],
    tags: [].}

Returns the column of key.

Includes an explicit check on whether the column exists in the DataFrame and raises a KeyError with a custom message in case the key does not exist.

This is mainly useful in an application where the exception message should contain information about the fact that we're accessing a data frame as a regular access using [] already raises a KeyError.

Source   Edit  
proc getKeys(df: DataFrame): seq[string] {....raises: [], tags: [].}

Returns the column keys of a DataFrame as a sequence.

The order is the same as the order of the keys in the DF.

Source   Edit  
proc group_by(df: DataFrame; by: varargs[string]; add = false): DataFrame {.
    ...raises: [KeyError, ValueError], tags: [].}

Returns a grouped data frame grouped by all unique keys in by.

Grouping a data frame is an almost lazy affair. It only calculates the groups and its classes. Otherwise the data frame remains unchanged.

If df is already a grouped data frame and add is true, the groups given by by will be added as additional groups.

It is meant to be used with any of the normal procedurs like filter, summarize, mutate in which case the action will be performed group wise.

Source   Edit  
proc head(df: DataFrame; num: int): DataFrame {....raises: [ValueError, KeyError],
    tags: [].}
Returns the head of the DataFrame, i.e. the first num elements. Source   Edit  
proc high(df: DataFrame): int {....raises: [], tags: [].}
Returns the highest possible index in any column of the input DataFrame df. Source   Edit  
proc innerJoin(df1, df2: DataFrame; by: string): DataFrame {.
    ...raises: [KeyError, ValueError, Exception], tags: [RootEffect].}

Returns a data frame joined by the given key by in such a way as to only keep rows found in both data frames.

This is useful to combine two data frames that share a single column. It "zips" them together according to the column by.

Example:

let df1 = toDf({ "Class" : @["A", "B", "C", "D", "E"],
                     "Num" : @[1, 5, 3, 4, 6] })
let df2 = toDf({ "Class" : ["E", "B", "A", "D", "C"],
                     "Ids" : @[123, 124, 125, 126, 127] })
let dfRes = innerJoin(df1, df2, by = "Class")
doAssert dfRes.ncols == 3
doAssert dfRes["Class", string] == ["A", "B", "C", "D", "E"].toTensor
doAssert dfRes["Num", int] == [1, 5, 3, 4, 6].toTensor
doAssert dfRes["Ids", int] == [125, 124, 127, 126, 123].toTensor
Source   Edit  
func isColumn(fn: FormulaNode; df: DataFrame): bool {....raises: [ValueError],
    tags: [].}
Checks if the given FormulaNode as a string representation corresponds to a column in the DataFrame.

Example:

let fn = f{`x` * `x`}
let df = toDf({"x" : @[1, 2, 3]})
  .mutate(fn) # creates a new column of squared `x`
doAssert fn.isColumn(df)
Source   Edit  
func isConstant(fn: FormulaNode; df: DataFrame): bool {.
    ...raises: [ValueError, KeyError], tags: [].}

Checks if the column referenced by the FormulaNode fn is a constant column in the DataFrame.

If the column corresponding to fn does not exist, it returns false as well. Be sure to be aware whether fn is actually a formula, if you need to distinguish between constant / non constant columns.

Example:

let fn = f{"y"} # is a reference to a constant column.
let df = toDf({"x" : @[1, 2, 3], "y" : 5})
doAssert fn.isConstant(df)
Source   Edit  
func len[T](t: Tensor[T]): int
Helper proc for 1D Tensor[T] to return the length of the vector, which corresponds to a length of a DF column. Source   Edit  
proc mutate(df: DataFrame; fns: varargs[FormulaNode]): DataFrame {.
    ...raises: [Exception, KeyError, ValueError, FormulaMismatchError],
    tags: [RootEffect].}

Returns the data frame with additional mutated columns, described by the functions fns.

Each formula fn given will be used to create a new column in the data frame.

Existing columns may also be overwritten by handing a formula with the name of an existing column as the resulting name.

The left hand side of the given formula will correspond to the new name of the column if present. If not, the name will be computed from a lisp representation of the formula code.

Example:

let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12], "z": ["5","6","7"] })
block:
  let dfRes = df.mutate(f{"x+y" ~ `x` + `y`})
  doAssert dfRes.ncols == 4
  doAssert "x+y" in dfRes
  doAssert dfRes["x+y", int] == [11,13,15].toTensor
block:
  # of course local variables can be referenced:
  let foo = 5
  let dfRes = df.mutate(f{"x+foo" ~ `x` + foo})
  doAssert "x+foo" in dfRes
  doAssert dfRes["x+foo", int] == [6,7,8].toTensor
import strutils
block:
  # they can change type and infer it
  let foo = 5
  let dfRes = df.mutate(f{"asInt" ~ parseInt(`z`)})
  doAssert "asInt" in dfRes
  doAssert dfRes["asInt", int] == [5,6,7].toTensor
block:
  # and if no name is given:
  let dfRes = df.mutate(f{`x` + `y`})
  doAssert "(+ x y)" in dfRes
  doAssert dfRes["(+ x y)", int] == [11,13,15].toTensor
block:
  let dfRes = df.mutate(
    f{"foo" <- 2},   # generates a constant column with value 2
    f{"bar" <- "x"}, # generates a constant column with value "x", does *not* rename "x" to "bar"
    f{"baz" ~ 2}     # generates a (non-constant!) column of only values 2
  )
  doAssert dfRes["foo"].kind == colConstant
  doAssert dfRes["bar"].kind == colConstant
  doAssert "x" in dfRes # "x" untouched
  doAssert dfRes["baz"].kind == colInt # integer column, not constant!
Source   Edit  
proc mutateInplace(df: var DataFrame; fns: varargs[FormulaNode]) {.
    ...raises: [Exception, KeyError, ValueError, FormulaMismatchError],
    tags: [RootEffect].}
Inplace variasnt of mutate below. Source   Edit  
proc newDataFrame(size = 8; kind = dfNormal): DataFrame {....raises: [], tags: [].}

Initialize a DataFrame by initializing the underlying table for size number of columns. The given size will be rounded up to the next power of 2!

The kind argument can be used to create a grouped DataFrame from the start. Be very careful with this and instead use groub_by to create a grouped DF!

Source   Edit  
proc pretty(df: DataFrame; numLines = 20; precision = 4; header = true): string {.
    ...raises: [KeyError, ValueError], tags: [].}

Converts the first numLines of the input DataFrame df to a string table.

If the numLines argument is negative, will print all rows of the data frame.

The precision argument is relevant for VFloat values, but can also be (mis-) used to set the column width, e.g. to show long string columns.

The header is the DataFrame with ... information line, which can be disabled in the returned output to more easily output a simple DF as a string table.

pretty is called by $ with the default parameters.

Source   Edit  
proc reduce(node: FormulaNode; df: DataFrame): Value {.
    ...raises: [Exception, FormulaMismatchError], tags: [RootEffect].}

Tries to a single Value from a reducing FormulaNode by evaluating it on a DataFrame df.

The argument must be a reducing formula.

This is usually not extremely useful. It can be handy to understand what a formula does without having mutate and friends interfer.

Source   Edit  
proc relocate[T: string | FormulaNode](df: DataFrame; cols: seq[T]): DataFrame
Relocates the given columns (possibly renaming them) in the DataFrame Source   Edit  
proc relocate[T: string | FormulaNode](df: DataFrame; cols: varargs[T];
                                       after = ""; before = ""): DataFrame
Relocates (and possibly renames if fkAssign formula "A" <- "B" is given) the column to either before or after the given column.

Example:

let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12], "z": ["5","6","7"] })
doAssert df[[0]].toTensor(int) == toTensor [1, 2, 3] ## first column is `x`
block:
  let dfR = df.relocate("x", after = "y")
  doAssert dfR[[0]].toTensor(int) == toTensor [10, 11, 12] ## first column is now `y`
  doAssert dfR[[1]].toTensor(int) == toTensor [1, 2, 3] ## second column is now `x`
block:
  let dfR = df.relocate(f{"X" <- "x"}, after = "y") ## can also rename a column while relocating
  doAssert dfR[[0]].toTensor(int) == toTensor [10, 11, 12] ## first column is now `y`
  doAssert dfR[[1]].toTensor(int) == toTensor [1, 2, 3] ## second column is now `x`
  doAssert "X" in dfR and "x" notin dfR
block:
  let dfR = df.relocate(["y", "x"], after = "z") ## can relocate multiple & order is respected
  doAssert dfR[[0]].toTensor(string) == toTensor ["5", "6", "7"] ## first column is now `z`
  doAssert dfR[[1]].toTensor(int) == toTensor [10, 11, 12] ## second column is now `y`
  doAssert dfR[[2]].toTensor(int) == toTensor [1, 2, 3] ## last is now `x`
Source   Edit  
proc rename(df: DataFrame; cols: varargs[FormulaNode]): DataFrame {.
    ...raises: [ValueError, KeyError, Exception, FormulaMismatchError],
    tags: [RootEffect].}

Returns the data frame with the columns described by cols renamed to the names on the LHS of the given FormulaNode. All other columns will be left untouched.

The given formulas must be of assignment type, i.e. use <-.

Note: the renamed columns will be moved to the right side of the data frame, so the column order will be changed.

Example:

let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12] })
let dfRes = df.rename(f{"foo" <- "x"})
doAssert dfRes.ncols == 2
doAssert "x" notin dfRes
doAssert "foo" in dfRes
doAssert "y" in dfRes
Source   Edit  
proc row(df: DataFrame; idx: int; cols: varargs[string]): Value {.inline,
    ...raises: [ValueError, KeyError], tags: [].}

Returns the row idx of the DataFrame df as a Value of kind VObject.

If any cols are given, only those columns will appear in the resulting Value.

Example:

let df = toDf({"x" : [1,2,3], "y" : [1.1, 2.2, 3.3], "z" : ["a", "b", "c"]})
let row = df.row(0)
doAssert row["x"] == %~ 1
doAssert row["x"].kind == VInt
doAssert row["y"] == %~ 1.1
doAssert row["y"].kind == VFloat
doAssert row["z"] == %~ "a"
doAssert row["z"].kind == VString
Source   Edit  
proc select[T: string | FormulaNode](df: DataFrame; cols: varargs[T]): DataFrame

Returns the data frame cut to the names given as cols. The argument may either be the name of a column as a string, or a FormulaNode.

If the input is a formula node the left hand side (left of <-, ~, <<) if it exists or the name of the formula is computed from the formula. In the simplest case it may just be a fkVariable: f{"myColumn"} formula.

The FormulaNode approach is mainly useful to select and rename a column at the same time using an assignment formula <-.

The column order of the resulting DF is the order of the input columns to select.

Note: string and formula node arguments cannot be mixed. If a rename is desired, all other arguments need to be given as fkVariable formulas.

Example:

let df = toDf({"Foo" : [1,2,3], "Bar" : [5,6,7], "Baz" : [1.2, 2.3, 3.4]})
block:
  let dfRes = df.select(["Foo", "Bar"])
  doAssert dfRes.ncols == 2
  doAssert "Foo" in dfRes
  doAssert "Bar" in dfRes
  doAssert "Baz" notin dfRes
block:
  let dfRes = df.select([f{"Foo"}, f{"New" <- "Bar"}])
  doAssert dfRes.ncols == 2
  doAssert "Foo" in dfRes
  doAssert "New" in dfRes
  doAssert "Bar" notin dfRes
  doAssert "Baz" notin dfRes
Source   Edit  
proc selectInplace[T: string | FormulaNode](df: var DataFrame; cols: varargs[T])

Inplace variant of select above.

Note: the implementation changed. Instead of implementing select based on selectInplace by dropping columns, we now implement selectInplace based on select. This is technically still shallow copy internally of input df. This way the order is the order of the input keys.

Source   Edit  
proc setDiff(df1, df2: DataFrame; symmetric = false): DataFrame {.
    ...raises: [ValueError, KeyError, Exception], tags: [RootEffect].}

Returns a DataFrame with all elements in df1 that are not found in df2.

This comparison is perforrmed row wise.

If symmetric is true, the symmetric difference of the dataset is returned instead, i.e. elements which are either not in df1 or not in df2.

Source   Edit  
proc shallowCopy(df: DataFrame): DataFrame {....raises: [], tags: [].}

Creates a shallowCopy of the DataFrame that does not deep copy the tensors.

Used to return a different DF that contains the same data for those columns that exist in both. Only the OrderedTable object is cloned to not reference the same column keys. This is the default for all procedures that take and return a DF.

Source   Edit  
proc spread[T](df: DataFrame; namesFrom, valuesFrom: string; valuesFill: T = 0): DataFrame

The inverse operation to gather. A conversion from long format to a wide format data frame.

The name is spread, but the API is trying to be more closely aligned to dplyr's newer pivot_wider.

Note: if the only two columns present are namesFrom and valuesFrom and one (or more) of labels have more entries, the output will be filled from row 0 to N (where N is the number of elements in each label).

Note: currently valuesFill does not have an effect. We simply default initialize the new columns to the native default value of the data stored in the column.

Example:

block:
  let df = toDf({ "Class" : ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
                      "Num" : [1, 8, 0, 3, 4, 0, 5, 7, 2] })
  let dfRes = df.spread(namesFrom = "Class",
                        valuesFrom = "Num")
  doAssert dfRes.ncols == 3
  doAssert dfRes["A", int] == [1, 8, 0].toTensor
  doAssert dfRes["B", int] == [3, 4, 0].toTensor
  doAssert dfRes["C", int] == [5, 7, 2].toTensor
block:
  let df = toDf({ "Class" : ["A", "A", "A", "B", "B", "C", "C", "C", "C"],
                      "Num" : [1, 8, 0, 3, 4, 0, 5, 7, 2] })
  let dfRes = df.spread(namesFrom = "Class",
                        valuesFrom = "Num")
  doAssert dfRes.ncols == 3
  ## in this case all new columns are extended with 0 at the end
  doAssert dfRes["A", int] == [1, 8, 0, 0].toTensor
  doAssert dfRes["B", int] == [3, 4, 0, 0].toTensor
  doAssert dfRes["C", int] == [0, 5, 7, 2].toTensor
Source   Edit  
proc strTabToDf(t: OrderedTable[string, seq[string]]): DataFrame {.
    ...raises: [KeyError, ValueError], tags: [].}

Creates a data frame from a table of seq[string].

Note 1: This is mostly used for the old readCsv procedure, which is now called readCsvAlt. One should normally not have to deal with a table of strings as a DF input.

Note 2: This proc assumes that the given entries in the seq[string] have been cleaned of white space. The readCsv proc takes care of this.

Source   Edit  
proc summarize(df: DataFrame; fns: varargs[FormulaNode]): DataFrame {.
    ...raises: [FormulaMismatchError, Exception, ValueError, KeyError],
    tags: [RootEffect].}

Returns a data frame with the summaries applied given by fn. They are applied in the order in which they are given.

summarize is a reducing operation. The given formulas need to take full columns as arguments and produce scalars, using the << operator. If no left hand side and operator is given, the new column will be computed automatically.

Example:

let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @[5, 10, 15, 20, 25] })
block:
  let dfRes = df.summarize(f{float:  mean(`x`) }) ## compute mean, auto creates a column name
  doAssert dfRes.len == 1 # reduced to 1 row
  doAssert "(mean x)" in dfRes
block:
  let dfRes = df.summarize(f{float: "mean(x)" << mean(`x`) }) ## same but with a custom name
  doAssert dfRes.len == 1 # reduced to 1 row
  doAssert "mean(x)" in dfRes
block:
  let dfRes = df.summarize(f{"mean(x)+sum(y)" << mean(`x`) + sum(`y`) })
  doAssert dfRes.len == 1
  doAssert "mean(x)+sum(y)" in dfRes
Source   Edit  
proc tail(df: DataFrame; num: int): DataFrame {....raises: [ValueError, KeyError],
    tags: [].}
Returns the tail of the DataFrame, i.e. the last num elements. Source   Edit  
proc toHashSet[T](t: Tensor[T]): HashSet[T]
Internal helper to convert a tensor to a HashSet Source   Edit  
proc transmute(df: DataFrame; fns: varargs[FormulaNode]): DataFrame {.
    ...raises: [Exception, KeyError, ValueError, FormulaMismatchError],
    tags: [RootEffect].}

Returns the data frame cut to the columns created by fns, which should involve a calculation. To only cut to one or more columns use the select proc.

It is equivalent to calling mutate and then select the columns created (or modified) during the mutate call.

Existing columns may also be overwritten by handing a formula with the name of an existing column as the resulting name.

The left hand side of the given formula will correspond to the new name of the column if present. If not, the name will be computed from a lisp representation of the formula code.

Example:

let df = toDf({ "x" : @[1, 2, 3], "y" : @[10, 11, 12], "z": ["5","6","7"] })
let dfRes = df.transmute(f{"x+y" ~ `x` + `y`})
doAssert "x+y" in dfRes
doAssert dfRes.ncols == 1
doAssert dfRes["x+y", int] == [11,13,15].toTensor
doAssert "y" notin dfRes
doAssert "z" notin dfRes
Source   Edit  
proc transmuteInplace(df: var DataFrame; fns: varargs[FormulaNode]) {.
    ...raises: [Exception, KeyError, ValueError, FormulaMismatchError],
    tags: [RootEffect].}
Inplace variant of transmute below. Source   Edit  
proc unique(c: Column): Column {....raises: [ValueError, KeyError, Exception],
                                 tags: [RootEffect].}
Returns a Column of all unique values in c (duplicates are removed).

Example:

let x = toColumn [1, 2, 1, 4, 5]
doAssert x.unique.toTensor(int) == [1, 2, 4, 5].toTensor
Source   Edit  
proc unique(df: DataFrame; cols: varargs[string]; keepAll = true): DataFrame {.
    ...raises: [ValueError, KeyError, Exception], tags: [RootEffect].}

Returns a DF with only distinct rows. If one or more cols are given the uniqueness of a row is only determined based on those columns. By default all columns are considered.

If not all columns are considered and keepAll is true the resulting DF contains all other columns. Of those the first duplicated row is kept!

Note: The corresponding dplyr function is distinct. The choice for unique was made, since distinct is a keyword in Nim!

Example:

let df = toDf({ "x" : @[1, 2, 2, 2, 4], "y" : @[5.0, 6.0, 7.0, 8.0, 9.0],
                    "z" : @["a", "b", "b", "d", "e"]})
block:
  let dfRes = df.unique() ## consider uniqueness of all columns, nothing removed
  doAssert dfRes["x", int] == df["x", int]
  doAssert dfRes["y", float] == df["y", float]
  doAssert dfRes["z", string] == df["z", string]
block:
  let dfRes = df.unique("x") ## only consider `x`, only keeps keeps 1st, 2nd, last row
  doAssert dfRes["x", int] == [1, 2, 4].toTensor
  doAssert dfRes["y", float] == [5.0, 6.0, 9.0].toTensor
  doAssert dfRes["z", string] == ["a", "b", "e"].toTensor
block:
  let dfRes = df.unique(["x", "z"]) ## considers `x` and `z`, one more unique (4th row)
  doAssert dfRes["x", int] == [1, 2, 2, 4].toTensor
  doAssert dfRes["y", float] == [5.0, 6.0, 8.0, 9.0].toTensor
  doAssert dfRes["z", string] == ["a", "b", "d", "e"].toTensor
Source   Edit  
proc valTabToDf(t: OrderedTable[string, seq[Value]]): DataFrame {.
    ...raises: [ValueError, KeyError], tags: [].}

Creates a data frame from a table of seq[Value].

Note: This is also mainly a fallback option for old code.

Source   Edit  

Iterators

iterator groups(df: DataFrame; order = SortOrder.Ascending): (
    seq[(string, Value)], DataFrame) {....raises: [Exception, KeyError, ValueError],
                                       tags: [RootEffect].}

Yields the subgroups of a grouped DataFrame df and the (key, Value) pairs that were used to create the subgroup.

If df has more than one grouping, a subgroup is defined by the pair of the groupings. For example: mpg.group_by("class", "cyl") will yield all pairs of car ("class", "cyl").

Note: only non empty data frames will be yielded!

Example:

let df = toDf({ "Class" : @["A", "C", "B", "B", "A", "C", "C"],
                    "Num" : @[1, 5, 3, 4, 8, 7, 2] })
  .group_by("Class")
let expClass = ["A", "B", "C"]
let dfA = toDf({ "Class" : ["A", "A"], "Num" : [1, 8] })
let dfB = toDf({ "Class" : ["B", "B"], "Num" : [3, 4] })
let dfC = toDf({ "Class" : ["C", "C", "C"], "Num" : [5, 7, 2] })
let expDf = [dfA, dfB, dfC]
var idx = 0
for t, subDf in groups(df):
  doAssert t[0][0] == "Class" # one grouping (first `[0]`), by `"Class"`
  doAssert t[0][1] == %~ expClass[idx] # one grouping (first `[0]`), Class label as `Value`
  doAssert subDf["Class", string] == expDf[idx]["Class", string]
  doAssert subDf["Num", int] == expDf[idx]["Num", int]
  inc idx
Source   Edit  
iterator items(df: DataFrame): Value {....raises: [ValueError, KeyError], tags: [].}

Returns each row of the DataFrame df as a Value of kind VObject.

This is an inefficient way to iterate over all rows in a data frame, as we don't have type information at compile time. Thus we need to construct a (Value internal) table to store (key, value) pairs at runtime.

It should only be used for convenience. To work with a data frame use procedures that are meant to modify / reduce / ... a data frame.

Source   Edit  
iterator keys(df: DataFrame): string {....raises: [], tags: [].}
Iterates over all column keys of the input DataFrame. Source   Edit  
iterator values(df: DataFrame; cols: varargs[string]): Tensor[Value] {.inline,
    ...raises: [KeyError, ValueError], tags: [].}

Yields all columns cols of DataFrame df as Tensor[Value] rows.

Each row is yielded without column key information. The tensor is filled in the order of the existing columns. The first entry corresponds to the first column etc.

This proc is usually not very useful.

Source   Edit  

Macros

macro toTab(args: varargs[untyped]): untyped

Performs conversion of the untyped arguments to a DataFrame.

Arguments may be either a list of identifiers, symbols or calls which are convertible to a Column:

  • toTab(x, y, z)
  • toTab(foo(), bar())

or an OrderedTable[string, seq[string/Value]]

  • toTab(someOrderedTable)

or a table constructor:

  • toTab({ "foo" : x, "y" : bar() })
Source   Edit  

Templates

template `$`(df: DataFrame): string
Source   Edit  
template add(df: var DataFrame; args: varargs[untyped]): untyped

Helper overload for add above, which takes a varargs of values that are combined to a tuple automatically.

The tuple field names will be default Field0 etc. You cannot use this overload to define custom column names in an empty DF (but that use case should ideally be avoided anyway!).

Source   Edit  
template bind_rows(dfs: varargs[DataFrame]; id: string = ""): DataFrame

Overload of bind_rows above, for automatic creation of the id values.

Using this proc, the different data frames will just be numbered by their order in the dfs argument and the id column is filled with those values. The values will always appear as strings, even though we use integer numbering.

bind_rows combines several data frames row wise (i.e. data frames are stacked on top of one another). If a given column does not exist in one of the data frames, the corresponding rows of the data frame missing it, will be filled with VNull.

Example:

let a = [1, 2, 3]
let b = [3, 4, 5]
let c = [4, 5, 6, 7]
let d = [8, 9, 10, 11]

let df = toDf({"a" : a, "b" : b})
let df2 = toDf({"a" : c, "b" : d})
import sequtils

block:
  let res = bind_rows([df, df2])
  doAssert res.len == 7
  doAssert res.ncols == 2
  doAssert res["a", int] == concat(@a, @c).toTensor
  doAssert res["b", int] == concat(@b, @d).toTensor
block:
  let res = bind_rows([df, df2], "From")
  doAssert res.len == 7
  doAssert res.ncols == 3
  doAssert res["a", int] == concat(@a, @c).toTensor
  doAssert res["b", int] == concat(@b, @d).toTensor
  doAssert res["From", string] == concat(newSeqWith(3, "0"),
                                         newSeqWith(4, "1")).toTensor
Source   Edit  
template colsToDf(s: varargs[untyped]): untyped
converts an arbitrary number of columns to a DataFrame or any number of key / value pairs where we have string / seq[T] pairs. Source   Edit  
template dataFrame(s: varargs[untyped]): untyped
alias for toTab Source   Edit  
template ncols(df: DataFrame): int
Returns the number of columns in the DataFrame df. Source   Edit  
template seqsToDf(s: varargs[untyped]): untyped
convertsb an arbitrary number of sequences to a DataFrame or any number of key / value pairs where we have string / seq[T] pairs. Source   Edit  
template toDf(s: varargs[untyped]): untyped
An alias for toTab and the default way to construct a DataFrame from one or more collections (seqs, Tensors, Columns, ...). Source   Edit  
template withCombinedType(df: DataFrame; cols: seq[string]; body: untyped): untyped

A helper template to work with a dtype that encompasses all data types found in the cols of the input DataFrame df.

It injects a variable dtype into the calling scope.

Example:

let df = toDf({"x" : @[1, 2, 3], "y" : @[2, 3, 4], "z" : @[1.1, 2.2, 3.3]})
withCombinedType(df, @["x", "y"]):
  doAssert dtype is int
withCombinedType(df, @["x", "z"]):
  doAssert dtype is float # float can encompass `int` and `float` as we're lenient!
Source   Edit