"DFLib" (short for "DataFrame Library") is a lightweight, pure Java implementation of DataFrame. DataFrame is a very common structure in data science and Big Data worlds. It provides operations like search, filtering, joins, aggregations, statistical functions, etc., that are reminiscent of SQL (also of Excel), except you run them in your app over dynamic in-memory data sets.

There are DataFrame implementations in Python (pandas), R, Apache Spark, etc. DFLib project’s goal is to provide the same functionality for regular Java applications. It is a simple library, that requires no special infrastructure. DFLib core is dependency-free.

The code in this documentation can be run in any Java IDE, as well as (and this is pretty cool!) in a Jupyter notebook. You will need Java 11 or newer.

Get Started with DFLib

Include DFLib in a project. Assuming you are using Maven, start by declaring a "BOM" to have a common version for multiple DFLib modules:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.dflib</groupId>
      <artifactId>dflib-bom</artifactId>
      <version>1.0.0-M22</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Next include DFLib core as a dependency:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib</artifactId>
</dependency>

Create a DataFrame, do some manipulations with it, print the result:

DataFrame df1 = DataFrame
        .foldByRow("a", "b", "c")
        .ofStream(IntStream.range(1, 10000));

DataFrame df2 = df1.rows(r -> r.getInt(0) % 2 == 0).select();

System.out.println(Printers.tabular.toString(df2));

When you run this code, console output will look like this:

   a    b    c
---- ---- ----
   4    5    6
  10   11   12
  16   17   18
...
9982 9983 9984
9988 9989 9990
9994 9995 9996
1666 rows x 3 columns
We’ll omit print statements in all the following examples, and will simply display their output. Details of printing are discussed in the in "Printers" chapter.

Main Data Structures

The two essential DFLib classes are Series and DataFrame. Series is a 1-dimensional array-like object, and DataFrame is a 2-dimensional table-like object. Columns in a DataFrame are stored as Series. Additionally, we will discuss Index object that stores DataFrame column names.

Both DataFrame and Series (and Index) are fully immutable, so all operations on them return a new instance. Behind the scenes the library shares as much data as possible between instances, so copying these objects does not cause significant performance degradation, and in turn makes DFLib fully thread-safe, with multiple concurrent operations possible on the same data structure. Also, immutability means that you can have full snapshots of your data at each step of a transformation, simplifying debugging and auditing of data pipelines.

Series

Series is the simplest of the two data structures. You can think of it as a wrapper around an array of values. You can use Series to model sequences of data such as timestamps in a time series. Series object is parameterized for the type of data that it holds. So there can be Series<String> or a Series<LocalDate>, etc. There is also an important category of "primitive" Series (IntSeries, LongSeries, DoubleSeries, BooleanSeries) that are optimized for memory use and arithmetic operations. Of course, each primitive Series can also pose as Series of a corresponding wrapper object type (e.g. IntSeries is also a Series<Integer>).

Series object is an important building block of DataFrame, but it defines a number of useful data manipulation and transformation operations on its own. Those will be covered in the following chapters. Here we’ll discuss how to create Series.

Creation of Series from Arrays

Series can be created using static ofXyz(..) methods on the Series interface:

Series<String> s = Series.of("a", "bcd", "ef", "g");
a
bcd
...
g
4 elements

Primitive Series classes have their own factory methods. E.g.:

IntSeries is = Series.ofInt(0, 1, -300, Integer.MAX_VALUE);

Creation of Series By Element

If we don’t have a data array or collection to start with, and instead somehow produce a sequence of values of unpredictable length, we can use Series.byElement() API. E.g. the following example reads data from an InputStream line by line as Strings:

// InputStream inputStream = ...
SeriesAppender<String, String> appender = Series
        .byElement(Extractor.<String>$col()) (1)
        .appender();

Scanner scanner = new Scanner(inputStream);
while (scanner.hasNext()) {
    appender.append(scanner.next()); (2)
}

Series<String> s = appender.toSeries();
1 Create Series "appender" that will accumulate values. For primitive Series you would use Extractor.$int(..), Extractor.$long(..) and so on
2 Append values as they are read one-by-one

DataFrame

DataFrame is an in-memory table made of an Index header and a number of named columns. Each column is a Series, and has an associated name in the Index. DataFrame can contain columns of different kinds, so it is not parameterized for any single type.

"Rows" is a purely virtual concept as the data is organized by column, yet there is a number of APIs that appear to operate on rows for user convenience.

There are a few ways to create a DataFrame. Here we’ll show how to convert various in-memory objects to DataFrames (arrays, Streams, Collections, Series).

More often than not, a DataFrame is not created from an in-memory object, but is rather loaded from (and saved to) an external source, like a database or a CSV file. Those are discussed in separate chapters.

First example, adding data row by row:

DataFrame df = DataFrame
        .byArrayRow("name", "age") (1)
        .appender() (2)
        .append("Joe", 18)   (3)
        .append("Andrus", 49)
        .append("Joan", 32)
        .toDataFrame();
1 A special builder is created to append each row as a vararg array
2 The builder creates an "appender" object. While here we are using the builder with default settings, it has extra methods to configure capacity, data sampling, etc.
3 Passing individual rows to the appender one-by-one

The resulting DataFrame looks like this:

name   age
------ ---
Joe    18
Andrus 49
Joan   32

A more general example - creating a DataFrame from a list of objects by "extracting" column data from object properties:

record Person(String name, int age) {
}

List<Person> people = List.of(
        new Person("Joe", 18),
        new Person("Andrus", 49),
        new Person("Joan", 32));

DataFrame df = DataFrame
        .byRow( (1)
                Extractor.$col(Person::name),
                Extractor.$int(Person::age))
        .columnNames("name", "age") (2)
        .appender() (3)
        .append(people)   (4)
        .toDataFrame();
1 The builder is started with an array of Extractors. Each extractor generates its own column, filling it with a corresponding object property.
2 Specifying the names of the DataFrame columns. If omitted, column names are assigned automatically
3 Creating a row-by-row appender
4 Appending the list.

The resulting DataFrame looks like this:

name   age
------ ---
Joe    18
Andrus 49
Joan   32

Another example - a single-dimensional array can be "folded" into a DataFrame row-by-row:

DataFrame df = DataFrame
        .foldByRow("name", "age") (1)
        .of("Joe", 18, "Andrus", 49, "Joan", 32); (2)
1 Folding Builder is created
2 Passing a varargs array of values, that is folded to match the specified number of columns, row by row.

Same, but folding column-by-column:

DataFrame df = DataFrame
        .foldByColumn("name", "age")
        .of("Joe", "Andrus", "Joan", 18, 49, 32);

You can also create DataFrames from collections or Streams (folded either by row or by column). Here is an example of how to use a Stream of primitive ints, creating a DataFrame made of memory-efficient IntSeries columns:

DataFrame df = DataFrame
        .foldByColumn("col1", "col2")
        .ofStream(IntStream.range(0, 10000));
col1 col2
---- ----
   0 5000
   1 5001
   2 5002
...
4997 9997
4998 9998
4999 9999
5000 rows x 2 columns

Finally, a DataFrame can be created from an array of Series, each Series representing a column:

DataFrame df = DataFrame
        .byColumn("name", "age")
        .of(
                Series.of("Joe", "Andrus", "Joan"),
                Series.ofInt(18, 49, 32)
        );

This is the most efficient way, as an array of Series is how each DataFrame is structured internally.

Index

Index is somewhat similar to Series. It serves as a DataFrame "header" and internally allows to quickly resolve column String labels to their numeric positions. You’d work with Index instances outside a DataFrame only occasionally, but it is still good to know its capabilities. E.g., here is how to get all the column labels:

String[] labels = df.getColumnsIndex().toArray();

Printers

When doing data exploration and running data pipelines, it is important to be able to visualize data at every step. More advanced forms of visualization include charts and diagrams. But the simplest thing you can do is printing data to the console. Both DataFrame and Series implement toString() method, that will print their contents as a single line, truncating large data sets in the process, which is good for debugging applications running on a server.

A more human-friendly form of output is produced by a tabular printer. Here is how to use the default tabular printer:

DataFrame df = DataFrame
        .foldByColumn("col1", "col2", "col3")
        .ofStream(IntStream.range(0, 10000));

String table = Printers.tabular.toString(df);
System.out.println(table);

It prints the data as follows, displaying at most 6 rows and truncating the rest. Same goes for cell values. Values longer than 30 chars are also truncated (since they are pretty short in our example, value truncation is not obvious here).

col1 col2 col3
---- ---- ----
   0 3334 6668
   1 3335 6669
   2 3336 6670
...
3331 6665 9999
3332 6666    0
3333 6667    0
3334 rows x 3 columns

You can change the truncation parameters by creating your own printer:

Printer printer = new TabularPrinter(3, 3); (1)
String table = printer.toString(df);
System.out.println(table);
1 Create a printer that displays at most 3 rows and up to 3 characters in each cell.
c.. c.. c..
--- --- ---
  0 3.. 6..
  1 3.. 6..
...
3.. 6..   0
3334 rows x 3 columns
If you are using Jupyter Notebook, all the printers are already setup for you. So if the last statement in a Jupyter cell is a DataFrame or a Series, it will be printed as a table in the notebook.

Expressions

DFLib has a built-in expression language (implemented as a Java "DSL"). It allows to perform column-centric operations on DataFrames and Series, such as data transformation, aggregation and filtering. Exp is the interface representing an expression that takes either a DataFrame or a Series and produces a Series of the specified type.

To use expressions, you’d often start by adding a static import of the Exp interface, so that all its factory methods are available directly in the code:

import static org.dflib.Exp.*;

Now let’s create two simple expressions that return a named and a positional column of the requested type:

StrExp lastExp = $str("last");
DecimalExp salaryExp = $decimal(2);

And now let’s evaluate the expressions:

DataFrame df = DataFrame.foldByRow("first", "last", "salary").of(
        "Jerry", "Cosin", new BigDecimal("120000"),
        "Juliana", "Walewski", new BigDecimal("80000"),
        "Joan", "O'Hara", new BigDecimal("95000"));

Series<String> last = lastExp.eval(df);
Series<BigDecimal> salary = salaryExp.eval(df);

This doesn’t look like much (other DFLib APIs would do the same), but this basic abstraction allows to describe a wide range of operations (filtering, sorting, aggregation, etc). Expressions are rarely evaluated standalone. More commonly they are passed as arguments to various other methods, as described elsewhere in this document.

DFLib expressions work on Series instead of individual values, so they can achieve the best possible performance for any given operation. Expressions should be the preferred way to manipulate your data instead of more "direct" APIs, such as custom lamdas.

Now let’s look at various types of expressions…​

Column Expressions

$str(..) and $decimal(..) expressions in the example above are the column lookup expressions. They return a DataFrame column with a given name or at a given (zero-based) position without applying any transformations to the data.

So what is returned when you evaluate a column expression with a Series object? Series can be thought of as a single (unnamed) column. So a column expression simply returns the Series unchanged, ignoring implied column name or position.

Factory methods for column expressions are easy to spot in the Exp interface - they all start with a dollar sign:

$col("col"); (1)

(2)
$decimal("col");
$double("col");
$int("col");
$long("col");

(3)
$date("col");
$dateTime("col");
$time("col");

(4)
$bool("col");
$str("col");
1 Retrieves a column without implying a specific type
2 Retrieve columns that are expected to be numeric
3 Retrieve columns that are expected to be date/time
4 Retrieve columns of boolean or String types

Return type is implied from the method name ($str(..) produces a Series<String>, $decimal(..) - Series<BigDecimal>, $int(..) - Series<Integer>, $date(..) - Series<LocalDate> and so on).

To avoid overhead, column expressions do not perform any type conversions of the column data (with a notable exception of $bool that does). So you need to pick the right type in your code based on your knowledge of data to avoid ClassCastExceptions downstream, or use a generic $col(..) expression. If you do want to convert data from one type to another, use one of the castAs methods explicitly:

$str("salary").castAsDecimal();

castAs will do its best to provide a meaningful conversion to the target type. But sometimes it is simply not possible, and it will throw an exception (e.g. a String like "abcd" can’t be converted to a number without some contextual knowledge). In some other cases, no default conversion exists, but a custom conversion is possible and desired. You can do it via Exp.mapVal(..).

Constant Expressions

If we want to generate a Series with the same repeating value, we can use the $val(..) expression:

Series<String> hi = $val("hi!").eval(df);
hi!
hi!
hi!

A useful case for $val(..) is to create a separator for String concatenation:

Series<String> fn = concat( (1)
        $str("first"),
        $val(" "), (2)
        $str("last")).eval(df);
1 This is a static import of Exp.concat(..), an expression that takes a variable number of arguments
2 Inserting space between first and last name
Jerry Cosin
Juliana Walewski
Joan O'Hara

String Expressions

Numeric Expressions

$int(..), $long(..), double(..) and $decimal(..) columns expressions mentioned above are numeric (i.e. instances of NumExp expression subclass), and as such, they provide arithmetics, comparison operations, and numeric aggregating operations.

We will look at comparisons later in the "Conditions" chapter. Here is an example of arithmetics operation:

NumExp<?> exp = $int("c1")
        .add($long("c2")) (1)
        .div(2.); (2)

DataFrame df = DataFrame.foldByRow("c1", "c2").of(
                1, 2L,
                3, 4L)
        .cols("c3").select(exp);
1 Add two columns
2 Multiply the result by a constant value
 c3
---
1.5
3.5
Associativity…​ Since expressions are built with Java API, there are no special associativity rules. Expressions are executed in the order they were chained together. In the example above, addition happens before division.
Widening conversions…​ When we added values of int and long columns, the result was implicitly widened to long, and later when it was divided by a double, the result became double. These conversions happen automatically and transparently to the user. They follow regular Java rules for primitive numbers operations (with an addition of widening from primitives to BigDecimal).

Here is an example of aggregating operations:

NumExp<?> exp = $int("c1")
        .add($long("c2"))
        .sum() // <1> aggregating here
        .div(2.);

DataFrame df = DataFrame.foldByRow("c1", "c2").of(
                1, 2L,
                3, 4L)
        .cols("c3").agg(exp);
1 Inserting an aggregating expression in a chain of arithmetic expressions produces an aggregated result
 c3
---
5.0

Date Expressions

TODO

Complex Expressions

Expressions can be expanded by calling methods on the Exp object (and its type-specific subclasses). As we’ve already seen above, expressions can be composed of other expressions by invoking static methods of Exp. E.g.:

(1)
Condition c = and( (2)
        $str("last").startsWith("A"), (3)
        $decimal("salary").add($decimal("benefits")).gt(100000.) (4)
);
1 "Condition" is an Exp<Boolean> described in more detail below
2 and(..) is a statically-imported Exp.and(..)
3 startsWith(..) produces a condition based on another string expression
4 add(..) produces an addition operation for two numeric expressions, gt(..) produces a condition from the result of the addition

Conditions

TODO

Sorters

Sorter is a special object, that allows to sort DFLib data structures. Internally a Sorter is using an expression to retrieve values that comprise sorting criteria and index them in the specified order. Sorter can be created from any expression by calling its asc() or desc() methods:

// sort by last name in the ascending order
Sorter s = $str("last").asc();
DFLib would allow you to create a Sorter based on any expression type. In runtime, the actual type must be either a Java primitive or an instance of Comparable, or a ClassCastException will be thrown during sorting.

Column Operations

To manipulate data in a single DataFrame, with a few exceptions you would start by picking a subset of columns or rows. Those are represented as ColumnSet and RowSet objects. Let’s look at columns first…​

Pick Columns

Columns are picked either by condition, by name, by position or implicitly. Let’s take a look at each style…​

By Condition

A condition is specified as a "predicate" lambda:

DataFrame df1 = df.cols(c -> !"middle".equals(c)).select();

This form of cols(..) does not allow to reorder the columns. The resulting columns will always follow the relative order of the original DataFrame:

first   last
------- --------
Jerry   Cosin
Joan    O'Hara

By Names

Here is a simple example of how to pick two columns from a DataFrame by name, and return a new DataFrame with just those two columns:

DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Joan", "O'Hara", null);

DataFrame df1 = df
        .cols("last", "first") (1)
        .select(); (2)
1 Define a ColumnSet of two columns matching provided names. The order of columns in the ColumnSet matches the order of column names passed to this method.
2 Return a new DataFrame matching the ColumnSet. More on this in the next chapter.
last     first
-------- -------
Cosin    Jerry
O'Hara   Joan
Passing names in a specific order to cols(..) allowed us to reorder the columns in the result.

Instead of listing included columns, you might specify which columns should be excluded from selection. This flavor doesn’t support reordering:

DataFrame df1 = df.colsExcept("middle").select();

By Positions

You can also pick columns by positions. The first position is "0".

DataFrame df1 = df.cols(1, 0).select();

Implicitly

If you specify no columns at all when building a RowSet, the returned DataFrame will be the same as the source one.

DataFrame df1 = df.cols().select();
In this particular case, implicit column selection is not very useful. But in fact, this style is dynamic and may result in a smaller subset of columns selected based on the semantics of an operation applied to the ColumnSet (e.g. named expressions making specific columns). Also, it allows to apply a transformation to all columns at once. These cases are described later in the columns transformation chapter.

Rename Columns

We saw how we can select a subset of columns. While doing that, we can also assign new names to all or some of the columns via selectAs(..).

Rename all column set columns:

DataFrame df1 = df
        .cols("last", "first")
        .selectAs("last_name", "first_name"); (1)
1 Passing new names for the ColumnSet columns. The order of the names must correspond to the order of columns in the ColumnSet, regardless of how it was defined.
last_name first_name
--------- ----------
Cosin     Jerry
Walewski  Juliana

Rename a subset of columns. Specifying names of all columns at once may not be practical for "wide" column sets. Instead, you can pass a map of old to new names to selectAs(..) to rename just some columns. Names not present in the map will remain unchanged:

DataFrame df1 = df
        .cols("last", "first")
        .selectAs(Map.of("last", "LAST_NAME"));
LAST_NAME first
--------- -------
Cosin     Jerry
O'Hara    Joan

Rename with a function, applied to all column names in turn. This is useful e.g. to convert all names to lower or upper case:

DataFrame df1 = df
        .cols("last", "first")
        .selectAs(String::toUpperCase);
LAST   FIRST
------ -----
Cosin  Jerry
O'Hara Joan

Transform Columns

In the above examples we performed DataFrame vertical slicing, columns reordering and renaming. In this chapter we’ll show how to transform column data.

Generating columns with expressions:

Exp fmExp = concat(
        $str("first"),
        ifNull($str("middle").mapVal(s -> " " + s), ""));

DataFrame df1 = df
        .cols("first_middle", "last") (1)
        .select(fmExp, $str("last")); (2)
1 When defining the ColumnSet, we are allowed to specify columns that are not present in the original DataFrame, as we are naming the result columns, not the source columns.
2 For each column in the ColumnSet, there should be an expression that generates the column. The first expression here transforms the data, the second simply picks a column from the source without modification.
first_middle last
------------ ------
Jerry M      Cosin
Joan         O'Hara

Generating columns row-by-row with RowMapper:

RowMapper mapper = (from, to) -> {
    String middle = from.get("middle") != null
            ? " " + from.get("middle")
            : "";
    to.set(0, from.get("first") + middle).set(1, from.get("last"));
};

DataFrame df1 = df
        .cols("first_middle", "last")
        .select(mapper);

There’s also a way to generating columns row-by-row with per-column array of RowToValueMapper. It is not very different from what we’ve seen so far, so we are leaving this as an exercise for the reader.

Split Columns

Values can be split using an expression into iterable or array elements, and new columns can be generated from those elements, thus "expanding" each row. Here is an example of handling Lists of values (List is, of course, an Iterable) :

DataFrame df = DataFrame.foldByRow("name", "phones").of(
        "Cosin", List.of("111-555-5555","111-666-6666","111-777-7777"),
        "O'Hara", List.of("222-555-5555"));

DataFrame df1 = df
        .cols("primary_phone", "secondary_phone")
        .selectExpand($col("phones"));
1 selectExpand(..) takes an expression that returns a column with lists. If each List has more than 2 numbers, the rest are ignored, if less - nulls are used for the missing elements
primary_phone secondary_phone
------------- ---------------
111-555-5555  111-666-6666
222-555-5555  null

If you don’t know the exact number of phones, but would like to capture them all in separate columns, do not specify any explicit names (as mentioned in the implicit column selection chapter). As many columns as needed to fit the longest array of phone numbers will be generated on the fly, and the names will be assigned dynamically:

DataFrame df1 = df
        .cols() (1)
        .selectExpand($col("phones"));
1 Not specifying any columns will result in split columns generated dynamically.
0            1            2
------------ ------------ ------------
111-555-5555 111-666-6666 111-777-7777
222-555-5555 null         null

If the phone numbers were provided as a comma-separated String instead of a List, you can split the String into an array of numbers using the split(..) expression, and use selectExpandArray(..) to generate the columns:

DataFrame df = DataFrame.foldByRow("name", "phones").of(
        "Cosin", "111-555-5555,111-666-6666,111-777-7777",
        "O'Hara", "222-555-5555");

DataFrame df1 = df
        .cols("primary_phone", "secondary_phone")
        .selectExpandArray($str("phones").split(','));

Merge Columns

The ColumnSet operations demonstrated so far resulted in discarding the source DataFrame, and returning only the columns defined within the set. But often we would like to recombine transformed ColumnSet with the original DataFrame. Any in-place data cleanup, data set enrichment, etc. fall in this category.

Of course, we call them "in-place" only in a logical sense, as DataFrame is immutable, and all modifications result in creation of a new instance.

For this ColumnSet provides a number of "merging" methods. All the ColumnsSet methods that do not start with select…​ are performing a merge with the source DataFrame. In fact, all the renaming and data transformation operations that we just discussed can also be executed as "merges".

Here is an example of cleaning up a data set and adding a new column:

DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "jerry", "cosin", "M",
        "Joan", "O'Hara", null);

Function<String, Exp<String>> cleanup = col -> $str(col).mapVal(
        s -> s != null && !s.isEmpty()
                ? Character.toUpperCase(s.charAt(0)) + s.substring(1)
                : null); (1)

DataFrame df1 = df
        .cols("last", "first", "full") (2)
        .merge( (3)
                cleanup.apply("last"),
                cleanup.apply("first"),
                concat($str("first"), $val(" "), $str("last"))
        );
1 An expression that capitalizes the first letter of the name
2 The columns we are going to generate or transform
3 Instead of select(..), use merge(..) to combine columns with the original DataFrame

Now let’s check the result. The first two columns in the ColumnSet ("last", "first") were already present in the DataFrame, so they got replaced with the cleaned up versions. The last column ("full") was new, so it was appended to the right side of the DataFrame:

first last   middle full
----- ------ ------ -----------
Jerry Cosin  M      jerry cosin
Joan  O'Hara null   Joan O'Hara

General merge rules:

  • Merging is done by name. DataFrame’s columns with matching names are replaced with new versions, columns that are not a part of the ColumnSet are left unchanged, and columns in the ColumnSet, but not in the DataFrame, are appended on the right.

  • The order of the existing columns in the ColumnSet has no effect on the order of replaced columns (i.e. they are placed in their original positions). The relative order of appended columns is respected.

  • All transformation operations (such as expressions in our example) are applied to the original DataFrame columns (and can reference those outside the ColumnSet), but do not see the transformed columns. This is why there is a lowercase name in the resulting "full" column.

What if we want to ensure that all of the ColumnSet columns are appended, and no replacement of the original columns occurs? For this you can manually specify ColumnSet labels that are not found in the source DataFrame, or you can use DataFrame.colsAppend(..), and DFLib itself will ensure that the names are unique, appending _ to each label until it doesn’t overlap with anything already in the DataFrame:

DataFrame df1 = df
        .colsAppend("last", "first") (1)
        .merge(
                cleanup.apply("last"),
                cleanup.apply("first")
        );
1 Using colsAppend(..) instead of cols() ensures that both pre- and post-cleanup columns are preserved.
first last   middle last_  first_
----- ------ ------ ------ ------
jerry cosin  M      Cosin  Jerry
Joan  O'Hara null   O'Hara Joan
Since we’ve already discussed various flavors of column renaming and transformation in the context of select(..), we leave it as an exercise for the user to explore their "merging" counterparts - as(..), map(..), expand(..), etc.

Compact Columns

This transformation converts columns to primitives. The values in the columns to be converted can be Numbers, Booleans or Strings (or objects with toString() that can be parsed to a primitive type):

DataFrame df = DataFrame.foldByRow("year", "sales").of(
        "2022", "2005365.01",
        "2023", "4355098.75");

DataFrame df1 = df
        .cols("year").compactInt(0) (1)
        .cols("sales").compactDouble(0.);
1 The argument to compact is the primitive value to use if the value being converted is null
compactInt(..), compactDouble(..) and other similar methods are not just syntactic sugar. They convert DataFrame columns to primitive Series, that take significantly less memory (hence, the name "compact") and compute most operations faster than Object-based Series. So you should consider using them where possible.

Fill Nulls

As a part of a dataset cleanup effort, you might want to replace null values with something meaningful. For this you may have to guess or maybe extrapolate the existing values based on some internal knowledge of the dataset. To help with this task, ColumnSet provides a few methods discussed here.

Filling nulls with a constant value:

DataFrame df = DataFrame.foldByRow("c1", "c2").of(
        "a1", "a2",
        null, null,
        "b1", "b2");

DataFrame clean = df.cols("c1", "c2").fillNulls("X");
c1 c2
-- --
a1 a2
X  X
b1 b2

Filling nulls with values adjacent to the cells with nulls:

DataFrame clean = df
        .cols("c1").fillNullsBackwards() (1)
        .cols("c2").fillNullsForward(); (2)
1 fillNullsForward(..) takes a non-null value preceding the null cell and uses it to replace nulls.
2 fillNullsBackwards(..) uses the first non-null value following the null cell.
c1 c2
-- --
a1 a2
b1 a2
b1 b2

There’s also a way to fill nulls from another Series object used as a "mask" :

Series<String> mask = Series.of("A", "B", "C", "D");
DataFrame clean = df.cols("c1", "c2").fillNullsFromSeries(mask);
c1 c2
-- --
a1 a2
B  B
b1 b2
The "mask" Series can be longer or shorter than the DataFrame height. Values are aligned by position starting at zero.

Finally, we can use an Exp to calculate values for nulls:

        DataFrame df = DataFrame.foldByRow("c1", "c2").of(
                "a1", "a2",
                null, null,
                "b1", "b2");

        DataFrame clean = df.cols("c1", "c2")
                .fillNullsWithExp(rowNum()); (1)
1 Exp.rowNum() generates a Series with row numbers, so the nulls are replaced by the number denoting their position in the DataFrame:
c1 c2
-- --
a1 a2
2  2
b1 b2

Drop Columns

To get rid of certain columns in a DataFrame, you can either select the columns you would want to drop and call drop() or select the columns you’d want to remain and call select():

DataFrame df1 = df.cols("middle").drop();
DataFrame df1 = df.colsExcept("middle").select();

In both cases, the result is the same:

first last
----- ------
Jerry Cosin
Joan  O'Hara

Row Operations

Just like with columns, row operations first define a RowSet, execute some transformation, and then are either merged back with the original DataFrame, or "selected" as a standalone one. RowSet and ColumnSet have other similarities (e.g. the ability to evaluate expressions by column, and create RowColumnSet), but RowSet is also rather different in how its rows are picked and what other operations are available.

Pick Rows

Rows are picked either by condition, by positions or as a range. Let’s take a look at each style…​

By Condition

A Condition (a boolean Exp) can be used to pick matching rows:

DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Joan", "O'Hara", "P");

DataFrame df1 = df
        .rows($str("last").startsWith("W").eval(df)) (1)
        .select(); (2)
1 Applies a Condition to the DataFrame, to include matching rows in the RowSet.
2 Returns a new DataFrame matching the RowSet
first   last     middle
------- -------- ------
Juliana Walewski null

Another form of condition is a "predicate" lambda (a RowPredicate object), evaluated row-by-row:

DataFrame df1 = df
        .rows(r -> r.get("last", String.class).startsWith("W"))
        .select();

A condition can be precalculated as a BooleanSeries. A common scenario is calling locate() on another Series or a DataFrame / RowSet to build a BooleanSeries "selector", and then using it to pick rows from another DataFrame:

IntSeries salaries = Series.ofInt(100000, 50000, 45600); (1)
BooleanSeries selector = salaries.locateInt(s -> s > 49999); (2)

DataFrame df1 = df.rows(selector).select();
1 A Series of salaries with elements positions matching row positions in the persons DataFrame.
2 Precalculates a reusable "selector"
first   last     middle
------- -------- ------
Jerry   Cosin    M
Juliana Walewski null

By Positions

Here is an example of RowSet defined using an array of row positions.

DataFrame df1 = df
        .rows(2, 0, 2) (1)
        .select();
1 An int[] that defines a RowSet
first last   middle
----- ------ ------
Joan  O'Hara P
Jerry Cosin  M
Joan  O'Hara P
An array of positions is also a kind of "condition". The main difference is that it allows to reorder rows by specifying positions in a desired sequence and/or duplicate rows by referencing the same position more than once. Both features are demonstrated in the example above.

Instead of an array, positions can be defined as an IntSeries "index". Just like with conditions, it is often calculated from another Series or DataFrame / RowSet:

IntSeries selector = salaries.indexInt(s -> s > 49999); (1)

DataFrame df1 = df.rows(selector).select();
1 Precalculates a reusable "selector" with positions. Uses a Series of salaries that is not a part of "our" DataFrame
first   last     middle
------- -------- ------
Jerry   Cosin    M
Juliana Walewski null

As a Range

Finally, rows can be selected as a continuous range of row positions:

DataFrame df1 = df
        .rowsRange(0, 2) (1)
        .select();
1 The range is defined by two ints: its starting index and the index, following its last index.
first   last     middle
------- -------- ------
Jerry   Cosin    M
Juliana Walewski null

Transform Rows

Just like ColumnSet, RowSet defines a number of column- and row-based transformations. And just like with ColumnSet, each transformation can be invoked as a "select" or a "merge", returning either the RowSet rows or all rows from the original DataFrame. Here we will demonstrate selecting operations, and will talk about merging next.

Transforming with column expressions:

DataFrame df = DataFrame.foldByRow("last", "age", "retires_soon").of(
        "Cosin", 61, false,
        "Walewski", 25, false,
        "O'Hara", 59, false);

DataFrame df1 = df
        .rows($int("age").mapBoolVal(a -> 67 - a < 10))
        .select(
                $col("last"),
                $col("age"),
                $val(true)); (1)
1 Select the rows from each column without changes, except the last one, where we flip the "retires_soon" flag to "true".
last   age retires_soon
------ --- ------------
Cosin   61         true
O'Hara  59         true
We had to specify an expression for every column in the DataFrame, even though only a single column got transformed. This code can be improved by creating a RowColumnSet described here.

Transforming row-by-row with RowMapper:

RowMapper mapper = (from, to) -> {
    from.copy(to);
    to.set("retires_soon", true);
};

DataFrame df1 = df
        .rows($int("age").mapBoolVal(a -> 67 - a < 10))
        .select(mapper);

There’s also a way to transforming row-by-row with per-column array of RowToValueMapper. It is not very different from what we’ve seen so far, so we are leaving this as an exercise for the reader.

Merge Rows

If we want to merge the result of a RowSet transformation to the original DataFrame, we’ll need to use methods like merge(..) (i.e. those that do not start with select):

DataFrame df = DataFrame.foldByRow("last", "age", "retires_soon").of(
        "Cosin", 61, false,
        "Walewski", 25, false,
        "O'Hara", 59, false);

DataFrame df1 = df
        .rows($int("age").mapBoolVal(a -> 67 - a < 10))
        .merge(
                $col("last"),
                $col("age"),
                $val(true));
last     age retires_soon
-------- --- ------------
Cosin     61         true
Walewski  25        false
O'Hara    59         true

Merging rows is done similar to columns, except rows have no name labels, so it is done by position. General merge rules:

  • Merging is done by row position. DataFrame rows with matching positions are replaced with transformed versions, rows that are not a part of the RowSet are left unchanged, and rows in the RowSet, but not in the DataFrame (e.g., intentionally duplicated rows or split rows) are appended in the bottom.

  • The order of the existing rows in the RowSet has no effect on the order of replaced rows (i.e. they are placed in their original positions). The relative order of added rows is respected.

Just like with columns, for most RowSet.select(..) methods there are "merging" counterparts, such as map(..), expand(..), unique(..), etc.

Split Rows

Splitting rows is somewhat simpler than splitting columns. The API takes a single column (no expression is allowed), and splits any expandable values into new rows:

DataFrame df = DataFrame.foldByRow("name", "phones").of(
        "Cosin", List.of(
                "111-555-5555",
                "111-666-6666",
                "111-777-7777"),
        "O'Hara", List.of("222-555-5555"));

DataFrame df1 = df
        .rows()
        .selectExpand("phones");
name   phones
------ ------------
Cosin  111-555-5555
Cosin  111-666-6666
Cosin  111-777-7777
O'Hara 222-555-5555
The expansion column can contain scalars, arrays or iterables.

Unique Rows

To deduplicate DataFrames, there’s an API to select fully or partially-unique rows. Uniqueness is checked either on all columns of each row, or a preselected subset:

DataFrame df = DataFrame.foldByRow("first", "last").of(
        "Jerry", "Cosin",
        "Jerry", "Jones",
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.rows().selectUnique(); (1)
1 Keep only fully unique rows
first last
----- ------
Jerry Cosin
Jerry Jones
Joan  O'Hara
DataFrame df2 = df.rows().selectUnique("first"); (1)
1 Deduplicate first names. Picking the row of each earliest-encountered unique first name.
first last
----- ------
Jerry Cosin
Joan  O'Hara

Drop Rows

To get rid of certain rows in a DataFrame, you can either select the rows you would want to drop, and call drop() or select the rows you’d want to remain and call select():

DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Joan", "O'Hara", "P");

DataFrame df1 = df.rows($col("middle").isNull()).drop();
DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Joan", "O'Hara", "P");

DataFrame df1 = df.rowsExcept($col("middle").isNull()).select();

In both cases, the result is the same:

first last   middle
----- ------ ------
Jerry Cosin  M
Joan  O'Hara P

Row and Column Operations

TODO

Pick Rows and Columns

TODO

Transform Rows and Columns

TODO

Merge Rows and Columns

TODO

Drop Rows and Columns

Head and Tail

As we’ve seen previously, most operations within a single DataFrame are performed on row and column sets. Still, some are applied to the DataFrame directly. head and tail are such examples.

When you only need the "first N" or "last M" rows of a DataFrame (or values of a Series) you can use head and tail operations. Here is how to get the top 2 rows of a DataFrame:

DataFrame df = DataFrame.foldByRow("first", "last").of(
        "Jerry", "Cosin",
        "Juliana", "Walewski",
        "Joan", "O'Hara");

DataFrame df1 = df.head(2); (1)
1 Returns a new DataFrame with two top rows of the original DataFrame.
first   last
------- ---------
Jerry   Cosin
Juliana Walewski

tail works similarly, but returns the last N rows:

DataFrame df1 = df.tail(1);
first last
----- ------
Joan  O'Hara
1 row x 2 columns

The argument to either head(..) or tail(..) can be negative. In which case the operation skips the specified number of elements either from the top or from the bottom, and returns the remaining elements. The following code returns a new DataFrame after skipping the two top rows of the original DataFrame:

DataFrame df1 = df.head(-2);
first  last
------ ---------
Joan  O'Hara
1 row x 2 columns
Unlike index operations on arrays and lists in Java, head(..) and tail(..) are safe in regards to bounds checking. They DO NOT throw exceptions when the length parameter is bigger than DataFrame height (or Series size), returning an empty DataFame (or Series) instead.
Series also define head(..) and tail(..) that do what you’d expect, returning first or last N values.

Sorting

You can sort values in Series, and sort rows in DataFrames. As is the case everywhere else in DFLib, sorting does not alter the original object, and instead creates a new instance of either Series or DataFrame.

First let’s look at sorting Series…​

Sort Series

Series provides a sort method to sort its data using a sorter built from an expression:

// sort series by String length
Series<String> s = Series
        .of("12", "1", "123")
        .sort($str("").mapVal(String::length).asc());

The result is a new Series with sorted data:

1
12
123
3 elements

Alternatively the same can be achieved using a Comparator:

Series<String> s = Series
        .of("12", "1", "123")
        .sort(Comparator.comparingInt(String::length));

The next example shows how to sort Series in the "natural" order (alphabetically for Strings):

Series<String> s = Series
        .of("c", "d", "a")
        .sort($str("").asc());
a
c
d
3 elements

Series of primitives have methods to sort values in the natural order without an explicit sorter or comparator:

LongSeries s = Series
        .ofLong(Long.MAX_VALUE, 15L, 0L)
        .sortLong();
                  0
                 15
9223372036854775807
3 elements

Additionally, IntSeries has an optimized method to sort ints with a custom IntComparator.

Next we’ll check how to sort a DataFrame.

Sort DataFrame

Rows in a DataFrame can be sorted with one or more sorters.

Just like in other places where a Sorter might be used, there is an assumption that the Sorter expression produces values that are either Java primitives or are compatible with java.lang.Comparable (i.e., Strings, numbers, etc).
DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Jerry", "Albert", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort(
        $str("first").asc(),
        $str("last").asc());
first   last      middle
------- --------- ------
Juliana Walewski  null
Jerry   Albert    null
Jerry   Cosin     M
Joan    O'Hara    J

Alternatively sorting can be done by a single column name or an array of columns. The assumption is the same as above - values in columns used for sorting must be either Java primitive or implement java.lang.Comparable.

DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort("first", true); (1)
1 The first argument is the column name (can also be column integer index), the second - a boolean indicating sort direction (true for ascending, false - for descending order).
DataFrame df = DataFrame.foldByRow("first", "last", "middle").of(
        "Jerry", "Cosin", "M",
        "Juliana", "Walewski", null,
        "Jerry", "Albert", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort(new String[]{"last", "first"}, new boolean[]{true, false});

Concatenation

There are various ways to combine data from multiple Series or DataFrames. The simplest is concatenation, described in this chapter. Series can be concatenated together producing a longer Series. DataFrames can be concatenated either vertically (along the rows axis) or horizontally (along the columns axis).

Concatenate Series

If you have a Series object and want to concatenate it with one or more other Series, you’d use Series.concat(..) method:

Series<String> s1 = Series.of("x", "y", "z");
Series<String> s2 = Series.of("a");
Series<String> s3 = Series.of("m", "n");

Series<String> sConcat = s1.concat(s2, s3);

The result is as expected contains values of all Series put together:

x
y
z
a
m
n
6 elements

If you have a collection or an array of Series and want to "glue" them together, you can use a static SeriesConcat.concat(..):

Collection<Series<String>> ss = asList(
        Series.of("x", "y", "z"),
        Series.of("a"),
        Series.of("m", "n"));

Series<String> sConcat = SeriesConcat.concat(ss);

The result is the same as in the previous example. Also you can concatenate Series with itself:

Series<String> s = Series.of("x", "y");
Series<String> sConcat = s.concat(s);
x
y
x
y
4 elements

Concatenate DataFrames

DataFrame offers two options for concatenation - vertical (stacking DataFrames on top of each other) and horizontal (putting them next to each other). Let’s see some examples..

DataFrame df1 = DataFrame.foldByRow("a", "b").of(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.foldByRow("a", "b").of(
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(df2); (1)
1 Concatenate this and another DataFrame vertically. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b
- -
1 2
3 4
5 6
7 8
DataFrame dfh = df1.hConcat(df2); (1)
1 Concatenate this and another DataFrame horizontally. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b a_ b_
- - -- --
1 2 5  6
3 4 7  8
Since both df1 and df2 had the same column names, and a DataFrame must only have unique columns, _ suffix was automatically appended to the conflicting columns in the resulting DataFrame. We will see this auto-renaming behavior in other places, such as joins.

So far all our concatenation examples consisted of DataFrames that had matching dimensions (same number and names of columns for vConcat and same number of rows for hConcat). But what if concatenated DataFrames are shaped or named differently?

Concat methods may take an extra "how" parameter to define concatenation semantics. The type of the "how" parameter is JoinType, and can be one of inner (default), left, right, full.

We’ll see JoinType again soon when discussing joins. Concatenation and joins are related fairly closely.

Let’s look how this works with vConcat:

DataFrame df1 = DataFrame.foldByRow("b", "a").of(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.foldByRow("a", "c").of( (1)
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(JoinType.inner, df2); (2)
1 df1 column names are "b" and "c", while df2 - "a" and "c".
2 Explicitly passing JoinType.inner to vConcat. It is done to demonstrate the point. Since inner is the default, omitting it will not change the outcome.
a
-
2
4
5
7

As you can see in the result, the behavior of inner join is to only keep columns that are present in both df1 and df2. Columns are joined by name, regardless of their order (though the order of columns is preserved in the result, following the order in each DataFame being joined, left to right).

Changing semantics from inner to left gives us all the columns of the leftmost DataFrame, but those columns that are missing from the concatenated DataFrame are filled with nulls in the corresponding positions:

DataFrame dfv = df1.vConcat(JoinType.left, df2);
b    a
---- -
1    2
3    4
null 5
null 7

Leaving it as an exercise for the reader to try right and full joins.

hConcat works similarly, however concatenation is done by row position, as there are no row names:

DataFrame df1 = DataFrame.foldByRow("a", "b").of(
        1, 2,
        3, 4,
        5, 6);

DataFrame df2 = DataFrame.foldByRow("c", "d").of(
        7, 8,
        9, 10);

DataFrame dfv = df1.hConcat(JoinType.left, df2);
a b c    d
- - ---- ----
1 2 7    8
3 4 9    10
5 6 null null

Joins

If you worked with relational databases, you know what "joins" are. This is a way to combine related rows from multiple tables. DataFrames can be joined together, just like two DB tables. DFLib provides 4 familiar flavors of joins: inner, left (outer), right (outer) and full (outer).

Inner Joins

DataFrame left = DataFrame
        .foldByRow("id", "name")
        .of(1, "Jerry", 2, "Juliana", 3, "Joan");

DataFrame right = DataFrame
        .foldByRow("id", "age")
        .of(2, 25, 3, 59, 4, 40);

DataFrame joined = left
        .join(right) (1)
        .on("id") (2)
        .select();
1 join(..) is equivalent to innerJoin(..) (just like in SQL). With "inner" semantics, only the matching rows that are present in both DataFrames will be included in the result below.
2 Columns used in the join criteria have the same name in both DataFrames, so we can specify the name of the join column only once. In general, the names do not need to match, and multiple columns can participate in a join.
id name    id_ age
-- ------- --- ---
 2 Juliana   2  25
 3 Joan      3  59

Column Namespace

Notice how DFLib treated the id column in the join result above. Since it was present in both left and right DataFrames, it was included in the result twice, but the second instance was renamed to id_ to avoid a naming collision. The same renaming would occur for any other columns from the right DataFrame that have conflicting names with the columns from the left. A quick way to get rid of duplicate columns is to exclude them from the result using the name pattern:

DataFrame joined = left
        .join(right)
        .on("id")
        .colsExcept(c -> c.endsWith("_"))
        .select();
id name    age
-- ------- ---
 2 Juliana  25
 3 Joan     59
You may have noticed that colsExcept(..).select() style of picking join columns is very similar to the column set API. The main difference is that the Join object itself represents a special form of column set, and only "select" operations are available.

For better clarity, you can assign names to the left and/or the right DataFrames just before the join, and the column names in the result will be prefixed with the corresponding DataFrame name, thus avoiding the trailing underscore:

DataFrame joined = left.as("L") (1)
        .join(right.as("R")) (2)
        .on("id")
        .select();
1 Assign the name to the left DataFrame
2 Assign the name to the right DataFrame

This makes it easier to identify the origin of each column.

L.id L.name  R.id R.age
---- ------- ---- -----
   2 Juliana    2    25
   3 Joan       3    59
DataFrame immutability applies to name assignments as well. Calling .as() creates a separate DataFrame, so the original one remains unnamed.

Transform Columns

When building the join result, you can apply expressions to alter the existing columns or create new columns all together:

DataFrame joined = left.as("L")
        .join(right.as("R"))
        .on("id")
        .cols("name", "retires_soon")
        .select(
                $col("name"),
                $int("R.age").gt(57)
        );
name    retires_soon
------- ------------
Juliana        false
Joan            true
In the context of a join, column expressions can reference either short column names of left and right DataFrames, or the fully qualified ones with the prefix. In the former case, the names should take into account the "_" suffix applied to the right DataFrame columns when their names overlap with column names on the left.

Outer Joins

Now let’s demonstrate outer joins. Here is a leftJoin(..) example using the same left and right DataFrames:

DataFrame joined = left
        .leftJoin(right)
        .on("id")
        .select();
id name     id_  age
-- ------- ---- ----
 1 Jerry   null null
 2 Juliana    2   25
 3 Joan       3   59

It has all the rows from the left DataFrame, and only the matching rows from the right DataFrame. For left rows with no matching right rows, the right columns are filled with null.

Indicator Column

Join API allows to explicitly identify which rows had a match, and which didn’t by adding a special "indicator" column. Let’s show it on a fullJoin(..) example:

DataFrame joined = left
        .fullJoin(right)
        .on("id")
        .indicatorColumn("join_type") (1)
        .colsExcept(c -> c.endsWith("_"))
        .select();
1 Request an indicator column with a user-specified name.
  id name     age join_type
---- ------- ---- ----------
   1 Jerry   null left_only
   2 Juliana   25 both
   3 Joan      59 both
null null      40 right_only

The join_type column will contain a special enum of org.dflib.join.JoinIndicator type that allows to categorize rows in the produced DataFrame.

indicator column can be requested for all join types, but it is only useful for the three outer joins (it will always be both for inner joins).

Nested Loop Joins

The joins we’ve seen so far were all based on comparing left and right column values. They are knows as "hash joins", and are usually as fast as joins can get. However, they can not express every possible condition, and sometimes we have to resort to another kind of joins - "nested loop", that would compare every row from the left DataFrame to every row from the right. For instance, let’s join a DataFrame of salaries with itself to produce pairs of names - a person with higher salary on the left vs. everyone with lower salary on the right:

DataFrame df = DataFrame.foldByRow("name", "salary").of(
        "Jerry", 120000,
        "Juliana", 80000,
        "Joan", 95000);

JoinPredicate p = (r1, r2) ->
        ((Integer) r1.get("salary")) > ((Integer) r2.get("salary"));

DataFrame joined = df
        .leftJoin(df)
        .predicatedBy(p) (1)
        .select()
        .sort($int("salary").desc(), $int("salary_").desc()) (2)
        .cols("name", "name_").selectAs("makes_more", "makes_less");
1 Custom join condition
2 Sorting and renaming the result columns for a user-friendly display
makes_more makes_less
---------- ----------
Jerry      Joan
Jerry      Juliana
Joan       Juliana
Juliana    null
Nested loop joins are much slower, and should be avoided unless absolutely necessary.

Group and Aggregate

DataFrame rows can be combined into groups based on common column values. For instance, here is a payroll report that contains payments to employees over time:

name             amount date
---------------- ------ ----------
Jerry Cosin        8000 2024-01-15
Juliana Walewski   8500 2024-01-15
Joan O'Hara        9300 2024-01-15
Jerry Cosin        4000 2024-02-15
Juliana Walewski   8500 2024-02-15
Joan O'Hara        9300 2024-02-15
Jerry Cosin        8000 2024-03-15
Joan O'Hara        9300 2024-03-15

We can group it by date like this:

GroupBy groupBy = df.group("date"); (1)
1 Here we are grouping by a single column, but multiple columns can be specified as well

GroupBy object has a number of useful operations to transform grouped data. The most common one is aggregation of values to produce one row per group (similar to SQL GROUP BY clause). E.g., we can calculate some statistics about each payroll period by passing a list of "aggregating" expressions to the agg(..) method:

DataFrame agg = df
        .group("date")
        .agg(
                $col("date").first(), (1)
                $double("amount").sum(), (2)
                count() (3)
        );
1 first(..) expression is applied to the column used in the grouping. As each value within the group is the same, we can simply select the first one for our aggregated row
2 This calculates total payroll amount within each group
3 count() returns the number of rows in each group
date       sum(amount) count
---------- ----------- -----
2024-01-15     25800.0     3
2024-02-15     21800.0     3
2024-03-15     17300.0     2

first(..), sum(..) and count() are special "aggregating" expressions. Unlike the "normal" expressions that produce Series of the same size as input Series, aggregating ones output Series with single values. Hence, the number of rows in the aggregated DataFrame is equal to the number of groups in GroupBy. There are more aggregating expressions that can concatenate values within a group, calculate min, max, avg values, etc.

Let’s make a minor improvement to the example above and provide more descriptive column names:

DataFrame agg = df
        .group("date")
        .cols("date", "total", "employees")
        .agg(
                $col("date").first(),
                $double("amount").sum(),
                count()
        );
date         total employees
---------- ------- ---------
2024-01-15 25800.0         3
2024-02-15 21800.0         3
2024-03-15 17300.0         2
There’s a similarity between GroupBy.cols(..) method and DataFrame.cols(..). And, as we’ll see in the later examples, a few more methods (select(..) and map(..)) work the same as in ColumnSet. So GroupBy is essentially a ColumnSet with operations applied per group.

Instead of aggregating groups, we can select back the original rows, but apply some group-specific transformations. E.g. we can do row ranking within groups. Here we’ll rank employees by their salary amount vs other employees in the same period:

DataFrame ranked = df.group("date")
        .sort($double("amount").desc()) (1)
        .cols("date", "name", "rank")
        .select( (2)
                $col("date"),
                $col("name"),
                rowNum() (3)
        );
1 To produce a proper ranking, we need to order rows within each group first
2 In select(..), instead of aggregating expressions, we are using "normal" ones , but they are applied to each group separately.
3 Ranking function applied within each group
date       name             rank
---------- ---------------- ----
2024-01-15 Joan O'Hara         1
2024-01-15 Juliana Walewski    2
2024-01-15 Jerry Cosin         3
2024-02-15 Joan O'Hara         1
2024-02-15 Juliana Walewski    2
2024-02-15 Jerry Cosin         3
2024-03-15 Joan O'Hara         1
2024-03-15 Jerry Cosin         2

The resulting DataFrame has the same number of rows as the original one, but the ordering is different, and the columns are the ones that we requested.

We can use GroupBy.head(..) and (tail(..)) to find top-paid employees within each payroll period. Notice, that this operation doesn’t do any aggregation. Just sorting within groups and picking the top results:

DataFrame topSalary = df.group("date")
        .sort($double("amount").desc()) (1)
        .head(1) (2)
        .select();
1 Do the same sorting as before
2 Pick the top row in each group
name        amount date
----------- ------ ----------
Joan O'Hara   9300 2024-01-15
Joan O'Hara   9300 2024-02-15
Joan O'Hara   9300 2024-03-15

Window Operations

Window operations are somewhat close to "group", except they usually preserve the rows of the original DataFrame and may add extra columns with values calculated from "windows" of data relative to a given row. We’ll use the following DataFrame to demonstrate how window operations work:

name             amount date
---------------- ------ ----------
Jerry Cosin        8000 2024-01-15
Juliana Walewski   8500 2024-01-15
Joan O'Hara        9300 2024-01-15
Jerry Cosin        8800 2024-02-15
Juliana Walewski   8500 2024-02-15
Joan O'Hara        9500 2024-02-15

A window is created by calling over() on the DataFrame:

Window window = df.over();

Here is a full operation that adds a "max_salary" column to each employee:

DataFrame df1 = df
        .over()
        .partitioned("name") (1)
        .cols("max_salary").merge($int("amount").max()); (2)
1 Defines DataFrame "partitioning". Calculation window for each row will be all the rows in the same partition
2 Defines a new column and an aggregating expression to produce values.
name             amount date       max_salary
---------------- ------ ---------- ----------
Jerry Cosin        8000 2024-01-15       8800
Juliana Walewski   8500 2024-01-15       8500
Joan O'Hara        9300 2024-01-15       9500
Jerry Cosin        8800 2024-02-15       8800
Juliana Walewski   8500 2024-02-15       8500
Joan O'Hara        9500 2024-02-15       9500

Partitioning in the example above is similar to "group" in GroupBy. Another way to define windows is via a range(..) method that allows to build window with fixed offset relative to each row.

TODO

Pivot

TODO

Stack

TODO

User-Defined Functions

User-defined functions (aka "UDF") is a way to create highly-reusable expressions. Expressions are, of course, abstract transformations of columns, so they are already somewhat reusable. However, they encode names (or positions) of all columns they operate on. So you won’t be able to apply the same expression to two different sets of columns in the same DataFrame. UDFs are intended to address this limitation and make expressions fully dynamic.

As the name implies, a UDF is a function (usually implemented as a lambda or a method ref). It takes one or more expressions as parameters, and produces a single expression:

void formatNames() {
    DataFrame df = DataFrame.foldByRow("first", "last").of(
            "JERRY", "COSIN",
            "juliana", "waLEwsKi");

    Udf1<String, String> formatName = e ->
            e.castAsStr().mapVal(this::formatName); (1)

    DataFrame clean = df.cols("first", "last").merge(
            formatName.call("first"),
            formatName.call("last")); (2)
}

String formatName(String raw) {
    return Character.toUpperCase(raw.charAt(0))
            + raw.substring(1).toLowerCase();
}
1 Define a UDF that takes a single parameter and performs proper name capitalization
2 Apply the same UDF separately to the "first" and "last" name columns
first   last
------- --------
Jerry   Cosin
Juliana Walewski

In the example above, we called the UDF multiple times, each time with a different column name. A UDF can also be invoked with a numeric position or a full expression instead of a column name.

Above, we used Udf1 function, that takes only one expression (or column) as an input. Generally, UDFs can take any number of expressions (columns). There are Udf1, Udf2, Udf3 and UdfN interfaces. The latter can take any number of arguments and should be used to model UDFs with more than 3 inputs or a variable number of inputs:

DataFrame df = DataFrame.foldByRow("first", "last").of(
        "Jerry", "Cosin",
        null, "Walewski",
        "Joan", null);

UdfN<Boolean> noNulls = exps -> and( (1)
        Stream.of(exps)
                .map(Exp::isNotNull)
                .toArray(Condition[]::new)
);

DataFrame clean = df
        .rows(noNulls.call("first", "last").castAsBool()) (2)
        .select();
1 Build a UDF with dynamic number of input parameters packaged in the Exp[]
2 Call the UDF to create a row filter. Since the UDF return type is Exp<Boolean>, we need to explicitly "cast" it to Condition expected by the RowSet, so we used castAsBool()
first last
----- -----
Jerry Cosin
Performance hint: If the logic within a UDF requires converting a column to a specific data type, and the column you are applying it to is known to be of that type already, you can speed up evaluation by passing an expression of that type to the UDF. E.g. if a UDF calls castAsInt() on its argument, udf.call($int("a")) is much faster than udf.call("a").

Using Relational Databases

Relational databases are arguably the most important type of data stores out there. Also they happen to map really well to DataFrames. DFLib provides advanced support for loading and saving DataFrames to RDBMS. It supports transactions, auto-generation of certain SQL statements, merging of data on top of the existing data ("create-or-update"), building custom SQL for selects and updates. It knows how to treat different DB "flavors" and does a number of other cool database-related things.

To start using all this, you need to import dflib-jdbc module:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-jdbc</artifactId>
</dependency>

<!-- Of course you also need to import your preferred JDBC driver -->

JdbcConnector

Once the imports are setup, you’d use Jdbc class to obtain an instance of JdbcConnector that can be later used for all DB operations. You may already have a preconfigured javax.sql.DataSource, so the simplest way to create a connector is to pass that DataSource to Jdbc factory method:

JdbcConnector connector = Jdbc.connector(dataSource);

Alternatively you can build everything from scratch:

JdbcConnector connector = Jdbc
        .connector("jdbc:derby:target/derby/mydb;create=true")
        // .driver("com.foo.MyDriver") (1)
        // .userName("root")
        // .password("secret") (2)
        .build();
1 (Optional) Driver name. Some drivers are not properly registered with the DriverManager, and require an explicit declaration.
2 DB account username/password (not needed for our in-memory Derby example, but will definitely be required for a real database).

TableLoader / TableSaver

Once you have the connector, you can start reading and persisting DataFrame data. Many DataFrames map directly to individual tables (or views) defined in the database. For those DFLib provides a set of rather straightforward operations that do not require the user to write SQL (all needed SQL is auto-generated by the framework):

DataFrame df = connector.tableLoader("person").load();
id name              salary
-- ----------------- --------
 1 Jerry Cosin       70000.0
 2 Juliana Walewski  85000.0
 3 Joan O'Hara       101000.0

Table loader provides a way to customize the load operation. It allows to select specific columns, set the maximum number of rows to read, sample rows, and even specify fetch condition as another DataFrame. Some examples:

DataFrame condition = DataFrame
        .byColumn("salary")
        .of(Series.ofInt(70_000, 101_000));

DataFrame df = connector.tableLoader("person")
        .cols("name", "salary")
        .eq(condition)
        .load();
name        salary
----------- --------
Jerry Cosin 70000.0
Joan O'Hara 101000.0

What it doesn’t require (unlike CSV loader) is explicit column types, as the proper value types are inferred from the database metadata.

Table saver allows to save DataFrame to a table. Column names in the DataFrame should match column names in the DB table:

DataFrame df = DataFrame.byArrayRow("id", "name", "salary")
        .appender()
        .append(1, "Jerry Cosin", 70_000)
        .append(2, "Juliana Walewski", 85_000)
        .append(3, "Joan O'Hara", 101_000)
        .toDataFrame();

connector.tableSaver("person").save(df);

In this scenario table saver executes insert for each DataFrame row. But what if there is already an existing data in the table? There are a few options the user has to overwrite or merge the data:

  • Append the data (that’s what we effectively did above).

  • Delete all existing data before doing the insert.

  • Merge the data by comparing DataFrame and DB table data on PK column(s) or an arbitrary set of columns. Insert missing rows, update the existing rows. If the table has more columns than there are columns in the DataFrame, the data in those extra columns is preserved.

Delete before insert example:

connector.tableSaver("person")
        .deleteTableData()
        .save(df);

Merge by PK example (PK columns are detected from DB metadata) :

connector.tableSaver("person")
        .mergeByPk()
        .save(df);

SqlLoader / SqlSaver

TODO

CSV Files

CSV (comma-separated values) is a very common and rather simple format for working with raw data. *.csv files can be manipulated programmatically or manually (via Excel/LibreOffice), loaded to / from databases, etc. DFLib supports reading DataFrames from CSV and storing them to CSV. You need to add an extra dependency to your project to take advantage of this functionality:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-csv</artifactId>
</dependency>

Once you do that, Csv class is the entry point to all the .csv related operations as discussed below.

Reading from CSV

The simplest API for reading a DataFrame from a CSV is this:

DataFrame df = Csv.load("src/test/resources/f1.csv"); (1)
1 The argument to the "load" method can be a filename, a file, or a Reader, so it can be loaded from a variety of sources.

The result is a DataFrame that matches CSV structure:

A B
- --
1 s1
4 s2

DFLib makes a few assumptions about the data in f1.csv, namely that the first row represents column names, that all columns are Strings, and that all of them need to be included. These assumptions are not necessarily true with many data sets. So there is a longer form of this API that allows to configure column types, skip rows and columns, or even sample the data without loading the entire CSV in memory. Some examples:

DataFrame df = Csv.loader() (1)
        .offset(1) (2)
        .header("x", "y") (3)
        .intColumn("x") (4)
        .load("src/test/resources/f1.csv");
1 Instead of "load" method use "loader".
2 Skip the header row.
3 Provide our own header.
4 Convert the first column to int.
x y
- --
1 s1
4 s2
In theory, you don’t have to do most of these tweaks via CSV loader. You can load the raw data, and then use standard DataFrame transformations to shape the result. However doing it via the loader allows to optimize both load speed and memory use, so it is usually preferable.

Writing to CSV

Writing to a CSV is equally easy:

Csv.save(df, "target/df.csv"); (1)
1 The argument to the "save" method can be a filename, a file, a Writer (or generally Appendable), so it can be stored to a variety of destinations.

Just like with the loader, CSV saver provides its own set of options, if the defaults are not sufficient:

Csv.saver() (1)
        .createMissingDirs() (2)
        .format(CSVFormat.EXCEL) (3)
        .save(df, "target/csv/df.csv");
1 Instead of "save" method use "saver" to be able to customize the process.
2 If intermediate directories for the output file are missing, create them.
3 Provide an alternative format (either from a collection of known formats, or user-specific).

Excel Files

Excel format is very common in the business and finance world. DFLib supports reading DataFrames from and storing them to Excel. You need to add an extra dependency to your project to take advantage of this functionality:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-excel</artifactId>
</dependency>

TODO…​

Avro Files

Avro binary format is common in data engineering. Its main advantage is that .avro files are saved with an embedded schema. So when they are read back, the fields are converted to the correct Java types. This is different from CSV, as reading a .csv by default produces a DataFrame with all String columns. To work with the Avro format, include the following dependency:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-avro</artifactId>
</dependency>

Once you do that, Avro class is the entry point to all the operations. With it you can save DataFrames to .avro files and load them back.

Avro Schema

In most cases you don’t need to know anything about Avro schemas. DFLib automatically generates a schema for a DataFrame before it is saved. Alternatively you can create a custom Schema object based on Avro specification and pass it to AvroSaver.

TODO…​

Parquet Files

Parquet is a popular columnar binary data format. Like, .avro, .parquet files are saved with an embedded schema. So when they are read back, the fields are converted to the correct Java types. To work with the Parquet format, include the following dependency:

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-parquet</artifactId>
</dependency>

Once you do that, Parquet class is the entry point to all the operations. With it you can save DataFrames to .parquet files and load them back.

TODO…​

Using JShell

Modern JDK installations come with jshell, a Read-Evaluate-Print Loop tool (aka REPL). DFLib can be used in JShell environment for interactive data exploration and code prototyping. For the best experience, you may want to configure a couple of things. Start jshell and execute the following commands:

// Set classpath.
// There's no automatic dependency management in jshell.
// You will need to include all DFLib jars and their
// dependencies explicitly

/env --class-path /path/to/dflib-1.0.0-M22.jar:/path/to/dflib-csv-1.0.0-M22.jar
// Disable data truncation. You want your data to be visible.

/set mode mine normal -command
/set truncation mine 40000
/set feedback mine
// Add needed imports

import org.dflib.*;
import org.dflib.csv.*;

import static org.dflib.Exp.*;
// Set tabular printer

Environment.setPrinter(Printers.tabular());

After this, you can start working with DataFrames and immediately check the output:

var df = Csv.load("../data/stuff.csv");

JShell may print something like this

df ==>
id   name              salary
---- ----------------- --------
   1 Jerry Cosin       70000.0
   2 Juliana Walewski  85000.0
   3 Joan O'Hara       101000.0
...
 997 Jenny Harris      65000.0
 998 Matt Ostin        52000.0
 999 Andrew Tsui       99000.0
1000 rows x 3 columns

Using Jupyter Notebook

While DFLib runs inside regular Java applications or jshell, it also works perfectly in Jupyter, which is a very common "notebook" environment among data scientists and data engineers. In a notebook, you interact with your code via a web browser in a rather visual way. It allows to run steps of a data transformation one-by-one and inspect the data between the steps. Jupyter is often associated with Python / pandas, but it can also be used with Java / DFLib.

The code developed in a notebook can be later copied and pasted into your application. Java developers should consider using either Jupyter as a part of their data project workflow, and DFLib design (and its Jupyter integration features described here) makes it rather easy.

Java support for Jupyter is provided by jjava project that is also maintained by the DFLib community. Once you follow the installation instructions and then run Jupyter, you should end up with a browser window open with Jupyter UI. Create a new notebook, and when asked to select a kernel, pick "Java". In the first cell enter something like this:

%maven org.dflib:dflib-jupyter:1.0.0-M23 (1)
1 Adding dflib-jupyter as a dependency also includes all other DFLib modules through its transitive dependencies. Also, it will load base DFLib imports, and configure Jupyter environment for pretty display of DataFrames and Series.

Click "Shift + Return" to execute the cell. If there are no errors, you can start using DFLib API in the following cells. E.g.:

DataFrame df = DataFrame.foldByColumn("a", "b", "c")
    .ofInts(-1, 1, 2, 3, 4, 5, 6);

// the result of the last statement in a cell is printed just below it
df

DFLib development is literally "visual" in Jupyter, as in addition to tabular data, you can display graphical charts with contents of DataFrames:

chart in jupyter

Change Display Parameters

To control truncation behavior, you can use static methods on DFLibJupyter:

DFLibJupyter.setMaxDisplayRows(10);
DFLibJupyter.setMaxDisplayColumnWidth(50);

Charting with ECharts

DFLib provides a simple Java API to generate charts from DataFrames based on the Apache ECharts library. Charts are produced as pieces of HTML/JavaScript code and can be rendered either in Jupyter notebooks or served to the browser in a webapp.

The focus of this chapter is the DFLib API. We don’t dwell much on the underlying ECharts concepts, so you may need to read their documentation to get explanations of things like multiple axes, grid, etc. The main difference of the DFLib approach is that you don’t need to explicitly specify chart dataset or data sections. Instead, you’d define which DataFrame columns should be used for chart series, axes, etc., and DFLib generates the data-related parts from the provided DataFrame.

The ECharts JavaScript API contains a very large number of chart types and configuration options. DFLib EChart class supports a growing, but incomplete subset of that API. If we are missing a feature that you need, please open a task on GitHub, and we’ll do our best to integrate it.

To work with the ECharts API, you should include the following dependency (already included in the Jupyter environment):

<dependency>
    <groupId>org.dflib</groupId>
    <artifactId>dflib-echarts</artifactId>
</dependency>

And then you can create a chart. E.g., here is a simple bar chart showing employee salaries:

DataFrame df = DataFrame.foldByRow("name", "salary").of(
                "J. Cosin", 120000,
                "J. Walewski", 80000,
                "J. O'Hara", 95000)
        .sort($col("salary").desc());

EChartHtml chart = ECharts
        .chart()
        .xAxis("name")
        .series(SeriesOpts.ofBar(), "salary")
        .plot(df);

EChartHtml object contains pieces of HTML and JavaScript needed to render the following chart in the browser:

doc chart1

Running this code in Jupyter will render the chart above in the browser. So you don’t need to do anything special to display it. When embedding charts in web applications, you’ll need to inspect EChartHtml.getContainer(), EChartHtml.getExternalScript(), EChartHtml.getScript() methods and insert them in your HTML as you see fit.

Line Charts

TODO

Bar Charts

TODO

Scatter Charts

TODO

Pie Charts

TODO

Candlestick Charts

TODO

Unit Testing

TODO