ReasonML: external and internal iteration

[2018-01-27] dev, reasonml, advanced
(Ad, please don’t block)

Table of contents for this series of posts: “What is ReasonML?


This blog post explains patterns and techniques for external and internal iteration in ReasonML.

The GitHub repository reasonml-demo-iterators has the source code of the examples.

Iteration  

Many algorithms operate on series of values. We can apply such algorithms to any data structure that lets us iterate over its elements (visit them sequentially, one element at a time). Examples of such algorithms:

  • Counting the number of items in a series.
  • Computing the average of all numbers in a series.

Two ways of implementing iteration are: externally and internally.

(In this blog post, I’m using the term item for an iterated value and the term element for a value stored in a data structure.)

External iteration  

With external iteration, we ask for each element of the data structure (pull). In OOP languages such as JavaScript, that looks as follows.

> const arr = ["a", "b"];
> const iter = arr[Symbol.iterator]();

> iter.next()
{ value: 'a', done: false }
> iter.next()
{ value: 'b', done: false }
> iter.next()
{ value: undefined, done: true }

We first create an iterator iter for the data structure arr. Then we invoke iter.next() repeatedly until the result signals us that all values were visited, via done: true.

Internal iteration  

With internal iteration, the data structure feeds us its elements (push). In OOP languages such as JavaScript, that looks as follows.

> const arr = ["a", "b"];
> arr.forEach(x => console.log(x));
a
b

The parameter of arr.forEach() is a callback that successively receives each element and logs it to the console.

Abstractions for iterations  

The next step for us will be to introduce abstractions for iteration:

  • We have already seen an iterator, an abstraction for external iteration. However, for ReasonML, we want an abstraction that follows the style of functional programming (not object-oriented programming).
  • It is not yet clear what an abstraction for internal iteration should look like. But, as we will see, there is one.

Abstractions for iteration have two benefits:

  • Algorithms that operate on series of values can use such abstractions as their inputs. Then they automatically work with all data structures that support the abstractions (instead of having to be implemented for each data structure separately).
  • Combinators are operations consuming and/or producing iteration abstractions. A combinator works with any data source that supports its iteration abstraction. It also computes on demand: To process the lines in a large text file without some kind of iteration abstraction (think streams), we need to read them all into a data structure (e.g. a list or an array) and then work with that data structure. If the lines are handled via nested combinators, processing can start much earlier – as soon as the first line is available. Combinators for iteration abstractions are similar to combinators for lists (which are also linear) such as List.length, List.map() and List.filter().

It is interesting that iteration abstractions (such as iterators and streams) are based on mutable state at their core: the same entity produces different results when you access it in the same manner several times. However, they also feel quite functional, due to combinators.

Abstracting external iteration in ReasonML  

In OOP, the abstraction for external iteration is an object, the so-called iterator.

In ReasonML, we also call the abstraction an iterator, but we use a nullary function that we call repeatedly to produce the iterated values. As an example, let’s use the conversion function ofList() (which is explained soon) to create an iterator gen() for a list myList:

# let myList = ["a", "b", "c"];
# let gen = ofList(myList);

# gen();
Some("a")
# gen();
Some("b")
# gen();
Some("c")
# gen();
None

gen() returns values of type option(string), a variant with two constructors:

  • Elements x of myList are returned as Some(x).
  • The end of the iteration is indicated via None.

The following type is for iterators that produce values of type 'a:

type gen('a) = unit => option('a);

Within the ReasonML/OCaml ecosystem, iterators are also called enumerations and generators (due to them being producers of data, muddling the difference between interface and implementation).

To better understand how generators work, let’s look at code that produces and consumes them.

Converting lists to generators  

This is how ofList() can be implemented:

let ofList = (l: list('a)): gen('a) => {
  let current = ref(l);
  () => { /* A */
    switch (current^) {
    | [] => None
    | [head, ...tail] =>
      current := tail;
      Some(head)
    }
  };
};

The last thing ofList() does (in line A), is return the generator, a nullary function. That function uses the ref cell current (think mutable variable) to store the next iteration value. If the list, pointed to by current, is not yet empty, the generator returns Some(head). Otherwise, it returns None.

Converting generators to lists  

The inverse process, collecting iteration values in a list, works as follows.

let toList = (g: gen('a)): list('a) => {
  let rec aux = (acc) =>
    switch (g()) {
    | None => acc
    | Some(x) => aux([x, ...acc])
    };
  List.rev(aux([]));
};

We repeatedly call g() and collect its results in the accumulator parameter acc that we return once aux() is finished. The resulting list is in the wrong order, which is why we reverse it via List.rev().

The following code is a test for toList().

let count = ref(0);
let myGen = () =>
  switch (count^) {
  | 0 =>
    count := count^ + 1;
    Some("hello");
  | 1 =>
    count := count^ + 1;
    Some("world");
  | 2 => None
  };

expect(toList(myGen)) |> toBe(["hello", "world"]);

myGen is a hand-written generator that uses the ref cell count to produce a series of values. The last line checks if toList() properly converts myGen to a list. The assertion is based on the bs-jest API.

Generator combinator: length()  

Let’s implement our first combinator: A function length whose input is a generator and whose output is the number of items in that generator:

let length = (g: gen('a)): int => {
  let rec aux = (len) =>
    switch (g()) {
    | None => len
    | Some(_) => aux(len+1)
    };
  aux(0);
};

expect(length(ofList([1, 2, 3]))) |> toBe(3);

We use the helper function aux() to loop over all elements in the generator g. While we are doing so, we increment the counter len. Once the generator is finished, we return len.

Generator combinator: map()  

The generator combinator map has as input a function f and a generator g and returns a new generator. Each item of the result is derived from an item of g by applying f to it:

let map = (~f: 'a => 'b, in_: gen('a)): gen('b) => {
  let out = () =>
    switch (in_()) {
    | None => None /* A */
    | Some(x) => Some(f(x)) /* B */
    };
  out;
};

This is a test for map():

let genIn = ofList([1, 2, 3]);
let genOut = map(~f = x=>2*x, genIn);
expect(toList(genOut)) |> toEqual([2, 4, 6]);

The result is the function out(). It derives its results from g(): If g() is done then so are we (line A). Otherwise, we need to apply f() to the iterated value x (line B).

Generator combinator: filter()  

The combinator filter() copies only those elements of its input generator to its output generator for which its callback parameter f returns true:

let filter = (~f: 'a => bool, in_: gen('a)): gen('a) => {
  let rec out = () =>
    switch (in_()) {
    | None => None
    | Some(x) when f(x) => Some(x) /* A */
    | _ => out() /* B */
    };
  out;
};

This is a test for filter():

let genIn = ofList([1, -2, 3, -4]);
let genOut = filter(~f = x=>x >= 0, genIn);
expect(toList(genOut)) |> toEqual([1, 3]);

With filter(), you pass on whatever the input generator in_() produces (line A), with one exception: If f() returns false for an iteration value, you immediately advance to the next result of in_() (line B).

Abstracting internal iteration in ReasonML  

Let’s briefly review what internal iteration looks like. The following code is an implementation of ListLabels.iter().

let iter = (~f: 'a => unit, theList: list('a)) => {
  let rec aux = (l) =>
    switch (l) {
    | [] => ()
    | [head, ...tail] =>
      f(head);
      aux(tail);
    };
  aux(theList);
};

iter() uses the helper function aux() to loop over the elements of theList and apply f to them.

This is what using iter() looks like:

let result = ref([]);
let f = (x) => result := [x, ...result^]; /* A */
iter(~f, ["a", "b", "c"]);
expect(result^) |> toEqual(["c", "b", "a"]);

Given that f() does not return anything, it must do its work via side effects.

Each step prepends a new item x in front of the existing items in result (line A). That’s why the final result is the original list in reverse.

iter() is elegant (if you need or want to use side effects), but it has one disadvantage: It is not an abstraction that can be returned by combinators.

An abstraction for internal iteration  

So what do we want from an abstraction for internal iteration?

  • It should work much like iter().
  • It should be a type of value that data structures can be mapped to and that combinators can work with.

The solution is as follows. A function that creates an internal iterator for a data structure simply returns the function aux that performs internal iteration (a partially applied iter(), if you will). That gives us the following type for sequences (internal iterators):

type sequence('a) = ('a => unit) => unit;

A sequence is a function that takes a callback of type 'a => unit and feeds it values of type 'a.

Converting lists to sequences  

Let’s examine how the type sequence works, by implementing a function that returns a sequence for a list:

let ofList = (theList: list('a)): sequence('a) =>
  (f: 'a => unit) => {
    let rec aux = (l) =>
      switch (l) {
      | [] => ()
      | [head, ...tail] =>
        f(head);
        aux(tail);
      };
    aux(theList);
  };

As you can see, this is a curried version of iter(). The following code tests ofList().

let result = ref([]);

let seq = ofList(["a", "b", "c"]);
seq(x => result := [x, ...result^]);

expect(result^) |> toBe(["c", "b", "a"]);

Combinators for sequences  

Let’s take a brief look at how combinators are implemented for internal iterators.

On one hand, computing results is more difficult than with external iterators – you need to rely on side effects to compute and return the result:

let length = (seq: sequence('a)) => {
  let result = ref(0);
  seq((_) => { result := result^ + 1 });
  result^;
};

On the other hand, producing an output iterator is simpler, because you decide when you send data to it:

let filter = (~f: 'a => bool, seqIn: sequence('a)): sequence('a) => {
  let seqOut = (out: 'b => unit) =>
    seqIn((in_: 'a) =>
      switch (f(in_)) {
      | true => out(in_); /* send value to output */
      | false => (); /* skip value */
      }
    );
  seqOut;
};

Libraries for iteration  

These are three OCaml libraries supporting iteration:

  • Gen supports generators (external iterators).
    • Tip for reading the docs: check out included modules and sub-modules, they provide much interesting functionality.
  • BatEnum supports enumerations (external iterators).
  • Sequence supports sequences (internal iterators).

In the following section, we’ll use the Gen library to demonstrate more iteration concepts.

Folding iterators  

One important operation for processing iterators is folding. We’ll use the Gen library to explore how that works. The following function len() counts the number of items produced by a generator gen:

let len = (gen: Gen.t('a)) => {
  let foldFunc = (state: int, _x: 'a) =>
    state + 1;
  Gen.fold(foldFunc, 0, gen);
};

This is a test for a generator with zero items:

let gen = Gen.of_list([]);
expect(len(gen)) |> toBe(0);

This is a test for a generator with two items:

let gen = Gen.of_list(["a", "b"]);
expect(len(gen)) |> toBe(2);

Gen.fold() works much like List.fold():

  • The first parameter is a function that maps the current state and an iteration value to the next state. States are arbitrary values whose type depends on the algorithm. In this case, the states are ints, because the last computed state is also the result of Gen.fold().
  • The second parameter is the first state.
  • The third parameter is the generator that Gen.fold() operates on.

As you can see, we are not even interested in the current iteration value _x. We only need to count how often foldFunc is called and use the states to do so.

Implementing iterators for trees  

So far, we have only looked at iterating over linear data structures. Iterating over non-linear data structures is more challenging.

Take, for example, a tree (the same data structure we have used in the blog post on variants):

type tree('a) =
  | Empty
  | Node('a, tree('a), tree('a));

Writing an external iterator for trees is not trivial. You could, for example, manually manage a stack of nodes to visit. Here, we explore an alternative technique that we will develop in three steps. We start by implementing internal iteration over trees.

Iterating internally over trees  

The following function is similar, in spirit, to ListLabels.iter():

let rec iterTree = (~f: ('a) => unit, tree: tree('a)) => {
  switch (tree) {
  | Empty => ()
  | Node(x, left, right) =>
    f(x);
    iterTree(~f, left);
    iterTree(~f, right);
  }
};

The code is straightforward – we use recursion to visit all nodes in the tree.

Note that iterTree is also how we’d implement internal iteration. The only change would be to nest two functions:

  • The parameter of the first one is tree.
  • The parameter of the second one is f.

Iterating, continuation passing style  

The next step is to convert the previous version to continuation-passing style (CPS): Instead of returning the results, we pass them on to callbacks. These callbacks are named continuations, because they determine how computation continues.

let rec iterTreeCps = (~f: ('a, unit=>unit) => unit,
tree: tree('a), cont: unit => unit) => {
  switch (tree) {
  | Empty => cont()
  | Node(x, left, right) =>
    f(x, () => { /* A */
      iterTreeCps(~f, left, () => {
        iterTreeCps(~f, right, cont);
      });
    });
  }
};

Line A is a good example of how CPS works: f() “returns” to us via its second parameter, a continuation. In CPS, the callee can delay (or avoid) returning a value to the caller. That’s why this style is used for asynchronous computations (e.g. by Node.js).

External iteration for trees  

For external iteration, we want the following:

  • Use the algorithm of iterTreeCps.
  • But instead of calling f() in line A, we want to pause and return a value. When the iterator is called again, it should continue its execution where it previously paused.

This is how yield and generators work in, for example, JavaScript, Python and C#. We can simulate this behavior by storing the continuation in a ref cell next:

let ofTree = (tr: tree('a)): Gen.t('a) => {
  let next = ref(None);

  let rec visitTree = (t: tree('a), cont: unit => option('a)) => {
    switch (t) {
    | Empty => cont()
    | Node(x, left, right) =>
      next := Some(() => {
        visitTree(left, () => {
          visitTree(right, cont);
        });
      });
      Some(x);
    }
  };

/* Continuing... */

The all that remains to be done is to define how everything starts and to return the actual external iterator.

/* Continued */

next := Some(() => visitTree(tr, () => None));

() =>
  switch (next^) {
  | Some(func) =>
    next := None;
    func();
  /* After the last node of the tree, */
  /* we always return None */
  | None => None
  };

This is how you use ofTree():

let myStrTree =
  Node("a",
    Node("b", Empty, Empty),
    Node("c",
      Node("d", Empty, Empty),
      Empty
    )
  );

let treeGen = TreeIteration.ofTree(myStrTree);
expect(Gen.to_list(treeGen))
  |> toEqual(["a", "b", "c", "d"]);

Infinite iterators  

Iterators don’t have to be finite. For example:

let naturalNumbers = (): Gen.t(int) => {
  let n = ref(0);
  () => {
    let cur = n^;
    n := n^ + 1;
    Some(cur);
  };
};

You obviously would not want to iterate over the items of this iterator (it would take a very long time). Infinite iterator are, however, useful for a few things.

First, you can use them to produce finite iterations of arbitrary length, via Gen.take(). Gen.take(n, gen) returns a generator with the first n items of generator gen.

let take5 = Gen.take(5, naturalNumbers());
expect(Gen.to_list(take5))
  |> toEqual([0, 1, 2, 3, 4]);

Second, Gen.zip() combines two iterations into a single iteration of pairs. And an infinite iterator can fill in one component of each pair:

let abc = Gen.of_list(["a", "b", "c"]);
let pairs = Gen.zip(naturalNumbers(), abc);
expect(Gen.to_list(pairs))
  |> toEqual([
      (0, "a"),
      (1, "b"),
      (2, "c")
    ]);

This works, because zip() stops as soon as one of the two iterators is finished.

You can also add indices to iteration values via Gen.zip_index().

Detecting the first or last iterated value  

Occasionally, when working with series of values, you need to distinguish the first or the last value. Alas, that is normally not possible: all iterated values are treated the same.

Let’s examine this issue by implementing join(), a function that concatenates a series of strings into a single string. Optionally, it inserts separator strings between adjacent iteration values.

Detecting the first element by indexing iteration values  

The first solution is to index each iteration value str by mapping it to a pair (index, str). Then it is easy to determine which is the first element – it has the index 0:

let joinViaIndex = (~sep: string = "", gen: Gen.t(string)) => {
  let foldFunc = (state, (index, str)) =>
    if (index == 0) {
      /* No separator before the first element */
      state ++ str;
    } else {
      state ++ sep ++ str;
    };
  Gen.fold(foldFunc, "", Gen.zip_index(gen));
};

joinViaIndex() works like this:

let gen = Gen.of_list(["a", "b", "c"]);
expect(joinViaIndex(~sep="|", gen))
  |> toBe("a|b|c");

We are using Gen.fold() (which we have already encountered in this blog post) to process the generator gen.

You may wonder if we could have simply checked whether the intermediate result state was empty and only added a separator if it wasn’t. But that would mean that we wouldn’t properly add a separator after an initial empty string (or after and between several initial empty strings).

Detecting the last element via Gen.peek()  

The Gen library has a clever way of supporting look-ahead for generators, via its function peek(). That function maps each iteration value x to a pair:

  • The first component of the pair is x.
  • The second component is the next iteration value, still wrapped in option().

That means that the last iteration value x_last is mapped to the following pair:

(x_last, None)

Let’s apply Gen.peek() to an iterator with two strings:

let gen = Gen.of_list(["a", "b"]);
let peekGen = Gen.peek(gen);
expect(Gen.to_list(peekGen))
  |> toEqual([
    ("a", Some("b")), /* A */
    ("b", None), /* B */
  ]);

The first pair in line A looks ahead to the second value, "b". The second pair enables you to determine that "b" is the last iteration value.

With this capability, we can again implement join() via Gen.fold():

let joinViaLookAhead = (~sep: string = "", gen: Gen.t(string)) => {
  let foldFunc = (state, (str, lookAhead)) =>
    switch lookAhead {
    /* No separator after the last element */
    | None => state ++ str
    | _ => state ++ str ++ sep
    };
  Gen.fold(foldFunc, "", Gen.peek(gen));
};

A practical example  

As a practical example, consider the following task. Given this text in Markdown syntax:

## Intro
This article is
about something.
## Details
These are the details
about something.

We’d like to split this text into chunks, records with two fields. Each section in the text produces one:

  • The field title contains the heading of the section.
  • The field body is a list with the lines of the section (including the line with the heading).

The result looks as follows.

[
  { title: "Intro", body: [
    "## Intro",
    "This article is",
    "about something.",
  ]},
  { title: "Details", body: [
    "## Details",
    "These are the details",
    "about something.",
  ]},
]

To implement this functionality, we use a more powerful variant of fold(): unfold_scan(). The callback unfoldScanFunc of this function works similarly to fold’s callback:

  • Its parameters are also the current state and the current iteration value.
  • Its result, however, is a pair (nextState, output):
    • nextState is the next state (the only value returned by fold’s callback).
    • output is the iteration value added to the output iterator.

That means that we can use the states to assemble chunks. As long as a chunk isn’t ready yet, our output is None. Once the chunk in state is finished, our output is Some(state).

let unfoldScanFunc = (state: chunk, line: string) => {
  switch (extractTitle(line)) {
  | None =>
    let nextState = appendLine(line, state);
    let output = None;
    (nextState, output)
  | Some(title) =>
    let nextState = {title, body: [line]};
    let output = toOptChunk(state); /* wrap with option() */
    (nextState, output);
  };
};

You can check out the complete source code in the file ParseChunks.re in the repository. It also includes an imperative version of the chunking algorithm that is less elegant.

Conclusion  

This is a summary of iteration as we have explored it in this blog post:

External Internal
Access Pull Push
Who is in control? Client Data source
Abstraction (iterator) Generator Sequence

Iterators are incredibly useful when processing series of values. It is fascinating that using them feels functional, even though they are an abstraction based on mutable state.

It also noteworthy that once you work with combinators, there is not much of a difference between external and internal iterators. For example, the following two test files are the same (apart from the the modules to be tested).

Further reading