Why Persistent LinkedHashMaps are cool and how to implement them.
After much (Lisp)ing I finally release my first library for Clojure.
This is a functional data structure that behaves like a LinkedHasMap (or Set), only it's completely persistent. Great, right? And the performance is comparable to a highly-optimised Hash Array Mapped Tree
such the one that Clojure implements in its standard library.
Ok, wait, I probably lost you already. Let's start again.Step 1) LinkedHashMap:
If you're a Java developer you should know LinkedHashMap very well.
Not because of what it does wrt HashMap (remembering the insertion order of the keys), but because standard HashMap has a major defect that LinkedHashMap solves with no penalties. In fact every time you iterate through a HashMap the order of the elements is non-deterministic
. This means that if you have to reproduce a certain bug given the same data structure and state you can always get different output. This is not only bad, this is pure masochism for no apparent reason. Since LinkedHashMap remembers the insertion order its iteration is deterministic and you should always prefer this to the standard HashMap.
Leaving these defects aside, you probably want to use LinkedHashMap for things like code generation/manipulation, data transmission (if I send you something in a certain order I expect that in the same order), ordered subcribers/receivers and so forth. It uses so little more bytes and cpu cycles that using it is a no brainer.Step 2) Persistent Data Structures:
Think if every time you do "foo" + "bar"
instead of getting a new String "foobar"
you were actually changing "foo"
to become "foobar"
That would make all your programs much, much more complicated.
Well the same applies for all data structures (after all Strings are just a Vector of Chars).
Every time you modify a collection you should get a new data structure reflecting the change you made, leaving the previous collection unchanged.
Persistent data structures allow you to do this with an efficient implementation (who wants copy on write
these days?) so that you can freely share references of all your colections, knowing that nobody can mutate them.
If you're on Java you can try out persistent collections with this great port from Clojure: https://github.com/krukow/clj-ds
.Step 3) Implementation details:
LinkedHashMap remembers insertion order by adding two additional pointers to every node stored in the map: previous and next.
Adding and removing an element it's just a matter of changing a couple of pointers in place because hey! it's all mutable state here. Ordering works just like LinkedList works.
If we want a persistent data structure we can't have mutability and we must create a node with updated pointers. But if the pointers are referring to other nodes when we create a new node all the nodes that were referring to it also
need to be updated.
And that goes on and on until you basically implemented an inefficient copy on write
What to do then?
There exist already a couple of implementations of a persistent sorted map:
- Clojure internally uses an array-map
which is just an array of key-values
that implements the Map
interface. This is efficient for very small maps, but as soon as you try to modify it it turns into a standard has-map
- Another implementation of ordered map uses a backing vector to redundantly store the key-value
pairs. Every time you add a new element you grow the vector (which is efficient in Clojure) and save the index of the element inside the backing map. Every time you remove an element you fill the vector position with a nil
. This is a reasonably fast implementation but has the potential of eating up all the space if you never remove all those nil
references.Step 4) Solution:
In literature you find plenty of functional data structures that may suggest you a smart implementation: zippers, finger-trees, functional graphs etc. But surprisingly there is a very simple and efficient solution to this problem.
You can do exactly what a LinkedHashMap does, only instead of storing references to other nodes, you store references to the keys of those nodes
If your map entry looks like [key value left-key right-key]
every time you remove an element in the list you just need to update the left an right node to point at each other's key, skipping the node in the middle.
So instead of a simple query + remove you're actually doing 3 queries, 1 remove and 2 adds. But since it's a map all those operations work on constant time.
There are other small optimisations such as promoting the head and the tail of the list as class members so you perform faster insertion and faster deletion at the extremities.
If code with lots of parens does not discourage you have a look at the Clojure implementation here: https://github.com/frankiesardo/linked/blob/master/src/linked/map.clj
All summed up this implementation has the same Big O complexity of a persistent hash map and an effective complexity of about 2.5 times slower, which is great.
I already see a potential application for a REPL as an immutable map where you can undo
evaluations effortlessly. But that's for another post.