UPDATE: now conforms to Swift 2.1b2
Following on from the previous post, here’s some more on building collections from enums.
For our book on Swift, we wrote a simple linked list example to use in a section on conforming to CollectionType
. Prior to indirect
, it was written using a box, and I wanted to update it. But I hit a snag, of which more later. First, a bit about lists and value types.
Here is a super-simple linked list:
enum List<Element> { case End indirect case Node(Element, next: List<Element>) }
You prepend another element to the list by creating a new node, with the next value that of the current list. To make that a little easier, we can create a method for it:
extension List { func cons(x: Element) -> List { return .Node(x, next: self) } } // a 3-element list: 3, 2, 1 let l = List<Int>.End.cons(1).cons(2).cons(3)
Just like the tree from last time, this list has an interesting property – it is “persistent“. The nodes are immutable – you cannot change them once created. Consing another element onto the list doesn’t copy the list, just gives you a new node that links on to the front of the existing list (again, the first couple of chapters of the Purely Functional Data Structures book is a really good primer for this stuff).
This means two lists can share a tail. To illustrate this, for the first time on this blog (duh duh duhhn…), a diagram:
The immutability of the list is key here. If you could change the list (say, remove the last entry, or update the element held in a node), then this sharing would be a problem – a
might change the list, and the change would affect b
.
Values and Variables
This list is a stack, with consing as a push, and unwrapping the next element a pop. Suppose we added two methods to simplify these operations:
extension List { mutating func push(x: Element) { self = self.cons(x) } mutating func pop() -> Element? { switch self { case .End: return nil case let .Node(x, next: xs): self = xs return x } } }
“What?! Mutating methods? I thought you said the list had to be immutable.”
Yes, I did. And it still is:
var stack = List<Int>.End.cons(1).cons(2).cons(3) var a = stack var b = stack a.pop() // 3 a.pop() // 2 a.pop() // 1 stack.pop() // 3 stack.push(4) b.pop() // 3 b.pop() // 2 b.pop() // 1 stack.pop() // 4 stack.pop() // 2 stack.pop() // 1
What this shows us is the difference between values and variables. The nodes of the list are values. They cannot change. A node of 3
and a reference to the next node cannot become some other value. It will be that value forever, just like the number 3
cannot change. It just is. Just because these values in question are structures with pointers to each other doesn’t make them less value-y.
A variable a
, on the other hand, can change the value it holds. It can be set hold a value of an indirect reference to any of the nodes, or to the value End
. But changing a
doesn’t change these nodes. Just which node a
refers to.
This is what these mutating methods on structs do – they take an implicit inout
argument of self
, and they can change the value self holds.1 This doesn’t change the list, just which part of the list the variable currently represents.
In this sense, the variables are iterators into the list:
You can of course declare your variables with let
instead of var
, in which case the variables will be constant i.e. you can’t change the value they hold once set. But let
is about the variables, not the values. Values are constant by definition.
Now, this is all just a logical model of how things work. In reality, the nodes are actually places in memory that point to each other. And they take up space, which we want back if it’s no longer needed. So Swift uses ARC to manage this, and frees them up:
Conforming to SequenceType
OK, so if List
variables act like iterators, we should be able to use them to iterate, right? For example, we can conform List
to SequenceType
:
extension List: SequenceType { func generate() -> AnyGenerator<Element> { // declare a variable to capture that // tracks progression through the list: var current = self return anyGenerator { // next() will pop, returning nil // when the list is empty: current.pop() } } }
And, to make things easier, to ArrayLiteralConvertible
:
extension List: ArrayLiteralConvertible { init(arrayLiteral elements: Element...) { self = elements.reverse().reduce(.End) { $0.cons($1) } } }
So now you can use lists with for...in
:
let l: List = ["1","2","3"] for x in l { print(x) }
It also means, through the power of protocol extensions, we can use List
with dozens of standard library functions:
l.joinWithSeparator(",") // "1,2,3" l.contains("2") // true l.flatMap { Int($0) } // [1,2,3] l.elementsEqual(["1","2","3"]) // true
Boxes and Identity
But not so many as if we conformed to CollectionType
. CollectionType
brings us properties like .first
, which would be nice to use to peek at the first element without popping it.
More importantly, conforming to CollectionType
gives a guarantee that multiple passes over the sequence are OK. As the docs for SequenceType
say:
SequenceType makes no requirement on conforming types regarding whether they will be destructively “consumed” by iteration. To ensure non-destructive iteration, constrain your sequence to CollectionType.
Our list definitely fits this description – you can iterate it as much as you like. So we’d like it to be a collection.
Since variables holding nodes can be treated as iterators, it seems reasonable to use them them as a ForwardIndexType
.
But first, lets separate out the nodes from the list and index types. It is tempting to just conform the enum to CollectionType
and ForwardIndexType
directly, but this will lead to problems. For example, those two protocols need very different implementations of ==
. The index needs to know if two indices from the same list are at the same position. It should not need the elements to conform to Equatable
. The collection, on the other hand, should be able to compare two different lists to see if they hold the same elements. It will need the elements to conform to Equatable
.
So lets rename the enum to ListNode
, making it a private implementation detail, and add public List
and ListIndex
types that use it:
private enum ListNode<Element> { case End indirect case Node(Element, next: ListNode<Element>) func cons(x: Element) -> ListNode<Element> { return .Node(x, next: self) } } public struct ListIndex<Element> { private let node: ListNode<Element> } public struct List<Element> { public typealias Index = ListIndex<Element> public var startIndex: Index // since this is a constant, no need to store it in a variable, // so use a computed property instead to save space public var endIndex: Index { return Index(node: .End) } public init() { startIndex = Index(node: .End) } } extension List { public mutating func push(x: Element) { startIndex = Index(node: startIndex.node.cons(x)) } public mutating func pop() -> Element? { guard case let .Node(x, next: xs) = startIndex.node else { return nil } startIndex = Index(node: xs) return x } } extension List: ArrayLiteralConvertible { public init(arrayLiteral elements: Element...) { self = List() for x in elements.reverse() { push(x) } } }
OK so now to conform List
to CollectionType
. The first step is to conform ListIndex
to ForwardIndexType
. Simple enough, successor()
just needs to return the next node:
extension ListIndex: ForwardIndexType { public func successor() -> ListIndex<Element> { guard case let .Node(_, next: next) = node else { fatalError("cannot increment endIndex") } return ListIndex(node: next) } }
But this is only half the story. ForwardIndexType
conforms to Equatable
, which requires you to implement ==
. And this is where we hit the buffers… how can we do that?
public func == <T>(lhs: ListIndex<T>, rhs: ListIndex<T>) -> Bool { switch (lhs.node,rhs.node) { case (.End,.End): return true case (.Node,.End),(.End,.Node): return false case (.Node,.Node): // what to put here??? } }
The problem being, indirect enum cases do not have identity. Once upon a time, when this list code was implemented with box, the boxes had identity, which you could use to solve this problem:
// Box, being a class, will have a === operator final class Box<Element> { let unbox: Element init(_ x: Element) { unbox = x } } enum ListNode<Element> { case End // a node will be held inside a box, so have identity case Node(Box<(Element, next: ListNode<Element>)>) func cons(x: Element) -> ListNode<Element> { return .Node(Box(x,next: self)) } }
Now, no-one is going to mourn the loss of Box
, since it got all up in your pattern matching and generally dirtied up the joint. But it did mean you could write ==
like this:
func ==<T>(lhs: List<T>, rhs: List<T>) -> Bool { switch (lhs, rhs) { case (.End,.End): return true case (.End,.Node),(.Node,.End): return false case let (.Node(l),.Node(r)): return l === r } }
This all fitted perfectly well with the notion of List
being a value type. Box
is not a value type – being a class, it is a reference type. But it is immutable – it only has one property, declared with let
. So it has value semantics.
This is a subtle but important distinction. Value types are copied on assignment, while reference types are shared (or rather, the thing they refer to is shared). But value types can contain reference types as properties. When value types containing reference types are copied, it is a shallow copy. The two value types will share a reference. If the reference type in question has methods that can mutate its state, the containing type needs to account for this possibility. This can get pretty hairy.
In the case of Box
, this is fine though, because it is immutable. The sharing doesn’t matter, because once created, a box can never be changed. Yes, what is boxed could be a reference type you could then mutate, but since what we are boxing is our enum, and that is a value type, it’s fine.
(the list could contain mutable reference types. But we can ignore this – in the same way an array containing reference types doesn’t meddle with the array’s value semantics.)
So, with our nodes being references, we essentially got a free globally-unique identifier for each one, which we could then use to implement node equality. But now that we have switched away from box, we can’t use ===
any more. So what can we do?
Well, we could rely on the fact that, really, indirect
is kind of the same as the box approach, just with the compiler writing the boxing/unboxing in for you and hiding the details. So we could do something very unwise, and just look at the bits:
public func == <T>(lhs: ListIndex<T>, rhs: ListIndex<T>) -> Bool { switch (lhs.node,rhs.node) { case (.End,.End): return true case (.Node,.End),(.End,.Node): return false case (.Node,.Node): // engage ಠ_ಠ now... return unsafeBitCast(lhs, UInt.self) == unsafeBitCast(rhs, UInt.self) } }
This is evil, but it appears to work. When a ListNode.Node
is copied, with indirect
this just makes a copy of the indirect reference. Then, if the variable is updated, it changes to be a new reference and is no longer equal.
But it is almost certainly Not A Good Idea. It’s relying on an implementation detail of indirect that, if it ever changed or if the compiler did some optimization under the hood that did something different, would lead to undefined behaviour. There could well be other ways in which my behaviour assumptions are incorrect and this breaks, though I haven’t found any.
Next hacky solution: only make the ends of the index equal. Even if two nodes might be the same, return false
regardless:
public func == <T>(lhs: ListIndex<T>, rhs: ListIndex<T>) -> Bool { switch (lhs.node,rhs.node) { case (.End,.End): return true default: return false } }
Again, this mostly works, but we are still breaking the rules – in this case, the rules of Equatable
.
Equatable
requires that ==
be an equvalence relation. This means, amongst other things, that it must be reflexive – a thing must equal itself. And our indices don’t (unless they’re the end):
l.startIndex == l.startIndex // false... not OK!
Mysterious undiagnosable bugs can ensue from this. There are some places in the standard library where ==
being an equivalence relation is relied upon. For example, Array
takes a shortcut of first checking if two arrays share the same buffer. If they do, they have to be equal, right? But implement an Equatable
type whose ==
isn’t reflexive, and you get some very strange behaviour:
struct NeverEqual: Equatable { } func == (lhs: NeverEqual, rhs: NeverEqual) -> Bool { return false } var a = [NeverEqual()] let b = a a == b // true a[0] = b[0] a == b // false
…so we should probably rule this out.
Conforming to CollectionType
Finally, a third option: tag our list indices with a number, and use that to compare them. Whenever you change an index, increment/decrement the tag.
edit: I originally put this tag on the nodes, but Thorsten in the comments made a great point that it is more space-efficient to have it on the indices.
This solution is a bit more intrusive. We’ll have to reconfigure ListIndex
to acommodate the tag:
public struct ListIndex<Element> { private let node: ListNode<Element> private let tag: Int }
…and any places where generate new indices:
public struct List<Element> { public typealias Index = ListIndex<Element> public var startIndex: Index public var endIndex: Index { return Index(node: .End, tag: 0) } public init() { startIndex = Index(node: .End, tag: 0) } } extension ListIndex: ForwardIndexType { public func successor() -> ListIndex<Element> { guard case let .Node(_, next: next) = node else { fatalError("cannot increment endIndex") } return ListIndex(node: next, tag: tag.predecessor()) } } extension List { public mutating func push(x: Element) { startIndex = Index(node: startIndex.node.cons(x), tag: startIndex.tag.successor()) } public mutating func pop() -> Element? { guard case let .Node(x, next: _) = startIndex.node else { return nil } ++startIndex return x } }
Now, index equality can now be performed just by comparing the tag
:
public func == <T>(lhs: ListIndex<T>, rhs: ListIndex<T>) -> Bool { return lhs.tag == rhs.tag }
Finally, we can conform List
to CollectionType
:
extension List: CollectionType { public subscript(idx: Index) -> Element { guard case let .Node(x, _) = idx.node else { fatalError("Subscript out of range") } return x } }
and it gainst the collection-only features:
l.first // .Some("1") if let twoIdx = l.indexOf("2") { l[twoIdx] // "2" }
There’s no longer any need to implement generate
for the SequenceType
conformance. Collections automatically gain a default implementation using IndexingGenerator
.
As an added benefit of the tagging solution, since the tag is the count of nodes prepended to .End
, we can implement a constant-time count
, even though the lists are forward-index only which would normally mean linear-time counting:
extension List { public var count: Int { return startIndex.tag } }
It is true that two ListIndex
types from two totally different lists with different values can now be equated and be equal to each other. But then, this is true of array’s indices too (cos, you know, they’re just integers).
We can also implement ==
for List
the way users expect ==
to behave for value-semantic collections, by comparing the elements:
func == <T: Equatable>(lhs: List<T>, rhs: List<T>) -> Bool { return lhs.elementsEqual(rhs) }
Efficient Slicing
Collections also get a default implementation of the slicing operation, [Range]
, which is how dropFirst
works:
// equivalent of dropFirst(l) let firstDropped = l[l.startIndex.successor()..<l.endIndex] // firstDropped will be a Slice<List<String>>
But this default implementation returns a wrapper over the original collection, plus an index sub-range, so is bigger than it needs to be in List
‘s case:
// Size of a list is size of two nodes, the start and end: sizeofValue(l) // returns 16 // Size of a list slice is size of a list, plus size of a subrange // (a range between two indices, which in lists's case are also list nodes) sizeofValue(dropFirst(l)) // returns 48
Remember, since endIndex
was always the same value, it takes no space inside List
, so the sizeofValue(l)
is just the size of List
’s sole member variable, startIndex
.
But if we switch to hold an explicit endIndex
property, lists could return subsequences of themselves by holding a different start and end:
public struct List<Element> { public typealias Index = ListIndex<Element> public var startIndex: Index public var endIndex: Index public init() { startIndex = Index(node: .End, tag: 0) endIndex = startIndex } // and add the slicing initializer, which takes a range: private init (subRange: Range<Index>) { startIndex = subRange.startIndex endIndex = subRange.endIndex } public subscript (subRange: Range<Index>) -> List<Element> { return List(subRange: subRange) } } // now lists take 32 bytes (a startIndex and an endIndex): sizeofValue(l) // returns 32 // but list slices are also lists, so only 32 bytes sizeofValue(dropFirst(l)) // returns 32
This leads to another benefit of this list implementation. With many sliceable containers, including Swift’s arrays and strings, a slice shares the storage buffer of the original collection. This has an unpleasant side-effect: slices can keep the original collection’s buffer alive in its entirety, even if the original collection falls out of scope. If you suck a 1gb file into an array or string, then slice off a tiny part, the whole 1gb buffer will stay in memory until both the collection and the slice are destroyed.
With List
, it isn’t quite as bad. As we’ve seen, the nodes are refcounted. Which means when the slices are the the only remaining copy, any elements dropped from the front will be reclaimed as soon as no-one is referencing them.
Not the back though – since the slice’s endIndex
still has a reference to what comes after. You could optimize this even further by having slices only hold the tag for their end index if you wanted.
I hate writing conclusions
So, losing Box was mostly great, but it also lost us one benefit, of a free identity for our nodes. This identity is still secretly there, and I think it works similarly, but we can’t use it. I have a feeling there’s a case to be made for enums to be able to use it internally, without necessarily breaking with the idea that enums are value types. But there are probably hidden gotchas to this that I haven’t thought of. Tags serve as a reasonable subsitute (at the expense of an extra few bytes), but this approach wouldn’t easily adapt to more complex structures like trees, so conforming last week’s code to a collection type would probably prove tricky (though, a persistent stack like this list may help with creating the index type). If anyone wants to take a stab, let me know what you come up with!
[…] Airspeed Velocity (tweet): […]
Great post as always!
Just one comment: Instead of having to tag each node you could also put the tag into the ListIndex (incrementing it in the successor() method). Then the tag would only take up space when indices are needed. With this approach the tag would count upwards from the beginning of the list and would be equivalent to the element index.
[…] other feedback, a few commented on the previous post about linked lists. Why implement an outdated data structure? What’s the point when we have […]
[…] Might be a bad idea. (Airspeed Velocity had a post on making Lists O(1) indexable) […]
[…] add two indices together. Think about non-contiguous storage like a node-based container like a linked list or a tree. You can't just add together two node pointers. Or for a random-access example, what […]