Generic Collections, SubSequences and Overloading

So, you’ve figured out the whole indices versus distances situation, and now you’re feeling pretty happy and you start trying out some more generic collection algorithms. And before long you smack your head against one of the more confusing things about generic collections: a sequence’s SubSequence doesn’t necessarily contain the same element type as that sequence. This usually manifests itself in one of those compiler errors that has you yelling “what are you talking about you infernal machine” at the compiler before eventually realizing what you’ve done wrong and sheepishly fixing it.

For example, suppose you want to find the location of the first subsequence in a collection. Like indexOf, but for another sequence rather than an element. The simple brute-force approach would be to iterate over every subsequence, trying to match it. Something like this – except it won’t compile:

extension CollectionType
  where 
  Generator.Element: Equatable {

    func search<
      Other: SequenceType 
        where 
        Other.Generator.Element == Generator.Element
    >(pattern: Other) -> Index? {

        for idx in self.indices {
            // error: cannot convert value of type 'Other' 
            // to expected argument type...
            if suffixFrom(idx).startsWith(pattern) {
                return idx
            }
        }
        return nil
    }
}

Until you know why, this can be very confusing. You’ve constrained the other sequence to have the same elements as the collection, so why can’t you use startsWith on it?

Except you’re not calling startsWith on Self. You’re calling it on Self.SubSequence, because that’s what suffixFrom returns – it’s a slicing operation. And SubSequence.Generator.Element isn’t guaranteed to be the same as Self.Generator.Element.

Here’s the relevant declaration:

protocol CollectionType {
    typealias SubSequence : Indexable, SequenceType = Slice<Self>
}

There’s nothing about this definition that forces the elements of SubSequence to be anything specific. Really, this needs a where clause on the typealias – something that Swift doesn’t currently support:1

protocol CollectionType {
    typealias SubSequence : Indexable, SequenceType
       // not (yet?) valid Swift
       where SubSequence.Generator.Element == Generator.Element
}

So, in the mean-time, the way around this is to specify any requirements like this as part of the declaration. A fix that might seem to work, but is over-constraining, would be to require subsequences be the same type:

extension CollectionType
  where
  Generator.Element: Equatable,
  SubSequence == Self {
    // etc
}

This would compile, and it would work fine with types whose slices are the same type as themselves, like strings:

// a slice of a String.CharacterView is
// a String.CharacterView, so this is fine 
"hello".characters.search("ll".characters)

but not for arrays, who have an ArraySlice type, or arbitrary collections, that have the default Slice implementation.

// "type of expression is ambiguous without more context", apparently
[1,2,3,4].search(2…4)
// "ambiguous reference to member '..<'" uhhh what?
(0..<5).search(2...4)

Instead, you can constrain the elements to be the same while allowing the slice type itself to differ:

extension CollectionType
  where
  Generator.Element: Equatable,
  // guarantee subsequences contain the same element
  SubSequence.Generator.Element == Generator.Element {
    
    func search<
      Other: SequenceType
        where
        Other.Generator.Element == Generator.Element
    >(pattern: Other) -> Index? {
            
            for idx in self.indices {
                // and now this line will happily compile...
                if suffixFrom(idx).startsWith(pattern) {
                    return idx
                }
            }
            return nil
    }
}

[1,2,3,4].search(2...4) // returns 1
(0..<5).search([3,4])   // returns 3

Optimizing for Random Access

It’s often the case that you can optimize a collection algorithm if you know it has random access. For example, given the above algorithm, you can tweak it to avoid searching for a pattern that is longer than what remains of the collection, so long as both the collection and the pattern are random access and therefore able to calculate their lengths (and the lengths of their collection) in constant time. OK, maybe it’s not Boyer-Moore, but it’s better than nothing and the code to do this is pretty simple, in fact it’s more constraints than code at this point:

extension CollectionType
  where
  Generator.Element: Equatable,
  Index: RandomAccessIndexType,
  SubSequence.Generator.Element == Generator.Element {
    func search<
      Other: CollectionType
      where
      Other.Index: RandomAccessIndexType,
      Other.Index.Distance == Index.Distance,
      Other.Generator.Element == Generator.Element
    >(pat: Other) -> Index? {
            
        // if pattern is longer, this cannot match, exit early
        guard !isEmpty && pat.count <= count else { return nil }
        
        // otherwise, search from the start up to the 
        // end less the space for the pattern
        for i in startIndex...endIndex.advancedBy(-pat.count) {
            // check if a slice from this point
            // starts with the pattern
            if self.suffixFrom(i).startsWith(pat) {
                // if it does, we've found it
                return i
            }
        }
        // otherwise, not found
        return nil
    }
}

(one part of the where clause, Other.Index.Distance == Index.Distance, is maybe overconstraining, but the alternative would be fiddling around with casts – non-Int distances aren’t exactly common, even Bit’s distance is an Int).

We’ve talked about overload priority a long time ago, and it still pertains even with protocol extensions. The most specific function wins. Here, if your two collections are random access, this version will be called because it is more tightly constrained than the first version. So:

// calls the version that works for any index
"hello".characters.search("ll".characters)
// calls the version optimized for random-access
[1,2,3,4].search(2…4) // returns 1?

Now, there is a potentially nasty bug you could stumble into here, and the compiler won’t save you from it. Try removing the constraints requiring the indices be random access on the second version. It still compiles! Except, now this version will be called strings as well. But in the case of non-random indices, it’s not an optimized version, it’s potentially slower. The reason being, pat.count and endIndex.advancedBy aren’t constant time, they’re linear, because they involve incrementing the index one by one. Now it could be worse – at least it’s not accidentally quadratic, but other examples could easily be (imagine implementing a sort that assumed constant-time advance but didn’t get it), so it’s something to watch out for.

Protocols, Extensions and Dynamic Dispatch

Once upon a time, it was harder to have made this mistake. In earlier versions of the standard library there were two ways to advance an index: a free advance function, that was O(1) for random access indices and O(n) otherwise, and then there was an advancedBy method on only the RandomAccessIndexType.

As part of the great extensionizing, the free advance function went away, and advancedBy moved into ForwardIndexType. So now you can call it on any index type, but you’ll get different performance characteristics depending on the type you call it on.

If you look at the declaration of ForwardIndex.advancedBy, you’ll see that it’s declared twice. Once inside the protocol:

public protocol ForwardIndexType : _Incrementable {
    public func advancedBy(n: Self.Distance) -> Self
}

and then again just below, as an extension:

extension ForwardIndexType {
    public func advancedBy(n: Self.Distance) -> Self
}

This seems a little weird. Why declare it twice? Why not just declare it once as an extension? It has to do with static versus dynamic dispatch, and which actual implementation is called.

To show this, let’s create our own toy index, initially without random access:

struct MyIndex {
    let i: Int
}

extension MyIndex: BidirectionalIndexType {
    func successor() -> MyIndex {
        print("successing!")
        return MyIndex(i: i+1)
    }
    func predecessor() -> MyIndex {
        print("predecessing!")
        return MyIndex(i: i-1)
    }
}

// indices need to be Equatable
func == (lhs: MyIndex, rhs: MyIndex) -> Bool {
    return lhs.i == rhs.i
}

I’ve put prints in there so you can tell when the functions are called. Now, this index type will have the default advancedBy implementation, which just calls successor() multiple times. And because of the print, you can see it happening:

func castHolyHandGrenadeOfAntioch<
  I: ForwardIndexType
>(idx: I) {
    idx.advancedBy(3)
}

let idx = MyIndex(i: 1)
castHolyHandGrenadeOfAntioch(idx)
// prints “successing!” 3 times

OK, now, we conform the type to be random access by giving it advance and distance functions:

extension MyIndex: RandomAccessIndexType {
    func advancedBy(n: Int) -> MyIndex {
        return MyIndex(i: i+n)
    }
    func distanceTo(end: MyIndex) -> Int {
        return end.i - i
    }
}

and now, when we call the castHolyHandGrenadeOfAntioch, we see nothing printed. Instead, even though the function only required a forward index, it uses the better advancedBy implementation of the random-access index.

But… this only works because advancedBy had been declared in the original ForwardIndexType protocol declaration. If, on the other hand, we added another method as an extension to different index types – say, a retreatedBy function – this would not happen.

First, let’s do that for BidirectionalIndexType:

extension BidirectionalIndexType {
    // just an example, you don't have to do this!
    // since BidirectionalIndexType also supports
    // advancedBy(-distance)
    func retreatedBy(n: Distance) -> Self {
        var i = n
        var result = self
        // since Distance isn’t strideable, this
        // is a rare case where you might mourn the
        // loss of C-style for and the ++ operator…
        while i > 0 {
            result = result.predecessor()
            i -= 1
        }
        return result
    }
}

and then, an optimized version for RandomAccessIndexType:

extension RandomAccessIndexType {
    func retreatedBy(n: Distance) -> Self {
        return self.advancedBy(-n)
    }
}

But now, if we repeat our earlier experiment but calling this function, we get a different result:

// only requires bidirectional access
func runAway<I: BidirectionalIndexType>(idx: I) {
    idx.retreatedBy(10)
}

// and even though MyIndex is random access
let braveSirRobin = MyIndex(i: 10)

// the bidirectional version is called...
runAway(braveSirRobin)
// prints “predecessing!” 10 times

So we can see the difference between default implementations that are also declared as part of the protocol, versus ones just created as extensions. With the former, generic functions get to take advantage of specialized implementations, whereas with the latter, only the implementation that matches the generic constraint will be called. This is why you often see functions declared twice in the standard library. For example, on SequenceType, underestimateCount is declared like this, so that if you pass in a collection to a function that takes a sequence, it can check it’s length if it’s a collection, without risking consuming the sequence.


  1. That’s assuming that there isn’t a legitimate reason to vary elements in subsequences which I don’t… think… there is? 

6 thoughts on “Generic Collections, SubSequences and Overloading

  1. I think (but am not certain) this implementation of ‘retreatedBy’ is simpler:

    extension BidirectionalIndexType where Distance: SignedIntegerType {
    func retreatedBy (n: Distance) -> Self {
    return (0..<n).reduce(self) { (i,_) in i.predecessor() }
    }
    }

    This requires constraining Distance, but I don't think that is a problem because as far as I can tell all Distance typealiases on Index types in the Standard Library are either Int or IntMax (a.k.a. Int64). In any case it's hard to imagine any Distance that is not at least an IntegerType (it is defined as a "number of steps"). Presumably BidirectionalIndexType.Distance is defined as _SignedIntegerType in the standard library for technical reasons.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s