May 2015 – Airspeed Velocity

One of my favourite blog posts is Peter Norvig’s How to Write a Spelling Corrector, in which he writes 21 lines of python that sucks in a large corpus of text, and then uses some probability theory to propose spelling corrections. Go read it now, it’s cool.

Anyway, that code has been translated into multiple different programming languages, but not Swift, so let’s give that a go. The resulting code is maybe not as compact as the python equivalent, but to my eyes that’s not such a bad thing, as the python is a bit write-only for my liking.¹

List Comprehensions and Swift

The heart of the code is this function, which generates a set of possible alternatives for a word:

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

This makes extensive use of python’s list comprehensions, which Swift lacks, so we’ll have to make do with map etc.

First, splits. The code takes a string, and generates an array of all the ways it can be split into two. So hello becomes (, hello), (h, ello), (he, llo), (hel, lo), (hell, o).

One way to do this in Swift would be to use map over the indices of the word, slicing it in two at each index, like so:

let splits = indices(word).map {
    (word[word.startIndex..<$0], word[$0..<word.endIndex])
}

This works – but those word.startIndex and word.endIndex are a bit cluttering. The python version uses a handy one-sided slicing operation that looks a bit neater. But we can easily extend String to have the same functionality. We just need to create a pre- and postfix version of the range operator, and then have that return a type that indicates a one-sided range:

// even though infix ..< already exists, we need to declare it
// two more times for the prefix and postfix form
postfix operator ..< { }
prefix operator ..< { }

// then, declare a couple of simple custom types that indicate one-sided ranges:
struct RangeStart<I: ForwardIndexType> { let start: I }
struct RangeEnd<I: ForwardIndexType> { let end: I }

// and define ..< to return them
postfix func ..<<I: ForwardIndexType>(lhs: I) -> RangeStart<I> 
{ return RangeStart(start: lhs) }

prefix func ..<<I: ForwardIndexType>(rhs: I) -> RangeEnd<I> 
{ return RangeEnd(end: rhs) }

// finally, extend String to have a slicing subscript for these types:
extension String {
    subscript(r: RangeStart<String.Index>) -> String { 
        return self[r.start..<self.endIndex]
    }
    subscript(r: RangeEnd<String.Index>) -> String { 
        return self[self.startIndex..<r.end] 
    }
}

This is a lot of effort to go to for just this one case, but probably useful to stash in a library somewhere for reuse. Armed with this, we can simplify the splits function:

// hello becomes (,hello), (h,ello), (he,llo) etc.
let splits = indices(word).map {
    (word[..<$0], word[$0..<])
}

OK now for each of the transformations. Deleting a single character is easy, just drop the first character from the right-hand of each split and join them back up:

// (h, ello) becomes hllo
let deletes = splits.map { $0.0 + dropFirst($0.1) }

Next, transposition. We want to generate every string in which two adjacent characters in the original string are swapped. This can be done by taking the first two characters of the right side of each split, swapping them, and joining it all back together. The tricky part is that not all right-hand sides will have two characters. The python code uses integer indexing and the fact that you can slice beyond the index range of a python string and just get back an empty string. Both of these are a non-starter in Swift, so the code is going to need to be a little more verbose:

// (h, ello) becomes hlelo
let transposes: [String] = map(splits) { left, right in
    if let fst = first(right) {
        let drop1 = dropFirst(right)
        if let snd = first(drop1) {
            let drop2 = dropFirst(drop1)
            return "\(left)\(snd)\(fst)\(drop2)"
        }
    }
    return ""
}.filter { !$0.isEmpty }

Unlike with delete, there’s not much to be done to neaten this up. There are helper functions that might make it a bit better (like dropTwo or firstTwo) but it’ll still look a bit messy. If anyone has some good suggestions, tweet me

Replacement requires a map within a map. We want to replace every character with another character from the alphabet. With each split, this can be done by dropping the first character of the right-hand side, and inserting each letter of the alphabet. If we do this with map, we’ll get an array of possible strings, but then if we map that over each of the splits, we’ll have a nested array. So instead, we can use flatMap, as introduced in Swift 1.2, which flattens out the arrays returned from each inner map into a single array of values:

let alphabet = "abcdefghijklmnopqrstuvwxyz"

let replaces = splits.flatMap { left, right in
    // (he, llo) becomes healo, heblo, heclo etc
    map(alphabet) { "\(left)\($0)\(dropFirst(right))" }
}

Finally, insertion is just the same as replace, only without dropping the first character of the right-hand side:

let inserts = splits.flatMap { left, right in
    // (he, llo) becomes heallo, hebllo, hecllo etc
    map(alphabet) { "\(left)\($0)\(right)" }
}

Reading the corpus

Next, we need to be able to load in a large text file, break it up into individual words, and then count the frequence of each word. This will be used to pick the most likely word in the case of misspellings. There are many good ways to do this, especially given useful Foundation classes like NSRegularExpression and NSCharacterSet. But since this is a Swift blog, let’s try for something almost entirely in Swift, just using the Foundation extension to String to read in the file:

struct SpellChecker {
    var knownWords: [String:Int] = [:]

    mutating func train(word: String) {
        knownWords[word] = knownWords[word]?.successor() ?? 1
    }    

    init?(contentsOfFile file: String) {
        if let text = String(contentsOfFile: file, encoding: NSUTF8StringEncoding, error: nil)?.lowercaseString {
            let words = split(text.unicodeScalars) { !("a"..."z").contains($0) }.map { String($0) }
            for word in words { self.train(word) }
        }
        else {
            return nil
        }
    }    
}

This code is rather shamefully English-language-centric. It may also cope partially with other Latin-based languages, but more through luck than anything else (for example, ("a"..."z").contains("é") is true, but ("a"..."z").contains("ß") is false).

Why split the unicodeScalars and then restringify them, rather than split the string itself directyly? Because splitting Swift strings is currently painfully slow. Swift makes heroic efforts to be correct regarding Unicode, such as presenting grapheme clusters as single characters. But this comes at a price, and since we’ve already acknowledged that we are really living in ASCII-world, it’s probably not a price worth paying.

Correcting spelling

Finally, we’re ready to write the spelling corrector. We need a function that takes a sequence of possible words, and filters out those that aren’t known words:

extension SpellChecker {
    func known<S: SequenceType where S.Generator.Element == String>(words: S) -> Set<String>? {
        let s = Set(filter(words) { self.knownWords.indexForKey($0) != nil })
        return s.isEmpty ? nil : s
    }
}

and a function that takes a word, and generates a list of known words from edits of edits (i.e. a distance of 2 from the original word):

extension SpellChecker {
    func knownEdits2(word: String) -> Set<String>? {
        var known_edits: Set<String> = []
        for edit in edits(word) {
            if let k = known(edits(edit)) {
                known_edits.unionInPlace(k)
            }
        }
        return known_edits.isEmpty ? nil : known_edits
    }
}

Why do these functions return an optional Set when it’s empty? Well, the goal is to go first for the word itself if it’s known, then the most likely word with a distance of 1 (i.e. edit(word)), then, if there’s no such word, the most likely word with a distance of 2 (i.e. knownEdits2(word)).

In the python version, this is done with or, which chooses the first non-empty set (i.e. [1,2] or [3,4] returns [1,2], but [] or [3,4] returns [3,4]). Swift doesn’t have this feature – but it’s very similar to how the nil-coalescing operator works. So if known and knownEdits2 return optionals, you can write the correcting function like this:

extension SpellChecker {
    func correct(word: String) -> String {
        let candidates = known([word]) ?? known(edits(word)) ?? knownEdits2(word)

        return reduce(candidates ?? [], word) {
            (knownWords[$0] ?? 1) < (knownWords[$1] ?? 1) ? $1 : $0
        }
    }
}

This is possibly abuse of the ?? operator, and perhaps we’d be better off overloading || to have the same behaviour as python’s or for collections. Still, it works.

The reduce is just searching for the value in candidates with the maximum frequency in the knownWords model. In the python version, the dictionary has a default value of 1 – that is, if a word isn’t known, it’s assumed to have a frequency of 1. This accounts for rare words that are correct, but not in the corpus. Swift doesn’t have dictionary defaulting, so we can use ?? to replace the not-found nil value with 1 instead.

And that’s pretty much it. There are lots of speed improvements that could be made by sacrificing the brevity of the code, and the original blog article has lots of ideas for enhancing this to make it better functionally. Here’s the full code, which you should be able to compile as a single file with swiftc to build a stand-alone executable. Let me know if you spot any bugs, I’m sure there are a couple!

/// Translation of [Peter Norvig's spell checker](http://norvig.com/spell-correct.html) into Swift.
/// Sample input corpus [here](http://norvig.com/big.txt)

import Foundation   // purely for IO, most things done with Swift.String

/// Given a word, produce a set of possible alternatives with
/// letters transposed, deleted, replaced or rogue characters inserted
func edits(word: String) -> Set<String> {
    if word.isEmpty { return [] }

    let splits = indices(word).map {
        (word[word.startIndex..<$0], word[$0..<word.endIndex])
    }

    let deletes = splits.map { $0.0 + dropFirst($0.1) }

    let transposes: [String] = map(splits) { left, right in
        if let fst = first(right) {
            let drop1 = dropFirst(right)
            if let snd = first(drop1) {
                let drop2 = dropFirst(drop1)
                return "\(left)\(snd)\(fst)\(drop2)"
            }
        }
        return ""
    }.filter { !$0.isEmpty }

    let alphabet = "abcdefghijklmnopqrstuvwxyz"

    let replaces = splits.flatMap { left, right in
        map(alphabet) { "\(left)\($0)\(dropFirst(right))" }
    }

    let inserts = splits.flatMap { left, right in
        map(alphabet) { "\(left)\($0)\(right)" }
    }

    return Set(deletes + transposes + replaces + inserts)
}


struct SpellChecker {

    var knownWords: [String:Int] = [:]

    mutating func train(word: String) {
        knownWords[word] = knownWords[word]?.successor() ?? 1
    }    

    init?(contentsOfFile file: String) {
        if let text = String(contentsOfFile: file, encoding: NSUTF8StringEncoding, error: nil)?.lowercaseString {
            let words = split(text.unicodeScalars) { !("a"..."z").contains($0) }.map { String($0) }
            for word in words { self.train(word) }
        }
        else {
            return nil
        }
    }

    func knownEdits2(word: String) -> Set<String>? {
        var known_edits: Set<String> = []
        for edit in edits(word) {
            if let k = known(edits(edit)) {
                known_edits.unionInPlace(k)
            }
        }
        return known_edits.isEmpty ? nil : known_edits
    }

    func known<S: SequenceType where S.Generator.Element == String>(words: S) -> Set<String>? {
        let s = Set(filter(words) { self.knownWords.indexForKey($0) != nil })
        return s.isEmpty ? nil : s
    }

    func correct(word: String) -> String {
        let candidates = known([word]) ?? known(edits(word)) ?? knownEdits2(word)

        return reduce(candidates ?? [], word) {
            (knownWords[$0] ?? 1) < (knownWords[$1] ?? 1) ? $1 : $0
        }
    }
}


// main()

if Process.argc == 2, let checker = SpellChecker(contentsOfFile: Process.arguments[1]) {

    println("Type word to check and hit enter")
    while let word = NSString(data: NSFileHandle.fileHandleWithStandardInput().availableData,
                              encoding: NSUTF8StringEncoding) as? String
        where last(word) == "\n"
    {
        let word = dropLast(word)
        let checked = checker.correct(word)

        if word == checked {
            println("\(word) unchanged")
        }
        else {
            println("\(word) -> \(checked)")
        }
    }

}
else {
    println("Usage: (Process.arguments[0]) <corpus_filename>")
}

It’s a continuum of course. Some would consider the code in this post unreadably terse. Personally I don’t think making code longer always makes it more clear. ↩

	Sorting Nibbles in S… on Sorting Nibbles in Swift
	Collection Data Stru… on Arrays, Linked Lists and …
	Swift化零为整：Reduce 详解… on Arrays, Linked Lists and …
	Writing A Generic St… on Generic Collections, SubSequen…
	Writing A Generic St… on Collection Indices, Slices, an…

	Sorting Nibbles in S… on Sorting Nibbles in Swift
	Collection Data Stru… on Arrays, Linked Lists and …
	Swift化零为整：Reduce 详解… on Arrays, Linked Lists and …
	Writing A Generic St… on Generic Collections, SubSequen…
	Writing A Generic St… on Collection Indices, Slices, an…

Airspeed Velocity

African or European Swift?

Month: May 2015

My Talk at Swift Summit

A Swift Spelling Corrector