A Swift String Tokenizer
/Brace yourself, this is a long article (now you at least can see why I've been quiet for a week or so). You will also see what I can only describe as some unusual patterns, I have had to work around all manor of Swift compiler bugs to get this running.
The objective: To write an extensible (but simple) Swift tokeniser. There are great Objective C frameworks like ParseKit but I thought this was both a fun thing to do in Swift, and I've never loved the way you build up your tokens and tokenisers.
The architecture is simple. There is a Token class (intended to be sub-classed, the base class has just a name and the characters it is made up of), a protocol of something that can create Tokens from Characters. A TokenizingState which can create more complex Tokens, and one final implementation of that, the Tokenizer which manages all of the different states of tokenisation.
The Token Class
@objc class Token{ let name:String var characters:String = "" init(name:String){ self.name = name } init(name:String, withCharacters:String){ self.name = name self.characters = withCharacters } func description()->String{ return "\(name) '\(characters)'" } @objc class ErrorToken : Token{ let problem : String init(forCharacter:UnicodeScalar,problemDescription:String){ problem = problemDescription super.init(name: "Error", withCharacters: "\(forCharacter)") } override func description() -> String { return super.description()+" - "+problem } } }
These are what you'll get back out of the Tokenizer, a very simple implementation. They have a name, and record the characters they capture. Note that it includes a nested concrete sub-class, I've grown to quite like this pattern. It says this sub-class is special. In this case it's a predefined Token type that can be used by others to identify that there was an error, and include a description of the problem. You can follow this pattern with your own Token classes.
Right, we'll need something to create Tokens!
The Tokenizing Protocol
To create a minimal class that can take a single character and return a token, just implement this protocol. It has just two methods. If it can tokenise a character, it should return true from canTokenize(). If it returns true it will be asked for the token with a call to tokenFor(). Simple.
@objc protocol Tokenizing{ //Returns true if the supplied character can be tokenized by the implementing class func canTokenize(character:UnicodeScalar) -> Bool //Only called if the implementation returned true for canTokenize func tokenFor(character:UnicodeScalar) -> Token }
We'll see some implementations later, but for now we also will need to deal with the parsing of more complex tokens (we might want to pick out whole words, or deal with quoted strings for example). For that we need something that captures some state.
TokenizerState
This could have been another protocol, but we bumped up against some Swift bugs, and in all honesty there are some things that every complex tokenizer will want to do.
class TokenizerState : Tokenizing{ var tokenizers = Array<Tokenizing>() var tokens = Array<Token>() //Fetches the correct tokenizer for the given character func tokenizerFor(character:UnicodeScalar) -> Tokenizing?{ for tokenizer in tokenizers{ if tokenizer.canTokenize(character){ return tokenizer } } return nil } //Determines if a token can be created by asking all //the tokenizers it has if they can create a token for the character func canTokenize(character:UnicodeScalar) -> Bool{ return tokenizerFor(character) != nil } //Returns the correct token func tokenFor(character:UnicodeScalar) -> Token{ if let tokenizer = tokenizerFor(character){ //If it is a state, create a new context token for it if tokenizer is TokenizerState{ return NewContextToken(newContext: tokenizer as TokenizerState) } else { return tokenizer.tokenFor(character) } } return Token.ErrorToken(forCharacter: character, problemDescription: "Failed to create token") } //Add a token to the stored state func append(token:Token) { tokens.append(token) } //Return all the tokens in the stored state, this //is the point to over-ride in a sub-class if you wish //to consolidate multiple tokens into one func flushTokens() -> Array<Token>? { let oldTokens = tokens tokens = Array<Token>() return oldTokens } }
State is captured as an array of tokens, which are built up using the array of other Tokenizers. Those might be simple character tokenizers, or TokenizerState sub-classes themselves. Note that essentially this class just delegates the recognition of tokens to its known tokenizers, so you may still get a stream of character tokens for example, but you could consolidate those in append() or flushTokens(). It's critical that after flushing the tokens that any state is cleared.
At Last, The Tokenizer
class Tokenizer : TokenizerState{ var stateStack = Stack<TokenizerState>() //Entry point for tokenization of a string, returning an //array of tokens func tokenize(string:String) -> Array<Token>{ stateStack = Stack<TokenizerState>() stateStack.push(self) for character in string.unicodeScalars{ tokenize(character) if stateStack.isEmpty(){ return tokens } } fullyUnwind() return tokens } //Removes the current state from the stack, passing //its tokens to the next state on the stack func unwindState(){ let state = stateStack.top()! stateStack.pop() if let newState = stateStack.top(){ //If there is a new state, pass the tokens on to the state if let currentStateTokens = state.flushTokens(){ for token in currentStateTokens{ newState.append(token) } } } else { //otherwise add them to mine if let currentStateTokens = state.flushTokens(){ tokens.extend(currentStateTokens) } } } //If we hit an error, unwind all the way func fullyUnwind(){ while !stateStack.isEmpty(){ unwindState() } } func tokenize(character:UnicodeScalar){ if let state = stateStack.top(){ if state.canTokenize(character){ let token = state.tokenFor(character) if let newContextToken = token as? NewContextToken{ stateStack.push(newContextToken.newContext) tokenize(character) } else { state.append(token) } } else { unwindState() tokenize(character) } } else { //If there's no state, we have fully unwound and could not tokenize append(Token.ErrorToken(forCharacter: character, problemDescription: "Unrecognized character")) } } }
As you can see, it's just a stateful tokenizer itself, but it manages a more complex stack of states, enabling token scanning to enter new modes as different characters are encountered. Note there is a special token that a TokenizerState can return, NewContext.
@objc class NewContextToken : Token{ let newContext : TokenizerState init(newContext:TokenizerState){ self.newContext = TokenizerState() //Works around a compiler bug that causes a segmentation fault if newContext is assigned directly self.newContext = newContext super.init(name: "NewContextToken") } }
This special token tells the Tokenizer that rather than adding it to the stream of tokens, it should push a new TokenizerState onto the stack.
Speaking of the stack, it's a very vanilla implementation
class Stack<T>{ var stack = Array<T>() func push(value: T){ stack.append(value) } func pop() -> T?{ if (stack.count > 0){ return stack.removeLast() } else { return nil } } func isEmpty()->Bool{ return depth() == 0 } func depth()->Int{ return stack.count } func top()->T?{ if stack.count == 0 { return nil } return stack[stack.endIndex-1] } func popAll(){ stack = Array<T>(); } }
That's it, it's all we need to start building something useful.
Example 1: The Useless Tokenizer
//We'll need to easily print out the tokens we get func printTokens(label:String, tokenizer:Tokenizer){ println(label+" Tokens:") for token in tokenizer.tokens{ println("\t"+token.description()) } println() } //And something to test with let parsingTest = "Short 10 string" //If we define nothing else, we just get an invalid token straight away let uselessTokenizer = Tokenizer() uselessTokenizer.tokenize(parsingTest) printTokens("Useless",uselessTokenizer)
If we create an instance of Tokenizer with nothing else, you'll see that it will immediately fail with an Error token
Useless Tokens: Error 'S' - Unrecognized character
Not a huge step forward, but we can see that it's trying! We will increase the complexity until we have a meaningful tokenisation of our parsing test string
Recognising Characters
The first thing we'll want to do is recognise certain classes of characters (letters, digits and white space in the case of our test). So let's build a class that makes it easy to do that
class CharacterTokenizer : Tokenizing{ let matchedCharacters : String init(character : Character, tokenName : String){ matchedCharacters = "\(character)" } init(allCharacters:String, tokenName : String){ matchedCharacters = allCharacters } func canTokenize(character: UnicodeScalar) -> Bool { for matches in matchedCharacters.unicodeScalars{ if matches == character { return true; } } return false } func tokenFor(character: UnicodeScalar) -> Token { return CharacterToken(character: character) } class CharacterToken : Token{ init(character:UnicodeScalar){ super.init(name: "Character", withCharacters: "\(character)") } } }
It's pretty simple, give it a string of valid characters for the token, and it will return a character token for it. We can now create tokenisers for our different classes of characters
let digitTokenizer = CharacterTokenizer(allCharacters: "0123456789",tokenName: "digit") let whiteSpaceTokenizer = CharacterTokenizer(allCharacters: " \t", tokenName: "whitespace") //You'll probably want to add /n and /r let englishLetterTokenizer = CharacterTokenizer(allCharacters: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", tokenName: "letter")
We can now feed these to the tokenizer and we should get a little further
let characterTokenizer = Tokenizer() characterTokenizer.tokenizers = [ digitTokenizer, whiteSpaceTokenizer, englishLetterTokenizer ] characterTokenizer.tokenize(parsingTest) printTokens("Character",characterTokenizer)
This yields the rather verbose output
Character Tokens: Character 'S' Character 'h' Character 'o' Character 'r' Character 't' Character ' ' Character '1' Character '0' Character ' ' Character 's' Character 't' Character 'r' Character 'i' Character 'n' Character 'g'
OK... that's useful, but really the tokens we would probably expect are 'Short' '10' and 'string'. We'd probably also expect that '10' would be an actual integer, not just a string.
Token Consolidation
What we really need to do is look for sequences of the different character classes. For that we need a sub-class of TokenizerState. Here it is with an implementation for each of the different character classes.
class CharacterSequenceTokenGenerator : TokenizerState{ //Flushes the token stream, and creates a string with all //of the token stream charactes in func consolidateCharacters() -> String { var string = "" if let allTokens = super.flushTokens(){ for token in allTokens{ string += token.characters } return string } else { return "" } } } //One for English letters class WordTokenGenerator : CharacterSequenceTokenGenerator{ init(){ super.init() tokenizers = [englishLetterTokenizer] } override func flushTokens() -> Array<Token>? { var wordString = consolidateCharacters() return [Token(name: "Word", withCharacters: wordString)] } } //One for Integers class IntegerTokenGenerator : CharacterSequenceTokenGenerator{ init(){ super.init() tokenizers = [digitTokenizer] } override func flushTokens() -> Array<Token>? { var integerString = consolidateCharacters() return [IntegerToken(number: integerString)] } class IntegerToken : Token{ let intValue : Int init(number : String){ intValue = number.toInt()! super.init(name: "Integer", withCharacters: number) } } } //One for white space class WhiteSpaceTokenGenerator : CharacterSequenceTokenGenerator{ init(){ super.init() tokenizers = [whiteSpaceTokenizer] } override func flushTokens() -> Array<Token>? { let spaceString = consolidateCharacters() return [Token(name:"Whitespace", withCharacters:spaceString)] } }
Note that the pattern is very simple, for each character tokenizer we wrap it in some state, and then finally, when asked to output our tokens (when we hit a character that can't be tokenised) we consolidate all of the tokens into a single one.
We're not quite done, we need one final class to wrap all of these different states into one (the root state for the tokenizer if you like).
class SentanceTokenGenerator : TokenizerState{ init(){ super.init() tokenizers = [ WhiteSpaceTokenGenerator(), IntegerTokenGenerator(), WordTokenGenerator(), ] } }
It just pulls all of these together. We can now use it.
let sentanceTokenizer = Tokenizer() sentanceTokenizer.tokenizers = [SentanceTokenGenerator()] sentanceTokenizer.tokenize(parsingTest) printTokens("Sentance", sentanceTokenizer)
And finally we get some output close to what we might want
Sentance Tokens: Word 'Short' Whitespace ' ' Integer '10' Whitespace ' ' Word 'string'
Excellent! As you can hopefully see, if you build these up in simple increments starting with the most basic it's very easy to construct a much more advanced parser. Let's finish off with something much tougher!
Complex Token Parsing
We might want to create a tokenizer for syntax highlighting code, and therefore want to capture quoted strings. Sounds easy initially... that's just a state that has to have an open and close quote. However, we'll also need to deal with escaped characters in the string (including the \"). That's quite easy
class EscapedCharacterTokenGenerator : CharacterSequenceTokenGenerator{ let escapedCharacters = CharacterTokenizer(allCharacters: "\"rt0n\\",tokenName: "control-code") @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used override func canTokenize(character: UnicodeScalar) -> Bool { switch tokens.count { case 0: return character == "\\" case 1: return escapedCharacters.canTokenize(character) default: return false } } @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used override func tokenFor(character: UnicodeScalar) -> Token { return CharacterTokenizer.CharacterToken(character: character) } @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used override func flushTokens() -> Array<Token>? { let escapeCode = consolidateCharacters() return [Token(name:"EscapeCode", withCharacters:escapeCode)] } }
If it encounters a \ and returns itself as a new state and looks for valid escaped characters. When flushing it turns these into a single EscapeCode token.
Right, now we just need something very similar for the quoted string itself
class QuoteTokenGenerator : CharacterSequenceTokenGenerator{ let QUOTE_DELIMETER = "quote-delimeter" init(){ super.init() tokenizers = [ EscapedCharacterTokenGenerator(), ] } func openedAndClosed()->Bool{ var delimeterCount = 0 for token in tokens{ if token.name == QUOTE_DELIMETER{ delimeterCount++ } } return delimeterCount > 1 } @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used override func canTokenize(character: UnicodeScalar) -> Bool { if tokens.count == 0 { return character == "\"" } //You may want to look for newlines etc and reject, but in this case we will keep it simple return !openedAndClosed() } @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used override func tokenFor(character: UnicodeScalar) -> Token { if tokens.count == 0 { return Token(name: QUOTE_DELIMETER, withCharacters: "\(character)") } //Let the super class deal with any escaped characters if super.canTokenize(character){ return super.tokenFor(character) } //We can accept any character now, but we want to give a quote a speicial name if character == "\"" { return Token(name:QUOTE_DELIMETER,withCharacters: "\(character)") } else { return CharacterTokenizer.CharacterToken(character: character) } } override func flushTokens() -> Array<Token>? { var quotedString = consolidateCharacters() return [Token(name: "Quote", withCharacters: quotedString)] } }
Again, it looks for the opening " and then starts accepting every character until it has a closing quote. Note that it uses the EscapedCharacterTokenizer we defined above.
Again, we want to wrap all our tokenisers up (we could just do this when we create the Tokenizer, but I like to bundle them up in a single higher-level state, it makes re-usability a bit clearer)
class AdvancedSentanceTokenGenerator : TokenizerState{ init(){ super.init() tokenizers = [ QuoteTokenGenerator(), WhiteSpaceTokenGenerator(), CharacterTokenizer(allCharacters: "!,:'?`./()&%$£@€#;", tokenName: "Punctuation"), IntegerTokenGenerator(), WordTokenGenerator(), ] } }
Note we've added a few more in there too, so we'll need a nasty string to test too!
let tougherTest = "Nasty example with 1 \"Great \\t 🐨 \\\"Nested\\\" quote\"!" let advancedSentanceTokenizer = Tokenizer() advancedSentanceTokenizer.tokenizers = [AdvancedSentanceTokenGenerator()] advancedSentanceTokenizer.tokenize(tougherTest) printTokens("Advanced Sentance", advancedSentanceTokenizer)
Plenty of escapes and Koala's in there!
Advanced Sentance Tokens: Word 'Nasty' Whitespace ' ' Word 'example' Whitespace ' ' Word 'with' Whitespace ' ' Integer '1' Whitespace ' ' Quote '"Great \t 🐨 \"Nested\" quote"' Character '!'
And there we have it, some genuinely useful output! We'll be including this in the OysterKit framework when Yosemite and iOS 8 ships, and we'll tidy up the architecture before then. For now you can copy and paste it all into a single playground and experiment.
We'd love to hear your feedback!