How do I turn a string like ‘forty two’ into something I can manipulate as a number? String has the #to_i method but that only works on numerals like ‘3’. Numerouno is an English natural language parser for numbers.
I hit this problem a few times in the past while writing Cucumber features that contained textual descriptions of numbers. Being good little BDD elves, we had worked very hard to keep the feature language true to that used by the customer. We were already using the awesome Chronic for parsing descriptions of dates and times which went a long way to preserving the language.
Unfortunately, describing numbers still seemed a bit clunky. We had steps like:
When I hop 37 times
The above is not ugly by any means, more mildly irritating. The main thing is that this is not how I would write the sentence. Maybe you find ’37’ more concise but to me it sticks out like a sore digit (ha ha) in an otherwise natural looking sentence. I want to write something like:
When I hop thirty seven times
And indeed now I can! Hurray hurrah!
require 'numerouno' 'thirty six billion, three hundred and ninety two'.as_number => 36000000392
How does it work?
The problem of parsing English number phrases was an interesting one and it took me a while to model it in a way that wasn’t totally confusing. Basically the current approach goes a little like this:
Identify individual numbers in the string
The first thing is to turn ‘thirty six billion, three hundred and ninety two’ into something we can manipulate a little easier, [30, 6, 1000000000, 300, 90, 2]. Simple regex matching is used to identify individual numbers.
The English language has certain rules for interpreting numbers in a sentence. The rules most often revolve around numbers that are powers of ten, one hundred, one thousand, one million and so on. Once you hit one of these numbers you can start applying rules for the numbers either side of it to mash them into a combined figure.
The rules typically lead to you multiplying by the number to the left and then adding the number to the right. For example ‘five thousand and one’ => [5, 1000, 1] => 5 * 1000 + 1 => 5001.
Combination is done in several passes to ensure that lower powers of ten are combined properly before attempting to combine them with higher ones. Once all combination passes have been made a final step sums up the resulting list of combined numbers for the actual figure.
At the moment only whole numbers up to those in the trillions are supported. The following things are not:
- anything bigger than nine hundred and ninety nine trillion, nine hundred and ninety nine billion, nine hundred and ninety nine million, nine hundred and ninety nine thousand, nine hundred and ninety nine
- fractions be they decimal or otherwise
- other variations of numbers like ‘third’, ‘thirteenth’
- slang like ‘K’, ‘grand’
- any language except English. The rules for interpreting number are specific to the English language.
Yes, ironically Numerouno does not recognise ‘numero uno’.
If in doubt, try it out. Rhymes.