Tuesday, 19 August 2014

Bonus Workshop Questions - Week 2

Getting caught up! Sorry about the delay.

6. Write a regular expression to solve the following problem:
(e) Find long words whose letters are in alphabetical order.

7. Practice using awk, or alternatively, bash-scripting with grep and sed. For example, write a program that finds all of the email addresses in enron-headers.txt. (You might contrast your observations with enron-emails.txt.) How many words are there in Fathers and Sons by Ivan Turgenev (turgenev.txt)? How many instances are there of the same word repeated twice? How many words in the dictionary (words.txt; it’s actually an inflectional lexicon) have their letters in
alphabetical order? How many of the nine letter words (nines.txt)? [All of these are available on the CIS servers, i.e. nutmeg and dimefox, connection instructions on the LMS.]

Solutions below the fold.


6. (e) First of all, note that the problem formulation is too vague to be formally solved - i.e. what does "long" mean? For the purposes of answering this question, let's say, words of at least ten letters.
Also, I'm going to frame this answer by saying that we're searching through a file where each line contains a single word only (a "dictionary", for example, words.txt).

Okay, let me begin by saying that the problem as written can't be (practically) solved - the main problem is with "a regular expression". Why not? Well, let's take a look:
There are two things we want to do: find long words, and find words whose letters are in alphabetical order. Writing a regular expression to find long words is pretty easy:
/^.{10,}$/ (or /^.{9}.+$/ if the unbounded {} expression is forbidden)
Alphabetical order is a bit harder. (We'll assume according to the English 26-letter alphabet and ordering.) Let's think of a couple of examples of words whose letters are in alphabetical order: act, film, opt; whereas for cat, file, oct, they aren't. After thinking about this for a bit, we see that, for a word whose letter are in alphabetical order, all of the "a"s (if there are any) have to come before all of the other letters, then all of the "b"s, and so on to the "z"s:
/^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$/
If we can do this with two regular expressions, our problem is solved: we run the words of our dictionary through one of the regular expressions, and whatever is left over is run through the other regular expression. However, we want to (try to) combine these into a single regular expression...
Our first attempt might be to just add the {10,} to the end of the above expression:
/^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*{10,}$/
No good - that just looks for the pattern z* 10 or more times, which does nothing. What about by grouping the alphabetical part?
/^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*){10,}$/
Still, no - what this means is: "look for a sequence whose letters are in alphabetical order, and do that 10 times". So, something like "parliament" matches, because the "p" is in alpha order, and the "a" is in alpha order, and so on for all ten letters. (In fact, this is even worse, because all of those *s might match 0 repetitions, so this will actually match any string composed entirely of letters.)
A common suggestion I get, and a better attempt, is to remember the first character that we've seen (by storing it in register 1), and then only look for characters that are lexically later by using some kind of set notation like [\1-z], so a regular expression like:
/^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])([\5-z])([\6-z])([\7-z])([\8-z])([\9-z])$/
At first glance, it kinda might work, but in fact it has three problems.
  • Probably the most fundamental is the fact that the set operator doesn't work this way, for boring technical reasons. \1 in a set means the character "1" (the escape means really really 1), so this actually describes a set where the ASCII code of the character is between 49 ("1") and 122 ("z"), which isn't what we want at all.
  • Even if we had some kind of super regex interpreter that allowed us to interpret memos within sets, remember that we have a fixed set of registers. What I've written there is just enough (with the common 9 registers) for 10 letters, but what if we wanted 11? or 12? or 20?
  • And then, even if the memos-in-set worked (and they don't) and even if we had an unlimited number of register (and we don't), this method still doesn't work, because there's no way extend it for strings of arbitrarily long length. Observe that the pattern above will only match exactly 10 characters. What if I wanted to allow 11 character strings? Well, I could write the expression above, followed by an "or" ("|") followed by that again with a "([\A-z])" for the eleventh character. And so on, with "([\B-z])" for 12, "([\C-z])" for 13, etc. ... but there's no way to extend it to strings of arbitrary length.
So, just before we give up and go home, I will observe that there is an actual single regular expression that will solve this problem. However, it isn't a practical solution. Let me explain.
When I first used to teach this problem, I used to say that this question defines a set of strings that isn't a regular language, which is false (as a student helpfully pointed out one day). Why? Well, there are actually a finite number of patterns (e.g. aaaaaaaaaa, aaaaaaaaab, aaaaaaaaaac, ..., abcdefghij, etc.) which are exactly 10 letters long, and whose letters are in alphabetical order. I could just write down each of these patterns, put "+"s between each of the letters (e.g. /^a+b+c+d+e+f+g+h+i+j+$/ etc.) and then just "or" ("|") between them all.
Why isn't this a practical solution? Well, there are too many patterns to write by hand, as you'll quickly see, but I could write a computer program that writes the regular expression to define the strings that I want to look for - there'll be 10 nested for loops, but it should work.
How many individual 10 letter strings are there whose letters are in alphabetical order? Well, when the 10 characters are all distinct, 30626189193600. (This is C(26,10). Why? Also, accounting for repeated characters makes this number much larger.) So, in our regular expression, each of these is going to contribute roughly 21 characters (each character has a "+", and then the "|"). So, all in all, we need about 6.3*10^14B - 630TB - just to store this expression. I'm guessing the kernel isn't going to be happy about this.

So, just before we give up and go home, I'm going to point to an extension to the regular expression module that does, in fact, allow us to practically solve the problem: the lookahead operator (?=...). It's not strictly regular, but it's often available; if we're allowed to use it, the answer suddenly becomes much simpler:
/^(?=.{10,})a*b*c*...x*y*z*$/
If you're interested, I encourage you to read more about the (negative)?look(ahead|behind) operators.

7. It depends a lot on how you define "words".
wc tells me that there are 74242 "words" in Fathers and Sons.
The same "word" is repeated twice 9-11 times, depending on how stringently you define word boundaries.
In words.txt, there are 547 words whose letters are in alphabetical order.
In nines.txt, there are 94 "words" whose letters are in alphabetical order. (But none of them are really "words" in a meaningful sense, e.g. the file contains the words "cefilnsvx" and "hooooooot".)

No comments:

Post a Comment