Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

you should be able to construct basic regular expressions to match things like email addresses

Actually, writing a correct (per RFC) regex to recognize email addresses is far from simple.



You can say that again. Here's a Perl Regex to validate according to RFC 822: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html. It's well over a page long.

The problem is that email validation is really better suited to being done by a push-down automata (rather than a finite state machine, which is what true regexes are). For better or for worse, perl regular expressions aren't strictly finite state machines though - so you can kind of extend them beyond what is sane.

P.S. Here's a ridiculously cool regex that can be used to determine if a number is prime: /^1?$|^(11+?)\1+$/

If you've got a few hours (a few seconds for cperciva ;-)) to waste, try and figure it out :-D


You can watch that prime-testing regex in action. Here it is matching 49, showing it is not-prime:

http://regex.powertoy.org/?pat=/%5E1%3F%24%7C%5E%2811+%3F%29...

And here failing against 47, showing it is prime:

http://regex.powertoy.org/?pat=/%5E1%3F%24%7C%5E%2811+%3F%29...


I'm not sure how this ( /^1?$|^(11+?)\1+$/ ) would match 3, 5, 7, etc... I'm no regex expert, maybe I'm missing something?


well, you're right - there are some more details. it needs a bit of wrapper code around it.

It actually tests if n is composite, when presented with a pattern of n ones. E.g., '111' doesn't match so it is prime, and '1111' does match so it's composite.


Wow, this is really amazing (to me at least because I had not seen or thought of this before).

[ Warning: Spoiler! :) ]

It's in base 1 (tally system).

The first part up to the "|" matches the case where number 1 is the whole string, covering the fact of 1 not being considered prime. It also matches no string, I guess to mean 0.

Next part matches a captured string consisting of 2 or more ones, and the captured string must be repeated at least one time. That is, 2+ repetitions of chunks of 2+ ones.

This much is obvious, but it wasn't until I wrote it out that I realized what it is doing:

multiples of 2

11+11 or 11+11+11 or 11+11+11+11 or ... = 4, 6, 8, 10, 12, ...

multiples of 3

111+111 or 111+111+111 or 111+111+111+111 or ... = 6, 9, 12, ...

multiples of 4

1111+1111 or 1111+1111+1111 or ... = 8, 12, 16, 20, ...

multiples of 5

11111+11111 or 11111+11111+11111 or ... = 10, 15, 20, 25, ...

multiples of n

which leaves only primes remaining.

Side note: In Perl, executing it this way:

$nstr = 1 x $ARGV[0]; print ($nstr =~ m/^1?$|^(11+?)\1+$/ ? "composite\n" : "prime\n");

The largest prime I can find without a seg fault is 37397

And the largest composite I can find without a seg fault is 37399


Hint: don't express the number in base 10


Indeed: technically, it's impossible. Email addresses, per RFC 822, are not a regular language -- though it's doable in Perl because Perl "regular expressions" are unapologetically more powerful than real regular expressions.

RFC 822, section 3.4.3, defines comments in email addresses to be surrounded by matching parentheses, and which nest -- a classic example of a language which is not regular.

Chalk this up to "stupid technicalities I learned one day when I decided to write a regex to match RFC 822 perfectly". It turned out to be a fun but useless exercise, because most email programs don't accept half the crap that's in RFC 822 anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: