Regular expressions and Solaris 8
Chuck Yerkes
chuck+baylisa at snew.com
Wed Jul 21 17:41:49 PDT 2004
I'll point to the egrep/grep in /usr/xpg4/bin/
and perhaps /usr/sfw/bin/ (sun freeware).
Sun's grep and diff packages are lovely and should be kept.
In a museum. On any Solaris machine I control, these are two
packages that quickly get replaced with something from the past
decade.
(note also that sun's egrep is about 10x faster than Sun's grep,
so I just "alias grep=egrep").
So I'd just offer that trying a proper grep (BSD's if it will
compile, GNU-grep which will compile) might make you happier.
Now anyone here work at Sun wanna talk to them about "doing an
Apple" and perhaps taking a lot of the userland apps from BSD?
whoish, diff, grep, REAL z-utils (zgrep, zcat). OpenBSD has
a BSD licenced diff and non-FSF gzip and friends (libz had a BSD
ok license and all the routines where there waiting...).
I tire of working around Solaris' tools.
But have I mentioned that netbds's www.pkgsrc.org stuff makes
me happier on Solaris? (and AIX and MacOS and...)
Quoting David Wolfskill (david at catwhisker.org):
> As (some of) you may recall, in my role as postmaster at baylisa.org I make
> use of a couple of different approaches to try to squelch spam at
> BayLISA'a MTA.
>
> One of those approaches is a content filter that uses regular
> expressions. The bulk of the specification I use for it are intended to
> look for certain "spamvertized" domains. (The census of these is now at
> about 3975.)
>
> Thus, a typical regex deployed for this use looked like
>
> `([^-0-9a-z]|([=%]2[ef]))2LD(=2E|\.)TLD`ie
>
> where:
> * the ` are the delimiters -- I didn't use / because sometimes I specify
> more of a URL, and they often have / characters in them.
>
> * "2LD" is the second-level domain
>
> * "TLD" is the top-level domain
>
> * "ie" (after the closing delimiter) denotes case-insensitive matching
> and extended regular expression syntax.
>
>
> Well, this morning, I received a spam that mentioned a known
> spamvertized domain. On looking at the spam a bit more closely, I saw
> that the doamin name in question was left-anchored on the line; thus,
> the above regex would not match (because it's looking for some sort of
> delimiter to the left of the doamin name).
>
> So I poked around in Jeffrey Friedl's _Mastering Regular Expressions_
> and found that the construct "\<" may be used to serve as a "left
> word-anchor" ... in some regular expression implementations.
>
> I then tried using "egrep"on one of my FreeBSD boxen (running the same
> flavor of FreeBSD as my home firewall/MTA) and found that a regex of the
> form
>
> `\>2LD(=2E|\.)TLD`ie
>
> fed to egrep appeared to work.
>
> Then I got a little more adventurous: some spammers like to use encodin
> constructs for the URLS; I tried
>
> `(\<|([=%]2[ef]))2LD(=2E|\.)TLD`ie
>
> and that appeared to work very nicely.
>
> (The next step, assuming all works OK, is to use
>
> `(\<|([=%]2[ef]))2LD(=2E|\.)TLD\>`ie
>
> though that's not really foolproof.)
> However, when I tried the same egrep test on the BayLISA machine, it
> failed to find the lines in question -- so I thought that maybe Solaris
> 8 didn't have supportfor \< and \> in its regex library.
>
> But the regexp)5) man page seems to indicate that the construct is
> recognized.
>
> Anyone have any clue whether this ought to work or not? (Note that the
> application is a "milter," not egrep (per se).
More information about the Baylisa
mailing list