# Better Regex #tech ## Rationale I *love* regular expressions. They're so useful! I use them every day. Perhaps due to living in the shell, it seems like *most* of what I do on a computer is manipulate strings. Support for regex is omnipresent, and even a shitty implementation like the one in nano can handle 90% of my needs. I also *hate* regular expressions. They're notoriously frustrating. Longer patterns especially are difficult to read and littered with backslash escapes. Everything supports them, sure, but they all use a slightly different flavor; there's always a little trial-and-error involved at the beginning to figure out what features are supported by whatever engine you find yourself using. I have neither the skill nor the inclination to write a new engine for regexes, but if I did, here are some ideas I have to make them less painful. Of course, if they were ever adopted, they would just constitute one more slight variation to trip you up. ## Periods Suck Nearly *every* string you will ever parse contains a dot: every properly formed sentence, (almost) every filename, and every URL. Most code will have it too. Do I *ever* remember to escape it? No, I do not. Usually it doesn't matter, but relying on luck is almost as bad as relying on memory. Worse, because an unescaped dot seemingly works fine, you're unlikely to notice your mistake. If we must use a tiny, common character, the comma would be a much better option. But we have an even better ASCII character at our disposal: the asterisk. Using a star as a wildcard is intuitive to everyone, since it's used in globs and boolean search and even everyday natural language. You might object that it makes more sense as the zero-or-more quantifier because that's closer to how it works in those other contexts — the star can stand for anything or nothing. But the actual regex equivalent to a wildcard `*` is `.*`, not `*`. That's already sufficiently different that you can't really think of the wildcard star and the zero-or-more quantifier as expressing the same thing without tripping yourself up. We don't need to replace it, either. We can express the same thing as `.*` with `(.?)+` (or `(.+)?` — they're subtly different but probably interchangeable?). We could simplify that further in our hypothetical regex by obviating the parentheses. Putting it all together, instead of writing `.*`, we would write `*?+`. It's one more character, sure, but the meaning is straightforward. This syntax would unfortunately clash with existing modifiers that set the matching mode, but I tend to think that should be changed anyway. `.?+` is possessive, `.+?` is lazy. Better hope you got them in the right order! Is possessive even useful for anything? I only ever use lazy, except when I mix up the order. ## Write with a Lisp Everything is so much easier to read when it's grouped with parentheses. You can do this already, but it automatically creates a capture group. You have to write the unwieldy `(?:string)` if you want to make a group without capturing it. Groups should be non-capturing by default and only capture when you tell them to. Aside from possible performance issues, capturing by default discourages you from making groups when you have to capture some of them, because you have to count the groups to get to the one you want. If you add or delete a group, you have to renumber your references. Capture groups shouldn't even be numbered at all, they should be named. Something like `({name}:string)`, maybe, then write references in template style like `{name}` or `${name}`. In fact, parentheses should be mandatory for matching a string: `foo|bar` should match `fobar` and `fooar` but not `foo` or `bar`. If you want it to behave the old way, write `(foo)|(bar)`. ## Negate All the Things I often wish I could negate a group in the same way you can negate a character class. `[^f]` matches any character except `f`; `(^fuck)` should match any string except `fuck`. There are some wonderful engines out there that implement something close to this as "negative lookahead" with `(?!string)`. All of the lookaround features are very cool, but the only one I ever want is negative lookahead, and I'd prefer to use the same negation character for everything. There is a subtle issue here in that lookarounds don't consume any characters, whereas a normal group does: `fuck(?!shit)ass` matches `fuckass` but not `fuckcuntass` (or `fuckshitass`). It would be nice if we could append a question mark to get something more like lookahead behavior: `fuck(^shit)ass` would match `fuckcuntass` but not `fuckass` (or `fuckshitass`), whereas `fuck(^shit)?ass` would match both `fuckass` and `fuckcuntass` (but not `fuckshitass`). Maybe we could do something ridiculous like `fuck((^shit)?)?ass` to match the behavior of the negative lookahead. Or just perform a negative lookahead when invoked with a `?`. Or just keep the normal lookarounds. ## Literally We really need to be able to write string literals to save us from escape hell. This is another feature generalized from character classes: I much prefer writing `[*]` to `\*`, but if I wrote `[f*$k]` I would match `f`, `*`, `$`, and `k` but not `f*$k`. I could fix the pattern by writing `f[*][$]k`, but that sucks worse than `f\*\$k`. (Exactly two characters worse.) Just some backticks would do the trick. I think some implementations default to interpreting stuff in groups as string literals, but that's one of those surprising gotchas; it should be a different character. Then I could just write: ``` `f*$k` ``` If you need to capture a literal for some reason, wrap it in parens: ``` ({swear}:`f*$k`) ``` Same for negation: ``` (^`f*$k`) ``` Other than the interpretation of characters, it should behave the same as any other token, so you can do things like: ``` `f*$k`{2} ``` I'm sure this would end up turning into escape hell sometimes too, but it's worth it for literal strings! ## Curly Quantifiers You should be able to write `f{1-3}` instead of `f{1,3}` to match `f`, `ff`, and `fff`. ONCE AGAIN, LIKE IN CHARACTER CLASSES: `[A-z]`. If you write `f{1,3}` it should match `f` and `fff` but not `ff`. ## Better sed Regexes are great, but you need some way to use them. The program for our shiny new system needs some work. - Instead of `sed s/pattern/substitute/g`, we should write each part separately, like: `bed 'pattern' 'substitute'`. Yes, you can use a different delimiter than the slash, but why do you have to use one at all? Just make them separate arguments. - Global and multiline should be on by default, substitute should be the default operation, and there shouldn't be any operations like delete that are easily replaced with substitute. - It should be much easier to use the contents of a file as a substitute. Something like `bed --file 'pattern' 'myfile.txt'`. Or just make it work with cat in a command substitution like you would expect it to. - Insertions should be much simpler. Instead of writing something like `sed 's|(foo)(bar)|\1 inserted text \2|g` you should be able to write something like `bed --insert 'foo{%}bar' ' inserted text '`. ## APL Expressions: Let's Get Weird It's easy to type weird characters now if you're technically competent — just turn on the compose key and write some definitions in ~/XCompose. (Admittedly, XCompose is even more cursed than regex, but you can always download someone else's definitions. (All the defaults I've encountered were horrible.)) Unicode has a shitload of characters that are well covered by fonts like Unifont. Why not use some of them instead of limiting ourselves to ASCII that we'll inevitably have to escape? I mean, sure, you'll still have to deal with Unicode sometimes, especially if you write in a non-Latin language. We can never totally escape escapes. But it would sure make life easier! We'd need characters that are different enough to distinguish from "normal" ones, but similar enough to recognize. Thankfully, Unicode is massive, so that isn't a problem. Here are some possibilities: - Asterisk: ✷ or ✲ (or many others — there are a TON of stars) - Caret: ⍲ or ↥ - Dollar: Ֆ or ₴ or Ꞩ - Plus: ✜ - Question mark: ⸮ or 🯄 or ¿ - Backtick: « » or 𐞁 or ⍘ - Curly brackets: ⦃ ⦄ - Parentheses: ⸨ ⸩ - Square brackets: 「 」 or ⟦ ⟧ Sadly, using Unicode in your terminal still kind of sucks. You get weird rendering issues, characters that are actually several characters in a trenchcoat, etc. And even with a nice XCompose, it would be kind of annoying. ## HALP I suspect there are regex programs out there already that would make me happy, syntax I'm unaware of, sed tricks, etc. Please email me any great tips you have! Just hit the feedback button.