THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-06-18 12:00:33, and last modified 2015-06-18 12:00:33.

Named Capture Fail

This is a short one, and if you came here for a solution, there is one, but it requires XS.

The problem is simple - I have a list of user configurable regexes that match URIs, and would like to extract certain parts. Instead of relying on $1, and thus on a specific ordering and number of matches, I'd like to use named captures.

One such use involves matching the SCRIPT_NAME of a CGI-like URI - the user would specify a regex match such as this (this uses a custom regex format so that . really only matches a dot btw.):

m'^lists.schmorp.de(?<name>/mailman/[a-z]+)'

The named capture name would then be used to find the SCRIPT_NAME. Everything past the SCRIPT_NAME is, by definition, the PATH_INFO, so all I'd need is to find this string in the whole match, which can be found conveniently (or not so) in "$`$&$'".

I must admit that I really like all those regex extensions by the new p5pers - they strike me as obvious and useful - but on the other hand, each time I actually wanted to use them, I found that the old method was faster and/or more convenient, or that the new features don't actually solve my problem, because something essential is missing. The same, as it turned out, is true for named captures.

I already knew (from reading the perlvar manpage a long time ago), that I can get string offsets for numbered captures using @LAST_MATCH_START/@-, and that there is a variable called %LAST_MATCH_START/%-. The latter is documented to being similar to @LAST_MATCH_START (heck, it's the same name!), so I didn't have to search long.

Unfortunately, %LAST_MATCH_START isn't similar to @LAST_MATCH_START, and I can't even imagine what the name refers to, as the former contains array references of all matches, and no start offsets (why reuse @- for %- when they are unrelated and there is a whole namespace for these variables, for example ${^MATCH}).

Anyways, the short story is, no, you can't get this information, it's only available for numbered captures: there is no API to get offsets, nor an API to get the number the named captures refer to, which would be an obvious thing to provide.

So it looks as if named captures are yet another limited design that has been added on a whim, and for the full features you still need to use numbered captures.

XS to the Rescue

Of course, the information is out there, if you can use XS and can find the struct regexp for the match (presumably via some per-interpreter variable, although I didn't check).

One way to get it is regexp->paren_names, which is a hash mapping capture names to, among other things, the capture number, which could then be mapped to offsets.

I didn't implement this (XS is out for my use case), and I don't know if a module exists to do so - if you know, you can drop me a note and I will add it.