THE RANT / THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2016-12-31 10:25:27, and last modified 2016-12-31 10:25:27.

Griefing the Perl API

Recently, I (together with my colleague) had an interesting debugging session for one of our products, which gives an interesting insight in how the Perl API is systematically being destroyed by the perl maintainers.

A confusing column title

It all started when one of our customers complained that the column "First Seen" in our product did not actually show the time and date where a device was "first seen", but instead the time got updated regularly.

The issue was quickly cleared up, as the "First Seen" column refers to the time where the specific set of measurements were first seen, so each time the measurements changed, the time would be updated. This probably would have been clear from the manual, but the column title can be easily misunderstood, so we decided to rename it to something better, case closed.

As part of the support ticket, however, we also received a "collect support", which is basically a bundle of logs and other data that is useful to track down the state of the product, and looking at the kernel log, we saw this (slightly edited to fit better):

2016-12-07T21:08:16.438401+00:00 ABC kernel: perl[21302]: segfault at c ip 7f1ac3007601 sp 7ffc39b24300 error 4 in XS.so[7f1ac3004000+6000]
2016-12-08T21:08:20.213476+00:00 ABC kernel: perl[26940]: segfault at c ip 7fcc1b75d601 sp 7ffc457974c0 error 4 in XS.so[7fcc1b75a000+6000]
2016-12-09T21:08:13.360726+00:00 ABC kernel: perl[32590]: segfault at c ip 7f00d20b7601 sp 7ffc05becfa0 error 4 in XS.so[7f00d20b4000+6000]
2016-12-10T21:08:13.961670+00:00 ABC kernel: perl[5776]: segfault at c ip 7f1d87795601 sp 7fffd65ff7e0 error 4 in XS.so[7f1d87792000+6000]
2016-12-13T21:09:29.369074+00:00 ABC kernel: perl[22919]: segfault at c ip 7f63f7211601 sp 7fffadb65db0 error 4 in XS.so[7f63f720e000+6000]
2016-12-14T21:09:29.456185+00:00 ABC kernel: perl[28871]: segfault at c ip 7f503c861601 sp 7fff30effd20 error 4 in XS.so[7f503c85e000+6000]
2016-12-15T21:09:34.731738+00:00 ABC kernel: perl[2099]: segfault at c ip 7f827d175601 sp 7ffca8677b50 error 4 in XS.so[7f827d172000+6000]
2016-12-16T21:09:30.049109+00:00 ABC kernel: perl[7766]: segfault at c ip 7efe2c0c7601 sp 7ffe5f4ee230 error 4 in XS.so[7efe2c0c4000+6000]
2016-12-17T21:08:25.647315+00:00 ABC kernel: perl[13843]: segfault at c ip 7f27ced08601 sp 7fffa874ee90 error 4 in XS.so[7f27ced05000+6000]
2016-12-18T21:08:07.977508+00:00 ABC kernel: perl[19741]: segfault at c ip 7fe6176c8601 sp 7ffd776e0110 error 4 in XS.so[7fe6176c5000+6000]
2016-12-19T21:09:25.243447+00:00 ABC kernel: perl[25565]: segfault at c ip 7fdd21e23601 sp 7ffd1a85dae0 error 4 in XS.so[7fdd21e20000+6000]
2016-12-20T21:09:28.867114+00:00 ABC kernel: perl[31602]: segfault at c ip 7f0697516601 sp 7ffd3c076700 error 4 in XS.so[7f0697513000+6000]

In case that didn't fit on your screen (it certainly doesn't fit into my blog text boxes...), basically perl segfaults shortly after 9pm every day in an object called XS.so.

And the customer hadn't even noticed that the switch discovery failed.

Identifying the XS module

Anyway, let's debug it before he finds out. But how? I must admit I didn't know the details of all the data presented by the kernel, and so I tried to concentrate on the low hanging fruit.... which in this case was the XS.so - the customer recently upgraded to a new firmware version, which was the first version to use CBOR::XS instead of Storable for its communication protocol. And since that was when the crashes started, surely it must be a bug in CBOR::XS.

Alas, looking at the other logs I couldn't see how the bug could happen at the state the software was in, and trying to replicate it with the customer data failed.

We also didn't have debug info on the box, no gdb to debug things and not even a register dump, so the prospect of debugging this further without replicating it on our development system looked doubtful. And potentially using something clunky such as teamviewer or other similar garbage was not something we looked forward to either, so we looked at how to possibly gain more insights from the sparse data we had (actually, we should feel lucky that Linux nowadays logs segfaults by default).

So I guessed that 7f1ac3007601 is the instruction address and the module it is in started at 7f1ac3004000 (from the first crash message). With this, I should be able to disassemble a local, binary-identical, copy and at least get an idea of what caused the crash:

# gdb /opt/bin/perl
GNU gdb (Debian 7.11.1-2+b1) 7.11.1
Reading symbols from /opt/bin/perl...done.
(gdb) r -MCBOR::XS
Starting program: /opt/bin/perl -MCBOR::XS
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0x00007ffff01cc130 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.

Now we have a perl with CBOR::XS loaded. Let's find out where it is loaded:

(gdb) inf sh
From                To                  Syms Read Shared Object Library
0x00007ffff03d9ac0  0x00007ffff03f53c0  Yes       /lib64/ld-linux-x86-64.so.2
0x00007ffff01c1a80  0x00007ffff01ce591  Yes       /lib/x86_64-linux-gnu/libpthread.so.0
0x00007fffeffb8da0  0x00007fffeffb993e  Yes       /lib/x86_64-linux-gnu/libdl.so.2
0x00007fffefcb8570  0x00007fffefd2517a  Yes       /lib/x86_64-linux-gnu/libm.so.6
0x00007fffefa7cb40  0x00007fffefa81e37  Yes       /lib/x86_64-linux-gnu/libcrypt.so.1
0x00007fffef6fa890  0x00007fffef8263b4  Yes       /lib/x86_64-linux-gnu/libc.so.6
0x00007fffeea0ce30  0x00007fffeea14356  Yes       /opt/perl/lib/perl5/auto/CBOR/XS/XS.so

Ok, it's at address 0x00007fffeea0ce30, so take the segfault address, subtract the start address, and add the local base address (due to ASLR and other factors, modules are loaded at different addresses):

(gdb) p/x 0x7f1ac3007601-0x7f1ac3004000+0x00007fffeea0ce30
$3 = 0x7fffeea10431
(gdb) disass 0x7fffeea10431-16,0x7fffeea10431+16
Dump of assembler code from 0x7fffeea10421 to 0x7fffeea10441:
   0x00007fffeea10421: test   %r14,%r14
   0x00007fffeea10424: mov    %rsi,0x18(%r12)
   0x00007fffeea10429: jle    0x7fffeea103a6
   0x00007fffeea1042f: nop
   0x00007fffeea10430: sub    $0x1,%r14
   0x00007fffeea10434: jne    0x7fffeea10d3b
   0x00007fffeea1043a: mov    0x28(%r12),%rdi
   0x00007fffeea1043f: mov    (%rdi),%rax
End of assembler dump.
(gdb)

Tough, the instruction pointer is in the middle of an instruction (...31 vs ...30). The instruction pointer is normally correct on x86/amd64, but who knows what the kernel does. Believing the kernel and the CPU to be exact, I realised my first mistake - 0x7f1ac3004000, the kernel-reported base address for XS.so, is on a page boundary (4kB) because the file is simply mmaped into memory, while the segment reported by gdb's info shared command probably reports the start of the code segment inside the file, without the header.

So let's look at the real load address:

(gdb) inf thr
Id   Target Id         Frame
* 1    Thread 0x7ffff05b0700 (LWP 8228) "perl" 0x00007ffff01cc130 in __read_nocancel ()
                                               at ../sysdeps/unix/syscall-template.S:84
(gdb)

[in another shell]
# cat /proc/8228/maps
...
7fffeea0b000-7fffeea15000 r-xp 00000000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so
7fffeea15000-7fffeec15000 ---p 0000a000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so
7fffeec15000-7fffeec16000 r--p 0000a000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so
7fffeec16000-7fffeec17000 rw-p 0000b000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so
...

So, uhm, I guess the real base address is 0x7fffeea0b000, because thats the only executable segment. So let's try again:

(gdb) p/x 0x7f1ac3007601-0x7f1ac3004000+0x7fffeea0b000
$7 = 0x7fffeea0e601
(gdb) disass $7-16,$7+16
Dump of assembler code from 0x7fffeea0e5f1 to 0x7fffeea0e611:
   0x00007fffeea0e5f1:     mov    %ebp,%eax
   0x00007fffeea0e5f3:     mov    $0x1,%edx
   0x00007fffeea0e5f8:     shr    $0x2,%rax
   0x00007fffeea0e5fc:     test   %rax,%rax
   0x00007fffeea0e5ff:     cmove  %rdx,%rax
   0x00007fffeea0e603:     add    %rax,%rsi
   0x00007fffeea0e606:     callq  0x7fffeea0cb80
   0x00007fffeea0e60b:     mov    0x10(%r15),%rdi
   0x00007fffeea0e60f:     mov    0x10(%rdi),%rax
End of assembler dump.
(gdb)

Goddamnit, still in the middle of an instruction. And doing the whole thing with JSON::XS, the only other suspect I came up with, resulted in the same problem.

So either the kernel is full of shit, I am doing it wrong, or maybe, just maybe, my initial guess of it being CBOR::XS or JSON::XS was wrong. But we don't use any other ???::XS module. Let's make sure there really isn't any other:

# find /opt/perl -name XS.so
/opt/perl/lib/perl5/auto/CBOR/XS/XS.so
/opt/perl/lib/perl5/auto/Date/Calc/XS/XS.so
/opt/perl/lib/perl5/auto/Digest/FNV/XS/XS.so
/opt/perl/lib/perl5/auto/JSON/XS/XS.so
/opt/perl/lib/perl5/auto/Net/SNMP/XS/XS.so
/opt/perl/lib/perl5/auto/Readonly/XS/XS.so
/opt/perl/lib/perl5/auto/Package/Stash/XS/XS.so

Oops, so maybe XS is a more common name component than I thought, and, quite obviously, there is another ???::XS module we use, even written by myself, namely Net::SNMP::XS. And surely, after going through the whole process again, we get a hit, at address 0x00007ffff5e9c601:

0x00007ffff5e9c59f:  movzbl (%rax),%eax
0x00007ffff5e9c5a2:  mov    %rdx,0x202f6f(%rip)        # 0x7ffff609f518
0x00007ffff5e9c5a9:  cmp    $0x44,%al
0x00007ffff5e9c5ab:  mov    %eax,%edx
0x00007ffff5e9c5ad:  mov    %eax,(%rdi)
0x00007ffff5e9c5af:  movzbl %al,%ecx
0x00007ffff5e9c5b2:  ja     0x7ffff5e9c7ee
0x00007ffff5e9c5b8:  lea    0x19e1(%rip),%rax        # 0x7ffff5e9dfa0
0x00007ffff5e9c5bf:  movslq (%rax,%rcx,4),%rsi
0x00007ffff5e9c5c3:  add    %rsi,%rax
0x00007ffff5e9c5c6:  jmpq   *%rax
0x00007ffff5e9c5c8:  nopl   0x0(%rax,%rax,1)
0x00007ffff5e9c5d0:  lea    0x15a9(%rip),%rdi        # 0x7ffff5e9db80
0x00007ffff5e9c5d7:  xor    %ebp,%ebp
0x00007ffff5e9c5d9:  callq  0x7ffff5e9ad00
0x00007ffff5e9c5de:  movl   $0x0,(%rbx)
0x00007ffff5e9c5e4:  xor    %edx,%edx
0x00007ffff5e9c5e6:  mov    0x202f13(%rip),%rax        # 0x7ffff609f500
0x00007ffff5e9c5ed:  mov    (%rax),%rcx
0x00007ffff5e9c5f0:  cmp    %rbp,0x10(%rcx)
0x00007ffff5e9c5f4:  jl     0x7ffff5e9c60b
0x00007ffff5e9c5f6:  mov    0x10(%rax),%rax
0x00007ffff5e9c5fa:  movzbl %dl,%ebx
0x00007ffff5e9c5fd:  mov    (%rax,%rbx,8),%rax

0x00007ffff5e9c601:  cmpb   $0xd,0xc(%rax)

0x00007ffff5e9c605:  je     0x7ffff5e9c708
0x00007ffff5e9c60b:  lea    0x13c3(%rip),%rdi        # 0x7ffff5e9d9d5
0x00007ffff5e9c612:  callq  0x7ffff5e9ad00
0x00007ffff5e9c617:  mov    0x201cba(%rip),%rax        # 0x7ffff609e2d8
0x00007ffff5e9c61e:  add    $0x8,%rsp
0x00007ffff5e9c622:  pop    %rbx
0x00007ffff5e9c623:  pop    %rbp
0x00007ffff5e9c624:  pop    %r12
0x00007ffff5e9c626:  pop    %r13
0x00007ffff5e9c628:  retq

And now the segfault at c message makes instant sense, c is the faulty memory address, which is trivially explained by the instruction that segfaulted:

0x00007ffff5e9c601:  cmpb   $0xd,0xc(%rax)

Because that instruction dereferences a pointer in %rax + 0xc, and compares it to 13 (0xd). And if that is a NULL pointer, you would expect the faulty address to be 0xc, which matched the kernel message.

That's too much for being accidental, this almost certainly must be the correct module and instruction. But Net::SNMP::XS is well tested and didn't change, so why is it suddenly segfaulting?

Identifying the source code for this without debug info looked rather hard, but let's try to find it. The first observation I made is that, since this is XS code, the pointer is most likely an SV *, and the 0xc is the offset inside, most likely the lowest byte of the flags, let's look at sv.h to verify that:

/* start with 2 sv-head building blocks */
#define _SV_HEAD(ptrtype) \
    ptrtype     sv_any;         /* pointer to body */   \
    U32         sv_refcnt;      /* how many references to us */ \
    U32         sv_flags        /* what we are */

Yup, 8 byte pointer, 4 byte U32, skipping the first two members gives us sv_flags. And the lowest 8 bits of it are the type, so what's type 13?

     SVt_PVCV,       /* 13 */

Ok, it tries to see if something is executable. A good hint. Let's see what some of those memory address point to, and indeed:

(gdb) x/s 0x7ffff5e9d9d5
0x7ffff5e9d9d5: "Unknown ASN.1 type"

Looks like an error message. And suddenly finding the source code was infinitely easier than I feared, because searching for the message found only one hit:

if (type > AvFILLp (av_type) || SvTYPE (AvARRAY (av_type)[type]) != SVt_PVCV)
  {
    error ("Unknown ASN.1 type");
    return &PL_sv_undef;
  }

Lo and behold, there is also the SVt_PVCV from earlier!

Ok, SvTYPE simply reads the flags and compares it to SVt_PVCV, so quite clearly this is the expression that resulted in a NULL pointer:

AvARRAY (av_type)[type]

Well, this array, as it turns out, is initialised at module load time, with most members being empty and only those indices that correspond to valid ASN.1 types set to callbacks to parse them. So either we had memory corruption (unlikely for a purely internal array), or we really did read a zero (most likely). Since the bounds check seemed correct, it must mean that the array members which are unset (those containing undef by default) are actually stored as NULL in perl. And as an occasional reader of perl source code, I did remember, faintly, that there indeed was a related change, and as it turns out it is even, kind of, documented, in perl5200delta:

Internal Changes

    ·   Arrays now use NULL internally to represent unused slots,
        instead of &PL_sv_undef.  &PL_sv_undef is no longer treated
        as a special value, so av_store(av, 0, &PL_sv_undef) will
        cause element 0 of that array to hold a read-only undefined
        scalar. "$array[0] = anything" will croak and "\$array[0]"
        will compare equal to "\undef".

Right, one of those "internal" changes that break existing code. But then, isn't AvARRAY internal and not to be used by XS code? And the more urgent question is, how to properly fix it?

Public or not, that is the question.

Well, it turns out, if you study the perl 5.20 documentation, AvARRAY clearly isn't public, because it isn't listed in perlapi, and perlapi says:

Anything not listed here is not part of the public API, and should not
be used by extension writers at all.

So it's easy to point at the module and say "it's broken, it abuses internal functionality and gets what it deserves", which is exactly what perl5porters like to do so often these days when they, again, break CPAN modules.

But it's not so simple: The perlapi manpage is relatively young and this notice therefore was an attempt to redefine reality at a point where the perl API was well established (funnily enough, the perl headers also have a way of declaring symbols as internal or not, but the perl headers think AvARRAY is part of the public API). Worse yet, AvARRAY is used as an example right within the perlapi manpage itself, and also in other parts of the documentation, so clearly, AvARRAY was never meant to be internal.

However, this is a common pattern - more and more of the perl API is declared internal, then broken, then fingers are pointed at module authors for abusing the "internal" API.

But that's not the whole story.

The plot thickens

Ranting aside, how do we fix this? Well, obviously we just test for a NULL pointer in addition to &PL_sv_undef, problem solved:

if (type > AvFILLp (av_type)
    || AvARRAY (av_type)[type] == 0
    || AvARRAY (av_type)[type] == &PL_sv_undef)
  {
    error ("Unknown ASN.1 type");
    return &PL_sv_undef;
  }

I also did some other changes to check for SVt_PVCV when the array is initialised, not when it is used, which hopefully regains a bit of the speed we lost, called it Net::SNMP::XS version 1.32 and released the fix.

A quick grep showed that at least one other of my modules was affected by this incompatible change in perl 5.20. A few minutes later, I uploaded Array::Heap version 3.22 with another clumsy workaround like the above.

Of course, both modules are still using the "internal" API, and whether I agree with it or not, future versions of Perl will make it worse, so again, how do I really, officially, legally fix this?.

Using the "public" API, of course, namely using av_fetch, the documented variant of AvARRAY. It is documented like this (excerpt):

 av_fetch

      Returns the SV at the specified index in the array.  The "key"
      is the index.  If lval is true, you are guaranteed to get a
      real SV back (in case it wasn't real before), which you can
      then modify.  Check that the return value is non-null before
      dereferencing it to a "SV*".

So there is something about a "real SV", and even the non-null aspect is mentioned, right?

Wrong. The "real" is perl implementation speak for an SV pointer that is properly refcounted, that is, a "real" AV is an array whose elements are refcounted normally, while a "non-real" array is an array with SV pointers that are not refcounted (and might, in fact, not even be legal SVs). "Non-real" data structures are sometimes used as optimisations by perl itself, and that comment was added in perl 5.14 (or so, I didn't check), long before the breaking change occurred.

What about the much more on-topic "non-null check"? Well, av_fetch returns a pointer to a pointer to an SV. So to get the actual SV pointer, you need to dereference it first. And in cases where the element has no storage space (as is the case for tied arrays for example), av_fetch returns 0 so indicate that.

Again, that comment is much older, and doesn't refer to the NULL pointers inside an array.

Now, I looked at the av_fetch source code again - since the NULL pointer change is internal, surely, av_fetch fixes this before it returns those internal values, right?

Wrong again: av_fetch will happily return a non-null pointer to a NULL SV pointer in the array.

Which makes you wonder, What is the documented way to check for undefinedness in XS? Maybe if I use the correct API for that, everything would work out?

Nope, the way to check for undefinedness in XS is SvOK, which, when presented with undefined values in perl arrays.... segfaults.

If you look at the perl API, you do find examples where NULL is treated specially, for example, newSVsv (0) will return 0 instead of crashing,, but that code is usually older as well, and wasn't in response to the new "0 is the new undef" change. If you look at the public API, it becomes clear that those "internal" NULL pointers are not caught by most functions and macros and cause segfaults.

So, again, how do you actually avoid that case with just the legal/public/official API?

Segfaulting as the only publicly documented way of dealing with this?

Beats me, I have no clue. The fact that 0 is now a valid undef value is not, to my knowledge, documented, and most code (both in CPAN modules and in perl itself) will happily just crash.

The only way to work around that is to either crash, as prescribed by the public API, or hard code the internal placeholder value (that is, 0), and wait for the next API change.

I don't know how many modules are affected (I use av_fetch quite often, and I imagine that a lot of CPAN modules do). Obviously, this bug can go undetected for quite long - both modules that I know of being affected are in heavy use elsewhere and didn't actually cause crashes yet. For Net::SNMP::XS, this is easily explainable, as network switches return a lot of garbage via SNMP, but it's exceedingly rare for a switch to return actually corrupted ASN.1 data.

Griefing the perl API

Now, everybody makes mistakes, and that surely includes the perl5porters who maintain perl. However, the experience I had in the last years was that reporting those bugs will fall on deaf ears - either you are told that backwards compatibility doesn't matter (long gone are the times were perlpolicy was used for anything but blocking dissenting opinions), you are told you are abusing the API, you are told that the change had to happen because of, well, reasons, or you realise that the person who did this change is simply absolutely clueless about perl, and shouldn't touch the perl code at all.

My personal favourite was Zefram's rationale for why changing warn to be incompatible to its documentation (and existing code) was somehow right by stating "this change was made to make warn behave more like die". The same rationale can, of course, be used to make print behave more like system 'rm -rf /', but hey, what do I know about reasoning.

This is the main reason why I had to fork perl 5.20 into stableperl, simply as a kind of self defense against these changes. And even if perl 5.20 is already quite old, I still have work around incompatible changes in it. I clearly should have forked 5.18. I mean 5.16. No wait, 5.14... No, no, no, the last version before this started was perl 5.10.

I can't even start to imagine how one can support products with newer releases such as 5.22, if every release breaks stuff that takes me (and CPAN) years to work around.

The closest equivalent that comes to my mind when thinking about this is griefing, because that's what happens, the perl5porters are griefing the perl API, and everybody else has to suffer for that. I am sure most (but not all) of the perl5porters do not do this for the fun of causing pain, but the central concept of griefing, destroying things that worked for no good reason, causing grief for others, is certainly fulfilled.