Griefing the Perl API
Recently, I (together with my colleague) had an interesting debugging session for one of our products, which gives an interesting insight in how the Perl API is systematically being destroyed by the perl maintainers.
A confusing column title
It all started when one of our customers complained that the column "First Seen" in our product did not actually show the time and date where a device was "first seen", but instead the time got updated regularly.
The issue was quickly cleared up, as the "First Seen" column refers to the time where the specific set of measurements were first seen, so each time the measurements changed, the time would be updated. This probably would have been clear from the manual, but the column title can be easily misunderstood, so we decided to rename it to something better, case closed.
As part of the support ticket, however, we also received a "collect support", which is basically a bundle of logs and other data that is useful to track down the state of the product, and looking at the kernel log, we saw this (slightly edited to fit better):
2016-12-07T21:08:16.438401+00:00 ABC kernel: perl[21302]: segfault at c ip 7f1ac3007601 sp 7ffc39b24300 error 4 in XS.so[7f1ac3004000+6000] 2016-12-08T21:08:20.213476+00:00 ABC kernel: perl[26940]: segfault at c ip 7fcc1b75d601 sp 7ffc457974c0 error 4 in XS.so[7fcc1b75a000+6000] 2016-12-09T21:08:13.360726+00:00 ABC kernel: perl[32590]: segfault at c ip 7f00d20b7601 sp 7ffc05becfa0 error 4 in XS.so[7f00d20b4000+6000] 2016-12-10T21:08:13.961670+00:00 ABC kernel: perl[5776]: segfault at c ip 7f1d87795601 sp 7fffd65ff7e0 error 4 in XS.so[7f1d87792000+6000] 2016-12-13T21:09:29.369074+00:00 ABC kernel: perl[22919]: segfault at c ip 7f63f7211601 sp 7fffadb65db0 error 4 in XS.so[7f63f720e000+6000] 2016-12-14T21:09:29.456185+00:00 ABC kernel: perl[28871]: segfault at c ip 7f503c861601 sp 7fff30effd20 error 4 in XS.so[7f503c85e000+6000] 2016-12-15T21:09:34.731738+00:00 ABC kernel: perl[2099]: segfault at c ip 7f827d175601 sp 7ffca8677b50 error 4 in XS.so[7f827d172000+6000] 2016-12-16T21:09:30.049109+00:00 ABC kernel: perl[7766]: segfault at c ip 7efe2c0c7601 sp 7ffe5f4ee230 error 4 in XS.so[7efe2c0c4000+6000] 2016-12-17T21:08:25.647315+00:00 ABC kernel: perl[13843]: segfault at c ip 7f27ced08601 sp 7fffa874ee90 error 4 in XS.so[7f27ced05000+6000] 2016-12-18T21:08:07.977508+00:00 ABC kernel: perl[19741]: segfault at c ip 7fe6176c8601 sp 7ffd776e0110 error 4 in XS.so[7fe6176c5000+6000] 2016-12-19T21:09:25.243447+00:00 ABC kernel: perl[25565]: segfault at c ip 7fdd21e23601 sp 7ffd1a85dae0 error 4 in XS.so[7fdd21e20000+6000] 2016-12-20T21:09:28.867114+00:00 ABC kernel: perl[31602]: segfault at c ip 7f0697516601 sp 7ffd3c076700 error 4 in XS.so[7f0697513000+6000]
In case that didn't fit on your screen (it certainly doesn't fit into my blog text boxes...), basically perl segfaults shortly after 9pm every day in an object called XS.so
.
And the customer hadn't even noticed that the switch discovery failed.
Identifying the XS module
Anyway, let's debug it before he finds out. But how? I must admit I didn't know the details of all the data presented by the kernel, and so I tried to concentrate on the low hanging fruit.... which in this case was the XS.so - the customer recently upgraded to a new firmware version, which was the first version to use CBOR::XS
instead of Storable
for its communication protocol. And since that was when the crashes started, surely it must be a bug in CBOR::XS
.
Alas, looking at the other logs I couldn't see how the bug could happen at the state the software was in, and trying to replicate it with the customer data failed.
We also didn't have debug info on the box, no gdb
to debug things and not even a register dump, so the prospect of debugging this further without replicating it on our development system looked doubtful. And potentially using something clunky such as teamviewer or other similar garbage was not something we looked forward to either, so we looked at how to possibly gain more insights from the sparse data we had (actually, we should feel lucky that Linux nowadays logs segfaults by default).
So I guessed that 7f1ac3007601
is the instruction address and the module it is in started at 7f1ac3004000
(from the first crash message). With this, I should be able to disassemble a local, binary-identical, copy and at least get an idea of what caused the crash:
# gdb /opt/bin/perl GNU gdb (Debian 7.11.1-2+b1) 7.11.1 Reading symbols from /opt/bin/perl...done. (gdb) r -MCBOR::XS Starting program: /opt/bin/perl -MCBOR::XS [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". ^C Program received signal SIGINT, Interrupt. 0x00007ffff01cc130 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory.
Now we have a perl with CBOR::XS
loaded. Let's find out where it is loaded:
(gdb) inf sh From To Syms Read Shared Object Library 0x00007ffff03d9ac0 0x00007ffff03f53c0 Yes /lib64/ld-linux-x86-64.so.2 0x00007ffff01c1a80 0x00007ffff01ce591 Yes /lib/x86_64-linux-gnu/libpthread.so.0 0x00007fffeffb8da0 0x00007fffeffb993e Yes /lib/x86_64-linux-gnu/libdl.so.2 0x00007fffefcb8570 0x00007fffefd2517a Yes /lib/x86_64-linux-gnu/libm.so.6 0x00007fffefa7cb40 0x00007fffefa81e37 Yes /lib/x86_64-linux-gnu/libcrypt.so.1 0x00007fffef6fa890 0x00007fffef8263b4 Yes /lib/x86_64-linux-gnu/libc.so.6 0x00007fffeea0ce30 0x00007fffeea14356 Yes /opt/perl/lib/perl5/auto/CBOR/XS/XS.so
Ok, it's at address 0x00007fffeea0ce30
, so take the segfault address, subtract the start address, and add the local base address (due to ASLR and other factors, modules are loaded at different addresses):
(gdb) p/x 0x7f1ac3007601-0x7f1ac3004000+0x00007fffeea0ce30 $3 = 0x7fffeea10431 (gdb) disass 0x7fffeea10431-16,0x7fffeea10431+16 Dump of assembler code from 0x7fffeea10421 to 0x7fffeea10441: 0x00007fffeea10421: test %r14,%r14 0x00007fffeea10424: mov %rsi,0x18(%r12) 0x00007fffeea10429: jle 0x7fffeea103a6 0x00007fffeea1042f: nop 0x00007fffeea10430: sub $0x1,%r14 0x00007fffeea10434: jne 0x7fffeea10d3b 0x00007fffeea1043a: mov 0x28(%r12),%rdi 0x00007fffeea1043f: mov (%rdi),%rax End of assembler dump. (gdb)
Tough, the instruction pointer is in the middle of an instruction (...31
vs ...30
). The instruction pointer is normally correct on x86/amd64, but who knows what the kernel does. Believing the kernel and the CPU to be exact, I realised my first mistake - 0x7f1ac3004000
, the kernel-reported base address for XS.so
, is on a page boundary (4kB) because the file is simply mmap
ed into memory, while the segment reported by gdb's info shared
command probably reports the start of the code segment inside the file, without the header.
So let's look at the real load address:
(gdb) inf thr Id Target Id Frame * 1 Thread 0x7ffff05b0700 (LWP 8228) "perl" 0x00007ffff01cc130 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:84 (gdb) [in another shell] # cat /proc/8228/maps ... 7fffeea0b000-7fffeea15000 r-xp 00000000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so 7fffeea15000-7fffeec15000 ---p 0000a000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so 7fffeec15000-7fffeec16000 r--p 0000a000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so 7fffeec16000-7fffeec17000 rw-p 0000b000 00:14 9889509 /opt/perl/lib/perl5/auto/CBOR/XS/XS.so ...
So, uhm, I guess the real base address is 0x7fffeea0b000
, because thats the only executable segment. So let's try again:
(gdb) p/x 0x7f1ac3007601-0x7f1ac3004000+0x7fffeea0b000 $7 = 0x7fffeea0e601 (gdb) disass $7-16,$7+16 Dump of assembler code from 0x7fffeea0e5f1 to 0x7fffeea0e611: 0x00007fffeea0e5f1: mov %ebp,%eax 0x00007fffeea0e5f3: mov $0x1,%edx 0x00007fffeea0e5f8: shr $0x2,%rax 0x00007fffeea0e5fc: test %rax,%rax 0x00007fffeea0e5ff: cmove %rdx,%rax 0x00007fffeea0e603: add %rax,%rsi 0x00007fffeea0e606: callq 0x7fffeea0cb80 0x00007fffeea0e60b: mov 0x10(%r15),%rdi 0x00007fffeea0e60f: mov 0x10(%rdi),%rax End of assembler dump. (gdb)
Goddamnit, still in the middle of an instruction. And doing the whole thing with JSON::XS
, the only other suspect I came up with, resulted in the same problem.
So either the kernel is full of shit, I am doing it wrong, or maybe, just maybe, my initial guess of it being CBOR::XS
or JSON::XS
was wrong. But we don't use any other ???::XS
module. Let's make sure there really isn't any other:
# find /opt/perl -name XS.so /opt/perl/lib/perl5/auto/CBOR/XS/XS.so /opt/perl/lib/perl5/auto/Date/Calc/XS/XS.so /opt/perl/lib/perl5/auto/Digest/FNV/XS/XS.so /opt/perl/lib/perl5/auto/JSON/XS/XS.so /opt/perl/lib/perl5/auto/Net/SNMP/XS/XS.so /opt/perl/lib/perl5/auto/Readonly/XS/XS.so /opt/perl/lib/perl5/auto/Package/Stash/XS/XS.so
Oops, so maybe XS
is a more common name component than I thought, and, quite obviously, there is another ???::XS
module we use, even written by myself, namely Net::SNMP::XS
. And surely, after going through the whole process again, we get a hit, at address 0x00007ffff5e9c601
:
0x00007ffff5e9c59f: movzbl (%rax),%eax 0x00007ffff5e9c5a2: mov %rdx,0x202f6f(%rip) # 0x7ffff609f518 0x00007ffff5e9c5a9: cmp $0x44,%al 0x00007ffff5e9c5ab: mov %eax,%edx 0x00007ffff5e9c5ad: mov %eax,(%rdi) 0x00007ffff5e9c5af: movzbl %al,%ecx 0x00007ffff5e9c5b2: ja 0x7ffff5e9c7ee 0x00007ffff5e9c5b8: lea 0x19e1(%rip),%rax # 0x7ffff5e9dfa0 0x00007ffff5e9c5bf: movslq (%rax,%rcx,4),%rsi 0x00007ffff5e9c5c3: add %rsi,%rax 0x00007ffff5e9c5c6: jmpq *%rax 0x00007ffff5e9c5c8: nopl 0x0(%rax,%rax,1) 0x00007ffff5e9c5d0: lea 0x15a9(%rip),%rdi # 0x7ffff5e9db80 0x00007ffff5e9c5d7: xor %ebp,%ebp 0x00007ffff5e9c5d9: callq 0x7ffff5e9ad00 0x00007ffff5e9c5de: movl $0x0,(%rbx) 0x00007ffff5e9c5e4: xor %edx,%edx 0x00007ffff5e9c5e6: mov 0x202f13(%rip),%rax # 0x7ffff609f500 0x00007ffff5e9c5ed: mov (%rax),%rcx 0x00007ffff5e9c5f0: cmp %rbp,0x10(%rcx) 0x00007ffff5e9c5f4: jl 0x7ffff5e9c60b 0x00007ffff5e9c5f6: mov 0x10(%rax),%rax 0x00007ffff5e9c5fa: movzbl %dl,%ebx 0x00007ffff5e9c5fd: mov (%rax,%rbx,8),%rax 0x00007ffff5e9c601: cmpb $0xd,0xc(%rax) 0x00007ffff5e9c605: je 0x7ffff5e9c708 0x00007ffff5e9c60b: lea 0x13c3(%rip),%rdi # 0x7ffff5e9d9d5 0x00007ffff5e9c612: callq 0x7ffff5e9ad00 0x00007ffff5e9c617: mov 0x201cba(%rip),%rax # 0x7ffff609e2d8 0x00007ffff5e9c61e: add $0x8,%rsp 0x00007ffff5e9c622: pop %rbx 0x00007ffff5e9c623: pop %rbp 0x00007ffff5e9c624: pop %r12 0x00007ffff5e9c626: pop %r13 0x00007ffff5e9c628: retq
And now the segfault at c
message makes instant sense, c
is the faulty memory address, which is trivially explained by the instruction that segfaulted:
0x00007ffff5e9c601: cmpb $0xd,0xc(%rax)
Because that instruction dereferences a pointer in %rax + 0xc
, and compares it to 13 (0xd
). And if that is a NULL
pointer, you would expect the faulty address to be 0xc
, which matched the kernel message.
That's too much for being accidental, this almost certainly must be the correct module and instruction. But Net::SNMP::XS
is well tested and didn't change, so why is it suddenly segfaulting?
Identifying the source code for this without debug info looked rather hard, but let's try to find it. The first observation I made is that, since this is XS code, the pointer is most likely an SV *
, and the 0xc
is the offset inside, most likely the lowest byte of the flags, let's look at sv.h to verify that:
/* start with 2 sv-head building blocks */ #define _SV_HEAD(ptrtype) \ ptrtype sv_any; /* pointer to body */ \ U32 sv_refcnt; /* how many references to us */ \ U32 sv_flags /* what we are */
Yup, 8 byte pointer, 4 byte U32
, skipping the first two members gives us sv_flags
. And the lowest 8 bits of it are the type, so what's type 13?
SVt_PVCV, /* 13 */
Ok, it tries to see if something is executable. A good hint. Let's see what some of those memory address point to, and indeed:
(gdb) x/s 0x7ffff5e9d9d5 0x7ffff5e9d9d5: "Unknown ASN.1 type"
Looks like an error message. And suddenly finding the source code was infinitely easier than I feared, because searching for the message found only one hit:
if (type > AvFILLp (av_type) || SvTYPE (AvARRAY (av_type)[type]) != SVt_PVCV) { error ("Unknown ASN.1 type"); return &PL_sv_undef; }
Lo and behold, there is also the SVt_PVCV
from earlier!
Ok, SvTYPE
simply reads the flags and compares it to SVt_PVCV
, so quite clearly this is the expression that resulted in a NULL
pointer:
AvARRAY (av_type)[type]
Well, this array, as it turns out, is initialised at module load time, with most members being empty and only those indices that correspond to valid ASN.1 types set to callbacks to parse them. So either we had memory corruption (unlikely for a purely internal array), or we really did read a zero (most likely). Since the bounds check seemed correct, it must mean that the array members which are unset (those containing undef
by default) are actually stored as NULL
in perl. And as an occasional reader of perl source code, I did remember, faintly, that there indeed was a related change, and as it turns out it is even, kind of, documented, in perl5200delta:
Internal Changes · Arrays now use NULL internally to represent unused slots, instead of &PL_sv_undef. &PL_sv_undef is no longer treated as a special value, so av_store(av, 0, &PL_sv_undef) will cause element 0 of that array to hold a read-only undefined scalar. "$array[0] = anything" will croak and "\$array[0]" will compare equal to "\undef".
Right, one of those "internal" changes that break existing code. But then, isn't AvARRAY
internal and not to be used by XS code? And the more urgent question is, how to properly fix it?
Public or not, that is the question.
Well, it turns out, if you study the perl 5.20 documentation, AvARRAY
clearly isn't public, because it isn't listed in perlapi
, and perlapi
says:
Anything not listed here is not part of the public API, and should not be used by extension writers at all.
So it's easy to point at the module and say "it's broken, it abuses internal functionality and gets what it deserves", which is exactly what perl5porters like to do so often these days when they, again, break CPAN modules.
But it's not so simple: The perlapi
manpage is relatively young and this notice therefore was an attempt to redefine reality at a point where the perl API was well established (funnily enough, the perl headers also have a way of declaring symbols as internal or not, but the perl headers think AvARRAY
is part of the public API). Worse yet, AvARRAY
is used as an example right within the perlapi
manpage itself, and also in other parts of the documentation, so clearly, AvARRAY
was never meant to be internal.
However, this is a common pattern - more and more of the perl API is declared internal, then broken, then fingers are pointed at module authors for abusing the "internal" API.
But that's not the whole story.
The plot thickens
Ranting aside, how do we fix this? Well, obviously we just test for a NULL
pointer in addition to &PL_sv_undef
, problem solved:
if (type > AvFILLp (av_type) || AvARRAY (av_type)[type] == 0 || AvARRAY (av_type)[type] == &PL_sv_undef) { error ("Unknown ASN.1 type"); return &PL_sv_undef; }
I also did some other changes to check for SVt_PVCV
when the array is initialised, not when it is used, which hopefully regains a bit of the speed we lost, called it Net::SNMP::XS
version 1.32 and released the fix.
A quick grep showed that at least one other of my modules was affected by this incompatible change in perl 5.22. A few minutes later, I uploaded Array::Heap
version 3.22 with another clumsy workaround like the above.
Of course, both modules are still using the "internal" API, and whether I agree with it or not, future versions of Perl will make it worse, so again, how do I really, officially, legally fix this?.
Using the "public" API, of course, namely using av_fetch
, the documented variant of AvARRAY
. It is documented like this (excerpt):
av_fetch Returns the SV at the specified index in the array. The "key" is the index. If lval is true, you are guaranteed to get a real SV back (in case it wasn't real before), which you can then modify. Check that the return value is non-null before dereferencing it to a "SV*".
So there is something about a "real SV", and even the non-null aspect is mentioned, right?
Wrong. The "real" is perl implementation speak for an SV pointer that is properly refcounted, that is, a "real" AV is an array whose elements are refcounted normally, while a "non-real" array is an array with SV pointers that are not refcounted (and might, in fact, not even be legal SVs). "Non-real" data structures are sometimes used as optimisations by perl itself, and that comment was added in perl 5.14 (or so, I didn't check), long before the breaking change occurred.
What about the much more on-topic "non-null check"? Well, av_fetch
returns a pointer to a pointer to an SV. So to get the actual SV pointer, you need to dereference it first. And in cases where the element has no storage space (as is the case for tied arrays for example), av_fetch
returns 0
so indicate that.
Again, that comment is much older, and doesn't refer to the NULL
pointers inside an array.
Now, I looked at the av_fetch
source code again - since the NULL
pointer change is internal, surely, av_fetch
fixes this before it returns those internal values, right?
Wrong again: av_fetch
will happily return a non-null pointer to a NULL
SV pointer in the array.
Which makes you wonder, What is the documented way to check for undefinedness in XS? Maybe if I use the correct API for that, everything would work out?
Nope, the way to check for undefinedness in XS is SvOK
, which, when presented with undefined values in perl arrays.... segfaults.
If you look at the perl API, you do find examples where NULL
is treated specially, for example, newSVsv (0)
will return 0
instead of crashing, but that code is usually older as well, and wasn't in response to the new "0 is the new undef" change. If you look at the public API, it becomes clear that those "internal" NULL
pointers are not caught by most functions and macros and cause segfaults.
So, again, how do you actually avoid that case with just the legal/public/official API?
Segfaulting as the only publicly documented way of dealing with this?
Beats me, I have no clue. The fact that 0
is now a valid undef
value is not, to my knowledge, documented, and most code (both in CPAN modules and in perl itself) will happily just crash.
The only way to work around that is to either crash, as prescribed by the public API, or hard code the internal placeholder value (that is, 0
), and wait for the next API change.
I don't know how many modules are affected (I use av_fetch
quite often, and I imagine that a lot of CPAN modules do). Obviously, this bug can go undetected for quite long - both modules that I know of being affected are in heavy use elsewhere and didn't actually cause crashes yet. For Net::SNMP::XS
, this is easily explainable, as network switches return a lot of garbage via SNMP, but it's exceedingly rare for a switch to return actually corrupted ASN.1 data.
Griefing the perl API
Now, everybody makes mistakes, and that surely includes the perl5porters who maintain perl. However, the experience I had in the last years was that reporting those bugs will fall on deaf ears - either you are told that backwards compatibility doesn't matter (long gone are the times were perlpolicy
was used for anything but blocking dissenting opinions), you are told you are abusing the API, you are told that the change had to happen because of, well, reasons, or you realise that the person who did this change is simply absolutely clueless about perl, and shouldn't touch the perl code at all.
My personal favourite was Zefram's rationale for why changing warn
to be incompatible to its documentation (and existing code) was somehow right by stating "this change was made to make warn behave more like die". The same rationale can, of course, be used to make print
behave more like system 'rm -rf /'
, but hey, what do I know about reasoning.
This is the main reason why I had to fork perl 5.22 into stableperl, simply as a kind of self defense against these changes. And even if perl 5.22 is already quite old, I still have work around incompatible changes in it. I clearly should have forked 5.18. I mean 5.16. No wait, 5.14... No, no, no, the last version before this started was perl 5.10.
I can't even start to imagine how one can support products with newer releases such as 5.24, if every release breaks stuff that takes me (and CPAN) years to work around.
The closest equivalent that comes to my mind when thinking about this is griefing, because that's what happens, the perl5porters are griefing the perl API, and everybody else has to suffer for that. I am sure most (but not all) of the perl5porters do not do this for the fun of causing pain, but the central concept of griefing, destroying things that worked for no good reason, causing grief for others, is certainly fulfilled.