THE RANT /
THE SCHPLOG
Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was published 2015-11-12 06:37:44, and since then has not been materially modified.

Tidbits - Why Coro Crashes or How Perl 6 Deals with Bugs

Some people on the internets keep claiming that Coro is an unstable mess and extremely buggy (sometimes they claim this about EV, and very rarely about other modules). I normally don't care what "some people" spout somewhere, but I think this issue has a real background that I keep explaining to people when it comes up - a perfect topic for a blog posting!

So, on the one hand, this is rather surprising, as Coro has been extremely stable for many years, and is used in production environments for long-running processes. Heck, a large share of the Fortune 500 companies use tools based on Coro to administrate their core company network or gather metrics about it.

On the other hand, there is a bug in perl that frequently causes segmentation faults when a perl program exits unexpectedly (or even on normal exits). Since the segmentation fault effectively hides the real error message, and since a backtrace often shows Coro or EV, "some people" conclude Coro (or EV) must be at fault and start spouting it out. More intelligent people (those seems to be in the majority, too) usually inquire about this and/or simply report it as a bug, which gives them a chance to understand this perl bug, and how to work around it.

What happens is that the perl interpreter corrupts memory before it exits, tries to hide this fact, and sometimes fails at this. It affects Perl code and XS code in general, but for reasons I will explain later, event-based code (AnyEvent, EV...) and thread-based code (Coro) has a higher chance of being affected, and you are only likely to see any issues with XS code, which is why EV and Coro are often seen, but pure-perl modules or most XS modules are not.

Understanding that and why this happens is usually quite difficult, and when you encounter this problem for the first time, some serious debugging is usually required to gain this understanding. I know, because I was in the same boat as you people, am just as annoyed as you, and am just as helpless as you.

In this blog post, I will try to explain this bug, what can be done, what should be done, what cannot be done, and what perl5porters have done about it. You can find more less verbose explanations by me on stackoverflow and various mailinglists, made over the last decade. I hope this time I can nail the issue definitely.

The Symptoms

The symptoms are usually segfaults on program exits (but it's of course often not obvious that it is at program exit time):

$ ./myperlprog
Segmentation Fault

Unfortunately, writing a reliable short snippet that shows this is hard, because it depends on the internal memory layout of the perl process and is hard to reproduce in a short piece of code - you usually need a nontrivial amount of code and runtime to make it appear.

So let's approach this from another angle, let's try to understand when and how perl DESTROYs your objects, and what it does on errors. This small program is a good start:

use Carp;

sub DESTROY {
   Carp::cluck "k/j DESTROY";
}

{
   my $k = bless { };
   my $j = bless { };
   
   $k->{j} = $j;
   $j->{k} = $k;
}

print "k and j are now out of scope, but not dead yet\n";

This program creates two objects that reference each other - creating a cyclical data structure that underlies many of the problems. Because it is cyclical, it doesn't go out of scope at the end of the code block, because k can see j, and j can see k.

However, at program exit, perl will forcefully free these objects, as can be witnessed from the program output:

k and j are now out of scope, but not dead yet
k/j DESTROY at - line 4.
        main::DESTROY(main=HASH(0x9ec378)) called at - line 0
        eval {...} called at - line 0
k/j DESTROY at - line 4.
        main::DESTROY(main=HASH(0x9ec360)) called at - line 0
        eval {...} called at - line 0

So main ends, the objects are still alive, but eventually during cleanup, perl calls their destructor.

This is not surprising. However, most people will be surprised when they change the cluck (a warn) into a die:

sub DESTROY {
   die "k/j DESTROY (", join ":", %{$_[0]}), ")";
}

This normally gives you this output:

k and j are now out of scope, but not dead yet

Okay, what happened to the exception? Has the destructor not been called?

The destructor has been called, and your exception was thrown away by perl. The eval context in the backtrace earlier means serious business - destructors in perl are called in eval context and any exceptions cause your constructor to not finish its business.

Unfinished Business

In this trivial example, this is all fine, but under real-world conditions, or rather, when your program deals with real-world things such as temporary files, it can be quite annoying. Take this destructor for example:

sub DESTROY {
   my ($self) = @_;

   $self->{other_object}->flush_data;

   unlink $self->{temporary_file};
}

Whether the unlink is being called or not depends on whether flush_data throws an exception or not. Fair enough, the problem is that if it throws, then you will never find out because perl throws away the error message.

Sure, but if $self->{other_object}->flush_data is called under normal conditions, the conditions the code was tested under, it will not throw an exception, and this issue will not come up. Nobody believes in subtle bugs that happen when code is executed in unusual conditions :)

But, seriously, no, even under pretty normal conditions the method call can crash - for example, when $self->{other_object} is undef.

To see why this can happen quite unexpectedly and despite the best intentions, let's make another experiment by dumping the contents of %$self in DESTROY:

sub DESTROY {
   print "k/j DESTROY (", (join ",", %{$_[0]}), ")\n";
}

Let's see what this outputs on my system (something similar will probably happen on yours):

k and j are now out of scope, but not dead yet
k/j DESTROY (k:main=HASH(0x9eb160))
k/j DESTROY (j:)
[Exit 255]

Hmm, shouldn't both k and j members be references to an object of class main? What happened to $k->{j}?

Well, it's been undefed by perl. If you'd call a method of $k->{j}, perl would throw an exception, which would be ignored.

Now we are in a good position to understand why Coro crashes - the equivalent to undef in C/XS is a null pointer. And the equivalent of dereferencing that is not an exception as in Perl, but a crash as in "Segmentation Fault".

Red Herrings, err, Cyclic References

A common argument is that this only happens to objects that are, in fact, cyclically referenced, and this is true, but most people don't understand what cyclically referenced means in practice, expecting only objects that are part of a cycle to be affected. This is not the case, as another experiment shows (it changes the two lines with Coro in them):

sub DESTROY {
   print "k/j DESTROY (", (join ",", %{$_[0]}), ")\n";
}

{
   use Coro;
   my $k = bless { sem => new Coro::Semaphore };
   my $j = bless { };
   
   $k->{j} = $j;
   $j->{k} = $k;
}

die "k and j are now out of scope, but not dead yet\n";

And the output might be:

k and j are now out of scope, but not dead yet
k/j DESTROY (j,main=HASH(0xaff828),sem,)
k/j DESTROY (k,)
[Exit 255]

Although $j->{sem} is not part of a direct cycle, it is apparently undefed by perl, even before the first destructor is being called.

One might argue that this in itself is not an issue, as long as Coro::Semaphore isn't somehow cyclically referencing either k or j, but the problem is that perl does not first break cycles as many people believe and then proceeds to clean up the remaining objects properly. It also does not first free other objects and the cyclically referenced onces later.

What perl does is free objects in random order, at least from the perspective of the Perl program and any XS code. What it does is free objects in the middle of existing data structures, whether they are directly in a cycle or not. And most importantly, whether they are responsible for the cycle or not.

In the above program run, we were lucky - it could have segfaulted, but it didn't. To see why we have to know how a Coro::Semaphore object looks like, and fortunately it is a very simple object: The object reference points to a standard Perl array. The first member is the lock counter, and any remaining members are thread object references that are currently waiting for the semaphore.

The thing to take away from this is that Coro::Semaphore (as any object) consists of at least two parts: a reference, and something it points to, such as an array. And these two parts will be independently freed by perl, in random order.

So what could have happened is that the array inside Coro::Semaphore is being freed before the reference. Later, when Coro::Semaphore::DESTROY would be called, it would dereference to a null pointer on the C level and cause an exception, err, segfault.

Or equally likely, one of the thread objects inside the array could have been freed - doing anything with them in the destructor would again result in a null pointer.

So Why Only Coro?

Yeah, obviously it isn't just Coro, but why does it affect Coro and EV more often? There are actually multiple answers to that.

First of all, event-based programs, and thread-based programs, often make people create cyclic data structures without realising it. A common AnyEvent idiom is this:

my $delay; $delay = AE::timer 60, 0, sub {
   undef $delay;
   print "one minute later\n";
};

Event objects are not normally referenced by the event library - they couldn't go out of scope if it did. And a common trick to keep these alive is to reference them in the callback - which resultsd in a cyclic data structure with which might have an XS object inside somewhere (if AnyEvent uses EV, which is the default).

With threads, it is even less obvious:

my $w = AE::timer 60, 0, Coro::rouse_cb;

Coro::rouse_wait;

The coro thread running this references $w, and $w very indirectly references the thread because the thread put itself into the wait list for the rouse callback, again resulting in a cyclic data structure.

The other answer is that many XS authors sooner or later run into this, and then protect those accesses that they see crashing with explicit checks for null pointers, or they simply refuse to do any work when perl says it is exiting:

// case 1, explicit check
if (!SvOK (myref))
  return;

// case 2, refuse service during global destruction
if (PL_phase == PERL_PHASE_DESTRUCT)
  return;
 

Of course, both of this relies on undocumented behaviour, because the new perl regime gives you enough rope to hang yourself, but no way to get out of it cleanly.

So Why Doesn't Coro Check?

For the same reason the above checks don't work, and the same reason you don't protect every dereference in your Perl code either:

$self->{otherobject}->doit
   if $self && $self->{otherobject};

It should not be neccessary!

And why don't the checks work? Well, they do, kind of, in those situations where the code saw crashes before and protected the crashing statements. But to work always, almost every single line of code needs to be protected as in the above example. Programming would be impossible.

Protecting destructors is not enough, as destructors are free to call any method of any object they wish, so every method that could potentially be called during global destruction would need to protect every access. Every library that exposes objects would need to protect all code in all methods. Every XS function that is being called would need to validate every data structure it accesses, as parts of it could be pointing to memory that has been freed already.

That's insanity, of course, and nobody does that, nor should anybody need to do it.

Coro for example could add a null pointer check in the Coro::Semaphore::wait method, which is the most common place of crashes (and I might do so at some point, out of sheer desperation), but that wouldn't protect all the other places where it might happen with lower frequency.

And it doesn't help that it's very hard to check for global destruction from the Perl level.

So Why is This Not Fixed in Perl?

Well, it's not broken. At least that is what the perl5porters have to say on this issue. If your data structure gets corrupted and points to free memory by the perl interpreter, this is your fault, and your fault alone.

No, seriously, that's the official stance. This has been reported multiple times over the years, and the only reaction was to suppress exceptions when invoking the DESTROY method. That's it.

What can I do?

It should be clear by now that there are very few realistic options to work around this problem. Honestly, the best workaround I know is to call POSIX::_exit to end your program, also from exception handlers, and I have given this issue a lot of though over the years.

You don't lose that much - destruction at program exit is unreliable at best anyway. But you do lose destructors that are well behaved and should work.

You can continue to ignore exceptions and corrupted data structures in Perl, and protect your destructors and methods which commonly fail in XS with more checks, which can avoid most but not all such problems.

The only reasonable place to fix this would be in the perl interpreter.

Why is Perl 6 in the Title?

You might or might not know that Perl 6 failed because the people working on it put their priorities into fancy fantasy projects, rather than creating a real world programming language. It's been abused to experiment on all kinds of en-vogue language things, but failed to attract actual Perl programmers, and utterly failed to deliver on every single promise made, most prominently, Perl 5 compatibility.

Now, what do you do when you do all these cool experiments, and nobody wants to suffer through them trying them out? Well, if the people don't come to you, you go to the people. While more and more of the older perl 5 maintainers have stopped working on perl, more and more incompetent experimenters from Perl 6 took over.

The result is that, for quite some years now, perl 5 regularly gains cool new features that never work and are removed a few versions later (smartmatches are but one prominent example), bugs do not get fixed, the language gets broken, the perl policy regarding compatibility is ignored, every major release majorly breaks CPAN.

The good work that a tiny minority still does on perl 5 is drowned among bad changes.

Anecdote Time

When you converse with perl 5 porters, it also becomes painfully clear that most of them just don't have any clue about perl, which they claim to maintain. In itself this is not problem, but an opportunity to learn and understand, something perl5porters refuse to do. A good example is a bug in warn that I reported a few years ago:

warn is documented to append a newline to its argument. It doesn't do so in current versions of perl when an object is involved. This is an issue because a lot of code (existing and newly written alike) uses warn to report, well, problems and debugging messages:

warn $object;

And this now loses vital line number information.

The response I got was "this is not a bug, warn was changed to behave more like die".

My reply was that first of all, the documentations disagrees with the behaviour, so there is a bug somewhere, and that this argument is a non-starter - you could use it to make warn exit immediately as well, because "we changed warn to behave more like exit".

I even tried to explain why any possible similarity between warn and die is not a good idea - die is the only way to throw an exception in Perl, so making it possible to throw objects as exceptions easily overrides any other concern - without thew ability to throw objects without stringifying them you couldn't write your own exception primitives using objects.

The same is not true for warn - it is easy to warn users without using warn, the fact that many people use logging packages that write to STDERR is a witness to this. You can even emulate warn completely in Perl without calling warn itself, something which is impossible to do with die.

The response? Nothing. As far as I know (I don't check every year), neither has the documentation changed, nor warn behaviour, nor has the bug report been closed.

And in recent times, people who report security (and other) bugs get blocked from the mailinglist. And while the policy still makes a backwards compatibility promise that isn't being kept, it has been updated to give admins the ability to block anybody for any reason - literally, "stating facts is not enough" anymore, you now have to play their arbitrary political correctness and privilege games.

Conclusion

TL;DR: The world will end, we're all fucked, and perl5porters are a bunch of losers that give my modules a bad rep.