Catching Native Crashes under JNI with UNIX Signals

A while back I've written about the Java Native Interface (JNI) and how to use and abuse it for running native code inside your Java program, either to run legacy code or to have performance critical code run natively. However if your native code crashes, which is not that hard in C/C++, then it takes your Java program with it. If your Java program is a server applet running inside a Tomcat process for example, your whole webserver comes crashing down and a world of pain is waiting for you.

A friend of mine had quite a clever "solution" to this problem, at least on Linux. While, we would like to eliminate all sources of crashes forever, this is not possible and unrealistic, we can however make debugging a bit easier. We would like to:

  1. catch a crash,
  2. write a crashdump (or at least backtrace) for easier debugging,
  3. throw an exception into the Java-world so it can try and shut down gracefully.

But first we will have to learn a few basics about Linux and Unix. And if you've missed the first JNI-article and are missing a few of the essentials, feel free to read that first.

Fundamentals about UNIX signals

Unix signals are a way in the Operating System to transfer data into a process. This is used to stop or pause a process (SIGSTOP), to notify a debugger about changes in the child process (SIGCHILD) or even to kill a process (SIGTERM, SIGKILL). Also the operating system stops a process and sends it a Signal when the process tried to execute an illegal instruction (SIGILL), tried to access invalid memory locations (SIGSEGV) or a bus error occured (SIGBUS). There is even a signal when the terminal changes in size (SIGWINCH).

Whenever a process receives such a signal, there is an action associated with it. A process may mask (i.e. disable) a signal, so it will be ignored, when it receives the signal, take the default action (which is to ingore the signal for most of them), or handle the signal itself. Afterall, most signals can just be caught, handled and the program wouldn't necessarily need to abort execution, in some cases aborting is the only reasonable way forward; with SIGTERM it is the only way to go (it cannot be caught).

Let's assume a C program like this. We have a main routine, where we set a pointer into some memory, and write to it in a loop.

#include <stdint.h>

int main () {
    uint8_t* ptr = (void*)0x400000000000;
    for ( int i = 0; i<8192; i++ ) {
        *ptr = 3;
        ptr++;
    }
}

Trying it out, we will receive a SIGSEGV (Segmentation Fault), the exact output could depend on your shell:

$ gcc main.c
$ ./a.out
fish: Job 1, './a.out' terminated by signal SIGSEGV (Segmentation Fault)

Nothing unexpected here. Clearly the 0x4... pointer points into virtual memory, that was not mapped to the process. Writing to it will cause the processor to trap into a page fault handler, where the operating system checks the address. It will look in its internal data structure and map memory to the place, if it sees that there should be something mapped (see lazy paging or the growing stack). If there is nothing mapped, a SIGSEGV signal is sent to the offending process. The default behaviour of SIGSEGV is to abort program execution.

But we can overwrite the default action and define our own with [sigaction]. If we set a signal handler for the SIGSEGV signal and map new memory into the space we just wanted to access, similar to [advent(2)], then we will be able to run our program to completion. This is just an example of what we can do. Trying this in production will break things, and you should never ever just ignore a SIGSEGV. While we are purposely writing into some random memory, most programs will trash the stack, heap or other necessary data, way before jumping into random memory.

// Have a look here: https://osg.tuhh.de/Advent/06-sigaction/
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>

#define PAGE_SIZE 4096

void sa_sigsegv(int signum, siginfo_t* info, void* context) {
    // Calculate the page address from the fault address
    // 0xdeadbeef -> 0xdeadb000
    uintptr_t addr = (uintptr_t)info->si_addr;
    addr = addr & (~ (PAGE_SIZE - 1));

    // MMAP one page of anonymous private memory to that location.
    // Afterwards, the program can continue its way to perdition.
    void *ret = mmap((void*)addr, PAGE_SIZE,
                     PROT_READ|PROT_WRITE,
                     MAP_PRIVATE|MAP_ANONYMOUS,
                     -1, 0);
    if (ret == MAP_FAILED) {
        perror("mmap");
        exit(-1);
    }
}

int main () {
    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    sa.sa_flags     = SA_SIGINFO | SA_RESTART;
    sa.sa_sigaction = &sa_sigsegv;
    sigemptyset(&sa.sa_mask);
    if (sigaction(SIGSEGV, &sa, NULL) != 0) {
        perror("sigaction");
        return -1;
    }

    uint8_t* ptr = (void*)0x400000000000;
    for ( int i = 0; i<8096; i++ ) {
        *ptr = 3;
        ptr++;
    }
}

A Caveat

Some of you reading this, will scream at the piece of code above. A signal handler can be called by the OS at any time, so we have no information about locks that are held or the state of other threads. They could be in the kernel right now, trying to allocate memory or may be altering important global data structures.

Make no mistake, we are trying to do something useful in a Unix signal handler, which it is not there for. There are several sources warning developers about doing anything more than setting a global flag (and even that might be racey). Accessing global data is tabu, executing asynchronous calls, that could use locks internally is tabu. Let's just say, everything fun should not be avoided in signal handlers.

What we are about to do works for my friend, but it most definitely might not work for you. It is based on undefined behaviour, which might lead to nothing bad hapenning at all, random signals or crashes, subtle behavioural changes or nasal daemons.

What we want to achieve

The above example shows a use case for a signal handler (ignoring the quite obvious warnings about signal handlers). Most of the times, SIGSEGV should not be overwritten, or if so only to print some more information (which is still undefined behaviour). Some signals, like SIGSEGV indicate a program error, that cannot be recovered from. When you receive SIGSEGV you cannot be certain, that all the process' memory is intact.

However. We want to run a Java-program, that links to native code via the Java Native Interface (JNI). The scenario is, that we have a library of legacy code, that cannot easily be ditched or rewritten in Java, so for the time being we will need to run this unsafe code and shield it, as good as we can, from potentially malicious user input.

Unsafe native code can however still crash. In C or C++ it is easy to forget the lifetime of your objects, forget to check a pointer for null or to ensure, that the object you want to call is the object you are trying to call. Java protects you from some of these errors with its type safety approach and garbage collection.

But, anyway. Your native code could crash, it could bring down your whole Java process with it. Remember, if we catch a SIGSEGV signal and if we don't handle it (assuming the Java JVM doesn't gracefully ignore it), the whole process is aborted, include everything that runs inside it, if that is a whole Tomcat webserver, it will come down as well. Very ungraceful.

So, let's try and catch the crash, gather some information and maybe even let our Java program catch an Exception and shut down somewhat gracefully.

Glueing it back together

In my previous article we've set up a thin wrapper glue library, that is actually loaded by the Java VM, converts parameter types and jumps into our native C/C++ code. Who says we can't use this library to have some fun?

Firstly, we will need a signal handler. This is were this mechanism could become quite unstable: the Java Virtual Machine uses signals itself, so overwriting or chaining signal handlers the wrong way can break (future versions) of our Java program, so I wouldn't recommend this approach to use in production. You've been warned.

But still, let's write ourselves a signal handler in our glue-library, that gathers a crash log, prints it to stderr and exits:

void SignalHandler( int signum, siginfo_t* si, void* uc ) {
    std::cerr << "Signal " << signum << " (SIG" << sigabbrev_np(signum) << ")" << std::endl;

    void** buffer = malloc(64 * sizeof(void*));
    int num = backtrace ( buffer, 64 );
    backtrace_symbols_fd ( buffer, num, STDERR_FILENO );
    abort();
}

jint JNI_OnLoad( JavaVM* vm, void* res ) {
    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    sa.sa_flags     = SA_SIGINFO | SA_RESTART;
    sa.sa_sigaction = &SignalHandler;
    sigemptyset(&sa.sa_mask);
    if (sigaction(SIGSEGV, &sa, NULL) != 0) {
        std::cout << "[FATAL] sigaction failed" << std::endl;
        return -1;
    }

    return JNI_VERSION_1_2;
}

We've created an oddly looking function, JNI_OnLoad. This function is called by Java, when the library is loaded. It is there to initialize the native library; many C libraries need to call an initializer function for preparing internal data structures before doing anything useful. We use it to register a signal handler. A similar function exists for Unloading a library, see [Java: Invocation API];

But are we actually registering a signal handler though? If we read a bit more of the JVM docs, we will find [Java: Signal Handling] and [Java: Signal Chaining]. If we just try to use this piece of code and not bother with anything else, our Java program will crash occasionally, for example when some NullPointer Exceptions would be thrown, or maybe when new objects will be allocated. That's no fun.

The JVM uses Unix Signals to detect null pointer accesses and uses those to generate a NullPointer Exception. It also uses Unix Signals to "automatically" grow Java stacks. We've just overwritten a used Signal Handler of our JVM, so we're hindering the JVM to do its job.

The JVM has a facility to catch Signals, that the JVM doesn't know what to do with: Signal Chaining. If we preload libjsig.so, it will overwrite calls like sigaction. This way and call to sigaction is intercepted, it stores our signal handler internally. If the JVM catched a signal, that it cannot do anything useful with in it own handlers, it will call (chain) our signal handler behind it's own.

This behaviour includes some SIGSEGV errors (some are caught by JVM to implement an auto growing stack). This also means, we have no control over the real signal handler, but we can execute code, if signals are caught.

To preload libjsig.so, so we'll load it before the libc, which would set sigaction itself, we need to invoke our program with the LD_PRELOAD (see [ld.so]) environment set:

LD_PRELOAD=/usr/lib/jvm/java-17-openjdk/lib/server/libjsig.so java -jar jnitest.jar

Sending a SIGSEGV to the process (or provoking one from native code), will lead to output like this:

Hello World
Signal 11 (SIGSEGV)
/home/naums/jni-test/build/libJniGlue.so(SignalHandler+0xb0)[0x7f8034f96289]
/usr/lib/libc.so.6(+0x3e710)[0x7f8035188710]
/home/naums/jni-test/build/libJniNative.so(native_hello_world+0x3f)[0x7f8034f8c158]
/home/naums/jni-test/build/libJniGlue.so(Java_de_snaums_jnitest_java_1hello_1world+0x15)[0x7f8034f96394]
[0x7f801c74453a]
fish: Job 1, 'java -jar jnitest.jar' terminated by signal SIGABRT (Abbruch)

So now we have a backtrace on stderr.

Longjumping back to Java

Nice, we get a backtrace, this should at least help to identify problems and fix them in the native code, if they arise. But our process is still gone.

Maybe you've found the functions setjmp and longjmp before, and maybe you've wondered about what they could be used for. These functions perform a "nonlocal goto", which is not that useful of a description if you're new to them.

In essence, setjmp stores the current processor state, and longjmp returns back to it. They will not alter any (global) data, they won't unwind the stack, they will quite literally just jump back, like a goto. setjmp in a sense returns two times: once setting the jump, once returning through longjmp. One could use it to implement exceptions and their non-local returns (if you don't need stack-unwinding, that is).

Let's first see, how we use setjmp normally, albeit a bit useless:

#include <iostream>
#include <setjmp.h>

int main() {
    jmp_buf buf;
    std::cout << "Starting program" << std::endl;

    // first return: rc will be 0
    // second return: will be 1 (set by longjmp)
    int rc = setjmp(buf);

    std::cout << "Starting calculation" << std::endl;
    // some random maths
    int c;
    int d = c + 12;
    c = 30;

    std::cout << "Value of d:" << d << std::endl;

    if ( rc == 0 ) {
        longjmp ( buf, 1 );
    }

    return 0;
}
$ g++ setjmp.cpp
$ ./a.out
Starting program
Starting calculation
Value of d:32674
Starting calculation
Value of d:42

Ah, hang on. What happened here? We've forgot to initialize c, so it is unitialized when we first start. The value of d is calculated from c. Then we set c to 30. On our second pass through (returning a second time from setjmp), we use the correct value of c. So setjmp and longjmp cannot be used to turn back time, but rather to jump to more or less arbitrary places in your program more or less safely (a lot safer, than if you were to implement a non-local jump yourself in assembly).

Now, we can use it to return control back to our Java program and maybe raise an Exception, so the Java-side could in theory shutdown cleanly. So we extend our glue-library:

static jmp_buf longjmp_buffer;

void SignalHandler( int signum, siginfo_t* si, void* uc ) {
    std::cerr << "Signal " << signum << " (SIG" << sigabbrev_np(signum) << ")" << std::endl;

    void** buffer = (void**) malloc(64 * sizeof(void*));
    int num = backtrace ( buffer, 64 );
    backtrace_symbols_fd ( buffer, num, STDERR_FILENO );
    // here we jump back to the glue-function
    longjmp( longjmp_buffer, 1 );
}

// JNI_OnLoad ...

// we create and throw a NoClassDefFoundError exception
jint throwException(JNIEnv* env) {
    jclass exClass;
    char *className = "java/lang/NoClassDefFoundError";

    exClass = env->FindClass(className);
    if (exClass == NULL) {
        std::cerr << "Could not find NoClassDefFoundError, Aborting" << std::endl;
        abort();
    }

    return env->ThrowNew(exClass, "A JNI Problen occurred");
}

void Java_de_snaums_jnitest_java_1hello_1world 
    (JNIEnv * env, jclass cls) {
        // setting the longjmp. The second time (error case), we'll
        // raise an exception
        if ( setjmp ( longjmp_buffer ) == 0 ) {
            native_hello_world();
        } else {
            throwException( env );
        }
}

jint Java_de_snaums_jnitest_java_1sum
    (JNIEnv *env, jclass cls, jint first, jint second) {
        // setting the longjmp. The second time (error case), we'll
        // raise an exception
        if ( setjmp ( longjmp_buffer ) == 0 ) {
            return (jint) native_sum( (int)first, (int)second );
        } else {
            throwException( env );
        }
}

So, the output is, if I introduce a write access to nullpointer in native_hello_world:

$ java -jar jnitest.jar
Hello World
Signal 11 (SIGSEGV)
/home/naums/jni-test/build/libJniGlue.so(SignalHandler+0xb0)[0x7fd09b1942d9]
/usr/lib/libc.so.6(+0x3e710)[0x7fd09b35a710]
/home/naums/jni-test/build/libJniNative.so(native_hello_world+0x3f)[0x7fd09b18a158]
/home/naums/jni-test/build/libJniGlue.so(Java_de_snaums_jnitest_java_1hello_1world+0x2d)[0x7fd09b19448c]
[0x7fd08474453a]
Exception in thread "main" java.lang.NoClassDefFoundError: A JNI Problen occurred
    at de.snaums.jnitest.java_hello_world(Native Method)
    at de.snaums.fun.main(fun.java:10)

We've thrown NoClassDefFoundError as a placeholder, you might want to lookup a list of predefined Exceptions or create a new Exception class yourself to throw. The only thing missing is to catch this Exception, I leave it as a task for the reader.

Oh Stack, my Stack

So we've solved the SIGSEGV problem, right? Right?

Yeah, no. We receive a SIGSEGV when we try to write a memory location that does not belong to us, that includes random stray write to anywhere. But, a SIGSEGV would also be raised when your native stack is running out of room. Linux allocates an 8MB stack area (see ulimit) for your processes. Your native stack will grow to that size automatically, but can't extend over that limit.

So, when we introduce an infinite recursion (in native_hello_world) and fill our native stack, we should see a SIGSEGV and handle it in our handler.

$ LD_PRELOAD=/usr/lib/jvm/java-17-openjdk/lib/server/libjsig.so  java -jar jnitest.jar
...
Hello World
Hello World
Hello World
Hello World
fish: Job 1, 'java -jar jnitest.jar' terminated by signal SIGSEGV (Segmentation fault)

Aha! We get a SIGSEGV, but our native signal handler is not called. Well we've exhausted our stack area, so the libc has no room to place the return address and signal information on the stack and call our signal handler. This causes the process to receive a second SIGSEGV while trying to prepare handling the first one. Now the kernel will hard-abort our application.

Luckily in UNIX there is a function [sigaltstack], with which we can register an alternative stack for signal handlers to use. Then, when a signal is caught the operating system will switch the signal handler to this alternate stack, place the return information on there, and starting our signal handler.

Let's try that in a normal C++-Test program:

#include <signal.h>
#include <sys/mman.h>
#include <cstring>
#include <cstdlib>

#include <iostream>

// we could also use SIGSTKSZ
#define SIGNAL_STACK_SIZE 4096*4

void setup_alternate_stack () {
    int8_t *mem = ( int8_t * )mmap( NULL,
                                    SIGNAL_STACK_SIZE + 2 * getpagesize(),
                                    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON,
                                    -1, 0 );
    if ( mem == MAP_FAILED )
    {
        std::cout<< "[FATAL] Unable to mmap Signal stack " << std::endl;
    }
    else
    {
        // We set up 2 Guard pages, one before our stack, one after
        // so if we write over the end while executing the signal handler
        // we will be killed (safely)
        mprotect( mem, getpagesize(), PROT_NONE );
        mprotect( mem + getpagesize() + SIGNAL_STACK_SIZE, getpagesize(), PROT_NONE );

        stack_t ss;
        memset( &ss, 0, sizeof( stack_t ) );
        ss.ss_flags = 0;
        ss.ss_size = SIGNAL_STACK_SIZE;
        ss.ss_sp = mem + getpagesize();;

        stack_t os;
        memset( &os, 0, sizeof( stack_t ) );
        int rc = 0;
        rc = sigaltstack( &ss, &os );
        if ( rc != 0 )
        {
            std::cout << "[FATAL] Setting alternate signal stack failed " << std::endl;
        }
        else
        {
            std::cout << "[INFO ] Set alternate signal stack " << std::endl;
        }
        std::cout << "Original stack: " << os.ss_sp << " size " << os.ss_size << " flags " << os.ss_flags << std::endl;
    }
}

void recurse(int b) {
    int a = 12;
    recurse(a);
}

void SignalHandler(int signum, siginfo_t* si, void* ucontext) {
    std::cout << "Hello from Stack " << &signum << std::endl;
    abort();
}

int main() {
    setup_alternate_stack();

    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    // SA_ONSTACK is needed for the handler to run on the alternate stack
    sa.sa_flags     = SA_SIGINFO | SA_RESTART | SA_ONSTACK;
    sa.sa_sigaction = &SignalHandler;
    sigemptyset(&sa.sa_mask);
    if (sigaction(SIGSEGV, &sa, NULL) != 0) {
        std::cout << "[FATAL] sigaction failed" << std::endl;
        return -1;
    }

    recurse(12);
}

Yes, this works.

[INFO ] Set alternate signal stack
Original stack: 0 size 0 flags 2
Hello from Stack 0x7f6f59a7e9ac
fish: Job 1, './a.out' terminated by signal SIGABRT (Abbruch)

Now let's try it with Java, adding the alt-stack:

LD_PRELOAD=/usr/lib/jvm/java-17-openjdk/lib/server/libjsig.so java -jar jnitest.jar
...
Hello World
Hello World
fish: Job 1, 'LD_PRELOAD=/usr/lib/jvm/java-17…' terminated by signal SIGSEGV (Adressbereichsfehler)

Hm, funny. It doesn't work.

Let's remind ourselves, Java handles (some) signals itself, like SIGSEGV. Java needs to handle some signals itself, so it registers its own signal handlers with the system. Preloading libjsig.so will prevent the native code from registering its native handlers with the system, rather the JVM stores them and will chain them after its own handlers.

This also means, that the Java signal handlers would run on the alternate stack, if this were successful. Unfortunately it isn't. Look at our pure C++ example from above. We've added a flag to our sigaction call: SA_ONSTACK. This is necessary, so the Signal Handler will be using our alternate stack. If it is missing, the alternate stack is not used for this signal handler. And this missing flag is our problem here.

Unfortunately there is no way for us to overwrite the JVMs signal registrations.

Going Bonkers

Or is there?

Remember, libjsig intercepts sigaction, but it cannot intercept the pure systemcall if we manage to call it ourselves. Funnily enough, the libc presents us a "nice" wrapper for it, the syscall function. Let's use it to call sigaction on the Java SIGSEGV handler and hard-set the SA_ONSTACK flag.

#include <sys/syscall.h>
#include <unistd.h>

// this is stupid, do never do this in real life
struct sigaction java_handler;
syscall( SYS_rt_sigaction, SIGSEGV, nullptr, &java_handler, sizeof( sigset_t ) );
if ( java_handler.sa_handler != SIG_DFL ) {
    std::cout << "attempting something very stupid" << std::endl;
    java_handler.sa_flags |= SA_ONSTACK;
    syscall( SYS_rt_sigaction, SIGSEGV, &java_handler, nullptr, sizeof( sigset_t ) );
}

We call it twice here, first to get the old handler back, so the one that the JVM set, we'll add the SA_ONSTACK flag, and set the structure again.

Even if we place this code right before the recursion call, we will have no luck, unfortunately.

...
Hello World
attempting something very stupid
fish: Job 1, 'LD_PRELOAD=/usr/lib/jvm/java-17…' terminated by signal SIGSEGV (Adressbereichsfehler)

Somewhere in the JVM it probably resets the handlers or maybe the JVM expects to be on it's normal C stack? I was unable to find out the reason, and more altering of the JVM from the outside would not increase or guarantee coverage. So let's stop here.

Conclusion

In this article we've seen that we (sometimes) are able to use signal handlers and signal chaining in native code linked to a Java VM via Java Native Interface (JNI) to handle some errors, report backtraces and maybe even raise exceptions and return to Java. The Java world should then be as short as possible, never allocate more memory, just shut down and maybe restart.

Keep in mind, that most of what we've done in our signal handlers is illegal and may introduce security bugs, leak data or both. So while this is a nice technical excersice, it should never be used in production code. Also there are some native bugs, that cannot be reported by this method, so using a solid mechanism to get coredumps externally should be used, to at least find the bugs that crash the legacy code and eliminate them.

Image Reference

References

Last edit: 31.12.2023 17:00