The Java Virtual Machine (JVM) is a vital part of modern distributed computing. It is the platform of big data applications like Spark, HDFS, Cassandra,and Hive. While the JVM provides “write once, run anywhere” platform independence, this comes at a cost. The JVM takes time to “warm up”, that is to load the classes, interpret the bytecode, and so on. This time may not matter much for a long-running Tomcat server, but big data jobs are typically short-lived. Thus the parallelization often used to speed up the time-to-results compounds the JVM warmup time problem.
David Lion and his colleagues examined JVM performance for a paper presented at the 12th USENIX Symposium on Operating Systems Design and Implementation earlier this month. that while many studies had investigated improving the runtime performance of the JVM, little work has examined the startup time of the JVM itself. Optimizations of garbage collection, scheduling,and computation deduplication have helped improve performance, but the largest overhead has remained largely untouched.
A one gigabyte read on an HDFS system, which is an I/O-intensive operation, spends roughly one third of the time in JVM warmup. When the read size drops to 1 megabyte, the warmup time accounts for over 60% of execution. This is significant for many users; a Cloudera study showed that not only do most real-world Hadoop jobs read less than a gigabyte, but many of their customers have job read sizes smaller than 1 megabyte. Half or more of job execution time going to overhead is never a satisfactory situation, but it becomes particularly undesirable when using infrastructure on a time-based billing model.
In order to address this performance overhead, Lion and his colleagues developed a new JVM called HotTub. Using HotTub, the worst performance for a 100 gigabyte Hive query is 1.10 times faster than when using the OpenJDK JVM. Other Hive and Spark queries of varying sizes show improvements in the 1.5-2x range. The most impressive performance improvement, however, is with 1 megabyte HDFS reads, which were just over thirty times faster with HotTub.
Runtime decreases with multiple reuse as well. The sharpest drop in runtime occurs with the first reuse of the JVM, but 1 megabyte HDFS reads continued to show diminishing speedups through 12 runs, at which point the JVM is fully warmed. It is clear that even with reuse, a JVM is not necessarily fully-warmed when the first job completes.
Instead of living for the duration of a single job as traditional JVM processes do, HotTub works by maintaining a pool of warm JVMs. When HotTub’s java executable is called, it first attempts to connect to a previously-launched JVM. Only if a warm JVM is unavailable will it launch a new JVM process. At the end of an application’s life, the HotTub JVM closes any open file descriptors and resets the state, while maintaining the classes and compiled code. As a result, subsequent jobs are spared the overhead of launching a new JVM. The exact performance improvement depends on the specifics of the jobs in question, and running similar jobs yields better results than widely-varying jobs.
HotTub isn’t entirely warmth and bubbles, though. While the user experience is to simply point to a new java executable, it does differ from the standard JVM in a few ways. Most notably, there are edge cases with static initialization that can lead to inconsistencies in the environment. The JVM specification explicitly warns against circular dependence and initializations with timing dependencies, but there are no technical measures to prevent such setups. In addition, HotTub cannot handle a SIGKILL sent to the server process. Sending that signal to the server (the warmed JVM) will cause the JVM to be lost. HotTub also incurs a slight performance hit when spawning new JVMs. When there are no available servers in the pool, the connection overhead is 81 milliseconds. This drops to 300 microseconds when warmed JVMs are available.
Even with the few imperfections, HotTub looks like a great tool. HotTub is open source (under the GNU General Public License, version 2) and the code is available on GitHub, so there is opportunity to provide community improvements. Since installation is as easy as using the HotSpot java executable, it’s easy to test. Large-scale users on public cloud infrastructure may see the most benefit as the reduced runtime could translate directly into cost savings.