How to (not) speed up Cobbler with Rust

A few weeks I was conducting an experiment: Could Cobbler be sped up with Rust?

So what’s the problem? We noticed that generating configs for around 160 systems (systems as in Cobbler systems) is not particularly fast. Changing a distribution and regenerating the configs takes a around 15 minutes and of course Uyuni - which is using Cobbler under the hood - is running into a timeout and shows an internal server error. So after some profiling and more investigations together with Enno it was clear that there is not much we could do without taking this completely apart. Don’t get me wrong - we might still do this, but that’s not what this post is about. At some point of time I was wondering if templating, which is taking up roughly 18% of the execution time, couldn’t be improved. So we did some googling and it turned out that Cheetah, the templating system that Cobbler is using, isn’t that slow. Actually it’s just a few milliseconds slower than Jinja in most test scenarios. Although Jinja would be a more sane choice. But the day came to an end and the (temporary) solution was to increase the timeout to 20 minutes.

But at the end of the day I was wondering if this wouldn’t be faster if it was implemented in Rust, which is a compiled and much more optimized language. Now templating is something that is also highly optimized in Python, because it’s just string manipulation and IO and that’s something where Python is pretty good at. A simple test scenario would look like this: Load a template, do something with the template and write the result to a file. Repeat n times. Easy enough. So I invested some time in the evening to check how Python with Cheetah would compare to Rust with Handlebars. I should probably mention that Rust has a really awesome Python interop. Writing Python libs in Rust nearly couldn’t get easier with PyO3. So without further ado here are the results for 20,000 iterations where Rust is also doing 20,000 full context switches:

# Cheetah
60bccd807b3e:/src/templ/target/release # time python3

real    0m3.370s
user    0m2.453s
sys    0m0.863s
60bccd807b3e:/src/templ/target/release # time python3

real    0m3.046s
user    0m2.241s
sys    0m0.778s
60bccd807b3e:/src/templ/target/release # time python3

real    0m3.111s
user    0m2.237s
sys    0m0.839s
60bccd807b3e:/src/templ/target/release # time python3

real    0m3.052s
user    0m2.290s
sys    0m0.741s

# Rust lib
60bccd807b3e:/src/templ/target/release # time python3

real    0m1.883s
user    0m1.154s
sys    0m0.726s
60bccd807b3e:/src/templ/target/release # time python3

real    0m2.132s
user    0m1.408s
sys    0m0.708s
60bccd807b3e:/src/templ/target/release # time python3

real    0m2.194s
user    0m1.261s
sys    0m0.910s
60bccd807b3e:/src/templ/target/release # time python3

real    0m2.164s
user    0m1.248s
sys    0m0.893s

The results were pretty homogenous. Cheetah has an average of 3.1 seconds and the Rust lib something around 1.9 seconds. Even though Rust has to do 20,000 context switches - means switching from Python world to Rust world and back) it’s still over a second fast. Now this is not exactly our real world scenario with 160 templates, but still pretty nice.

Next thing I was wondering was how expensive the context switches are. Turns they are not as bad as I thought they would be. Still having 10,000 iterations we get an average of around 1.7 seconds. So context switching costs us roughly 200 milliseconds in this case. Could have been worse.

There is something wrong about this experiment. All of those results include the startup time of the Python interpreter. So I did some further testing and it turns out that for the Cheetah example the template generation is roughly 92% of the total execution time and for the Rust lib it’s nearly 97%. What does that mean? Well, I guess it’s the imports. Since Rust is doing optimizations during compile time, the result is stripped from things that are not needed. That’s not the case for Cheetah and of course the Cheetah lib is not a binary. Means loading that takes more time (compilation was already done since I never used the first execution) and probably also uses more memory.

So what is the take away for me? Using Rust from Python seems to be very easy and straightforward. What you get are performance improvements even over Python’s highly optimized C layer. And the difference would be like a landslide for pure Python implementations. I haven’t touched the topic with this little experiment, but Rust allows you to use much better concurrency features - especially when it comes to CPU. Memory safety guarantees are also pretty neat. There should never ever be segfaults with Rust. But of course in our real world scenario with 160 iterations the performance improvement is too small. So no Rust in Cobbler yet.

You can find the example code here. And yes, this Rust code sucks and it’s entirely my fault for not knowing Rust any better! ;p

Happy hacking!