Sunday, 10 January 2016

Choosing between Ratpack and Dropwizard.


This post will discuss our team’s approach to evaluating the java based web framework ratpack as an alternative to dropwizard for creating a new microservice.  This post assumes a basic prior knowledge of both ratpack and dropwizard.

Background


The team I'm on is full of developers all with experience of creating and maintaining applications built with the Dropwizard framework. It’s safe to say that Dropwizard would be the team’s default preference when considering frameworks for web based java apps. However, we have recently been tasked with building a new microservice that is as efficient as possible. Although this non-functional requirement isn’t 100% specified, we do know it will need to be capable of high throughput and low resources (i.e. memory and CPU). High costs from the company’s PAS provider have in no doubt influenced this requirement.

Ratpack is said to allow you to create "high performance" services that are capable of meeting these non-functional requirements assuming that your application is I/O bound.  It's because of this that the team evaluated it.

Dropwizard vs Ratpack performance


A Dropwizard vs Ratpack performance test was conducted to see if the reputed performance benefits could be observed for our use case.

Two webapps that performed the same steps (detailed below) were created, one with dropwizard the other with ratpack.

Architecture Diagram




Load Profiles  


Three load profiles were given to each application as detailed below:
  1. 5 minute duration, 100 JMeter Threads
  2. 5 minute duration, 200 JMeter Threads
  3. 5 minute duration, 300 JMeter Threads

Results


The third test run resulted in the Dropwizard application eventually failing to respond.  


The graph above shows that throughput for the first two test runs was very close.  The third test run shows Ratpack answering significantly more requests per second due to the fact that the Dropwizard application began failing to respond.  Exactly why the Dropwizard application gave up was not investigated. 





Here we see ratpack using significantly less memory in all but the last test run. 






Ratpack seemed to use around half of the CPU as Dropwizard.




Here we see the true nature of our non-blocking frameworks in action.  Ratpack used just twenty threads in each test run.  Knowing the basic premise of non-blocking I/O, this shouldn't come as a surprise - however I still find this very impressive!



Performance Summary

Ratpack can clearly handle more requests with less resources in an application that spends most of it's time waiting for I/O.  This was of major significance to us since we knew from the start our application would be I/O bound.  It's also worth pointing out that (at least in my experience) I/O bound webapps are very common.

Other concerns


In order for the organisation to support our new service, it needs to comply with a few standards.  

Logging

A common design design pattern in a microservices architecture is to assign some kind of UUID to each request which is then passed to downstream systems.  This  UUID can then be added to log events so that they can be correlated across multiple systems.  This can (and has at our organisation) been implemented using SLF4J's Mapped Diagnostic Context (MDC) which "...manages data on a per thread basis."  However, it should be noted that " a server that recycles threads might lead to false information".
Ratpack recycles threads, but luckily were covered an MDCInterceptor has been created which addresses this very problem.

Deployment


The organisation relies on the simplicity that comes with building and deploying Java based services as Uber/Shaded/Runnable jars.  Since ratpack applications can be built in just the same way, this wasn't an issue.

Integration with Hystrix


Hystrix is a great library that helps you make your services fault tolerant.  Not only is it great, it's become the unofficial standard within our organisation.  Any java based application can integrate with Hystrix in many different ways since it's so flexible.  Hystrix supports synchronous, asynchronous and reactive programming models.  Not only does Hystrix support non-blocking calls it also supports the use of semaphores instead of Thread pools to manage concurrent downstream calls.  The flexibility and support offered here by Hystrix is absolutely crucial!  If Hystrix mandated the use of a thread pool or blocking calls, we'd be back to the Dropwizard performance characteristics (shown above) of having one thread per concurrent request we handle (assuming each request results in a downstream system call, which is true in our case) which would negate ratpack's benefits almost entirely.

Not only is it possible to use Hystrix with ratpack, it's also easy to use features such as request caching, request collapsing and streaming metrics to the dashboard (which is awesome by the way) with the ratpack-hystrix JAR.

Complexity (Learning Curve)


The benefits of ratpack come at the cost of complexity.  Our team (myself very much included) are used to building traditional synchronous based Java apps.  The effort of adopting an asynchronous programming model was difficult for the team to assess but was certainly a known risk which could affect delivery time.

Decision Time - Ratpack or Dropwizard


With the amazing performance from our benchmark, a sense of optimism that we’d pick up the new framework and asynchronous programming quickly... the decision to go to ratpack was made.



My next post will detail the team's experience in general with Ratpack. 

Thanks to the team for all of the shared learning so far, and especially to @mirkonasato for leading the performance benchmark work detailed above.

9 comments:

  1. Hi Phill,

    Interesting article.

    Can you provide some more background on the economic decisions behind this?
    Your conclusion seems to be that 20-30% efficiency improvements on CPU utilisation and throughput measures are enough to justify taking on the engineering complexity. I'm curious what scale you're running at to justify that.

    How did you measure memory consumption? Did this include off-heap?

    How did you decide what conditions to model e.g. latency in the upstream services?

    Did you consider building an async backend on Dropwizard (i.e. doing your backend calls concurrently and syncing prior to responding)?

    Cheers,
    Tom

    ReplyDelete
  2. Hi Tom,
    Thanks very much for taking an interest. All good points which I'll try and answer.

    * This service will be used at very high loads. The current figures we're estimating for peak load are between 0.5 million and 3 million concurrent users. This equates to 1,750 and 10,500 calls per second respectively. This is obviously a huge range, but even at the minimum it's a huge volume so there was certainly a feeling that it was worth some level of effort to realise the benefits of ratpack.

    * The memory usage was measured via: jconsole, however only the heap memory was recorded. This isn't ideal, it would have obviously been better to measure the absolute total. If we get a chance to repeat this test, we'll make sure to do that.

    * The CPU usage was also measured via jconsole.

    * You mentioned "latency in the upstream system" - If by that you mean latency in the system we call (this is what I'd refer to as downstream so bit confused here), we simply looked at the average latency of the systems we'd be calling. If you meant something else by that, let me know.

    Regarding the decision itself, to make the most informed one I think you'd have to calculate:

    1. Extra cost of developing/maintaining ratpack app.
    2. Extra cost of running the more CPU/Memory hungry dropwizard equivalent.

    However, these figures weren't estimated to inform our decision which in retrospect was probably a mistake. Despite now having the benefit of hindsight, I'm still not sure how you could go about estimating the extra cost of developing/maintaining the ratpack app. One possible way could be:

    1. Start off using ratpack.
    2. Play a couple of sprints delivering stories.
    3. Have a team retrospective which answers "How much quicker would we have been if we used dropwizard app?"
    4. Decide on whether or not to continue with ratpack, or revert to Dropwizard.

    I think this approach has a number of problems though:

    * The team will likely assess current velocity of using ratpack as "about to improve soon as we get more familiar" - this is something we have said to our scrum master on many retrospectives when faced with slow velocity!
    * A group of bright and tenacious individuals might view reverting to Dropwizard as "giving up"
    * The later you ask this question, the more informed you'll be. However the later you ask, the more time you potentially waste by scraping the work done so far and the more likely you are to continue with ratpack for fear of appearing to have wasted time.

    I think that these problems could be overcome, but I think that the question(s) would have to be framed very carefully so as not to bias the outcome.

    When we started using ratpack, we said that we could revert the decision if we encountered major problems. So far we haven't but we're still not live! A situation that's far from ideal.

    If you (or anyone else out there) have any other views on how to make the best informed decision I'd be very keen to hear them!

    Let me know if this all makes sense or not.

    Thanks
    Phill

    ReplyDelete
  3. Did you also compare performance with a dropwizard app that uses a reactive http client (example: https://jersey.java.net/documentation/latest/rx-client.html) to make requests to the ELB?

    I have no experience in using Ratpack, but a quick read of their doc suggests that the perf benefits you have seen are due to its non-blocking HTTP engine. So it might be possible to get comparable performance by simply using a reactive http lib to make requests to your backend (ELB in this case).

    Thoughts?

    ReplyDelete
  4. Hi Anurag,
    Thanks for the comment. Good question... made me think!

    In the scenario you mentioned, I think you'd still have one Thread for every request that your server responds to. If your use case involved making multiple http calls and they could potentially be done in parallel then this approach would be a benefit. This is the situation described here with the calls to the weather service: https://jersey.java.net/documentation/latest/rx-client.html

    As it currently stands for us, each request that we process results in one request to a downstream system only. This means there would be no benefit for us just using the asynchronous http client as our threads would have nothing to do in the meantime whilst waiting for a response. We need to be asynchronous at every step of the way whilst handling our requests to get the benefit of the asynchronous model.

    Hopefully that makes sense and is correct!

    Thanks

    ReplyDelete
  5. Hey, thanks for taking the time to answer my question.

    I think the example of the rx-client was not a great one in this context. I consulted with a few of our team members here who have had much more experience of using async libs and programming approach in the last few months than I have had. They pointed me to the following:

    https://jersey.java.net/documentation/latest/async.html

    The server side async model is what we have successfully used in several of our dropwizard apps.

    From the intro section of the documentation
    "It will however increase the throughput of the server, by releasing the initial request processing thread back to the I/O container while the request may still be waiting in a queue for processing or the processing may still be running on another dedicated thread."

    Would the above be useful and pertinent for your use-case?

    ReplyDelete
    Replies
    1. Hi Anurag,
      That all makes sense now.

      With your first comment, I assumed that we were discussing only making the downstream calls async which would still result in each request being processed by it's own thread.

      However, when you use the @Suspended annotation (as explained in your 2nd comment) it seems you get the benefits of threads being able to deal with multiple requests simultaneously.

      I have to confess that I'm not that familiar with Jersey (JAX-RS's) async model and how it compares to Ratpack's model.

      Given our team's familiarity with Dropwizard it's probably something we should have looked into more.

      Cheers

      Delete
  6. Hello Phil,

    Thanks for your post. Has your team been happy with the decision to use ratpack? How was productivity affected?

    Best

    ReplyDelete
  7. Hi JuanPa,
    Thanks for the comment.

    I'll create a new blog post soon on this very topic.

    Thanks

    ReplyDelete
  8. Is it possible to get the source code?

    ReplyDelete