Saturday, 30 June 2018

Polling vs WebSockets - Part 2 - Stress Testing

In my previous blog post, I discussed the efficiency of polling compared to WebSockets for a web application.  Using these two different implementations and performance tests, I decided it would be interesting to perform some stress testing to see which solution can handle the most load.

All code on github here.  For details of the original problem and performance tests, see here

Let's increase the threads until it fails


Even with a low number of threads, occasionally I would encounter errors - most likely soon after the server had started in its new Docker container.  Therefore I decided to run each scenario three times and display all results.

Scenario - 40 Threads

  • Job duration 0-10 seconds
  • 40 Threads/Users - Instant ramp up
  • Each Thread creating 10 jobs
  • Polling interval of 500ms
  • Timeout: 11 seconds

Results - Some Errors from WebSocket implementation

  • Run 1
    • Polling - 0 errors
    • WebSockets - 0 errors
  • Run 2
    • Polling - 0 errors
    • WebSockets - 2 errors
  • Run 3
    • Polling - 0 errors
    • WebSockets - 0 errors

What are the errors?


The JMeter test report showed me errors like the following example:

The operation lasted too long: It took 11,771 milliseconds, but should not have lasted longer than 11,000 milliseconds.

However crucially it didn't tell me which job ids were too slow.  My theory at this stage was that given the tests are very time sensitive - it's possible that one of the runs coincided with a GC run and delayed a few jobs.  


Scenario - 60 Threads


Results - More errors on WebSockets - Polling error free

  • Run 1
    • Polling - 0 errors
    • WebSockets - 0 errors
  • Run 2
    • Polling - 0 errors
    • WebSockets - 3 errors
  • Run 3
    • Polling - 0 errors
    • WebSockets - 4 errors

Again, the WebSocket tests failed with errors indicating the client had to wait longer than the maximum timeout for a response.  The theory of a GC collection slowing down the performance doesn't explain why Polling was unaffected.  These errors seemed hard to explain so I decided to add some logging on the server in order to establish where the delays were coming from.

override fun onWebSocketConnect(sess: Session?) {
    super.onWebSocketConnect(sess)
    var storeJob = jobService.storeJob(Job())
    storeJob.subscribe(Action1 { remote!!.sendString(objectMapper.writeValueAsString(it))
        logger.info("Job ${it.id} has been sent to client by socket")
        session.close(200, "Done")
    })
}

The logging revealed that the server was responding in a timely manner.  The log entries  below show that job with id ending "f2e34477fc57" had been sent over the WebSocket just one millisecond after it was marked as complete.

INFO  [2018-06-19 21:30:16,038] com.github.phillbarber.job.JobService: Job: 6daba92d-2e09-41cc-bfcf-f2e34477fc57 Delay: 9985ms
INFO  [2018-06-19 21:30:26,023] com.github.phillbarber.job.JobService: Job 6daba92d-2e09-41cc-bfcf-f2e34477fc57 is complete
INFO  [2018-06-19 21:30:26,023] com.github.phillbarber.job.JobService: Job 6daba92d-2e09-41cc-bfcf-f2e34477fc57 is complete
INFO  [2018-06-19 21:30:26,024] com.github.phillbarber.job.JobSocket: Job 6daba92d-2e09-41cc-bfcf-f2e34477fc57 has been sent to client by socket

My next point of call was to try and add extra logging to the performance tests.  Sadly, there seemed no way to be able to do this with the JMeter-WebSocketSampler.  I tried adding a "Debug Post Processor" which wrote many variables to the JMeter console, however it didn't show any detail of the WebSocket response to determine which jobs were slow.

How to identify the slow responses - More logging


I realised that the logging above was before the session was closed and not after.  To log the "total" round trip time, each Job was given a create time as follows:

data class Job (val id: String = UUID.randomUUID().toString(),
                val createTime: Long = Instant.now().toEpochMilli(),
                var complete: Boolean = false)

Then the following logging code was added:

override fun onWebSocketConnect(sess: Session?) {
    super.onWebSocketConnect(sess)
    var storeJob = jobService.storeJob(Job())
    storeJob.subscribe(Action1 {        remote!!.sendString(objectMapper.writeValueAsString(it))
        try{
            session.close()
        }
        catch (e: Exception){
            logger.error(e.toString())
        }
        logger.info("Job ${it.id} has been sent to client by socket AND CLOSED. Start to finish: " + (Instant.now().toEpochMilli() - it.createTime ))
    })
}

Those eagle eyed may also realise that session.close(200) was replaced by session.close.  This was after I realised that the status_code does not relate to http status codes and was actually throwing an error stating org.eclipse.jetty.websocket.api.ProtocolException: Out of range close status code: 200.  I never could quite find what is a valid number until I looked into the implementation of session.close and found that it uses the status code of 1000 to indicate a non error scenario.  I'm still confused as to what these status codes represent so if anyone can point me in the direction of an explanation please do!

The output from this code was telling...

INFO  [2018-06-20 20:55:44,308] com.github.phillbarber.job.JobService: Job: 106a3332-adc6-4541-9601-60c4c5a35be4 Delay: 9148ms
INFO  [2018-06-20 20:55:54,393] com.github.phillbarber.job.JobService: Job 106a3332-adc6-4541-9601-60c4c5a35be4 is complete
INFO  [2018-06-20 20:55:54,494] com.github.phillbarber.job.JobService: Job 106a3332-adc6-4541-9601-60c4c5a35be4 is complete
INFO  [2018-06-20 20:55:54,495] com.github.phillbarber.job.JobSocket: Job 106a3332-adc6-4541-9601-60c4c5a35be4 has been sent to client by socket. Start to finish: 10187

Jobs aren't completing quite as timely as I thought.  Above, we see a Job with a delay of 9,148ms but a start to finish time of 10,187ms which is a significant amount of time extra.  This example shouldn't be enough to make the JMeter client timeout but it does illustrate that the extra time could be coming from closing the session/connection.

Running the server within my IDE (InteliJ Idea) in debug mode, whilst JMeter sent it requests also revealed an interesting quirk.  Once the response of a completed job was sent to the client, if the server took a long time to close the connection, the test still passed.  However, the next request for that thread failed.  This would make debugging quite difficult since if the server was slow closing the connection for one job, it would actually be the next job created by the same JMeter thread that would error.

Conclusion - Part I


The above test scenario shows that as load increases, the socket implementation will start to get slower.  It seems that the extra time causing the test to fail is coming from opening and closing connections.

I experimented by removing the CPU and Memory constraints, which resulted in the problem going away.  This lead me to look at other stats available via docker's api including the throttling stats.

Web Sockets


  • Total Bytes sent/received 901KB
    • Bytes received: 507KB
    • Bytes sent: 393KB 
  • CPU 2.17 seconds 
  • Throttled Periods: 80

Polling

  • Total Bytes sent/received 5MB
    • Bytes received: 2.8MB
    • Bytes sent: 2.2MB 
  • CPU 6.9 seconds 
  • Throttled Periods: 121

Polling uses three times the cpu but seems to only gets throttled fifty percent more.  In other words, it seems that the WebSocket implementation is being throttled proportionally more by Docker.

Stress Test Part 2 - Limiting the CPU by cores - not shares


Whilst reading more about docker throttling, I found another way to try and mimic an aws ec2 t2.micro spec on my laptop (which was the original goal).  You can specify a set of cores to allocate a docker container.  I changed my bash script as follows:

From:

docker run --kernel-memory=1024m --cpus=0.125 --name polling-vs-sockets -d -p 8080:8080 -p 8081:8081 $FULL_IMAGE_NAME /startServerInDocker.sh

To:

docker run --memory=1024m --cpuset-cpus=0 --name polling-vs-sockets -d -p 8090:8080 -p 8081:8081 $FULL_IMAGE_NAME /startServerInDocker.sh

See my commit here.  This would limit the docker container to just one core.  Given my laptop has 8 cores, I expected to see similar performance.  However this change alone brought about an astonishing increase in throughput (and resulted in zero throttled periods).  Suddenly the scenarios I ran before with 60 threads gave 0 errors and the threads could be increased hugely as seen below.

Polling Results


JMeter ThreadsNumber of errorsMB ReceivedMB SentMB Sent/ReceivedCPU Seconds
100 04.413.517.9220.84
150 06.835.4512.2817.00
200 09.027.1916.2114.21
250 010.968.7419.6916.83
300 013.1110.4623.5716.87
350 014.6711.7126.3818.67
400016.4813.1729.6421.2
450017.5414.0131.5624.44
500118.7914.9933.7724.81
550119.5115.5935.0925.1
600120.5816.4937.0729.78
6501021.9217.5439.4629.34
7001222.8718.2941.1628.70
7503624.0019.1143.1131.09
8003625.4220.2745.6931.93
8506326.3321.0147.3534.87
9008927.0621.6148.6733.99
95025925.6320.5446.1741.18
100012628.6822.8651.5438.63

Web Sockets results


JMeter Threads Number of errors MB Received MB Sent MB Sent/Received CPU Seconds
100  1 0.64 0.70 1.34 1.06
150  0 0.97 1.04 2.01 1.67
200  0 1.29 1.39 2.68 2.24
250  0 1.61 1.74 3.35 2.66
300  2 1.93 2.09 4.02 6.3
350  1 2.26 2.44 4.70 3.45
400 1 2.58 2.78 5.37 3.78
450 8 2.91 3.13 6.04 4.01
500 11 3.243.486.724.42
550 263.583.837.414.74
600 43.934.178.115.17
650 414.274.528.795.57
700314.784.879.656.31
750694.935.2210.156.30
8001185.335.5710.906.68
850325.685.9211.607.53
9002096.136.2612.398.36
9504436.516.6113.128.35
10009656.966.9613.918.74

Terms:
  • Errors - As reported by the client (JMeter).
  • CPU Seconds - The increase of total CPU time for the server's docker container during the given test run.  
  • MB Sent/Received - The increase of total data sent/received for the server's docker container during the given test run.


 
Apart from a few anomalies, the CPU seems to increase linearly as the number of JMeter Threads is increased.  The first polling run of 100 JMeter threads looks as if the JVM wasn't given enough of a warmup to optimise itself (or perhaps was busy optimising that it had to allocate more CPU resources).  This would suggest that WebSockets (at least for this scenario) are far more efficient than Polling in terms of CPU.

Like the CPU graph, the total data sent and received increases linearly until the Polling implementation suffers so many errors that the total throughput starts to decrease as a result of client threads waiting for responses.



Up until around six hundred threads, the errors for both scenarios seem to be negligible, as the threads are increased we see both scenarios start to give errors with every further run.  At eight hundred threads a tipping point is met for the WebSockets and we see errors exponentially increase. 

Removing the last few scenarios from the graph (before the WebSocket errors go completely sky high) reveals that even before the tipping point is exceeded for WebSockets, it was consistently failing more frequently than it's Polling counterpart.

What are the errors?


Most errors were due to the client not receiving a completed job after eleven seconds.  This is the reason for all errors in the polling and around ninety seven percent of the errors encountered with the WebSocket tests.  This would suggest that the server is so overloaded that the RXJava code doesn't get enough time to fire the events at the correct times to mark jobs as complete.  This same issue strangely effects the WebSocket variant more despite it consistently performing less CPU activity than the polling variant.  I have no explanation for this at present.

A small number of errors (around 3% of all errors) were encountered by the WebSocket JMeter tests as they waited for a connection to the server.

Other interesting points


The WebSocket performance tests required more than 2GB of memory to prevent JMeter from failing with OutOfMemoryErrors!  This might suggest a memory leak in either my WebSocket test or the JMeter plugin used.

Conclusion

WebSockets can be far more efficient in terms of CPU resources and data sent/received.  The figures and graphs shown above are obviously only relevant to my one hypothetical scenario.  The following factors will obviously contribute to how much CPU and data is saved when comparing the two implementations:

  • Cost of polling query - My example is very low cost, an in memory check.  If another service needs to be queried (e.g. a DB) then increased load on that will also need to be considered.
  • Polling frequency - If you don't have to be super responsive, you can keep halving your load by doubling your polling frequency.    
Whilst the benefits of a WebSocket implementation are clear to see in the above graphs, so to are the errors.  Both solutions suffer when the server is "stressed", but the WebSocket implementation seems to degrade earlier and faster with my example.  Doubtless these service degradation issues could be addressed with tweaks, monitoring and auto-scaling.  

Thursday, 31 May 2018

Efficiency of Polling vs WebSockets

A web application I maintain uses polling from the front end to check if a long running task is complete. A colleague suggested that WebScokets would be a far better alternative in terms of performance and user experience. Having never used WebSockets before and keen to see just how much better it could be, I decided to compare the two approaches to a contrived but similar problem side by side.

All code on github here.

The problem


My hypothetical problem involves jobs. Each job consists of:
  • a unique id
  • a boolean named complete
A job is created by a client of the application and after a random duration the job completes (i.e. complete = true). The client needs to know as soon as possible once a job is complete.  This can be achieved by the client polling a job's status repeatedly until complete, or receiving a "job completion" event once finished.

For both solutions I decided to use Kotlin and the Dropwizard framework. Both of which I'm familiar with, enjoy using but have never used together.

Implementing Solution 1 - Polling via http rest


I'm familiar with creating RESTful web services so decided to implement that first. Starting from the acceptance tests here I created the following endpoints

  • POST /job - creates a job which will be incomplete at first.
  • GET /job/{id} - returns a job

Implementing this from the tests drove the design and resulted in the following service functions:

fun storeJob(job: Job) //Take a job and store it 
fun getJob(jobId: String): Job? //Return a job - null if not found


Kotlin issue with fasterxml JSON


My first acceptance test hit an issue. It seemed that fasterxml couldn't work with my kotlin data class.

Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `com.github.phillbarber.Job` (no Creators, like default construct, exist): cannot deserialize from Object value (no delegate- or property-based Creator) at [Source: (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream); line: 1, column: 2]

Thankfully this was solved after a quick google lead me to the FasterXML/jackson-module-kotlin which solved my problem and let me serialize/deserialize kotlin data classes.

Kotlin issue with main method


The next issue I hit was that my kotlin class with a main method would not work. I added the following code at the package level in my Application:

fun main(args: Array<String>) {
    if(args.size == 0){
        PollingVsSocketsApplication().run("server");
    }
    else{
        PollingVsSocketsApplication().run(*args);
    }
}

(see code on github here)
I then listed this class as my Main class in the Manifest file by configuring the maven-shade-plugin.  Once running from the command line as follows, I hit this error:

java -jar polling-vs-sockets-1.0-SNAPSHOT.jar
Error: Main method not found in class com.github.phillbarber.job.PollingVsSocketsApplication, please define the main method as: public static void main(String[] args)
or a JavaFX application class must extend javafx.application.Application

It seemed that after a maven install, the static method was not added to the compiled class for some reason.  I never could work out what was going on so instead created an object as follows:

object Launcher {
    @JvmStatic fun main(args: Array<String>) {
        if(args.size == 0){
            PollingVsSocketsApplication().run("server");
        }
        else{
            PollingVsSocketsApplication().run(*args);
        }
    }
}

(see code on github here)


If anyone reading this works out what I'm doing wrong - please let me know! Or, alternatively, perhaps this is a bug with Kotlin 1.2.3?

Implementing Solution 2 - WebSockets


For this solution, I had to do a bit of googling as it was completely new ground for me.  My first goal was to just get any basic WebSocket example I could find working locally.  I eventually came across the Jetty WebSocket Client which made sense to use given that DropWizard already uses Jetty.  This page on the Jetty docs site shows how to get a "Echo" WebSocket Client up and running, however it didn't work for me.  The only way it seems possible to get a WebSocket to work (at least with Jetty 9.4) was to implement the WebSocketConnectionListener class.  I did that by making my SimpleMessageSocket extend the WebSocketAdapter class and then it started magically working.  My first WebSocket!

Whilst playing around I noticed something interesting which now seems obvious.  With HTTP, the client is very different to the server. There is a fundamental difference between sending and receiving a http request which is why we have many different http client libraries.  With WebSockets the only difference between client and server is who establishes the connection.  Both parties can equally send and receive messages.  This means (at least with the Jetty library) you can deploy the same WebSocket code to both the client and server.  This is good fun when you create a socket that just sends back the message it received (i.e. infinite loop of WebSocket traffic)!

Make the code Asynchronous - RxJava


In order to create a WebSocket for the server to create jobs and send job completion events, I had to expose some Asynchronous behaviour.  For this I decided to use RxJava.  The tests ended up giving me the following design in the Service:

fun storeJob(job: Job): rx.Single<Job> 
fun getJob(jobId: String): Job?

As you can see, the storeJob function now returns a Single of Job.  This Single will only ever emit a Job once it has been completed.  This made the client code in the JobSocket quite simple:

override fun onWebSocketConnect(sess: Session?) {
    super.onWebSocketConnect(sess)
    var storeJob = jobService.storeJob(Job())
    storeJob.subscribe(Action1 {        remote!!.sendString(objectMapper.writeValueAsString(it))
        session.close(200, "Done")
    })
}

It was great to be working with RxJava again - lots of fun!

Creating the Performance Tests


Code for both tests (plus some bash scripts I found useful), can be found here.

Creating the tests gave me some interesting thoughts. The http-polling tests were not simple to create since they had to:
  • Call POST /job to create the job
  • Call GET /job/{id} repeatedly until complete=true or the maximum timeout had been exceeded.
I'd argue that this isn't trivial to implement in any client whatever language you choose. It shows that polling (whilst simple to expose as a service) is complicated for clients to integrate with.

Creating the tests for the websocket implementation had its own challenges due to the fact that JMeter does not have its own native WebSocket sampler. After a bit of googling I found Maciej Zaleski's Websocket sampler.  The wiki explained how to install and after a bit of trial and error I had the test working. Sadly I wasn't able to setup proper error handling as there didn't seem to be a way to validate successful responses in the sampler, or expose the responses for bean shell processors.

These issues (whilst significant) are only due to the fact that websocket solutions aren't as common as http. Assuming that more mature and feature rich clients (in this case JMeter plugins) are created this issue goes away. The only reason my acceptance tests for the polling solution were simple was because I cheated with some Thread.sleeps. It's highly likely that this would not be a viable option for clients integrating in a real-world scenario.

Test Setup - Docker



Docker is great. Running the server in a docker container allowed me to constrain its resources and also separate the performance test execution from the server under test.

I decided to constrain the resources to mimic something close to Amazon's ec2 t2.micro spec. This seemed a more production realistic simulation than allowing the server to have free rein of my entire laptop. I limited the cpus resources to one eighth (since my laptop has eight CPU cores) and memory to 1GB as follows:

docker run --kernel-memory=1024m --cpus=0.125 --name polling-vs-sockets -d -p 8080:8080 -p 8081:8081 $FULL_IMAGE_NAME /startServerInDocker.sh

I have to confess that the first time I ran these tests I incorrectly interpreted the docker runtime arg of --cpus as "number of cpu cores" and set it to 1.  This had the effect of allowing full CPU to the docker container.  Once I corrected my mistake, and set it to 1/8, throughout was hugely reduced.  This highlights the fact that my hypothetical problem is mostly CPU bound. 

Running in Docker also gives you a number of very interesting stats like overall CPU, Memory and Network IO via the useful docker stats command. This was also useful in showing me that I had failed to limit the memory of my container to 1024m as I had hoped!
Sadly specifying these limits meant that I was unable to replace my bash script with a docker-compose.yml file since version 2 of docker-compose does not support limiting resources.

After spending a lot of time trying to get pretty graphs from the docker-monitoring project I decided it would be simpler for me to just use the Docker Stats API.  Once I found out how to enable the docker api on my laptop, I got my bash scripts to curl the following URL before and after each test run so as to snapshot all the data it provided for later analysis.

http://localhost:4243/containers/polling-vs-sockets/stats?stream=false

Where polling-vs-sockets was my container running the application.

Understanding docker stats - What is cpu_stats.cpu_usage.total_usage?


Within the big json response from docker stats are CPU stats.  The field cpu_stats.cpu_usage.total_usage looked the most relevant however I didn't know what the measurement actually represented.  I couldn't find a description in the docker docs  anywhere but eventually found someone with the same issue on github here.  One of the responses provided a link to some code for a project named moby here which (within the comments) explains that it is a measure of Total CPU time in nanoseconds.  I haven't even heard of moby before but essentially it's a framework which can be used to assemble it's own libraries into a standalone container platform which Docker uses.  I thought Docker was a singe big monolith - but it's not!  This article has a good explanation.

Performance Test Results - First impressions from the JMeter UI


Whilst creating the performance tests I was immediately struck by how much cleaner a Websocket implementation is over polling. Take a look at the following screenshots.




The image above shows my http-polling test creating a single job on a single thread and waiting for it to complete. Ignoring the “Start Time” entry (which was necessary to implement my overall timeout), we can see five requests. Only two of these are requests we actually care about, the POST to create and the final GET which returned the completed job. The other GET requests are a distraction and a waste of everyone's time and effort.



This screenshot shows the WebSocket solution. As before, just one thread creating one job and waiting for it to complete. This was all achieved with just a single entry which is incredibly clean and has no items/events that we don't care about.  Sadly it didn't seem possible to display the individual messages received, but it didn't prevent from getting a working test up and running.

Results - What is more efficient Polling or WebSockets?

Scenario:

  • Job duration 0-10 seconds
  • 20 Threads/Users - Instant ramp up
  • Each Thread creating 10 jobs
  • Polling interval of 500ms
  • Maximum job creation time (from client perspective) - 11 seconds

Polling Results:

  • Total Bytes sent/received 1.62 KB
    • Bytes received: 951 bytes
    • Bytes sent: 703 bytes 
  • CPU 3.29 seconds (3,287,000,000 nanoseconds)

WebSocket Results:

  • Total Bytes sent/received 0.28 KB
    • Bytes received: 162 bytes
    • Bytes sent: 126 bytes 
  • CPU 0.68 seconds (681,000,000 nanoseconds)

Conclusion


In this scenario, the polling solution uses roughly five times the amount of CPU and five times the amount of network traffic than the WebSocket solution.

What if we double the polling frequency?


Our product owner has now decided that the UI isn't responsive enough to the job completion event. Our only solution is to decrease the polling interval and suffer the consequences!

Scenario - Same as above except:

  • Polling interval of 250ms

Polling Results:

  • Total Bytes sent/received 2.6 KB
    • Bytes received: 1,471 bytes
    • Bytes sent: 1,195 bytes 
  • CPU 13.8 seconds (1,381,000,000 nanoseconds)

Conclusion


Here we see a huge difference over the socket implementation.  Nine times the amount of data and twenty times the CPU are used for polling compared to the socket implementation.  Also, the socket implementation is still quicker for the user since the polling option still requires the user to wait on average half of the polling duration (125 ms). 

Summary


Websockets still seem a slightly niche technology in comparison to good old HTTP/Rest. Depending on the language you are using, there aren't perhaps as many libraries available to you should you go the WebSocket route. However, when dealing with events, HTTP is fundamentally flawed in that it can only truly support events generated by the client and not the server. Polling is a hack which generates waste, distracting noise and complicated client code (i.e. keep trying until a condition OR max attempts exceeded). The Network and CPU differences shown above could be even more significant if the GET request was serviced by a complicated query (e.g. a DB query).
To summarise:
  • Polling code is simple on the server but:
    • The client code making polling requests is annoyingly complicated.
    • Load can increase when clients decide they want to know sooner.
  • WebSockets (like anything) take a little learning and you might not quite have the library support but:
    • The client code will be simpler on a conceptual level as it will deal with true events and not loops with conditions.
    • You'll benefit from the reduced cpu and memory consumption in the long run. 

Other things to investigate

If I get the time, I'd love to look into this topic again.  Specifically:
  • Stress test - Determine which solution can handle the most traffic before breaking.
  • Examine an HTTP2 implementation.  Just before I published this, a colleague told me that HTTP2 supports full duplex communication and could end up replacing WebSockets!

Saturday, 31 March 2018

ASCII Art Mazes and Playing with Kotlin

I decided to play around with Kotlin for fun on a home project that interested me.  At many points during the development I was thankful that I was using Kotlin and not Java, here's a few notes that explain why.  All the code for this page can be found here: https://github.com/phillbarber/kotlin-maze-demo

The Problem

Given a simple 2D maze inputted in an "ASCII art" format, write some kotlin code that will plot a route from the Start to Finish.

In other words, given this...

############################################
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #         F             #
#                  #                       #
#                  #                       #
#                  #                       #
#    S             #                       #
#                  #                       #
#                  #                       #
#                                          #
#                  #                       #
#                  #                       #
#                  #                       #
############################################

Output this...

############################################
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #         F             #
#                  #         .             #
#                  #         .             #
#                  #         .             #
#    S             #         .             #
#    .             #         .             #
#    .             #         .             #
#    .........................             #
#                  #                       #
#                  #                       #
#                  #                       #
############################################

Approach

Tried to be disciplined and use a TDD approach from the start.

Implementation

As soon as I started, I had an idea of how I wanted to achieve this.  I wanted to use recursion.  I remember learning recursion as a student and finding it strangely fascinating.  It seemed so powerful with such a small amount of code.  Also, it's one of these techniques I never seem to use at work.

First acceptance test - The `is` issue

I like using hamcrest matchers in my test as I think it makes tests closer to plain English (more readable).  A very common method used with hamcrest is CoreMatchers.is().  Unfortunately, the word "is" is a keyword in kotlin but can be escaped for compatibility with methods that happen to clash by name.  This is great, but in my opinion makes the code look a little ugly as shown below...

assertThat(mazeAsStringWithRoute, `is`(expectedMazeWithRoute)) 

A quick google for some alternatives brought me here where it was pointed out that static methods can be aliased on import in the following way:

import org.hamcrest.CoreMatchers.is as Is

This is purely a styling issue but I chose to import as _is.  It looks better in my eyes (possibly from enjoying using the wonderful underscore.js library).

Multi-line (raw) Strings - Hooray!

At first I stored my mazes in .txt files in my test resources directory.  Then I had to write some code to load the file into a String.  Converting a file into a String always seems much harder than it should in Java.  To do it succinctly you have to import a library (e.g. Apache Commons IO), even then I'm forced to think of directories and classpaths etc etc as my first attempt never works!  Once I had imported Apache Commons and got it working, I was left with the following code.

FileUtils.readFileToString( File(javaClass.getClassLoader().getResource("expected-super-simple-solvable-maze-with-route.txt").file), CHARSET)

A bit of a pain, but at least my maze files are easily viewable.  One alternative would be to store the mazes in a String in Java as follows

String maze = "#########\n" +
              "# S  F  #\n" +
              "#########\n" +;

...over simple example

This is obviously very ugly and fiddly.  However, with Kotlin this becomes very easy thanks to multi-line Strings:

val maze = """
########## 
# S  F   #
##########"""


I think it's great that Kotlin supports multi-line (or rather raw) Strings. However, perhaps the more relevant point here is "Why does Java not support it?".  It looks like it might be coming to Java if JEP 326 gets its way - let's hope it does.

Extension methods

Although the multi-line Strings improved things, I did have a slight issue.  To format the maze nicely, I had to introduce a newline character so that everything would line up.  To remove this from each String I used an extension function.

In Java it would have just been a static method, but an extension method did make the code seem cleaner...

val maze = """
##########
# S  F   #
##########""".removeFirstCharacter()

fun String.removeFirstCharacter(): String{
    return this.substring(1, this.length);
}

Data classes

The pain of creating Java Beans isn't too high in Java when using a good IDE such as IntelliJ IDEA.  I tend to create my Java Beans with their attributes and just hit Alt+Enter to insert my constructor and my getters.  Builders can definitely be a pain however as I don't think I've ever found a nice shortcut or plugin to create/maintain them.  Regardless of how easy it is to create this code, you're still left with a lot of it about!  Data classes in Kotlin make this so much nicer.  They provide equals(), toString() and no real need for builders due to the nice syntax of the language (See this... How to create a builder in kotlin? - Don't!).

How Immutable to be?

Whilst trying to make all of my objects nice and immutable, I hit a problem.  I wanted to created a Maze from my ASCII art input as shown above.  Each Maze consisted of a List of a List of Cells.  The original idea was to have a Cell consist of immutable attributes (i.e. vals and not vars) as follows:-

data class Cell(val type: Type,
                val xAxis: Int,
                val yAxis: Int,
                val down: Cell,
                val up: Cell,
                val left: Cell,
                val right: Cell)

Given that a Cell consisted of four other Cells (i.e. a recursive data structure), you couldn't create a cell until you have created every other cell required for the entire maze.  This didn't seem possible.  One option would be to create a "half baked" Cell which had it's sibling cells missing (null) at first.  From that, a copy could be made each time a new sibling needed to be set.  I considered this, but thought that it was more complex than just having a mutable Cell class as follows:

data class Cell(val type: Type,
                val xAxis: Int,
                val yAxis: Int,
                var down: Cell? = null,
                var up: Cell? = null,
                var left: Cell? = null,
                var right: Cell? = null)

This design made it actually possible for me to write the code (which was a good sign!).

Parsing the ASCII and generating a Maze


The ASCII art was parsed a line at a time creating the List of List of Cells.  The first parse would create Cells with just the type, xAxis and yAxis set.

Cell(   type = fromChar(character.toChar()), 
        xAxis = xIndex, 
        yAxis = yIndex)

Once complete, the Cells would have their sibling Cells set by the addDownUpLeftRightToCells method.

private fun addDownUpLeftRightToCells(gridOfRows: List<List<Cell>>) {
    gridOfRows.forEach{ rowOnYAxis ->        rowOnYAxis.forEach { cell ->            
            cell.down = getCellByCoordinates(gridOfRows, cell.yAxis + 1, cell.xAxis)
            cell.up = getCellByCoordinates(gridOfRows, cell.yAxis - 1, cell.xAxis)
            cell.left = getCellByCoordinates(gridOfRows, cell.yAxis, cell.xAxis - 1)
            cell.right = getCellByCoordinates(gridOfRows, cell.yAxis, cell.xAxis + 1)
        }    }
}

Solve the problem

With the data structure in place, now the code to solve the maze could actually be written.

This ended up being pretty simple!

private fun getFinish(cell: Cell?, route: MutableList<Cell>): List<Cell> {
    if (route.filter { cell -> cell.type == Type.Finish }.any()){
        return route;
    }
    if (cell != null){
        route.add(cell)
        if (cell.type == Type.Finish){
            return route;
        }
        if (shouldVisitCell(cell.down, route)){
            getFinish(cell.down, route);
        }

        if (shouldVisitCell(cell.right, route)){
            getFinish(cell.right, route);
        }
        if (shouldVisitCell(cell.left, route)){
            getFinish(cell.left, route);
        }
        if (shouldVisitCell(cell.up, route)){
            getFinish(cell.up, route);
        }
    }
    return route;
}

This produced a solution....

############################################
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #                       #
#                  #.........F             #
#                  #.......................#
#                  #.......................#
#                  #.......................#
#    S             #.......................#
#    .             #.......................#
#    .             #.......................#
#    .            .........................#
#    ..............#.......................#
#    ..............#.......................#
#    ..............#.......................#
############################################

...but a very bad one!

Final Thoughts

This was always going to be a very simplistic route finder as shown by the "solved" maze above.  I think it would be great fun to extend this and use some AI concepts I was taught back at Uni.    This was a fun task and made more fun with Kotlin.  



Tuesday, 13 June 2017

From Developer to Product Owner

I have been a software developer for around ten years, but recently I was fortunate enough to be offered a different role as product owner (PO) for an Identity Platform. Keen to try something new, I accepted. However, after seven months in the role, I worked out that it wasn’t for me. This post will detail what went well, what went badly and what other developers can possibly expect if they choose to make the switch from engineering to product.


My background

I have been a developer ever since my Uni days and have worked at a number of different companies in London for around ten years. I’ve worked for Media companies, Finance and also a startup.


The Job

When I was contacted by a friend (and former colleague) about a “Technical Product Owner” role I was really intrigued. I have always felt that “pure development” roles don’t fit me perfectly as I enjoy more than just coding features. I enjoy coming up with plans, talking to a wide range of people in an organisation and taking on extra responsibility. I thought that perhaps this role could suit me perfectly but was also quite intimidated by how much of a challenge it would be. What made me accept the role was the realisation that it was such a unique opportunity that I could hugely regret declining in the future.


I took it on and was hopeful that a lot of my experience would prove relevant and that I’d pick up the other skills needed along the way.


Some of the challenges that I faced were unique to the initiatives that were in progress or planned. What I want to focus on here are the challenges that could apply to other developers making a switch to product and offer my advice.


Stakeholder management is different to working with a single PO as a developer

I was confident I’d be ok at stakeholder management. I have a lot of experience from my development days talking to non-techy POs, working out what they want and coming up with a plan of action. I assumed that as PO of a platform (with a more technical focus) engaging with the stakeholders that were non-techy and more business focused would be very similar. In some ways it is, but there are some differences that are very significant.


You need to have the confidence to dictate

As a developer, talking to your PO is all about coming to a compromise. As a PO, when it comes to dealing with stakeholders, compromise is still to be strived for but sometimes you have to dictate in the name of progress. Looking back, I wasn’t confident enough to dictate when it was necessary for a clear plan of action. I was very much aware of how important it was to be decisive, but still didn’t feel able to state “This is the plan, we’re going ahead” when I knew that they had expressed preferences that were only slightly different to my proposal.



I think this came from a subconscious feeling that I wasn’t in someway on their level. They seemed like the “real” POs to me as they could talk confidently about business terms such as revenue growth, user engagement and subscription funnels. I was more comfortable talking to the devs about techy details such as scalability, zero downtime deployments and firewall rules. Dealing with multiple stakeholders as PO in the same way that I had dealt with single POs as a developer led to too much compromise which resulted in a lot of back and forth. It took way too long for us to come up with a concrete plan of scope for our first release of a new feature.


Looking back, a key moment was when I was trying to get many stakeholders to agree on a plan of action. One day I told a more senior product owner something along the lines of “Hey, great news I just managed to get one of the stakeholders to agree with plan A”. To my disappointment he wasn’t very enthused with my update. His response was “It doesn’t matter, because we’re doing it anyway”. I thought that this was a strange thing to say at the time, but looking back I now realise its importance. In some scenarios it is impractical and inefficient to negotiate with every stakeholder because it prevents you from getting clarity on what you’re doing.


People more senior than me started to recognise the problem that I was facing. With their help a plan was put in place, communicated at a steering group and everything started to fall into place. It’s very widely known that when a team is working towards a clearly defined goal they are much more likely to get something of tangible value delivered. I have seen this first hand and realise just how important it is to have clarity from the top so that everyone understands exactly what’s needed.


Developers are used to looking at the detail - When the PO joins in, things don’t move

As a developer, I have always found estimation hard. Estimating at a high level, such as epics that have not been fully broken into stories, I find even harder. I can recall retrospectives in the past when former teams I belonged to have discussed why we think we vastly under-estimated a piece of work.


This past experience has (rightly or wrongly) made the developer in me very cautious to not overlook details that could contribute to underestimating. Other developers no doubt feel the same way.


When estimating high level epics as a team for future items on the roadmap, being super accurate is not important. Given the tendency of developers to need more detail, If the product owner supports the demands for “more analysis” or “spikes” you are left with analysis paralysis. It’s issues like this that can cause uncertainty to rumble along for too long and prevent a delivery roadmap from being formed.


A development background can lead you to present at too fine-grained detail

This sounds so obvious, but it is worth pointing out. I prepared a presentation to some non-techy stakeholders that showed user journeys and thought I was mindful enough of the business focused audience. To my astonishment, I received feedback that my message was far too detailed. Keep this in mind and get a second opinion (ideally before slaving away over your slides) to ensure you communicate at just the right level.


Stress makes you less effective

When things got stressful, being effective was a lot harder. Work felt exhausting and became miserable at times. Seeing that our “sprint planning” session was fast approaching in the calendar filled me with dread on the occasions I knew there weren’t enough stories prepared. At times it seemed impossible to create a backlog of work and switch to a more proactive, long term focus.


Seeing this issue, our delivery manager made the suggestion to put the work temporarily on hold and let the team work on whatever tech-debt they’d like for a sprint. This was a great idea as it bought us some much needed time and allowed us to finally get things in order. With stories prepared for a number of sprints, gradually our focus switched to more long term planning and things became much calmer.


A good support network is crucial

As a developer, you benefit from working closely with your peers. Pair programming, code reviews and informal chats typically occur multiple times per day and all help with knowledge sharing. This is made easier when you’re all at the same bank of desks working on the same thing.


As PO, it’s likely that you will sit with your team and far away from your other fellow PO peers. This means that the opportunities that do occur to knowledge share are highly valuable.


If a role doesn’t work out, it will still benefit you

As I mentioned at the start, the role wasn’t perfect for me. However I did gain a huge amount of experience in things that I’m hoping I can apply to other future roles. At the very least, I have a newfound respect for POs because I can honestly say from experience, it’s a tough role!

Tuesday, 30 August 2016

How to prove your Ratpack app is non-blocking.

Using the Ratpack framework can be tricky. It's easy to get something working but it's harder to ensure that your code is non-blocking.  This post will detail how to prove your code is non-blocking with automated tests.

Background - What's the problem?

In order to realise the performance benefits of the Ratpack framework you have to ensure that your code is non-blocking.  This means ensuring that your Java Threads are always busy and not waiting for I/O (i.e. non-blocking).  As soon as your thread kicks off an I/O task (e.g. sending a http request or making a DB call), it should be able to switch to another task.  Once the I/O eventually completes an event is fired which informs that the next step in the code can be performed, typically doing something with the result of the I/O task.  This means that a huge number of tasks which are I/O bound (i.e. spend a significant amount of time waiting for I/O) can be performed in parallel with only a small number of threads.  This approach can have significant CPU and memory benefits.  However, it comes at the price of increased code complexity.


In my opinion, the biggest problem is where you accidentally make your code blocking i.e. the Thread is blocked waiting for the I/O to finish and doesn't do anything useful in the meantime.  Just because you are using asynchronous libraries, there's nothing stopping you calling a blocking method by mistake.  In this situation you go back to the traditional synchronous model whereby you can only hope to achieve more parallel tasks being executed by adding more threads to your application.  What makes this situation worse with a Ratpack application is that you have invariably assumed non-bocking code and have only allocated a small number of threads to your application (the default is two threads per CPU core).  This can result in a drastic decrease in throughput.


These bugs can easily sneak through into production for several reasons.  Firstly, you can make your code inadvertently block a thread for a variety of very subtle reasons.  You don't even need to call any blocking methods yourself, to introduce blocking code.  As an example, see this code which calls Cassandra - it makes any thread which subscribes to the returned Observable block on I/O.  Secondly, when the offending blocking code is invoked from tests, unless under load, it will not only work fine but respond with the same latency as if it was non-blocking.  In this scenario, only performance tests have the potential to stop your bug going live.


How to prove your code is non-blocking with tests


In short: demonstrate that your code can handle more I/O bound transactions concurrently than it has threads. Let's assume a web application that uses standard blocking I/O, has four threads, and  perform the following:
  1. Takes a http request from a client.
  2. Makes a call to a downstream service over http which takes three seconds to respond.
  3. Once the response has been received from the downstream service, returns a response to the client.
Assuming a blocking model, the absolute best throughput that could ever be achieved is as follows:

(Time Period / (I/O Delay)) x (Number Of Threads) = Maximum throughput

Or in the case above.

((60 seconds) / (3 seconds for I/O to complete)) x (2 Threads) = 40 Transactions per minute (TPM)

In practice, the throughput would not be as high as this since other time would be required for dealing with the client's request and response.

When we consider a non-blocking I/O model, the above formula should no longer hold true since we are no longer limited by the number of threads.  

It's up to our tests to prove that the above formula does not apply to our code.  This can be done by sending N number of requests at the code simultaneously and ensuring that the throughput exceeds the maximum throughput if it were blocking.  If it doesn't exceed the throughput - fail the test!

Writing the test  


I'm a fan of writing end to end tests that prove your code can make actual requests over the network to stubbed dependencies.  Wiremock is a great tool for stubbing and mocking http services and is ideal for this test.  

We need several things in our test as detailed below:-
  1. I/O delay - To simulate a delay in our downstream http service, we can use wiremock's fixed delay feature for simulating slow responses.
  2. Constrained number of threads - We will set ratpack to have one thread only.
  3. Send simultaneous requests to our web application.
  4. Timeout - We will set the timeout to the I/O delay multiplied by the number of simultaneous requests.  If this timeout is exceeded, we haven't proved our code is non-blocking.  This can be specified quite neatly in junit tests with an annotation.
The code:
@Test(timeout = SLOW_ENDPOINT_DELAY * NUMBER_OF_CALLS)
public void handlerIsNotBlocking() throws Exception {

    URI uri = new URI(getAddress().toString() + "happy");

    List<Response> responses = new ConcurrentExecutor(
            () -> jerseyClient().target(uri).request().get(), NUMBER_OF_CALLS).executeRequestsInParallel();

    assertThat(responses).hasSize(NUMBER_OF_CALLS);
    responses.forEach(this::verifyResponseHasCorrectContent);

}
 
See github for code.


Results


When this test is run, the following happens:
  1. Eight requests are sent simultaneously to the ratpack application via a "ConcurrentExecutor" a class I created which wraps an ExecutorService. 
  2. Eight requests are sent to wiremock. 
  3. After a three second delay, all eight responses are received from wiremock. 
  4. Ratpack then returns all eight responses within around ten milliseconds of each other. 
  5. The junit test verifies that all eight responses were received ok. 
  6. The test completes quicker than the timeout, proving that the code is non-blocking. 
This same technique can be applied to other services such as Cassandra using Stubbed Cassandra. 

Wednesday, 6 April 2016

The case for keeping firewalls simple

Internal firewall rules that attempt to analyse anything higher than the network layer can cause huge problems. In this post I'll make the case for keeping your firewall rules simple.


The problem we had


Our team recently encountered an error where an internal web application received a socket timeout when trying to call one of its internally hosted dependencies.  Whilst investigating, we found that the application had made successful HTTP calls to the same service, immediately prior to the error.  



It was puzzling but I ruled out anything Network related in our investigation given:

  • The application could make some requests absolutely fine.
  • There was nothing seemingly different about the requests and responses.  They were all GET requests that returned a small amount of JSON.
  • There weren't any connection errors.

All signs pointed to the server taking too long to respond, i.e. an application issue. We then found that the system being called had no record of even receiving the failed request. It was very puzzling.  One member of the team recommended we talk to Ops about the issue.  I was convinced not to bother them as the problem must be with the application.  I was wrong!

We then found out that (through talking to Ops) that the firewall was blocking the request on the basis that the contents of the request was deemed suspect.

We wasted a lot of time investigating completely incorrect theories based on seemingly sound, but invalid assumptions.



The problem in general


How to ever know if an error is firewall related


Our problem manifested itself as a socket timeout.  How would other non http based protocols report a blocking of traffic?  I can easily imagine going through the same long learning process for a database, an FTP or an SMTP service.

Confusion is introduced - even if it never fails again


Let's assume that these problems are addressed and the logic is updated to handle the legitimate requests.  Let's also suppose that the firewall never blocks a genuine request again.  When a socket timeout error is encountered, we could now point the finger at the firewall when we should be focusing on the application.

Recommendation


I'd advocate a simple firewall for internal traffic that whitelists IPs and ports only.  If we can be certain that the firewall completely trusts traffic based on an established TCP/IP connection, things will become a lot simpler to debug.  

The simpler to debug, the quicker you fix your site in an emergency!  Time is of the essence.

If you must...


If there is an absolute requirement that these firewall rules are in place, confusion can be mitigated by performing the following:

  • Ensure that all developers are aware of how firewall issues may present themselves.
  • Provide a console for everyone to easily see if traffic is being blocked by the firewall or not.