The case for keeping firewalls simple

Internal firewall rules that attempt to analyse anything higher than the network layer can cause huge problems. In this post I'll make the case for keeping your firewall rules simple.


The problem we had


Our team recently encountered an error where an internal web application received a socket timeout when trying to call one of its internally hosted dependencies.  Whilst investigating, we found that the application had made successful HTTP calls to the same service, immediately prior to the error.  



It was puzzling but I ruled out anything Network related in our investigation given:

  • The application could make some requests absolutely fine.
  • There was nothing seemingly different about the requests and responses.  They were all GET requests that returned a small amount of JSON.
  • There weren't any connection errors.

All signs pointed to the server taking too long to respond, i.e. an application issue. We then found that the system being called had no record of even receiving the failed request. It was very puzzling.  One member of the team recommended we talk to Ops about the issue.  I was convinced not to bother them as the problem must be with the application.  I was wrong!

We then found out that (through talking to Ops) that the firewall was blocking the request on the basis that the contents of the request was deemed suspect.

We wasted a lot of time investigating completely incorrect theories based on seemingly sound, but invalid assumptions.



The problem in general


How to ever know if an error is firewall related


Our problem manifested itself as a socket timeout.  How would other non http based protocols report a blocking of traffic?  I can easily imagine going through the same long learning process for a database, an FTP or an SMTP service.

Confusion is introduced - even if it never fails again


Let's assume that these problems are addressed and the logic is updated to handle the legitimate requests.  Let's also suppose that the firewall never blocks a genuine request again.  When a socket timeout error is encountered, we could now point the finger at the firewall when we should be focusing on the application.

Recommendation


I'd advocate a simple firewall for internal traffic that whitelists IPs and ports only.  If we can be certain that the firewall completely trusts traffic based on an established TCP/IP connection, things will become a lot simpler to debug.  

The simpler to debug, the quicker you fix your site in an emergency!  Time is of the essence.

If you must...


If there is an absolute requirement that these firewall rules are in place, confusion can be mitigated by performing the following:

  • Ensure that all developers are aware of how firewall issues may present themselves.
  • Provide a console for everyone to easily see if traffic is being blocked by the firewall or not.

Comments

  1. If a firewall is doing stuff above layer ii then it is an application service. Therefore it *should* follow all the same disciplines:

    - config under scm
    - replicas in test envs
    - automated deployment
    - monitoring and logging
    - notification to consumers of breaking changes
    - explicit contract(s)
    - automated tests

    Unfortunately this is often not the case with centralized ops teams. Since a system is as strong as its weakest link, your recommendation of simplifying the configuration is a good one. Otherwise if the security constraints are real then some collaboration and effort will be needed to reach a more equitable common ground.

    ReplyDelete

Post a Comment

Popular posts from this blog

Lessons learned from a connection leak in production

How to connect your docker container to a service on the parent host

How to test for connection leaks