Big Geospatial Data on Docker

3 min readOct 28, 2020

by Roman Bilusiak

While Docking “Ingeenee” Optimizer to the New Graph of Points & Places we hit the limits of the hardware. The new geodata is 10x bigger and simply did not fit because of caused multiplier effects further in the [previous] data pipeline. In this post I will tell you how we overcame a few technical problems. The solution is universal enough so you could apply it in your work.

Data

During the construction of that new graph of points we have to query for geospatial information. To be more specific, we are calling geospatial module almost 10,000,000,000 times (one and ten zeros). As there are light and heavy queries, each request takes different time, ranging from 20ms to 2000ms. Estimated ETA was ~2–3 months if running on a single server. We have Docker Swarm cluster so we deployed geospatial module to 10+ nodes to process all data in 5 days. It happened that the workload would not be finished in a month, in two months… the ETA became almost infinity.

Problem

In about 24h since workload deployment we received notification from Zabbix that one of nodes became overloaded. Analysis showed all others nodes were idling with no load but that one node was struggling under 1000% load. But the load had to be uniform across all nodes and Swarm load balancer was responsible for doing so. Docker Swarm load balancer uses roundrobin as default (and actually the only supported) balancing method. As result, heavy requests stacked up on slowest node causing overload.

If you need more details — dig into the official Docker documentation how to use swarm mode routing mesh.

Solution

Different balancing method had to be used to make sure load is distributed uniformly. We installed HAProxy as external load balancer. It supports leastconn (least-connected) load balancing method that connects new request to the node with least number of connections. The idea is to hide all servers behind load balancer and make it single entry point as shown on the diagram — HAProxy load balancer on swarm cluster and overlay network:

Details

Here is the config sample to make HAProxy resolve host names using Docker resolver and set balancing method to least-connected:

global
    log          fd@2 local2
    chroot       /var/lib/haproxy
    pidfile      /var/run/haproxy.pid
    maxconn      40000
    user         haproxy
    group        haproxy
    stats socket /var/lib/haproxy/stats expose-fd listeners
    master-worker

defaults
    timeout connect 100s
    timeout client 300s
    timeout server 300s
    log global
    mode http
    option httplog

resolvers docker
    nameserver docker 127.0.0.11:53
    resolve_retries 3
    timeout resolve 10s
    timeout retry   10s
    hold other      10s
    hold refused    10s
    hold nx         10s
    hold timeout    10s
    hold valid      10s
    hold obsolete   10s

backend stat
    stats enable
    stats uri /stats
    stats refresh 15s
    stats show-legends
    stats show-node

backend backend_web
    balance leastconn
    option forwardfor
    server-template my-web- 1-20 my-web:80 check resolvers docker init-addr libc,none

frontend frontend_web
    bind *:80
    use_backend stat if { path -i /stats }
    default_backend backend_web

Conclusion

The problem we faced with geospatial data is general and is caused by the roundrobin balancing method. Docker Swarm, HAProxy, Nginx use it as default balancing method, and that may cause similar issues on high-load systems in long run. Roundrobin is a quick and lazy implementation of load balancing and it’s better to not use it by default. Load balance explicitly!