Date
1 - 6 of 6
Checmate / LSM use cases
Sargun Dhillon
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the use cases for the LSM. I'd love to get opinions. There is a large enterprise that runs between 8-24 containers on each server. These apps don't always play by the rules -- They don't always bind to the right ports, sometimes they make too many connections that exhaust ephemeral ports, they use up too much bandwidth, etc.. In addition, this organization has an infrastructure that's leveraging a legacy DC, and they're very performance sensitive. This legacy DC is IPv4 only. Right now, they employ a mechanism by which they run multiple network namespaces that use tc mirred action to tie together machines, you can find out more about how it works here: http://mesos.apache.org/documentation/latest/port-mapping-isolator/. This is suboptimal because it interferes with operation of things like ICMP (packets are fanned out to all network namespaces, resulting in dups), and it has a noticeable performance overhead. Specifically, with data-intensive use cases, they've mentioned overhead numbers of 20-30%. They have the following use cases: Preventing Incorrect Binds They prevent apps from interfering that bind to the incorrect port by not forwarding traffic to that app's given network namespace, but this has the downside of quiet failures. They would rather have a set of ports to bind to, and if the app bound out of those, to -EPERM on the bind() syscall. Preventing Resource Exhaustion They prevent apps from exhausting the ephemeral port range by carving up 1000 port ephemeral ranges per network namespace, and only mirroring these ports on ingress. This has interesting issues when this range gets dense, but it means that 32 containers becomes the cap on a given machine, because that exhausts 32k set of ports that are dedicated to ephemeral ports. If instead they just counted the number of unique connections to a given ip:port for a container, they could sensibly limit it to 1000, and return EAGAIN, or some such. Accounting They use their current network isolator to account for traffic send and received by a container. Right now this is done by monitoring the veth between the container netns, and the host netns. Unfortunately, this has overhead. On the other hand if we did this using XDP / tc + rcv_skb, we could save on a lot of overhead. Filtering They run a lot of non-production apps along-side production systems. They want to be able to limit the egress access of production apps to non-production apps. This could potentially be done with XDP, but having multiple XDP filters, or a single complex TC/XDP filter for the entire system could prove inflexible. Since these filters are constantly churning, doing filtering to ensure existing connections are not severed would require connection tracking, and that's yet more overhead and complexity. Doing this at the syscall level would push that complexity down. ----- There exists other organizations that want to use helpers with alter kernel memory. This is done for security, as well as capability. Security The organization uses DSCP markings, and packet marks for filtering on the system, and on the network. This would require a helper that's accessible to the LSM to change the mark of packets generated by the sk and therefore they'd need a helper specifically to access information beyond just the SK. This could also be done using the netlabels framework for their use case, but netlabels requires opening up quite a few more APIs. Capability: Load Balancing There are a ton of container load balancing solutions right now. Unfortunately, all of them have caveats - IPVS only works with NAT in the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow and requires NAT. In addition to this, all of these solutions lose fidelity and make the BSD socket API not-a-thing for introspection of peer addresses. This makes logs hard to use. The organizations want the helper to be able to write to the sockaddr_in while intercepting the connect syscall to redirect that connect elsewhere. This is (1) much lower overhead than doing this at the XDP / TC layer (2) allows for logging to keep working. Capability: Port Remapping (DNAT) A lot of folks run Docker in the bridge / Port binding mode (https://docs.docker.com/engine/userguide/networking/default_network/binding/). Unfortunately, this has a lot of downsides such as speed, and requiring a separate ns. The customer would like to rewrite the struct sockaddr during the bind syscall. They're happier doing this once the data is copied to a kernel address rather than doing it in a probe to prevent non-cooperating programs from binding to addresses that shouldn't be available to them. (plus the aforementioned reasons about Load Balancing) |
|
Alexei Starovoitov
On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote: Hello all,Thanks a lot for describing the use cases. The problem statement is clear. I suspect there cannot be a single solution to all of the above. Sounds like all containers already use netns. By itself netns has non-trivial overhead. Hannes is working on something new that should solve veth/netns performance issues. Netns makes above problem to be solvable at L2 level. Sometimes it doesn't fit. The solution without netns are being developed. The upcoming cgroup+bpf will allow ingress/egress socket filtering without netns. All processes of a container will be under a cgroup and bpf program will enforce operation for all sockets of all processes. Sounds like what you're saying is that checmate lsm can solve all of the above? I think theoretically it can, but bpf programs will become very complex. imo cgroup+bpf approach is easier to manage and operate. cgroup+bpf won't work for load balancing and nat, but they're solvable at tc+bpf layer like cilium does. |
|
Sargun Dhillon
On Fri, Aug 12, 2016 at 6:16 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote: On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-devI'm not familiar with Cilium? I'm implementing some of the programs that have the aforementioned behaviour, and some of them (limiting usage of ephemeral ports) gets complicated, but it's still in the realm of possibility. I plan to share: -Rx / Tx statistics per container -Limiting usage of ephemeral ports -Rewriting bind port -Limiting filesystem access Do you have any advice on the API? Currently, I've created some new bits around prctl, but I feel like those extensions are a bit awkward, since prctl is supposed to work on a process-by-process basis. On the other hand, a VFS API seems complex -- do I make it similar to kprobes, where someone can echo an fd # into some file, and we add the probe? |
|
Alexei Starovoitov
On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:
On Fri, Aug 12, 2016 at 6:16 PM, Alexei StarovoitovFor networking bits Daniel Mack is working on cgroup+bpf patches that will allow 1 and 2. Rewriting bind port would need write access which is a possible extension. fs is tricky. Probably should be cgroup based as well, but that will be different cgroup controller. The one Daniel does is network only, since the hook point is similar to sk_filter(). Currently, I've created some new bits around prctl, but I feel like |
|
Sargun Dhillon
On Tue, Aug 16, 2016 at 07:38:45PM -0700, Alexei Starovoitov wrote:
On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:I read the thread on netdev by the fellow writing a cgroup isolator to doOn Fri, Aug 12, 2016 at 6:16 PM, Alexei StarovoitovFor networking bits Daniel Mack is working on cgroup+bpf patches similar things to me. Given this, do you the approach of building cgroups controllers to attach hooks to BPF programs is better, and I should abandon the LSM for that instead -- or just sit tight for Daniel Mack's patches? Perhaps it makes sense to have a network cgroup, where there is a per-cgroup hook to get into sk_filter, bind, listen, so that one can filter there instead of at the LSM level? Do you think adding such hooks would perhaps be a better approach, as opposed to doing a global LSM? Currently, I've created some new bits around prctl, but I feel like |
|
Alexei Starovoitov
On Tue, Aug 16, 2016 at 9:16 PM, Sargun Dhillon <sargun@...> wrote:
On Tue, Aug 16, 2016 at 07:38:45PM -0700, Alexei Starovoitov wrote:I think cgroup approach is better, since it provides hierarchy.On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:I read the thread on netdev by the fellow writing a cgroup isolator to doOn Fri, Aug 12, 2016 at 6:16 PM, Alexei StarovoitovFor networking bits Daniel Mack is working on cgroup+bpf patches Any global hook is harder to manage, since current_task_under_cgroup doesn't scale with more than few containers/cgroups. It also adds overhead for the host whereas the goal is to restrict containers, right? For networking restrictions like ports, container stats, container qos, cgroup+bpf is better. Also sooner or later we'd need to start charging cpu as well for networking (like packet rx/tx, qos, etc) and the only feasible way to do that is to integrate this tightly with cgroupv2 For fs/io restrictions I don't know what would be the best approach. I suspect cgroup style is likely better as well. Simply struggling to see how global lsm can work well with containers. |
|