Checmate / LSM use cases


Sargun Dhillon
 

Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)


Alexei Starovoitov
 

On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote:
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)
Thanks a lot for describing the use cases.
The problem statement is clear.
I suspect there cannot be a single solution to all of the above.
Sounds like all containers already use netns.
By itself netns has non-trivial overhead. Hannes is working
on something new that should solve veth/netns performance issues.
Netns makes above problem to be solvable at L2 level.
Sometimes it doesn't fit. The solution without netns are being
developed.
The upcoming cgroup+bpf will allow ingress/egress socket
filtering without netns. All processes of a container will be
under a cgroup and bpf program will enforce operation for all
sockets of all processes.
Sounds like what you're saying is that checmate lsm can
solve all of the above? I think theoretically it can, but bpf
programs will become very complex.
imo cgroup+bpf approach is easier to manage and operate.
cgroup+bpf won't work for load balancing and nat, but they're
solvable at tc+bpf layer like cilium does.


Sargun Dhillon
 

On Fri, Aug 12, 2016 at 6:16 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote:
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)
Thanks a lot for describing the use cases.
The problem statement is clear.
I suspect there cannot be a single solution to all of the above.
Sounds like all containers already use netns.
By itself netns has non-trivial overhead. Hannes is working
on something new that should solve veth/netns performance issues.
Netns makes above problem to be solvable at L2 level.
Sometimes it doesn't fit. The solution without netns are being
developed.
The upcoming cgroup+bpf will allow ingress/egress socket
filtering without netns. All processes of a container will be
under a cgroup and bpf program will enforce operation for all
sockets of all processes.
Sounds like what you're saying is that checmate lsm can
solve all of the above? I think theoretically it can, but bpf
programs will become very complex.
imo cgroup+bpf approach is easier to manage and operate.
cgroup+bpf won't work for load balancing and nat, but they're
solvable at tc+bpf layer like cilium does.
I'm not familiar with Cilium?

I'm implementing some of the programs that have the aforementioned
behaviour, and some of them (limiting usage of ephemeral ports) gets
complicated, but it's still in the realm of possibility. I plan to
share:
-Rx / Tx statistics per container
-Limiting usage of ephemeral ports
-Rewriting bind port
-Limiting filesystem access

Do you have any advice on the API?

Currently, I've created some new bits around prctl, but I feel like
those extensions are a bit awkward, since prctl is supposed to work on
a process-by-process basis. On the other hand, a VFS API seems complex
-- do I make it similar to kprobes, where someone can echo an fd #
into some file, and we add the probe?


Alexei Starovoitov
 

On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:
On Fri, Aug 12, 2016 at 6:16 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote:
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)
Thanks a lot for describing the use cases.
The problem statement is clear.
I suspect there cannot be a single solution to all of the above.
Sounds like all containers already use netns.
By itself netns has non-trivial overhead. Hannes is working
on something new that should solve veth/netns performance issues.
Netns makes above problem to be solvable at L2 level.
Sometimes it doesn't fit. The solution without netns are being
developed.
The upcoming cgroup+bpf will allow ingress/egress socket
filtering without netns. All processes of a container will be
under a cgroup and bpf program will enforce operation for all
sockets of all processes.
Sounds like what you're saying is that checmate lsm can
solve all of the above? I think theoretically it can, but bpf
programs will become very complex.
imo cgroup+bpf approach is easier to manage and operate.
cgroup+bpf won't work for load balancing and nat, but they're
solvable at tc+bpf layer like cilium does.
I'm not familiar with Cilium?

I'm implementing some of the programs that have the aforementioned
behaviour, and some of them (limiting usage of ephemeral ports) gets
complicated, but it's still in the realm of possibility. I plan to
share:
-Rx / Tx statistics per container
-Limiting usage of ephemeral ports
-Rewriting bind port
-Limiting filesystem access

Do you have any advice on the API?
For networking bits Daniel Mack is working on cgroup+bpf patches
that will allow 1 and 2.
Rewriting bind port would need write access which is a possible
extension.
fs is tricky. Probably should be cgroup based as well,
but that will be different cgroup controller. The one Daniel does
is network only, since the hook point is similar to sk_filter().

Currently, I've created some new bits around prctl, but I feel like
those extensions are a bit awkward, since prctl is supposed to work on
a process-by-process basis. On the other hand, a VFS API seems complex
-- do I make it similar to kprobes, where someone can echo an fd #
into some file, and we add the probe?


Sargun Dhillon
 

On Tue, Aug 16, 2016 at 07:38:45PM -0700, Alexei Starovoitov wrote:
On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:
On Fri, Aug 12, 2016 at 6:16 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote:
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)
Thanks a lot for describing the use cases.
The problem statement is clear.
I suspect there cannot be a single solution to all of the above.
Sounds like all containers already use netns.
By itself netns has non-trivial overhead. Hannes is working
on something new that should solve veth/netns performance issues.
Netns makes above problem to be solvable at L2 level.
Sometimes it doesn't fit. The solution without netns are being
developed.
The upcoming cgroup+bpf will allow ingress/egress socket
filtering without netns. All processes of a container will be
under a cgroup and bpf program will enforce operation for all
sockets of all processes.
Sounds like what you're saying is that checmate lsm can
solve all of the above? I think theoretically it can, but bpf
programs will become very complex.
imo cgroup+bpf approach is easier to manage and operate.
cgroup+bpf won't work for load balancing and nat, but they're
solvable at tc+bpf layer like cilium does.
I'm not familiar with Cilium?

I'm implementing some of the programs that have the aforementioned
behaviour, and some of them (limiting usage of ephemeral ports) gets
complicated, but it's still in the realm of possibility. I plan to
share:
-Rx / Tx statistics per container
-Limiting usage of ephemeral ports
-Rewriting bind port
-Limiting filesystem access

Do you have any advice on the API?
For networking bits Daniel Mack is working on cgroup+bpf patches
that will allow 1 and 2.
Rewriting bind port would need write access which is a possible
extension.
fs is tricky. Probably should be cgroup based as well,
but that will be different cgroup controller. The one Daniel does
is network only, since the hook point is similar to sk_filter().
I read the thread on netdev by the fellow writing a cgroup isolator to do
similar things to me. Given this, do you the approach of building cgroups
controllers to attach hooks to BPF programs is better, and I should abandon the
LSM for that instead -- or just sit tight for Daniel Mack's patches?

Perhaps it makes sense to have a network cgroup, where there is a per-cgroup
hook to get into sk_filter, bind, listen, so that one can filter there instead
of at the LSM level? Do you think adding such hooks would perhaps be a better
approach, as opposed to doing a global LSM?


Currently, I've created some new bits around prctl, but I feel like
those extensions are a bit awkward, since prctl is supposed to work on
a process-by-process basis. On the other hand, a VFS API seems complex
-- do I make it similar to kprobes, where someone can echo an fd #
into some file, and we add the probe?


Alexei Starovoitov
 

On Tue, Aug 16, 2016 at 9:16 PM, Sargun Dhillon <sargun@...> wrote:
On Tue, Aug 16, 2016 at 07:38:45PM -0700, Alexei Starovoitov wrote:
On Tue, Aug 16, 2016 at 5:09 PM, Sargun Dhillon <sargun@...> wrote:
On Fri, Aug 12, 2016 at 6:16 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Aug 11, 2016 at 12:06 PM, Sargun Dhillon via iovisor-dev
<iovisor-dev@...> wrote:
Hello all,
Sorry for my bumbling on the phone yesterday. I promised to share the
use cases for the LSM. I'd love to get opinions.

There is a large enterprise that runs between 8-24 containers on each
server. These apps don't always play by the rules -- They don't always
bind to the right ports, sometimes they make too many connections that
exhaust ephemeral ports, they use up too much bandwidth, etc.. In
addition, this organization has an infrastructure that's leveraging a
legacy DC, and they're very performance sensitive. This legacy DC is
IPv4 only.

Right now, they employ a mechanism by which they run multiple network
namespaces that use tc mirred action to tie together machines, you can
find out more about how it works here:
http://mesos.apache.org/documentation/latest/port-mapping-isolator/.
This is suboptimal because it interferes with operation of things like
ICMP (packets are fanned out to all network namespaces, resulting in
dups), and it has a noticeable performance overhead.

Specifically, with data-intensive use cases, they've mentioned
overhead numbers of 20-30%.

They have the following use cases:
Preventing Incorrect Binds
They prevent apps from interfering that bind to the incorrect port by
not forwarding traffic to that app's given network namespace, but this
has the downside of quiet failures. They would rather have a set of
ports to bind to, and if the app bound out of those, to -EPERM on the
bind() syscall.

Preventing Resource Exhaustion
They prevent apps from exhausting the ephemeral port range by carving
up 1000 port ephemeral ranges per network namespace, and only
mirroring these ports on ingress. This has interesting issues when
this range gets dense, but it means that 32 containers becomes the cap
on a given machine, because that exhausts 32k set of ports that are
dedicated to ephemeral ports. If instead they just counted the number
of unique connections to a given ip:port for a container, they could
sensibly limit it to 1000, and return EAGAIN, or some such.

Accounting
They use their current network isolator to account for traffic send
and received by a container. Right now this is done by monitoring the
veth between the container netns, and the host netns. Unfortunately,
this has overhead. On the other hand if we did this using XDP / tc +
rcv_skb, we could save on a lot of overhead.

Filtering
They run a lot of non-production apps along-side production systems.
They want to be able to limit the egress access of production apps to
non-production apps. This could potentially be done with XDP, but
having multiple XDP filters, or a single complex TC/XDP filter for the
entire system could prove inflexible. Since these filters are
constantly churning, doing filtering to ensure existing connections
are not severed would require connection tracking, and that's yet more
overhead and complexity. Doing this at the syscall level would push
that complexity down.
-----

There exists other organizations that want to use helpers with alter
kernel memory. This is done for security, as well as capability.

Security
The organization uses DSCP markings, and packet marks for filtering on
the system, and on the network. This would require a helper that's
accessible to the LSM to change the mark of packets generated by the
sk and therefore they'd need a helper specifically to access
information beyond just the SK. This could also be done using the
netlabels framework for their use case, but netlabels requires opening
up quite a few more APIs.

Capability: Load Balancing
There are a ton of container load balancing solutions right now.
Unfortunately, all of them have caveats - IPVS only works with NAT in
the cloud, HAProxy has a ton of overhead kubeproxy + iptables is slow
and requires NAT. In addition to this, all of these solutions lose
fidelity and make the BSD socket API not-a-thing for introspection of
peer addresses. This makes logs hard to use.

The organizations want the helper to be able to write to the
sockaddr_in while intercepting the connect syscall to redirect that
connect elsewhere. This is (1) much lower overhead than doing this at
the XDP / TC layer (2) allows for logging to keep working.

Capability: Port Remapping (DNAT)
A lot of folks run Docker in the bridge / Port binding mode
(https://docs.docker.com/engine/userguide/networking/default_network/binding/).
Unfortunately, this has a lot of downsides such as speed, and
requiring a separate ns. The customer would like to rewrite the struct
sockaddr during the bind syscall. They're happier doing this once the
data is copied to a kernel address rather than doing it in a probe to
prevent non-cooperating programs from binding to addresses that
shouldn't be available to them. (plus the aforementioned reasons about
Load Balancing)
Thanks a lot for describing the use cases.
The problem statement is clear.
I suspect there cannot be a single solution to all of the above.
Sounds like all containers already use netns.
By itself netns has non-trivial overhead. Hannes is working
on something new that should solve veth/netns performance issues.
Netns makes above problem to be solvable at L2 level.
Sometimes it doesn't fit. The solution without netns are being
developed.
The upcoming cgroup+bpf will allow ingress/egress socket
filtering without netns. All processes of a container will be
under a cgroup and bpf program will enforce operation for all
sockets of all processes.
Sounds like what you're saying is that checmate lsm can
solve all of the above? I think theoretically it can, but bpf
programs will become very complex.
imo cgroup+bpf approach is easier to manage and operate.
cgroup+bpf won't work for load balancing and nat, but they're
solvable at tc+bpf layer like cilium does.
I'm not familiar with Cilium?

I'm implementing some of the programs that have the aforementioned
behaviour, and some of them (limiting usage of ephemeral ports) gets
complicated, but it's still in the realm of possibility. I plan to
share:
-Rx / Tx statistics per container
-Limiting usage of ephemeral ports
-Rewriting bind port
-Limiting filesystem access

Do you have any advice on the API?
For networking bits Daniel Mack is working on cgroup+bpf patches
that will allow 1 and 2.
Rewriting bind port would need write access which is a possible
extension.
fs is tricky. Probably should be cgroup based as well,
but that will be different cgroup controller. The one Daniel does
is network only, since the hook point is similar to sk_filter().
I read the thread on netdev by the fellow writing a cgroup isolator to do
similar things to me. Given this, do you the approach of building cgroups
controllers to attach hooks to BPF programs is better, and I should abandon the
LSM for that instead -- or just sit tight for Daniel Mack's patches?

Perhaps it makes sense to have a network cgroup, where there is a per-cgroup
hook to get into sk_filter, bind, listen, so that one can filter there instead
of at the LSM level? Do you think adding such hooks would perhaps be a better
approach, as opposed to doing a global LSM?
I think cgroup approach is better, since it provides hierarchy.
Any global hook is harder to manage, since current_task_under_cgroup
doesn't scale with more than few containers/cgroups.
It also adds overhead for the host whereas the goal is
to restrict containers, right?
For networking restrictions like ports, container stats,
container qos, cgroup+bpf is better.
Also sooner or later we'd need to start charging cpu as well
for networking (like packet rx/tx, qos, etc) and the only
feasible way to do that is to integrate this tightly with cgroupv2

For fs/io restrictions I don't know what would be the best approach.
I suspect cgroup style is likely better as well.
Simply struggling to see how global lsm can work well with containers.