Re: [RFC][Proposal] BPF Control MAP

Saeed Mahameed <saeedm@...>

On Tue, 2019-03-19 at 08:36 -0700, Alexei Starovoitov wrote:
On Fri, Mar 08, 2019 at 08:51:03PM +0000, Saeed Mahameed wrote:
Thoughts ?
It's certainly an interesting idea. I think we need to agree on use
and goals first before bikesheding on the solution.
Sure will discuss most of the use cases tomorrow in the iovisor

Example use cases (XDP only for now):

1) Query XDP stats of all XDP netdevs:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type

while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) ==
0) {
bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats);
// we don't even need to know stats format in this case
btf_pretty_print(xdp_ctrl->btf, &stats);
ifindex = next_ifindex;
this bit show cases advantage of BTF nicely.

2) Setup XDP tx redirect resources on egress netdev (netdev with no

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type

xdp_attr->command = SETUP_REDIRECT;
xdp_attr->rings.num = 12;
xdp_attr->rings.size = 128;
bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
this one starting to become a bit odd, since input arguments are
between key an value. ifindex is part of the key whereas command,
rings.num and rings.size are part of the value.
The idea is that we need to keep a map semantics (object->vlaue).
in this case ( map_attr.ctrl_type = XDP_ATTR ) object (key) is if_index
and value is xdp attributes of that if_index, which plays nicely when
you want to iterate through all the objects (XDP netdevs). simply think
of ifindex as a key and not an input value.

different ctrl map types can have different keys and values hence the
need for map_create and passing map_attr.ctrl_type attribute which will
define what the user is trying to access (which control map) and what
will be the key (object) & value (configuration) pair layout.

3) Turn On/Off XDP meta data offloads and retrieve meta data BTF
of specific netdev/hardware:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type

xdp_attr->command = SETUP_MD;
xdp_attr->enable_md = 1;
err = bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
if (err) {
printf("XDP meta data is not supported on this netdev");
// Query Meta data BTF
bpf_map_lookup_elem(xdp_ctrl, &ifindex, &xdp_attr);
md_btf = xdp_attr.md_btf;
here it gets even weirder, since lookup arguments are also
split between key and value.
ifindex is inside the key while addition info is passed
inside xdp_attr which is part of value.

I wish we could do maps where every key would have different layout.
Then such api would be suitable.
The existing maps require all keys and values to be inform.
I guess one can argue that such 'control map' can have one element
and one value, but then what is the purpose of 'map_create'
and 'map_delete==close(fd)' operations?
This is exactly the purpose of map_{create/delete} to define what the
key,vlaue format will be ( it is not going to be ifindex for all
control maps), to make it clear, the key doesn't have to be an ifindex
at all, it depends on the map_attr.ctrl_type which the user request on
map creation, so different layouts are already supported by this

examples of different map_attr.ctrl_type:

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR)
// Key layout == ifindex, vlaue format if_xdp_attributes

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
// Key layout == prog_fd, value format struct bpf_prog_stats

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
// Key layout == socket_fd, value layout struct bpf_sock_attr

extending this can be done in any linux BPF subsystem, by introducing
new map_attr.ctrl_types and new key value layouts of that control type.

on map creation we will attach the btf format of the (key, value) pair
to the map created for that user.

I think we need something else here. Using BTF to describe
output stats is nice, but using BTF to describe input query is
since user cannot know before hand what kernel can and cannot accept.
imo input should be stable uapi with fixed constants whereas
stats-like output can be BTF based and vary from driver to driver
and from one nic version to another.
Well, we can decide to use static stable uapi, and still use this
special map to leverage the map API as described in this doc.

but also we can allow dynamic value layouts, but any extension should
be done to the end of any value structuer and on lookup we will only
copy the size that the user already recognizes .. ? or we can
assume/force the user to use the map_attributes to figure out format
layouts and sizes, but still we will guarantee backward compatibility
in kernel level by keeping old format the same, and extension is only
allowed to the end of the value structures.

Join to automatically receive all group messages.