[RFC][Proposal] BPF Control MAP


Alexei Starovoitov
 

On Fri, Mar 08, 2019 at 08:51:03PM +0000, Saeed Mahameed wrote:

Thoughts ?
It's certainly an interesting idea. I think we need to agree on use cases
and goals first before bikesheding on the solution.


Example use cases (XDP only for now):

1) Query XDP stats of all XDP netdevs:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_STATS);

while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) == 0) {
bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats);
// we don't even need to know stats format in this case
btf_pretty_print(xdp_ctrl->btf, &stats);
ifindex = next_ifindex;
}
this bit show cases advantage of BTF nicely.

2) Setup XDP tx redirect resources on egress netdev (netdev with no XDP
program).

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR);

xdp_attr->command = SETUP_REDIRECT;
xdp_attr->rings.num = 12;
xdp_attr->rings.size = 128;
bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
this one starting to become a bit odd, since input arguments are split
between key an value. ifindex is part of the key whereas command,
rings.num and rings.size are part of the value.

3) Turn On/Off XDP meta data offloads and retrieve meta data BTF format
of specific netdev/hardware:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR);

xdp_attr->command = SETUP_MD;
xdp_attr->enable_md = 1;
err = bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
if (err) {
printf("XDP meta data is not supported on this netdev");
return;
}
// Query Meta data BTF
bpf_map_lookup_elem(xdp_ctrl, &ifindex, &xdp_attr);
md_btf = xdp_attr.md_btf;
here it gets even weirder, since lookup arguments are also
split between key and value.
ifindex is inside the key while addition info is passed
inside xdp_attr which is part of value.

I wish we could do maps where every key would have different layout.
Then such api would be suitable.
The existing maps require all keys and values to be inform.
I guess one can argue that such 'control map' can have one element
and one value, but then what is the purpose of 'map_create'
and 'map_delete==close(fd)' operations?

I think we need something else here. Using BTF to describe
output stats is nice, but using BTF to describe input query is problematic,
since user cannot know before hand what kernel can and cannot accept.
imo input should be stable uapi with fixed constants whereas
stats-like output can be BTF based and vary from driver to driver
and from one nic version to another.


Alexei Starovoitov
 

On Tue, Mar 19, 2019 at 06:54:08PM +0000, Saeed Mahameed wrote:

This is exactly the purpose of map_{create/delete} to define what the
key,vlaue format will be ( it is not going to be ifindex for all
control maps), to make it clear, the key doesn't have to be an ifindex
at all, it depends on the map_attr.ctrl_type which the user request on
map creation, so different layouts are already supported by this
proposal

examples of different map_attr.ctrl_type:

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR)
// Key layout == ifindex, vlaue format if_xdp_attributes

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_PROG_STATS)
// Key layout == prog_fd, value format struct bpf_prog_stats

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_SOCK_ATTR)
// Key layout == socket_fd, value layout struct bpf_sock_attr

extending this can be done in any linux BPF subsystem, by introducing
new map_attr.ctrl_types and new key value layouts of that control type.

on map creation we will attach the btf format of the (key, value) pair
to the map created for that user.
In your examples above does netdev with corresponding ifindex
exist before map is created?
Does prog_fd exist ? and socket?
In all cases yes. they do. Hence creation of 'map' (even pseudo map)
is an unnecessary step.

User space providing BTF for input and output makes little sense to me.
What kernel suppose to do with it?

In case of XDP stats (that could be different between drivers)
the driver would provide a BTF to describe the stats it's collecting.
So it's kernel supplied BTF instead of user.

I think we need something else here. Using BTF to describe
output stats is nice, but using BTF to describe input query is
problematic,
since user cannot know before hand what kernel can and cannot accept.
imo input should be stable uapi with fixed constants whereas
stats-like output can be BTF based and vary from driver to driver
and from one nic version to another.
Well, we can decide to use static stable uapi, and still use this
special map to leverage the map API as described in this doc.

but also we can allow dynamic value layouts, but any extension should
be done to the end of any value structuer and on lookup we will only
copy the size that the user already recognizes .. ? or we can
assume/force the user to use the map_attributes to figure out format
layouts and sizes, but still we will guarantee backward compatibility
in kernel level by keeping old format the same, and extension is only
allowed to the end of the value structures.
for input queries - yes. See how 'union bpf_attr' works.
It can accommodate any number of new commands with its own arguments
to query XDP stats.
For output the driver can supply BTF back to user space along with
a blob of data.

If I got your intent correctly you want only BPF_MAP_TYPE_CONTROL to
be processed by generic bpf layer and everything else to be done in
the driver? All commands, values, input/output to be driver specific?


Alexei Starovoitov
 

On Wed, Mar 20, 2019 at 04:47:11AM +0000, Saeed Mahameed wrote:

consider the following command line, with bpftool that compiled way
before xsk-ring-szie attribute was introduced.
perfect. let's agree on the use case first...

$ bpftool xdp set eth0 xsk-ring-size 128
would be great to have. no doubt.

what this can eventually do:

fd = create_map(BPF_CONTROL, map_attr.ctrl_type = XDP_ATTR)

//Query BTF of the control (XDP) map
query_map_attributes(fd, &map_attr);
and bpf_map_info returns btf_key_type_id and btf_value_type_id
Are you saying btf_key_type_id will return single u32 with name 'ifindex' ?
Where do you specify netns_id+dev_id ?

// Find if control map values btf layout include the new member
offset = btf_get_offset_of_memeber_name(map_attr.btf, "xsk-ring-size",
BTF_TYPE_INT);
if (offset < 0)
return offset; // the new member is not supported in kernel

*(value + offset) = (int)new_val;
map_update(fd, <eth0 ifindex>, value);
what about other fields in this value ?
If zero do not update ?
Shouldn't it be read/modify/write ?
So lookup everything first, modify only that '*(value + offset)'
and do map_update ?
and the driver received full blob of bytes for all fields.
What driver suppose to do ? Rewrite all fields ?
Or try to be smart and do read+if_not_the_same_write ?

Different ifindexes will be different drivers,
so single map providing single btf_value_type_id for all of them
isn't going to work.

I hope you see that such api is falling apart even for simplest use case.


Saeed Mahameed <saeedm@...>
 

On Tue, 2019-03-19 at 08:36 -0700, Alexei Starovoitov wrote:
On Fri, Mar 08, 2019 at 08:51:03PM +0000, Saeed Mahameed wrote:
Thoughts ?
It's certainly an interesting idea. I think we need to agree on use
cases
and goals first before bikesheding on the solution.
Sure will discuss most of the use cases tomorrow in the iovisor
meeting.


Example use cases (XDP only for now):

1) Query XDP stats of all XDP netdevs:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type
=
XDP_STATS);

while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) ==
0) {
bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats);
// we don't even need to know stats format in this case
btf_pretty_print(xdp_ctrl->btf, &stats);
ifindex = next_ifindex;
}
this bit show cases advantage of BTF nicely.

2) Setup XDP tx redirect resources on egress netdev (netdev with no
XDP
program).

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type
=
XDP_ATTR);

xdp_attr->command = SETUP_REDIRECT;
xdp_attr->rings.num = 12;
xdp_attr->rings.size = 128;
bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
this one starting to become a bit odd, since input arguments are
split
between key an value. ifindex is part of the key whereas command,
rings.num and rings.size are part of the value.
The idea is that we need to keep a map semantics (object->vlaue).
in this case ( map_attr.ctrl_type = XDP_ATTR ) object (key) is if_index
and value is xdp attributes of that if_index, which plays nicely when
you want to iterate through all the objects (XDP netdevs). simply think
of ifindex as a key and not an input value.

different ctrl map types can have different keys and values hence the
need for map_create and passing map_attr.ctrl_type attribute which will
define what the user is trying to access (which control map) and what
will be the key (object) & value (configuration) pair layout.

3) Turn On/Off XDP meta data offloads and retrieve meta data BTF
format
of specific netdev/hardware:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type
=
XDP_ATTR);

xdp_attr->command = SETUP_MD;
xdp_attr->enable_md = 1;
err = bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
if (err) {
printf("XDP meta data is not supported on this netdev");
return;
}
// Query Meta data BTF
bpf_map_lookup_elem(xdp_ctrl, &ifindex, &xdp_attr);
md_btf = xdp_attr.md_btf;
here it gets even weirder, since lookup arguments are also
split between key and value.
ifindex is inside the key while addition info is passed
inside xdp_attr which is part of value.

I wish we could do maps where every key would have different layout.
Then such api would be suitable.
The existing maps require all keys and values to be inform.
I guess one can argue that such 'control map' can have one element
and one value, but then what is the purpose of 'map_create'
and 'map_delete==close(fd)' operations?
This is exactly the purpose of map_{create/delete} to define what the
key,vlaue format will be ( it is not going to be ifindex for all
control maps), to make it clear, the key doesn't have to be an ifindex
at all, it depends on the map_attr.ctrl_type which the user request on
map creation, so different layouts are already supported by this
proposal

examples of different map_attr.ctrl_type:

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR)
// Key layout == ifindex, vlaue format if_xdp_attributes

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_PROG_STATS)
// Key layout == prog_fd, value format struct bpf_prog_stats

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_SOCK_ATTR)
// Key layout == socket_fd, value layout struct bpf_sock_attr

extending this can be done in any linux BPF subsystem, by introducing
new map_attr.ctrl_types and new key value layouts of that control type.

on map creation we will attach the btf format of the (key, value) pair
to the map created for that user.


I think we need something else here. Using BTF to describe
output stats is nice, but using BTF to describe input query is
problematic,
since user cannot know before hand what kernel can and cannot accept.
imo input should be stable uapi with fixed constants whereas
stats-like output can be BTF based and vary from driver to driver
and from one nic version to another.
Well, we can decide to use static stable uapi, and still use this
special map to leverage the map API as described in this doc.

but also we can allow dynamic value layouts, but any extension should
be done to the end of any value structuer and on lookup we will only
copy the size that the user already recognizes .. ? or we can
assume/force the user to use the map_attributes to figure out format
layouts and sizes, but still we will guarantee backward compatibility
in kernel level by keeping old format the same, and extension is only
allowed to the end of the value structures.


Saeed Mahameed <saeedm@...>
 

On Tue, 2019-03-19 at 20:04 -0700, Alexei Starovoitov wrote:
On Tue, Mar 19, 2019 at 06:54:08PM +0000, Saeed Mahameed wrote:
This is exactly the purpose of map_{create/delete} to define what
the
key,vlaue format will be ( it is not going to be ifindex for all
control maps), to make it clear, the key doesn't have to be an
ifindex
at all, it depends on the map_attr.ctrl_type which the user request
on
map creation, so different layouts are already supported by this
proposal

examples of different map_attr.ctrl_type:

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR)
// Key layout == ifindex, vlaue format if_xdp_attributes

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_PROG_STATS)
// Key layout == prog_fd, value format struct bpf_prog_stats

fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
BPF_SOCK_ATTR)
// Key layout == socket_fd, value layout struct bpf_sock_attr

extending this can be done in any linux BPF subsystem, by
introducing
new map_attr.ctrl_types and new key value layouts of that control
type.

on map creation we will attach the btf format of the (key, value)
pair
to the map created for that user.
In your examples above does netdev with corresponding ifindex
exist before map is created?
Does prog_fd exist ? and socket?
In all cases yes. they do. Hence creation of 'map' (even pseudo map)
is an unnecessary step.

User space providing BTF for input and output makes little sense to
me.
What kernel suppose to do with it?

In case of XDP stats (that could be different between drivers)
the driver would provide a BTF to describe the stats it's collecting.
So it's kernel supplied BTF instead of user.
ok, i think i wasn't clear enough, let me retry.

So map creation has nothing to do with ifindex or the object the user
is trying to access.

on map creation the user will define what map sub type (control type)
the user want to create/access, and the kernel job will be to setup
this user map attributes, connect that map to the subsystem that is
going to handle this map operations (lookup and update). by subsystem i
mean XDP subsystem for example (net/core layer and not device driver)
There will be no direct connection to the device driver, direct map
operations that go to device drivers is something we should avoid, the
whole idea here is to provide standard sustainable uapi and not device
specific key values that will go out of control.

BTF layout for key value pair is going to be defined and assigned by
kernel middle layers and NOT the user space nor device drivers, each
subsystem that want to support new control MAP sub types should pre-
define its own BTF layou.

on map creation user can query the BTF key value layouts and any other
map attributes.

For example to enable user to access XDP attributes, we will add a
kernel BPF layer under net/core/xdp_ctrl.c which will handle all
map operations coming in from maps that are created via:

create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR)

other subsystems/layers (eg. sockets/kprobe/cgroups/tracepoints) can
define there own map_attr.ctrl_type and implement thier own mid-layer
which will handle control maps operation of that sub type.

So you don't create a map to get connected to a specific netdev, you
create a map to get a connection to a specific kernel BPF subsystem and
query modify that subsystem objects (in XDP case netdevs).

User can create multiple maps with different ap_attr.ctrl_type to
access different system attributes.

I think we need something else here. Using BTF to describe
output stats is nice, but using BTF to describe input query is
problematic,
since user cannot know before hand what kernel can and cannot
accept.
imo input should be stable uapi with fixed constants whereas
stats-like output can be BTF based and vary from driver to driver
and from one nic version to another.
Well, we can decide to use static stable uapi, and still use this
special map to leverage the map API as described in this doc.

but also we can allow dynamic value layouts, but any extension
should
be done to the end of any value structuer and on lookup we will
only
copy the size that the user already recognizes .. ? or we can
assume/force the user to use the map_attributes to figure out
format
layouts and sizes, but still we will guarantee backward
compatibility
in kernel level by keeping old format the same, and extension is
only
allowed to the end of the value structures.
for input queries - yes. See how 'union bpf_attr' works.
It can accommodate any number of new commands with its own arguments
to query XDP stats.
For output the driver can supply BTF back to user space along with
a blob of data.

If I got your intent correctly you want only BPF_MAP_TYPE_CONTROL to
be processed by generic bpf layer and everything else to be done in
the driver? All commands, values, input/output to be driver specific?
hmm, no i want the kernel core code to process everything, different
subsystems (not device drivers) can provide there own
BPF_MAP_TYPE_CONTROL sub types and provide the map operation
implementation. As a backend such mid layer subsystem can use whatever
interface to access objects (netdevs ndos in our case and could be via
union bpf_attr), this is implementation specific that is transparent to
the user.

bottom line we don't want drivers/users to create BTF layouts it is the
kernel mid layer job to connect between the two via the control MAPs,
and this mid layer will audit drivers and users.

I know we can achieve same thing with union bpf_attr and use plain
ethtool or netlink, but this means that we need to integrate
ethtool/netlink APIs into libbpf but the whole beauty if this proposal
is that a BPF developer will have to deal only with MAP operations even
if he wants to change system attributes.

One extra benefit of this is forward compatibility of a simple user
space program that can access new kernel attributes with out the need
to modify the user space program

consider the following command line, with bpftool that compiled way
before xsk-ring-szie attribute was introduced.

$ bpftool xdp set eth0 xsk-ring-size 128

what this can eventually do:

fd = create_map(BPF_CONTROL, map_attr.ctrl_type = XDP_ATTR)

//Query BTF of the control (XDP) map
query_map_attributes(fd, &map_attr);

// Find if control map values btf layout include the new member
offset = btf_get_offset_of_memeber_name(map_attr.btf, "xsk-ring-size",
BTF_TYPE_INT);
if (offset < 0)
return offset; // the new member is not supported in kernel

*(value + offset) = (int)new_val;
map_update(fd, <eth0 ifindex>, value);


Toke Høiland-Jørgensen <toke@...>
 

Saeed Mahameed <saeedm@...> writes:

In this proposal I am going to address the lack of a unified user API
for accessing and manipulating BPF system attributes, while this
proposal is generic and will work on any BPF subsystem (eBPF attach
points), I will mostly focus on XDP use cases.

So lately I started working on three different XDP open issues, namely
XDP statistic, XDP redirect and XDP meta-data, while the details of
these issues are not really relevant for the sake of this proposal, all
of them share one common problem: the lack of unified user interface to
manipulate and access their attributes.

Examples:
1. Query XDP statistics.
2. XDP resource management, Setup XDP-redirect TX resources.
3. Setup and query XDP-metadata - (BTF data structure).

Jesper Brouer, explains some of these issues in details at:
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org

Yes I considered, netlink, devlink, ethtool, sysctrl, etc .. but each
one of them has it's own drawback, they are networking specific and
will not serve the BPF general purpose.
The one concern I have with this is that it makes XDP configuration
different from regular networking configuration. One of the compelling
features of XDP is that it is less surprising than kernel offloads,
because that you can interface with it using the regular kernel tooling.
This is less the case if we're doing a BPF-specific thing...

Or to put it another way, in my mind XDP is a networking technology that
happens to use eBPF, more than it is an eBPF usage that happens to
process packets; and I think it would make more sense for the userspace
tooling to reflect this.

That being said, I do agree that there are some cool ideas in your
example, such as using BTF to express the statistics format, and the
automatic enumeration of objects.

-Toke


Saeed Mahameed <saeedm@...>
 

In this proposal I am going to address the lack of a unified user API
for accessing and manipulating BPF system attributes, while this
proposal is generic and will work on any BPF subsystem (eBPF attach
points), I will mostly focus on XDP use cases.

So lately I started working on three different XDP open issues, namely
XDP statistic, XDP redirect and XDP meta-data, while the details of
these issues are not really relevant for the sake of this proposal, all
of them share one common problem: the lack of unified user interface to
manipulate and access their attributes.

Examples:
1. Query XDP statistics.
2. XDP resource management, Setup XDP-redirect TX resources.
3. Setup and query XDP-metadata - (BTF data structure).

Jesper Brouer, explains some of these issues in details at:
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org

Yes I considered, netlink, devlink, ethtool, sysctrl, etc .. but each
one of them has it's own drawback, they are networking specific and
will not serve the BPF general purpose.

What we want is, all of the BPF related knobs to be present in BPF user
tools: bcc, bpftool and libbpf. Ideally we don't want these tools to
integrate with all different subsystem's UAPIs, especially the wide
variety of the networking UAPIs, and imagine what other subsystems are
going to be using ..

So what seems to be the right path here is a unified BPF
control/configuration user interface, which will hook the caller with
the targeted subsystem.

To be aligned with all existing BPF tools I am going to propose the use
of BPF syscall (No, not a new BPF syscall command, I am not planing to
reinvent the wheel - "again" -).
What i am going to suggest is to use an already existing API which runs
on top of the BPF syscall, BPF MAPs API with just a small tweak. Enter:



BPF control MAP:

A special type of MAP "BPF_MAP_TYPE_CONTROL", this map will not behave
like other maps in the essence of having a user defined data structure
behind it, we are going to use it just to hook the user with the
targeted underlying subsystem and delegate user commands to it through
map operations (create/update_elem/lookup_elem/etc ...)



Requirements and implementation details:

1) Hook the user with the targeted subsystem:
- On create map, user selects the BPF_MAP_TYPE_CONTROL map type and
sets map_attr.ctrl_type to be the subsystem he wants to access and
manipulate (KPROBE/CGROUP/SOCKET_FILTER/XDP/etc..).

2) Set and Get operations of a specific BPF subsystem or an object in
that subsystem (for example a netdev in XDP).
- user will use the file descriptor retrieved on map creation to access
(Set/Get) the BPF subsystem attributes via map update_elem and
lookup_elem operations, the key will be the object id (example:
ifindex, or just the type of configuration to access) keys and values
are subsystem dependent.

3) Iterate through the different attributes/objects of the subsystem,
Use case: XDP BPF subsystem, get ALL netdevs XDP attributes/statistics.
can be easily achieved with: bpf_map_get_next_key.



Advantages & Motivation:
Why BPF MAP and not just a plain new BPF syscall command or any other
existing UAPI:

0) All BPF users love maps and got used to them, and simply, everything
is a map, system objects can be keys and their attributes can be
values.

1) **BTF** integration, any map (key, value) pair can be described in
BTF in kernel level and can be attached to the map the user creates,
this will be a huge advantage for user forward compatibility, and for
development convenience to not copy kernel uapi headers on each
attribute set updates, and simplify ABI compatibility.
New values or attributes can be dumped/parsed in user space with zero
effort, no need to constantly update user space tools.

2) BPF maps already laid the groundwork for our requirements as the
infrastructure and has the semantics that we are looking for (set/get).

3) Already integrated in user-space tools and libraries such ash
bcc/libbpf and friends, what is missing is just this small tweak (in
the kernel) to hook one special map type with the underlying BPF
subsystems.

Thoughts ?


[Some EXTRAs]
Example use cases (XDP only for now):

1) Query XDP stats of all XDP netdevs:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_STATS);

while (bpf_map_get_next_key(xdp_ctrl, &ifindex, &next_ifindex) == 0) {
bpf_map_lookup_elem(xdp_ctrl, &next_ifindex, &stats);
// we don't even need to know stats format in this case
btf_pretty_print(xdp_ctrl->btf, &stats);
ifindex = next_ifindex;
}

2) Setup XDP tx redirect resources on egress netdev (netdev with no XDP
program).

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR);

xdp_attr->command = SETUP_REDIRECT;
xdp_attr->rings.num = 12;
xdp_attr->rings.size = 128;
bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);

3) Turn On/Off XDP meta data offloads and retrieve meta data BTF format
of specific netdev/hardware:

xdp_ctrl = bpf_create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
XDP_ATTR);

xdp_attr->command = SETUP_MD;
xdp_attr->enable_md = 1;
err = bpf_map_update_elem(xdp_ctrl, &ifindex, &xdp_attr);
if (err) {
printf("XDP meta data is not supported on this netdev");
return;
}
// Query Meta data BTF
bpf_map_lookup_elem(xdp_ctrl, &ifindex, &xdp_attr);
md_btf = xdp_attr.md_btf;

Thanks,
Saeed.