asahi/rikako-note

Fork 0

Files

asahi bb028ceafc doc: 阅读protobuf encoding文档

2025-03-03 00:15:49 +08:00

31 KiB

Raw Blame History

protobuf
- language guide（proto3）
- Encoding

protobuf

language guide（proto3）

定义message type

如下为一个定义search request message format的示例，

synatx="proto3"

message SearchRequest {
    string query = 1;
    int32 page_number = 2;
    int32 results_per_page = 3;
}

上述示例含义如下：

synatx = "proto3":
- 代表当前protobuf的language版本为proto3
- 如果没有指定syntax，那么protocol buffer compiler默认会假设在使用proto2
Search Request消息定义了3个fields，每个field都代表希望包含在message中的一部分数据

Assign Field Numbers

可以为message中的每个field定义一个整数，范围为[1,536,870,911]，并有如下约束

message中所有field的给定数字必须唯一
field number [19,000, 19,999]是为Protocol Buffer实现保留的，如果使用这些数字，protocol buffer compiler将会报错

一旦消息类型被使用后，field number就不能被改变，field number代表message wire format中的field。

如果对field的field number进行了修改，代表删除旧的field并且新建一个相同类型的field。

filed number不应该被重用。

对于频繁被设置的fields，应该将其的field number设置为[1,15]。在wire format中，field number的值越小占用空间越小。

例如，[1,15]在编码时只占用1字节，而[16, 2047]则会占用2字节。

重复使用filed number的后果

如果重复使用field number，将会造成解码wire-format message的二义性。

对于protobuf wire format，其在编码和解码过程中，fields的定义必须一致。

field number被限制为29bit，故而field number的最大值为536870911。

指定字段基数

在protobuf协议中，field可以为如下的集中类型

Singular

在proto3中，有两种singular field：

optional（推荐使用）：一个optional field可能有如下两种状态
- 如果optional field值被设置，那么其将会被序列化到wire中
- 如果optional field值未被设置，那么该field将会返回一个默认值，并且其不会被序列化到wire中
implict（不推荐使用）：一个隐式字段没有显式基数标签，并且行为如下：
- 如果field为一个message type，那么其行为和optional相同
- 如果field不是message，那么其有两种状态：
  - 如果field被设置为非默认值（non-zero），其会被序列化到wire中
  - 如果field被设置为默认值，那么其不会被序列化到wire中

相比于implict，更推荐使用optional，使用optional能更好与proto2相兼容

optional和implicit的区别是，如果scalar field被设置为默认值，在optional场景下，其会被序列化到wire中，而implicit则不会对其进行序列化

repeated

代表该field可以在消息中出现0次或多次，消息出现的顺序也将被维护

map

代表field为成对的键值对

Message Type Files Always have Field Presence

在proto3中，message-type field永远都存在field presence。故而，对于message-type field添加optional修饰符并不会改变该field的field presence。

例如，如下示例中定义的Message2和Message3对所有的语言都会生成相同的code，并且在binary json、text format格式下如下两种定义的数据展示都不会有任何区别

syntax="proto3";

package foo.bar;

message Message1 {}

message Message2 {
  Message1 foo = 1;
}

message Message3 {
  optional Message1 bar = 1;
}

well-formed messages

用well-formed来修饰protobuf message时，其代表被序列化或反序列化的bytes。在对bytes进行转化时，protoc parser将会校验是否proto定义文件是可转化的。

对于singular field，其可以在wire-format bytes中出现多次，parser会接收该输入，但是，在转化过程中，只有field的最后一次出现才有效。

在相同`.proto`中定义多个message type

可以在相同.proto文件中定义多个message type，示例如下所示：

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 results_per_page = 3;
}

message SearchResponse {
 ...
}

应尽可能的在每个proto文件中包含较少的message type定义，在同一proto问文件中包含过多message type可能会造成依赖的膨胀。

删除Fields

当不再需要某个field，并且所有对该field的references都被从client code中移除时，可以从message type definition中移除该field。但是，该field对应的field number必须被reserved，防止field number后续被重用。

该field对应的field name也应该被reserved，以允许按json或text-format进行编码的消息你能偶被正常的转换。

reversed field number

在将field注释或删除时，将来使用者可能仍会对field number进行重用。为了避免该问题，可以将被删除field的field number添加到reversed列表中，示例如下：

message Foo {
  reversed 2, 15, 9 to 11;
}

上述示例中，9 to 11代表9,10,11

reversed field names

对被删除field的field name通常是安全的，除非使用TextProc或json的编码格式，在使用这些格式时field name也会参与序列化。为了避免该问题，可以将deleted field name添加到reversed列表中。

reversed names只会影响protoc compiler的行为，并不会对rumtime behavior造成影响，但是存在一个例外：

在parse过程中，TextProto实现会丢弃reversed中包含的未知fields，而不会抛出异常
runtime json parse过程不会受到reversed names影响

使用reversed names示例如下：

message Foo {
  reversed 2, 15, 9 to 11;
  reversed "foo", "bar";
}

上述示例中，将field numbers和field names分为了两个reversed语句，实际上，可以在同一行reversed语句中包含它们

what generated from `.proto`

当针对.proto文件运行protocol buffer compiler时，compiler将会生成所选中编程语言对应的和message type进行交互的代码，生成的交互代码包括如下部分：

getting and setting field values
serialize message to outputstream
parse message from input stream

对于常用变成语言，其生成文件内容如下：

java:
- 对于java，其会为每个message type生成其对应的class
- 除了clas外，还会生成对应的Builder类，用于生成class实例
go:
- 对于go，其会为每个message type生成.pb.go文件

Scalar Value Types

一个scalar message field可以是如下类型

Proto Type	Notes
double
float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.
uint32	Uses variable-length encoding.
uint64	Uses variable-length encoding.
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2²⁸.
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2⁵⁶.
sfixed32	Always four bytes.
sfixed64	Always eight bytes.
bool
string	A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 2³².
bytes	May contain any arbitrary sequence of bytes no longer than 2³².

上述scalar type，在各个编程语言中对应的类型如下所示：

Proto Type	C++ Type	Java/Kotlin Type^[1]	Python Type^[3]	Go Type	Ruby Type	C# Type	PHP Type	Dart Type	Rust Type
double	double	double	float	float64	Float	double	float	double	f64
float	float	float	float	float32	Float	float	float	double	f32
int32	int32_t	int	int	int32	Fixnum or Bignum (as required)	int	integer	int	i32
int64	int64_t	long	int/long^[4]	int64	Bignum	long	integer/string^[6]	Int64	i64
uint32	uint32_t	int^[2]	int/long^[4]	uint32	Fixnum or Bignum (as required)	uint	integer	int	u32
uint64	uint64_t	long^[2]	int/long^[4]	uint64	Bignum	ulong	integer/string^[6]	Int64	u64
sint32	int32_t	int	int	int32	Fixnum or Bignum (as required)	int	integer	int	i32
sint64	int64_t	long	int/long^[4]	int64	Bignum	long	integer/string^[6]	Int64	i64
fixed32	uint32_t	int^[2]	int/long^[4]	uint32	Fixnum or Bignum (as required)	uint	integer	int	u32
fixed64	uint64_t	long^[2]	int/long^[4]	uint64	Bignum	ulong	integer/string^[6]	Int64	u64
sfixed32	int32_t	int	int	int32	Fixnum or Bignum (as required)	int	integer	int	i32
sfixed64	int64_t	long	int/long^[4]	int64	Bignum	long	integer/string^[6]	Int64	i64
bool	bool	boolean	bool	bool	TrueClass/FalseClass	bool	boolean	bool	bool
string	string	String	str/unicode^[5]	string	String (UTF-8)	string	string	String	ProtoString
bytes	string	ByteString	str (Python 2), bytes (Python 3)	[]byte	String (ASCII-8BIT)	ByteString	string	List	ProtoBytes

default field values

当message执行反序列化操作时，如果encoded message bytes中并不包含指定的field，那么对反序列化后的对象，访问field时，将会返回一个默认值。

各种类型的默认值都不一样：

对于string类型，默认值为空字符串
对于bytes，默认值为empty bytes
对于bool类型，默认值为bool
对于numeric types，默认值为0
对于message fields，默认值为该field没有被设置
对于enum，默认值为first defined enum value

对于repeated fields，其默认值为empty（empty list）。

对于map fields，其默认值为emtpy(empty map)。

implicit-presence

对于implicit presence scalar fields, 当消息被反序列化后，没有方法区分该field是被显式设置为default value还是该field根本未被设置。

enumerations

在proto中定义枚举的示例如下：

enum Corpus {
  CORPUS_UNSPECIFIED = 0;
  CORPUS_UNIVERSAL = 1;
  CORPUS_WEB = 2;
  CORPUS_IMAGES = 3;
  CORPUS_LOCAL = 4;
  CORPUS_NEWS = 5;
  CORPUS_PRODUCTS = 6;
  CORPUS_VIDEO = 7;
}

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 results_per_page = 3;
  Corpus corpus = 4;
}

在proto3中，first value defined in enum其值必须为0，并且，其名称必须为{ENUM_TYPE_NAME}_UNSPECIFIED或{ENUM_TYPE_NAME}_UNKNOWN。

enum value alias

当开启allow_alias option时，允许为相同的枚举值指定不同的枚举项。在反序列化时，所有的alias值都有效，但是只有第一个会被用于反序列化。

enum alias示例如下：

enum EnumAllowingAlias {
  option allow_alias = true;
  EAA_UNSPECIFIED = 0;
  EAA_STARTED = 1;
  EAA_RUNNING = 1;
  EAA_FINISHED = 2;
}

enum EnumNotAllowingAlias {
  ENAA_UNSPECIFIED = 0;
  ENAA_STARTED = 1;
  // ENAA_RUNNING = 1;  // Uncommenting this line will cause a warning message.
  ENAA_FINISHED = 2;
}

在使用枚举时，可以在一个message type中定义枚举，然后在另一个message type中使用枚举，语法如下 _MessageType_._EnumType_。

修改message Type需要遵循的原则

如果旧的message type不再满足需求，想要添加新的field，且在修改message type后仍想和旧代码保持兼容，则必须要遵守如下原则

不要修改field number
在添加新field后，使用旧meesage type序列化到的消息格式仍然能被新的message type反序列化。同样的，新的message type序列化的消息同样能被旧的message type定义反序列化
- 即是，若service A和service B通过message type进行通信，service A为调用方而service B为被调用方，如果service B修改了message type定义，向message中添加了field，但是serivce A仍然使用的是旧的message定义，那么service A使用旧的message type定义序列化的数据，仍然能被service B反序列化
- 同样的，service B使用新的message type定义序列化的新消息数据仍然能被service A反序列化，对于旧的消息定义，其会忽略新添加的字段
field可以被删除，但是被删除的field，其field number不能被重用。
int32, uint32, int64, uint64, bool这些类型的值都是兼容的，代表可以将field的类型从一个修改为另一个，并且不会打破向前或向后兼容。
sint32和sint64能够彼此兼容，但是和其他数值类型不相兼容
string和bytes能够互相兼容，但要求bytes为有效的utf8字符串
嵌套的消息能够和bytes相互兼容，但是要求bytes内容为序列化后的message实例
fixed32和sfixed32相互兼容，fixed64和sfixed64相互兼容
对于string, bytes，message fields, singular和repeated能够相互兼容。
- 假如被序列化的数据中包含repeated field，并且client期望该field为singluar，那么在反序列化时
  - 若field为primitive type，client会取repeated field的最后一个值
  - 如果field为message type，那么client会对所有的field element进行merge操作
enum和int32, uint32, int64, uint64能够互相兼容

Unknown Fields

如果被序列化的数据中包含parser无法识别的fields时，其被称为unknown fields。例如，old parser针对new sender发送的数据进行反序列化时，如果new sender发送的数据中包含new field，那么new field即为unknown field。

proto3会对unknown fields进行保存，并在序列化和反序列化时包含它们，该行为和proto2一致。

unknown fields丢失

一些行为可能会造成unknown fields丢失，示例如下：

将消息序列化为json
遍历消息中的field并将其注入给新的message

为了避免unknown fields的丢失，遵循如下规则：

使用binary格式，在数据交换时避免使用text-format
使用message-oriented api来拷贝消息，例如CopyFrom或MergeFrom，不要使用field-by-field的拷贝方式

Any

any的使用类似于泛型，允许在使用嵌套类型时无需声明其.proto定义，Any将会包含如下内容

任意被序列化为bytes的消息
一个唯一标识消息类型的url，用于对消息的反序列化

为了使用Any类型，需要import google/protobuf/any.proto，示例如下：

import "google/protobuf/any.proto";

message ErrorStatus {
  string message = 1;
  repeated google.protobuf.Any details = 2;
}

对于给定消息类型的默认type url为type.googleapis.com/_packagename_._messagename_.。

OneOf

如果消息中包含多个singluar fields，并且在同一时刻最多只能有一个被设置，则可以使用OneOf特性。

OneOf fields类似于optional fields，但是所有oneof fields都保存在oneof共享内存中，在同一时间最多只能有一个field被设置。对任一oneof member进行设置都会自动清空其他oneof members。

可以通过cause()或WhichoneOf()方法来得知哪一个field被设置。

oneof使用如下所示：

message SampleMessage {
  oneof test_oneof {
    string name = 4;
    SubMessage sub_message = 9;
  }
}

在使用oneof field时，可以向oneof中添加任意类型的field，除了repeated和map。

OneOf Feature

当parser对数据进行转换时，如果存在多个oneof member被设置，那么只有最后一个oneof member才会被使用
oneof不能为repeated
反射api对oneof field也适用

向后兼容问题

在向oneof中添加field时应该要注意，如果oneof的值返回None/NOT_SET，其可能代表oneof没有被设置或其有可能被设置，但是设置的member不包含在旧版本的消息类型中。

Maps

如果想要创建一个map作为data definition的一部分，可以使用如下语法：

map<key_type, value_type> map_field = N;

上述示例中，key_type可以是任意整数类型或string类型。value_type可以是除了map外的任何类型。

创建示例如下所示;

map<string, Project> projects = 3;

map feature

map fields 不能被repeated修饰
map中值的遍历顺序以及值在format中的顺序是未定义的
在merge或反序列化过程中，如果存在多个相同的key，那么最后出现的key将会被使用

向后兼容性

map synatx声明等价于如下声明：

message MapFieldEntry {
  key_type key = 1;
  value_type value = 2;
}

repeated MapFieldEntry map_field = N;

故而，不支持maps的protocol buffer实现也能够接收map数据。

package

可以在.proto文件中指定一个package name，用于避免message type冲突，示例如下：

package foo.bar;
message Open { ... }

message Foo {
  ...
  foo.bar.Open open = 1;
  ...
}

Service In Rpc System

如果想要在rpc系统中使用message type类型，可以在proto文件中定义rpc service interface。protocol buffer compiler将会生成service interface code和stub，示例如下：

service SearchService {
  rpc Search(SearchRequest) returns (SearchResponse);
}

gRPC

与protocol buffer一起使用的最简单的rpc系统为gRPC。其是语言和平台中立的，其能够直接通过.proto文件产生rpc代码。

json

binary wire format为protobuf的首选格式，但protobuf支持json编码规范。

Options

.proto文件中的声明可以添加options。options并不会改变声明的含义，但是会影响特定上下文下对声明的处理。

消息级别

部分option是文件级别的，其应当被写在文件作用域的最上方，而不应该在message type、enum、service之内
部分option则是message级别的，其应该被写在消息中
部分option是field级别的，其应该被写在field 定义之内

常用options如下

java_package (file option)

java_package option代表生成classes的package name。如果没有指定该选项，会默认使用proto文件的package。

使用示例如下：

option java_package = "com.example.foo";

java_outer_class_name (file option)

生成class文件的outer classname。如果该选项没有指定，默认为proto文件的文件名，使用示例如下：

option java_outer_classname = "Ponycopter";

java_multiple_files (file option)

如果该选项被指定为false，那么所有.proto文件产生的内容都会被嵌套在outer class中。

如果选项被指定为true，那么会为每个message type单独生成一个java文件。

该选项默认为false，使用示例如下：

option java_multiple_files = true;

optimize_for (file option)

该选项可以被设置为SPEED, CODE_SIZE, LIFE_RUNTIME，其将会影响java和c++生成器：

SPEED: SPEED为默认值，protocol buffer compiler将会生成序列化、反序列化、执行其他常用操作的代码。生成的代码是高度优化的
CODE_SIZE: protocol buffer compiler将会生成最小的类代码，类代码中依赖共享、反射等操作来实现序列化、反序列化以及其他常用操作，指定该选项后生成的代码长度要比SPEED小得多，但是操作可能会更慢。
LIFE_RUNTIME

使用示例如下所示：

option optimize_for = CODE_SIZE;

deprecated (field option)

如果该option为true，代表该field已经被废弃并不应该被新代码使用。在java中，其代表生成的代码中field将会被标注为@Deprecated。

使用示例如下所示：

 int32 old_field = 6 [deprecated = true];

生成代码

在安装完protocol buffer compiler后，可以通过运行protoc命令来生成类文件，示例如下：

protoc --proto_path=IMPORT_PATH --cpp_out=DST_DIR --java_out=DST_DIR --python_out=DST_DIR --go_out=DST_DIR --ruby_out=DST_DIR --objc_out=DST_DIR --csharp_out=DST_DIR path/to/file.proto

其中，IMPORT_PATH代表当使用import指令时，查找.proto文件的路径，如果省略，将会使用当前目录。如果要指定多个目录时，可以传递多个--proto_path选项。

在指定输入时，也可以指定多个proto文件。

Encoding

默认，protobuf会将消息序列化为wire format，wire format定义了如何将消息发送到wire中，并定义了消息占用的空间大小。

Simple Message

如下是一个简单的消息定义示例：

message Test1 {
  optional int32 a = 1;
}

在application code中，可以创建Test1消息实例并将a设置为150，之后将消息序列化到outputstream中。被序列化之后消息的hex内容为：

08 96 01

如果使用protoscope tool来转储这些字节，可以得到1:150的结果。

Base 128 varints

可变宽度整数（varints）为wire format的基础。通过varints，可以将unsigned 64bit整数编码为1~10个字节。

varint可以对固定长度8字节的64bit unsigned整数进行编码，当64bit整数越小，varint花费的字节数越少。

varint中每个字节的载荷为7位（取值范围为0~127），故而对64bit整数进行编码时，最多可能需要花费10个字节。(7 * 9 = 63 < 64)

而对于7位载荷能够代表的数字，例如1，则varint只需要1个字节即能对其进行编码。

varint编码原理

对于varint中的每个字节，都含有一个continuation bit，代表下一字节是否为varint的一部分，continuation bit是字节的MSB。

故而，在varint编码产生的结果中，最后一个字节的MSB都是0，而其他字节的MSB都是1，这样可以对varint进行分隔。

构成varint的每个字节，除了MSB之外的7bit构成了载荷，将所有字节的7bit拼接在一起即是varint代表的整数。

例如，整数1，其编码为varint后的长度为1字节，内容为01，其中MSB为0，代表varint只由当前字节构成，7位载荷为0000001，代表该varint的值为1。

对于整数150，其16进制代表为0x96，一个字节能表示的最大数为127，故而150需要两个字节来表示，表示的载荷为1001 0110，将其分割为7位并由两个字节来表示之后，在大端环境下，编码后字节内容为1 0000001 0 0010110，即0x8116；在小端环境下，编码后字节内容为1 0010110 0 0000001，即0x9601。

网络传输默认使用大端字节序，故而网络传输字节内容为0x9601

消息结构

protocol buffer message由一系列的key/value pairs构成，message的二进制数据，将field number作为key，field的类型以及field name只有在decode时参照message type的定义才能得知。

当消息被编码后，每个key/value pair都会被转化为一条记录，记录中包含field number, wire type, payload。其中，wire type会告知parser后续payload的大小。

对于old parser，通过wire type可以得知payload的大小，故而old parser可以跳过unknown field。

wire type

wire type故而被称为tag-length-value，即TLV。

有6中wire type，varint，i64, len, sgroup, egroup, i32。

ID	Name	Used For
0	VARINT	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	I64	fixed64, sfixed64, double
2	LEN	string, bytes, embedded messages, packed repeated fields
3	SGROUP	group start (deprecated)
4	EGROUP	group end (deprecated)
5	I32	fixed32, sfixed32, float

对于一条记录(field key/value pair), 其tag被编码为了一个varint，被编码值的计算公式如下：

(field_number << 3) | wire_type

故而，在对varint类型的tag进行decode操作后，其结果最低三位代表wire type，其他部分代表field number。

故而，可以stream中永远以varint数字开头，代表field的tag，例如，stream中的第一个字节为08

0000 1000

其去掉符号位后，载荷为0001 000，后三位代表wire type，值为0的wire type代表varint，即field value的类型为varint。前4位代表field number，故而field number的值为0001，即1。

故而，08tag 代表field number为1，并且field value的类型为varing的field。

More Integer Types

Boolean & Enum

bool类型和enum类型的编码方式和int32一致，bool值通常会被编码为00或01。

Length-Delimited Records

length prefixed为wire format另一个要点。wire type中值为2的类型为LEN，其在tag后有一个varint类型的动态长度，动态长度之后跟随载荷。

示例如下:

message Test2 {
  optional string b = 2;
}

如果新建一个Test2的实例，并且将b设置为testing，那么其编码后的结果为

12 07 [74 65 73 74 69 6e 67]

解析如下：

第一个varint值12代表tag，其载荷0010 010代表field number为2，并且wire type为2，即LEN。
wire type为LEN代表tag后跟随一个varint表示field值的长度，而07其载荷表示的值为7，代表field value的长度为7，testing字符串长度正好为7
字符串testing的utf8编码为0x74657374696e67，其正好和后续内容一致

Sub messages

对于sub messages类型的记录，其仍使用LEN wire type，在tag和length varint之后，跟随的是sub message编码之后的二进制内容。

Optional & Repeated

对于otpional场景，编码时，如果field没有被设置，只需要跳过其即可。

对于repeated场景，普通（not packed）repeated fields会为field中的每一个元素单独发送一个record，示例如下：

message Test4 {
  optional string d = 4;
  repeated int32 e = 5;
}

如果构造一个Test4实例，d为hello并且e为1, 2, 3，那么，其可以按照如下方式被编码：

4: {"hello"}
5: 1
5: 2
5: 3

其中，e的多条record顺序并不需要在一起，可以乱序排放，例如

5: 1
5: 2
4: {"hello"}
5: 3

31 KiB Raw Blame History Unescape Escape