Base64 原理

在计算机领域,Base64 是一类将 二进制数据 转为 可打印 字符序列的 二进制转文本 编码方案,字符集包含 64 个唯一符号。原始数据按 6 位为一组处理,每组映射为 64 个字符之一。

与其他二进制转文本编码类似,Base64 便于在仅可靠支持文本的通道中传输二进制数据。它在 Web 端尤为常见,可将 图片 等二进制资源嵌入 HTML/CSS 等文本资源中。

由于早期 SMTP 仅支持 7 位 ASCII,Base64 也被广泛用于 邮件 附件的发送:发送前将附件编码为 Base64,接收后再解码,可避免老旧服务器对附件的干扰。

相较原始二进制数据,Base64 编码会带来约 33%–37% 的体积开销(编码本身约 33%,可选换行最多再增加约 4%)。

设计

用于表示 64 个数值的字符集在不同实现中可能不同。一般选择各编码中常见且可打印的字符,避免在历史上非 8 位干净的系统(如电子邮件)中被篡改。例如,MIME 的 Base64 前 62 个值使用 A–Z、a–z、0–9;其他变体也类似,但最后两个符号可能不同(如 UTF‑7)。

该类编码最早用于同一操作系统间的拨号通信,例如 UNIX 的 uuencode、TRS‑80 的 BinHex(后被移植到 Macintosh)。这些方案会假设哪些字符是“安全”的,例如 uuencode 使用大写字母、数字和多种标点,而不使用小写字母。

RFC 4648 中的 Base64 表

索引 二进制 字符 Index Binary Char. Index Binary Char. Index Binary Char.
0000000A16010000Q32100000g48110000w
1000001B17010001R33100001h49110001x
2000010C18010010S34100010i50110010y
3000011D19010011T35100011j51110011z
4000100E20010100U36100100k521101000
5000101F21010101V37100101l531101011
6000110G22010110W38100110m541101102
7000111H23010111X39100111n551101113
8001000I24011000Y40101000o561110004
9001001J25011001Z41101001p571110015
10001010K26011010a42101010q581110106
11001011L27011011b43101011r591110117
12001100M28011100c44101100s601111008
13001101N29011101d45101101t611111019
14001110O30011110e46101110u62111110+
15001111P31011111f47101111v63111111/
Padding =

示例

In the above quote, the encoded value of Man is TWFu. Encoded in ASCII, the characters M, a, and n are stored as the byte values 77, 97, and 110, which are the 8-bit binary values 01001101, 01100001, and 01101110. These three values are joined together into a 24-bit string, producing 010011010110000101101110. Groups of 6 bits (6 bits have a maximum of 26 = 64 different binary values) are converted into individual numbers from start to end (in this case, there are four numbers in a 24-bit string), which are then converted into their corresponding Base64 character values. As this example illustrates, Base64 encoding converts three octets into four encoded characters.

示例:源字符串(Man)的 Base64 编码

源 ASCII 文本 M a n
字符 M a n
Octets 77 (0x4d) 97 (0x61) 110 (0x6e)
Bits 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0
Base64 encoded
Sextets
19 22 5   46
Base64 encoded
Character
T W F   u
Base64 encoded
Octets
84 (0x54) 87 (0x57) 70 (0x46)   117 (0x75)

注意:末块可能添加 = 作为填充,以保证最后一组包含 4 个 Base64 字符。

Hexadecimal to octal transformation is useful to convert between binary and Base64. For example, the hexadecimal representation of the 24 bits above is 4D616E, whose octal is 23260556. Split into pairs (23 26 05 56) and map each to decimal (19 22 05 46); using those four decimal numbers as indices for the Base64 alphabet yields the ASCII characters TWFu.

If there are only two significant input octets (e.g., "Ma"), or when the last input group contains only two octets, all 16 bits are captured in the first three Base64 digits (18 bits); the two least significant bits of the last 6‑bit block will be zero and are discarded on decoding (along with the succeeding = padding character).

Source ASCII text M a  
Character M a  
Octets 77 (0x4d) 97 (0x61)  
Bits 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0
Base64 encoded
Sextets
19 22 4
Base64 encoded
Character
T W E
Base64 encoded
Octets
84 (0x54) 87 (0x57) 69 (0x45)

输出填充

Because Base64 is a six‑bit encoding, and because the decoded values are divided into 8‑bit octets, every four characters of Base64‑encoded text (4 sextets = 4 × 6 = 24 bits) represent three octets of unencoded text or data (3 octets = 3 × 8 = 24 bits). This means that when the length of the unencoded input is not a multiple of three, the encoded output must have padding added so that its length is a multiple of four. The padding character is =, which indicates that no further bits are needed to fully encode the input.

The example below illustrates how truncating the input of the quote changes the output padding:

输入 输出 填充
文本 长度 文本 长度  
light work. 11 bGlnaHQgd29yay4= 16 1
light work 10 bGlnaHQgd29yaw== 16 2
light wor 9 bGlnaHQgd29y 12 0
light wo 8 bGlnaHQgd28= 12 1
light w 7 bGlnaHQgdw== 12 2

解码时并非必须依赖填充字符,因为可由编码文本长度推断缺失字节。一些实现要求填充,另一些则不需要。常见的需要填充的场景是多个 Base64 文件被拼接。

带填充的解码

When decoding Base64 text, four characters are typically converted back to three bytes. The only exceptions are when padding characters exist. A single = indicates that the four characters will decode to only two bytes, while == indicates that the four characters will decode to only a single byte. For example:

编码 填充 长度 解码
bGlnaHQgdw== == 1 light w
bGlnaHQgd28= = 2 light wo
bGlnaHQgd29y None 3 light wor

另一种理解填充字符的方式是:每遇到一个 =,就从位串末尾丢弃若干位。例如解码 bGlnaHQgdw== 时,将每个字符(除末尾 =)转为 6 位,再对两个 = 各丢弃 2 位,共 4 位;余下 8 位即 1 个字节。

不带填充的解码

无填充时,按 4 个字符 → 3 个字节的节奏反复解码,末尾可能不足 4 个字符,只会剩余 2 或 3 个字符;不会只剩 1 个字符,因为单个 Base64 字符只有 6 位,构成 1 字节需要至少 2 个字符:第一个提供 6 位,第二个提供其前 2 位。例如:

Length Encoded Length Decoded
2 bGlnaHQgdw 1 light w
3 bGlnaHQgd28 2 light wo
4 bGlnaHQgd29y 3 light wor

不同解码器对无填充的处理并不一致。此外,允许无填充解码会导致多个不同字符串解码成同一字节序列,存在潜在安全风险。