unicode

Unicode 是抽象的数字编码，UTF-8 是具体的字节编码。

第一个字节	第二个字节	第三个字节	第四个字节	用于实际编码的 bit 数量	能表示的最大值
-	-	-	-	-	-
0xxxxxxx				7	127
110xxxxx	10xxxxxx			11	2047
1110xxxx	10xxxxxx	10xxxxxx		16	65535
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	21	1114111

Python 中，可以用 \u 或 \U 前缀表示 Unicode 字符, 如 '\U0010ffff', '\uffff', '\U00000042', '\u0042'。

Python:

>>> s = '😋'
>>> s.encode()
b'\xf0\x9f\x98\x8b'
>>> import json
>>> json.dumps(s)
'"\\ud83d\\ude0b"'
>>> ord(s)
128523
>>> len(s)
1

一个字符，在内存中，普通 ascii 字符占用 1 字节，普通汉字占用 2 字节，emoji 字符占用 4 字节，虽然他们都是长度 1。

一个字符串包含多种字符时，占用内存为最占内存的字符 x 长度，"s" * 1024 占 1K 内存，"s" * 1023 + "😋" 占 4k 内存。

JavaScript:

> const s = '😋'
> s.length
2
> (s + 'x').length
3
> (s + '中').length
3
> new TextEncoder().encode(s)
Uint8Array [ 240, 159, 152, 139 ]
> new TextEncoder().encode('中')
Uint8Array [ 228, 184, 173 ]

> const n = 1024*1024*100
> const ss = '😋'.repeat(n)
> ss.indexOf('x')  // active this string
-1
// memory usage 400M

> const ss = '中'.repeat(n); ss.indexOf('x')
// memory usage 200M

> const ss = 's'.repeat(n); ss.indexOf('x')
// memory usage 100M

> const ss = 's'.repeat(n) + '中'.repeat(n); ss.indexOf('x')
// memory usage 400M, is 200M + 200M, not 100M + 200M

> const ss = 's'.repeat(n) + '😋'.repeat(n);; ss.indexOf('x')
// memory usage 600M, is 200M + 2 * 200M

JavaScript 的内存使用情况与 Python 类似，有对普通纯 ascii 字符串做优化。 emoji.length == 2，因为字符在内存中使用 UTF-16 编码存放，.length 返回的是 UTF-16 的单元数量。emoji 被认为是 2 单元占 4 字节内存。

Python 无法存储非法的 unicode：

>>> bytes([254,255]).decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

Unicode 与 UTF-8

emoji 例子 😋