对中文字体进行压缩

2021-06-19coding 中文字体

相关文档：中文字体的终极解决方案——对字体进行切片

引言

中文字体在使用过程中有着不便之处，因为中文字体较多，中文字体包也一般偏大，3、4MB大小很正常，直接引入字体包会减慢用户的加载速度，并造成不必要的资源浪费。因此，对中文字体进行压缩是很有必要的事情。

一些解决方案

Web fonts

谷歌字体等字体提供商可以提供部分中文字体的解决方案。首先，它们有着庞大的CDN网络，在传输过程中可以使用gzip等压缩方案，能够让世界各地的人以较快的速度加载字体。其次，它们采用多种优化手段，比如按照使用频率来分成不同字体包来减小加载体积。 @Google is using #MachineLearning to optimize Chinese, Japanese & Korean web fonts - so they load faster than ever: http://goo.gl/4JsCsz

02/05/2019 from https://design.google/news/google-fonts-launches-chinese-support/

Google Fonts launches Simplified and Traditional Chinese support

New year—new, faster fonts. In the spirit of the Lunar New Year, the Google Fonts catalog now includes five Simplified and two Traditional Chinese fonts—the Chinese written language differs according to country—for designers and developers working with Chinese text. Since Chinese fonts often contain more than 10,000 characters, single font file delivery is unacceptably slow. Building on earlier launches for Korean and Japanese, Google Fonts has analyzed character usage over millions of public web pages to build optimized font ”slicing” patterns for both Simplified and Traditional Chinese. This allows modern web browsers to only download the portions of a font—typically a very small fraction of the complete set—containing the characters that they need.

Head over to Google Fonts to check out—and try out—the Simplified Chinese and Traditional Chinese libraries.

Google字体中引入字体的示例CSS

@font-face {
  font-family: 'Noto Sans SC';
  font-style: normal;
  font-weight: 400;
  font-display: swap;
  src: url(https://fonts.gstatic.com/s/notosanssc/v12/k3kXo84MPvpLmixcA63oeALhLOCT-xWNm8Hqd37g1OkDRZe7lR4sg1IzSy-MNbE9VH8V.4.woff2) format('woff2');
  unicode-range: U+1f1e9-1f1f5, U+1f1f7-1f1ff, U+1f21a, U+1f232, U+1f234-1f237, U+1f250-1f251, U+1f300, U+1f302-1f308, U+1f30a-1f311, U+1f315, U+1f319-1f320, U+1f324, U+1f327, U+1f32a, U+1f32c-1f32d, U+1f330-1f357, U+1f359-1f37e;
}
/* [5] */
@font-face {
  font-family: 'Noto Sans SC';
  font-style: normal;
  font-weight: 400;
  font-display: swap;
  src: url(https://fonts.gstatic.com/s/notosanssc/v12/k3kXo84MPvpLmixcA63oeALhLOCT-xWNm8Hqd37g1OkDRZe7lR4sg1IzSy-MNbE9VH8V.5.woff2) format('woff2');
  unicode-range: U+fee3, U+fef3, U+ff03-ff04, U+ff07, U+ff0a, U+ff17-ff19, U+ff1c-ff1d, U+ff20-ff3a, U+ff3c, U+ff3e-ff5b, U+ff5d, U+ff61-ff65, U+ff67-ff6a, U+ff6c, U+ff6f-ff78, U+ff7a-ff7d, U+ff80-ff84, U+ff86, U+ff89-ff8e, U+ff92, U+ff97-ff9b, U+ff9d-ff9f, U+ffe0-ffe4, U+ffe6, U+ffe9, U+ffeb, U+ffed, U+fffc, U+1f004, U+1f170-1f171, U+1f192-1f195, U+1f198-1f19a, U+1f1e6-1f1e8;
}
...

对于这种优化，我个人理解如下：按照一定的粒度，将字体分成多个文件，比如一个4MB的字体包分成100个40KB的字体包。通过机器学习等方法，将一些字频较高的字体、容易同时出现的字体(词语、成语、诗句等)分别打包进同一个字体包，并通过css中unicode-range来给不同文字加载不同的字体包资源。这样的话，一般网页中使用到的中文也只是一部分字体，只需要加载多个资源包就能完全覆盖。同时，就算网页中有很多生僻字，需要付出的代价也只是多加载几个资源包。

font-spider(字蛛)

字蛛是一个智能 WebFont 压缩工具，它能自动分析出页面使用的 WebFont 并进行按需压缩。它主要作用于html文件和css文件，通过检查页面中不同CSS类使用的字体来进行压缩。
它可以满足一些简单的需求，但在使用中有着较多不便之处。

fontmin

fontmin是字蛛实际使用压缩字体的库。可以从字体包中提取指定的字体，并生成压缩的字体包，同时支持转换为eot、woff2、woff、ttf等格式。本文也通过fontmin来进行简单的字体压缩操作。

基于fontmin的有可视化操作网页：fonteditor，实际网页demo：https://kekee000.github.io/fonteditor/index-en.html
可视化操作APP：fontmin-app

对文件中使用到的中文字体压缩

从指定文件读取中文字体

匹配文件，可以使用glob来进行文件的匹配

// 遍历src目录下的全部ts和tsx文件
const filePaths = glob("src/**/*.@{ts|tsx}");

匹配中文字体
- 中文字符正则匹配：/[\u4e00-\u9fa5]/ 或 /\p{sc=Han}/u (unicode正则匹配)
- 中文标点符号正则匹配：/[\u3000-\u301e\ufe10-\ufe19\ufe30-\ufe44\ufe50-\ufe6b\uff01-\uffee]/

整体流程为，以字符串形式读取匹配到的文件，并通过正则匹配文件中使用到的中文字体(注释的中文字体也会包括在内)，从而得到文件中使用的中文字符的集合。再加上额外的一些英文符号，如 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,./;’[]\`-=<>?:”{}|~!@#$%^&*()_+ 等符号。

使用fontmin压缩中文字体

本网站使用的代码参考：

/**
 * 遍历某个文件夹里的文件的中文字体，生成压缩后的各个类型的字体文件
 */
// 使用的 v10 glob，如果报错可以检查下版本
const { glob } = require("glob");
const fs = require("fs");
const Fontmin = require("fontmin");
const path = require("path");

const outputDir = "src/fonts"; // 输出文件位置

const matchChinese =
  /[\u4e00-\u9fa5\u3000-\u301e\ufe10-\ufe19\ufe30-\ufe44\ufe50-\ufe6b\uff01-\uffee]/gmu;

// 不同字体对应的扫描文件，匹配方法
const fontData = [
  {
    files: ["content/**/*.@(md|mdx)", "src/**/*.@(ts|tsx)"],
    fontPath: path.resolve(__dirname, "../src/assets/HYWenHei-55W.ttf"),
    fontFamily: "HYWenHei",
    defaultText: `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,./;'[]\\\`-=<>?:\"{}|~!@#$%^&*()_+`,
    match(str) {
      return (str.match(matchChinese) || []).join("");
    },
  },
  {
    files: [
      "content/blog/rhymes/*.@(md|mdx)",
      "content/blog/secrets/poems.mdx",
    ],
    fontPath: path.resolve(__dirname, "../src/assets/AaKaiSong.ttf"),
    fontFamily: "AaKaiSong",
    defaultText: `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,./;'[]\\\`-=<>?:\"{}|~!@#$%^&*()_+`,
    match(str) {
      return (str.match(matchChinese) || []).join("");
    },
  },
  {
    files: [],
    fontPath: path.resolve(__dirname, "../src/assets/SourceCodePro-Medium.ttf"),
    fontFamily: "SourceCodePro",
    defaultText: `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,./;'[]\\\`-=<>?:\"{}|~!@#$%^&*()_+`,
  },
];

const trimText = (text = "") => {
  const cache = Object.create(null);
  const arr = [];
  text.split("").forEach((char) => {
    if (!(char in cache)) {
      arr.push(char);
      cache[char] = true;
    }
  });
  return arr.join("");
};

const promiseList = fontData.map(async (item) => {
  const { files, fontPath, fontFamily, defaultText, match } = item;

  const filePaths = await glob(files);

  const matchTextArr = filePaths.map((file) => {
    const data = fs.readFileSync(file, "utf-8");

    const matchText = match(data);

    console.log(`读取${file}完成`);

    return matchText;
  });

  const textSubset = trimText(matchTextArr.join("") + defaultText);

  const fontmin = new Fontmin();

  const { name } = path.parse(fontPath);

  fontmin.src(fontPath);
  fontmin.use(
    Fontmin.glyph({
      text: textSubset,
      hinting: false,
    })
  );
  fontmin.use(Fontmin.ttf2woff());
  fontmin.use(Fontmin.ttf2woff2());
  fontmin.dest(outputDir);

  await new Promise((resolve, reject) => {
    fontmin.run(function (err, files) {
      if (err) {
        reject(err);
      }
      console.log(`写入字体${name}成功`);
      resolve();
    });
  });

  return `@font-face {
  font-family: ${fontFamily};
  src: url("./${name}.woff2") format('woff2'),
    url("./${name}.woff") format('woff'),
    url("./${name}.ttf") format('truetype');
  font-weight: normal;
  font-style: normal;
}
`;
});

Promise.all(promiseList).then((values) => {
  fs.writeFileSync(
    path.resolve(outputDir, "font.css"),
    `/* generated by srcipts/compressFont.js */
${values.join("")}
`
  );
});

在输出目录下生成的CSS文件如下，同时输出目录中也会生成压缩过的字体包的多种格式文件：

/* generated by srcipts/compressFont.js */
@font-face {
  font-family: HYWenHei;
  src: url("./HYWenHei-55W.woff2") format('woff2'),
    url("./HYWenHei-55W.woff") format('woff'),
    url("./HYWenHei-55W.ttf") format('truetype');
  font-weight: normal;
  font-style: normal;
}
@font-face {
  font-family: AaKaiSong;
  src: url("./AaKaiSong.woff2") format('woff2'),
    url("./AaKaiSong.woff") format('woff'),
    url("./AaKaiSong.ttf") format('truetype');
  font-weight: normal;
  font-style: normal;
}
@font-face {
  font-family: SourceCodePro;
  src: url("./SourceCodePro-Medium.woff2") format('woff2'),
    url("./SourceCodePro-Medium.woff") format('woff'),
    url("./SourceCodePro-Medium.ttf") format('truetype');
  font-weight: normal;
  font-style: normal;
}

直接在工程文件中引入该CSS文件即可，一个简单的效果。

AaKaiSong：会有清亮的风使草木伏地...

HYWenHei: 会有清亮的风使草木伏地…
SourceCodePro: 0123456789

使用范围

中文字符压缩主要应用场景还是静态网站，使用的中文字体都在文件中引入，没有动态生成的中文字符。可以通过scripts/compressFont.js文件去实际运行字符压缩。记得在yarn start和yarn build命令前加入node scripts/compressFont.js命令。

不足之处

在默认情况下，引用的js代码库则不会扫描进去，可能会造成一些字体格式显示不正确，可以通过手动fix加入这些字体来进行修正。
随着代码文件的增大，使用过的中文字符增加，字体包也会逐渐增大。同时，注释中使用过的中文字符也会被扫描打包。(对于需要国际化的网站，在一般情况下，中文字体都在独立的目录下。直接扫描此目录将使用过的中文字符进行压缩能够起到不错的效果)

展望

在构建工具中实现插件，可以忽略掉注释中使用的中文字符。
在构建完成后对build目录进行中文字符扫描，并将生成的css文件引入到构建完成的html文件中，这样可以允许代码库中使用的字符也被扫描到。不过应注意build构建的文件中字符编码的问题：以create-react-app为例，在build生成的js文件中，中文字符以unicode的形式存在，如“\u6587\u4ef6\u7c7b\u578b\u9519\u8bef\uff0c\u8bf7\u4e0a\u4f20xlsx\u6587\u4ef6” ，因此需要先通过正则匹配unicode，再将unicode转为字符并用正则匹配判断是否中文。
使用 babel 插件，扫描代码文件，将需要特定字体的语言用特定方法包裹。可以扫描源代码也可以扫描打包后的文件。一个简单的示例如下：

// 什么名字都行
const customText = (strs, ...params) => {
  const result = [];
  for(let i = 0, len = strs.length - 1; i < len; i++) {
    result.push(strs[i], params[i])
  };
  result.push(strs[strs.length - 1]);
  return result.join('');
};
// custom file
const title = customText`示例字体`;

babel 插件处理：

const { parse } = require('@babel/parse');
const traverse = require('@babel/traverse').default;

const ast = parse(code, {
  sourceType: 'module'
});

const words = [];

traverse(ast, {
  /** customFont`需要被扫描的中文文案`*/
  TaggedTemplateExpression(path) {
    const node = path.node;
    if (node.tag.name === 'customFont') {
      const spans = node.quasi.quasis;
      const spansText = spans.map(v => v.value.cooked);
      words.push(spansText.join(''));
    }
  },
})

使用案例

以此网站为例。原字体包大小为3199KB，进行压缩后，在PC客户端网页实际访问的字体包为woff2格式，目前大小仅有~~114KB~~260KB。

字体切片

以上所说的方法，只适合于静态加载的内容，如果说，需要自定义字体能够展示用户输入的内容，可以采用下面的方法：

相关文档：中文字体的终极解决方案——对字体进行切片