如何生成包含增补字符的随机Unicode字符串?(How to generate a random Unicode string including supplementary characters?)

[db:摘要]

如何生成包含增补字符的随机Unicode字符串?(How to generate a random Unicode string including supplementary characters?)

我正在研究一些用于生成随机字符串的代码。 结果字符串似乎包含无效的char组合。 具体来说,我发现高代理人没有低代理人。

任何人都可以解释为什么会这样吗? 我是否必须明确生成随机低代理以遵循高代理人? 我假设这不是必需的,因为我正在使用Character类的int变体。

这是测试代码,在最近的运行中产生了以下错误配对:

Bad pairing: d928 - d863 Bad pairing: da02 - 7bb6 Bad pairing: dbbc - d85c Bad pairing: dbc6 - d85c public static void main(String[] args) { Random r = new Random(); StringBuilder builder = new StringBuilder(); int count = 500; while (count > 0) { int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1); if (!Character.isDefined(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; } builder.appendCodePoint(codePoint); count--; } String result = builder.toString(); // Test the result char lastChar = 0; for (int i = 0; i < result.length(); i++) { char c = result.charAt(i); if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) { System.out.println(String.format("Bad pairing: %s - %s", Integer.toHexString(lastChar), Integer.toHexString(c))); } lastChar = c; } }

I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate.

Can anyone explain why this is happening? Do I have to explicitly generate a random low surrogate to follow a high surrogate? I had assumed this wasn't needed, as I was using the int variants of the Character class.

Here's the test code, which on a recent run produced the following bad pairings:

Bad pairing: d928 - d863 Bad pairing: da02 - 7bb6 Bad pairing: dbbc - d85c Bad pairing: dbc6 - d85c public static void main(String[] args) { Random r = new Random(); StringBuilder builder = new StringBuilder(); int count = 500; while (count > 0) { int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1); if (!Character.isDefined(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; } builder.appendCodePoint(codePoint); count--; } String result = builder.toString(); // Test the result char lastChar = 0; for (int i = 0; i < result.length(); i++) { char c = result.charAt(i); if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) { System.out.println(String.format("Bad pairing: %s - %s", Integer.toHexString(lastChar), Integer.toHexString(c))); } lastChar = c; } }

最满意答案

可以随机生成高或低代理。 如果这导致低代理,或高代理没有低代理,则结果字符串无效。 解决方案是简单地排除所有代理人:

if (!Character.isDefined(codePoint) || Character.isSurrogate(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; }

(从技术上讲,你也可以允许随机生成的高代理并添加另一个随机的低代理,但这只会创建其他随机代码点> = 0x10000,而这可能是未定义的或供私人使用。)

It's possible to randomly generate high or low surrogates. If this results in a low surrogate, or a high surrogate not followed by a low surrogate, the resulting string is invalid. The solution is to simply exclude all surrogates:

if (!Character.isDefined(codePoint) || Character.isSurrogate(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; }

(Technically, you could also allow randomly generated high surrogates and add another random low surrogate, but this would only create other random code points >= 0x10000 which might in turn be undefined or for private use.)

发布者:admin,转转请注明出处:http://www.yc00.com/web/1690482787a355946.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信