opstr; go-icu-regex; Intl.NumberFormat
U.S. Folk: REMEMBER TO VOTE THIS WEEK!
ICU (International Components for Unicode) is a comprehensive set of C/C++ and Java libraries that provide essential Unicode and globalization support for software applications. The project recently announced ICU 76 release candidate, which brings Unicode 16 support and CLDR 46 locale data updates.
It began as part of Taligent, a joint Apple-IBM venture in the 1990s, before becoming an IBM project and eventually being open-sourced in 1999.
ICU excels at character set conversion, offering arguably the most complete charset data available, built upon decades of IBM’s collection efforts. The library provides sophisticated text handling capabilities including collation based on the [Unicode Collation Algorithm], locale-aware formatting for numbers and dates, and extensive timezone calculations.
The framework offers both C++ (ICU4C) and Java (ICU4J) implementations. ICU4C fills critical gaps in C/C++ environments that lack robust Unicode support, while ICU4J extends Java’s built-in internationalization capabilities with enhanced performance and newer Unicode standard compliance. The C/C++ versions are used as bridges to ICU in many other programming languages.
Lots of familiar names rely heavily on ICU, including:
Apple integrates it across macOS, iOS, watchOS, and tvOS
Google employs it in Chrome/ChromeOS, Android, and core web services
Microsoft uses it in Visual Studio Code and Windows Bridge for iOS
Adobe implements it throughout Creative Cloud and Document Cloud
The upcoming ICU 76 introduces significant improvements, including direct formatting support for java.time types and modernized C++ and Java patterns. The release also achieves near-perfect alignment between CLDR and Unicode default sort orders.
In today’s Bonus Drop, we’re covering a few resources that make use of ICU.
TL;DR
(This is an AI-generated summary of today’s Drop using Ollama + llama 3.2 and a custom model.)
This Drop proved somewhat challenging to my custom TL;DR model on the 16GB M1 Mac Mini’s Ollama server. It did the job, but the number of input tokens — and, very likely, the giant bullet list and big blocks of source code — caused it to operate very slowly. I’ll be glad to get it re-setup on the forthcoming M4 Mini (which comes next week while, sadly, I’m AFK).
opstr is a command-line utility that provides convenient access to common string operations, including Unicode-aware manipulations.
The go-icu-regex project is a Go implementation of ICU regular expressions, providing a drop-in replacement for Go’s standard regexp package.
The Intl.NumberFormat in JavaScript provides robust support for formatting numeric output, including various notation styles and currency formats.
opstr
opstr is a command-line utility written in Rust that provides convenient access to common string operations. The tool let us perform Unicode-aware string manipulations directly from the shell without resorting to icky Python scripts or bloated web applications.
It accepts UTF-8 strings as input and offers locale-aware operations when configured with proper Unicode data. It can be installed via Cargo or downloaded as a pre-built binary (for some platforms) from the project’s releases page.
The tool’s behavior can be customized through environment variables rather than command-line flags. These include settings for output radix, hexadecimal case formatting, color schemes, and locale preferences. For locale-specific operations, you’ll need to generate your own locale data using icu4x-datagen, as the default installation only supports en-US to help keep the binary size as small as possible. The README provides an example, but note that said tool is also deprecated. I haven’t poked at the equivalent in its replacement, since I don’t really need to do much outside of what it supports by default.
Incantations with it take a bitof getting used to, such as how you get stdin to the operations:
$ opstr --op lorem-ipsum 4 | \ opstr --stdin-as-arg 1 --op base64-encode 1TG9yZW0gaXBzdW0gaXVudSBwdHVsb3IgZXJlb2x1c3RlIHVtZGV0ai4K
And, whilt --stdin-as-arg works in many of the --op contexts, you sometimes have to do things lke this:
$ opstr --op join ", " $(opstr --op lorem-ipsum 4)Lorem, ipsum, oremau, yadi, miut, etum.
We’re covering the tool since it has a bonkers number of ops:
base64-decode: base64 decoding of provided hexadecimal string #1
base64-encode: base64 encoding of provided string #1
base64-url-safe-decode: base64 decoding of provided string #1 with URL-appropriate representation (c.f. RFC 3548)
base64-url-safe-encode: base64 encoding of provided string #1 with URL-appropriate representation (c.f. RFC 3548)
camelcase: turn #1 to lowercase and replace the ASCII character after ‘ ‘ or ‘_’ sequences with an uppercase letter
center: put string #1 in the middle of string of width #2 (default 80) repeating char #3 (default #) on both sides
codepoint-frequencies: return the frequency analysis per codepoint of string #1
codepoint-lookup: given the Unicode name as string #1 (e.g. “LATIN SMALL LETTER A”), return its UTF-8 representation (or an empty string, if unknown)
codepoints: represent string #1 with Unicode codepoints as integers, e.g. [72, 105, 10069]
codepoints-names: look up the Unicode name (or ‘unknown-name’ if unknown) of each codepoint of string #1, e.g. [“LATIN SMALL LETTER H”, “LATIN SMALL LETTER DOTLESS ”]
codepoints-unotation: represent string #1 with Unicode codepoints, e.g. [“U+0048”, “U+0069”]
concatenate: concatenate all provided strings
count-codepoints: return the number of Unicode scalars in the Unicode string #1
count-grapheme-clusters: return number of “Grapheme clusters” in string #1 according to Unicode Standard Annex 29 “Unicode Text Segmentation”
count-substring: how often does string #2 non-overlappingly occur in string #1?
count-utf16-bytes: encode string #1 in UTF-16 and return its number of bytes
count-utf8-bytes: encode string #1 in UTF-8 and return its number of bytes
dedent: identify and remove common indentation among all non-empty lines of string #1
dedent-with-substring: remove prefix string #2 at the beginning of every line of string #1
digest-md5: generate the MD5 hexadecimal digest of the given UTF-8 string #1
digest-sha1: generate the SHA1 hexadecimal digest of the given UTF-8 string #1
digest-sha256: generate the SHA256 hexadecimal digest of the given UTF-8 string #1
digest-sha3-256: generate the SHA3-256 hexadecimal digest of the given UTF-8 string #1
emoji-by-name: given a Emoji Sequence Data (UTS #51) description string #1 return the corresponding emoji (e.g. ‘smiling face with halo’ returns ‘😇’)
format: replace {placeholders} in string #1 with consecutive arguments #2, #3, …
grapheme-clusters: return “Grapheme clusters” of string #1 according to Unicode Standard Annex 29 “Unicode Text Segmentation”
guarantee-prefix: if string #1 does not start with string #2, prepend it
guarantee-suffix: if string #1 does not end with string #2, append it
human-readable-bytes: represent integer #1 (as 1024-based count of bytes) in a human-readable manner likely with two decimal points
indent-with-substring: concatenate string #2 with every non-empty line in string #1, keep other lines
is-ascii: does this string #1 only contain ASCII characters?
is-caseinsensitively-equal: do all Unicode strings have the same byte sequence after ASCII lowercasing?
is-contained: does string #1 contain string #2?
is-crlf-lineterminated: is (U+000D CARRIAGE RETURN)(U+000A LINE FEED) the only sequence causing line breaks in string #1?
is-empty: does this string #1 have length zero?
is-equal: do all Unicode strings have the same byte sequence?
is-lf-lineterminated: is U+000A LINE FEED the only character causing line breaks in string #1?
is-prefix: does string #1 start with string #2?
is-suffix: does string #1 end with string #2?
is-whitespace: does the provided string #1 only contain codepoints in the Unicode Whitespace category?
is-whitespace-agnostically-equal: are all strings equal if we ignore any whitespace characters?
join: join all following strings with string #1
length-maximum: return the first string among the longest strings
length-minimum: return the first string among the shortest strings
levensthein-distance: levensthein distance between strings #1 and #2
linebreak-before: linebreak long lines in (text #1) before they reach (integer #2) codepoints
lines-shortened: shorten lines in string #1, if necessary, not to exceed width #2
lorem-ipsum: generate (int #1) words of an Lorem Ipsum text
lowercase-for-ascii: get locale-independent/ASCII lowercase version of string #1
normalize-with-nfc: NFC-normalize Unicode string #1 which applies canonical decomposition followed by canonical composition (c.f. UAX #15)
normalize-with-nfd: NFD-normalize Unicode string #1 which applies canonical decomposition (c.f. UAX #15)
normalize-with-nfkc: NFKC-normalize Unicode string #1 which applies compatibility decomposition followed by canonical composition (c.f. UAX #15)
normalize-with-nfkd: NFKD-normalize Unicode string #1 which applies compatibility decomposition followed by canonical composition (c.f. UAX #15)
regex-search: does regex pattern #1 occur anywhere inside #2? if so, return matching substring, otherwise empty string
remove-ansi-escape-sequences: remove any ANSI X3.64 (also found in ECMA-48/ISO 6429) sequences in string #1 starting with U+001B ESCAPE
repeat: repeat string #1 several (integer #2) times
replace: replace string #2 with string #3 in string #1
sentence-clusters: return “Sentence clusters” according to Unicode Standard Annex #29 “Unicode Text Segmentation”
similarity: indicate similarity (0 = not, 100 = equal) of two strings with a number between 0 and 100
skip-prefix: remove string #2 from the beginning of string #1 if it exists
skip-suffix: remove string #2 from the end of string #1 if it exists
sort: sort the strings provided
sort-lexicographically: sort the strings provided lexicographically by their Unicode codepoints
split: split string #1 by any of the provided substrings #2, or #3, or …
split-by-whitespaces: split string #1 by any character of Unicode category Whitespace
split-by-whitespaces-limited-at-end: split at most #2 times from the end of the string #1 by any character of Unicode category Whitespace
split-by-whitespaces-limited-at-start: split at most #2 times at the start of the string #1 by any character of Unicode category Whitespace
strike-through: add U+0336 COMBINING LONG STROKE OVERLAY before each codepoint resulting in strike-through text
strip-codepoints: strip codepoints found in string #2 from start or end of string #1
strip-codepoints-at-end: strip codepoints found in string #2 from end of string #1
strip-codepoints-at-start: strip codepoints found in string #2 from start of string #1
strip-whitespaces: strip whitespaces from start and end of string #1
strip-whitespaces-at-end: strip whitespaces from end of string
strip-whitespaces-at-start: strip whitespaces from start of string
subscript: return the subscript version of the provided string #1
substring-byte-indices: return the byte indices where string #2 can be found in string #1
superscript: return the superscript version of the provided string #1
uppercase-for-ascii: get locale-independent/ASCII uppercase version of string #1
utf16-big-endian-bytes: encode string #1 in UTF-16 and return its bytes in big endian order
utf16-little-endian-bytes: encode string #1 in UTF-16 and return its bytes in little endian order
utf8-bytes: encode string #1 in UTF-8 and return its bytes
word-clusters: return “Word clusters” of string #1 according to Unicode Standard Annex 29 “Unicode Text Segmentation”
xml-decode: replace the 5 pre-defined XML entities with their unescaped characters &<>”‘ in string #1
xml-encode: replace the 5 characters &<>”‘ with their pre-defined XML entities in string #1
It also has some guidelines for folks who want to add more ICU ops.
This is the CLI equivalent to R’s spiffy {stringi} library, which I use pretty much every day at work.
go-icu-regex
The go-icu-regex project is a Go implementation of ICU regular expressions that provides a drop-in replacement for Go’s standard regexp package. This library is hand for when we need Unicode-aware regular expressions that behave consistently across different platforms and programming languages. In other news, it’s 2024 and I still have to re-remember what types of regex are supported in each of the languages I use in a given week (if I want to avoid extra dependencies). This is why we can’t have nice things.
The package maintains API compatibility with Go’s regexp while implementing the ICU regular expression specification. This means we can use it as a direct replacement in existing Go code that uses the standard library’s regex implementation.
The implementation wraps the ICU4C library’s regular expression engine. It includes support for Unicode Technical Standard #18 (Unicode Regular Expressions) and maintains compatibility with other ICU implementations across different programming languages.
This library is especially useful when:
We need consistent regex behavior across multiple programming languages
Our apps requires advanced Unicode support beyond Go’s native capabilities
We’re porting applications that rely on ICU regex behavior to Go
Since this is a wrapper around ICU4C, there may be some performance overhead compared to Go’s native implementation. However, this tradeoff is often worth it when Unicode compliance and cross-platform consistency are primary requirements. It also means it’s not a “pure Go” library. So much for CISA’s “Use memory-safe languages, like Go” annoying mantra (both Rust and Go can be as unsafe as C/C++).
Intl.NumberFormat
Photo by Black ice on
Pexels.comI cannot count the number of times I’ve made functions, across many modern languages, to format numeric output. The ICU-powered Intl.NumberFormat in JavaScript (and, in different forms in some other languages) provides robust support for such operations.
I’ve included a bunch in the remainder of this section which was generated from this notebook. To get the same output locally — assuming you have Jupyter instaleld — you can do the following:
$ # get Deno from https://deno.land/$ # install the kernel$ deno jupyter --unstable --install$ cd to-somewhere-safe-to-download-things$ curl --silent --output numfmt.ibynb https://rud.is/dl/numfmt.ibynb$ # prbly shld make sure I'm not pwning you by viewing the notebook$ # never run untrusted code without looking at the source, first$ jupyter \ nbconvert \ --to markdown \ --execute \ --ExecutePreprocessor.kernel_name=deno \ numfmt.ipynb \ --stdout
macOS folk may need to do:
$ ln -s /opt/homebrew/share/jupyter/nbconvert ~/Library/Jupyter
if you encounter errors.
Basic number formatting:
const nf = new Intl.NumberFormat('en');console.log(nf.format(123456.789));console.log(nf.formatToParts(123456.789));
123,456.789[ { type: "integer", value: "123" }, { type: "group", value: "," }, { type: "integer", value: "456" }, { type: "decimal", value: "." }, { type: "fraction", value: "789" }]
Notation styles:
const std = new Intl.NumberFormat('en', { notation: 'standard'});console.log(std.format(9876543.21));
9,876,543.21
Sign display:
const signs = new Intl.NumberFormat('en', { style: 'unit', unit: 'celsius', signDisplay: 'always'});console.log(signs.format(23.5));console.log(signs.format(-5));console.log(signs.format(0));
+23.5°C-5°C+0°C
BigInt formatting:
const bigFormatter = new Intl.NumberFormat('fr');console.log(bigFormatter.format(987654321098765432n));
987 654 321 098 765 432
Currency with accounting:
const money = new Intl.NumberFormat('en', { style: 'currency', currency: 'EUR', signDisplay: 'exceptZero', currencySign: 'accounting'});console.log(money.format(42.42));console.log(money.format(-42.42));console.log(money.format(0));
+€42.42(€42.42)€0.00
Units:
const bytes = new Intl.NumberFormat('en', { style: 'unit', unit: 'megabyte'});console.log(bytes.format(42.5));
42.5 MB
const speed = new Intl.NumberFormat('en', { style: 'unit', unit: 'kilometer-per-hour'});console.log(speed.format(88));
88 km/h
const compact = new Intl.NumberFormat('en', { notation: 'compact'});console.log(compact.format(9876543.21));
9.9M
const scientific = new Intl.NumberFormat('en', { notation: 'scientific'});console.log(scientific.format(9876543.21));
9.877E6
const engineering = new Intl.NumberFormat('en', { notation: 'engineering'});console.log(engineering.format(9876543.21));
9.877E6
FIN
Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️
https://dailydrop.hrbrmstr.dev/2024/11/03/bonus-drop-66-2024-11-03-eye-sea-ewe/
#1 #15 #18 #2 #29 #3 #51