“Ignore Punctuation” Options

Contents

  1. Overview
  2. Non-Ignorable
  3. Blanked
  4. Shifted
  5. Shift-Trimmed
  6. Variable-After

Overview

By default, spaces and punctuation characters add primary (base character) differences. Such characters sort less-than digits and letters. For example, the default collation yields “De Anza” < “de-luge” < “deanza”.

UCA/CLDR/ICU provide several options for “ignore punctuation” collation settings, also known as Variable Weighting or Alternate Handling. These options change the sorting behavior of “variable” characters algorithmically. “Variable” characters are those with low (but non-zero) primary weights up to a threshold, the “variable top”. By default, CLDR and ICU treat spaces and punctuation as variable. (This can be changed via API.) The DUCET also includes most symbols.

Non-Ignorable

The default behavior in CLDR & ICU, shown above, is to not ignore punctuation (alternate=non-ignorable) but to map variable characters to their normal primary collation elements.

All of the following options cause variable characters to be ignored on levels 1..3. Only when strings compare equal up to the tertiary level may variable characters make a difference, depending on the options.

See also

Here is an overview of the sorting results with these options.

Non-ignorable Blanked Shifted Shift-Trimmed Variable-After
delug delug delug delug delug
de-luge de-luge de-luge deluge deluge
delu-ge delu-ge (*) delu-ge de-luge deluge-
deluge deluge (*) deluge delu-ge delu-ge
Deluge deluge- (*) deluge- deluge- de-luge
deluge- Deluge Deluge Deluge Deluge

Items with (*) compare equal to the preceding ones, and their relative order is arbitrary. These only occur in the Blanked column. This table shows the results of a stable sort algorithm with the non-ignorable column as input.

Blanked

The simplest option is to “ignore punctuation” completely, as if all variable characters (and following combining marks) had been removed from the input strings before comparing them.

For example: “De Anza” = “De-Anza” = “DeAnza”.

In ICU, this option is selected with alternate=shifted and strength=primary|secondary|tertiary. (ICU does not support Blanked combined with strength=identical.)

The implementation “blanks” out all weights of the variable characters’ collation elements.

With all of the following options, variable characters are ignored on levels 1..3 but add distinctions on level 4 (quaternary level).

Shifted

Among strings that compare tertiary-equal, that is, they contain the same letters, accents and casing:

  • Sorts all variable characters less-than (before) regular characters.
  • Appending a variable character makes a string sort greater-than the string without it.
  • Inserting a variable character makes a string sort less-than the string without it.
  • Inserting a variable character earlier in a string makes it sort less-than inserting the variable character later in the string.

The result is similar to Merging Sort Keys (with shorter prefixes sorting less-than longer ones), like in last-name+first-name sorting, except only among tertiary-equal strings.

For example: “de-luge” < “delu-ge” < “deluge” < “deluge-”.

In ICU, this option is selected with alternate=shifted and strength=quaternary|identical.

The implementation “shifts” the primary weight p of the collation element [p, s, t, q] of each variable characters down three levels: [0, 0, 0, p]. Regular characters with primary collation elements get a high quaternary weight, higher than that of any variable character.

Note that this behavior is different from collation on secondary and tertiary level, because normal collation elements get low secondary & tertiary weights but high quaternary weights. Adding an accent difference anywhere makes a string sort greater-than the string without it, and adding an accent difference earlier makes it sort greater-than adding it later. For example, “deanza” < “deanzä” < “deänza” < “dëanza”. (Compare the ‘ä’/‘ë’ positions here with the ‘-’ positions above.)

Shift-Trimmed

Note: This method is not currently implemented in ICU.

Among strings that compare tertiary-equal:

  • Sorts variable characters sometimes less-than, sometimes greater-than regular characters.
  • Inserting a variable character anywhere makes a string sort greater-than the string without it. (The string without variable characters gets an empty quaternary level.)
  • Inserting a variable character earlier in a string makes it sort less-than inserting the variable character later in the string.

For example: “deluge” < “de-luge” < “delu-ge” < “deluge-”.

The Shift-Trimmed method works like Shifted, except that trailing high-quaternary weights (from regular characters) are removed (trimmed). Compared with Shifted, the Shift-Trimmed method sorts strings without variable characters before ones with variable characters added, rather than producing the equivalent of Merging Sort Keys.

Shift-Trimmed is more complicated to implement than all of the other options: When comparing strings, a lookahead (or equivalent) is needed to determine whether a non-variable character gets a zero quaternary weight (if no variables follow) or a high quaternary weight (if at least one variable follows). When building sort keys, trailing high/common quaternary weights are trimmed (backed out) at the end of the quaternary level.

Variable-After

Note: This method is not currently implemented in ICU.

Among strings that compare tertiary-equal:

  • Sorts all variable characters greater-than (after) regular characters.
  • Inserting a variable character anywhere makes a string sort greater-than the string without it. (Like Shift-Trimmed.)
  • Inserting a variable character earlier in a string makes it sort greater-than inserting the variable character later in the string. (Like accent differences.)

For example: “deluge” < “deluge-” < “delu-ge” < “de-luge”.

The implementation “shifts” the primary weight p of the collation element [p, s, t, q] of each variable characters down three levels: [0, 0, 0, p]. Regular characters with primary collation elements get a low quaternary weight, lower than that of any variable character. This is consistent with collation on secondary and tertiary levels but unlike Merging Sort Keys.

This method extends the UCA well-formedness condition 2 to apply to quaternary weights. (UCA versions before UCA 6.2 did not limit WF2 to secondary & tertiary weights, which meant that several of the Variable Weighting options technically created ill-formed quaternary weights.)