public class SpoofChecker extends Object
This class, based on Unicode Technical Report #36 and Unicode Technical Standard #39, has two main functions:
Although originally designed as a method for flagging suspicious identifier strings such as URLs,
SpoofChecker
has a number of other practical use cases, such as preventing attempts to evade bad-word
content filters.
The following example shows how to use SpoofChecker
to check for confusability between two strings:
SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
int result = sc.areConfusable("desparejado", "ԁеѕрагејаԁо");
System.out.println(result != 0); // true
SpoofChecker
uses a builder paradigm: options are specified within the context of a lightweight
SpoofChecker.Builder
object, and upon calling SpoofChecker.Builder.build()
, expensive data loading
operations are performed, and an immutable SpoofChecker
is returned.
The first line of the example creates a SpoofChecker
object with confusable-checking enabled; the second
line performs the confusability test. For best performance, the instance should be created once (e.g., upon
application startup), and the more efficient areConfusable(java.lang.String, java.lang.String)
method can be used at runtime.
If the paragraph direction used to display the strings is known, it should be passed to areConfusable(java.lang.String, java.lang.String)
:
// These strings look identical when rendered in a left-to-right context.
// They look distinct in a right-to-left context.
String s1 = "A1א"; // A1א
String s2 = "Aא1"; // Aא1
SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
int result = sc.areConfusable(Bidi.DIRECTION_LEFT_TO_RIGHT, s1, s2);
System.out.println(result != 0); // true
UTS 39 defines two strings to be confusable if they map to the same skeleton. A skeleton is a
sequence of families of confusable characters, where each family has a single exemplar character.
getSkeleton(java.lang.CharSequence)
computes the skeleton for a particular string, so the following snippet is
equivalent to the example above:
SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build();
boolean result = sc.getSkeleton("desparejado").equals(sc.getSkeleton("ԁеѕрагејаԁо"));
System.out.println(result); // true
If you need to check if a string is confusable with any string in a dictionary of many strings, rather than calling
areConfusable(java.lang.String, java.lang.String)
many times in a loop, getSkeleton(java.lang.CharSequence)
can be used instead, as
shown below:
// Setup: String[] DICTIONARY = new String[]{ "lorem", "ipsum" }; // example SpoofChecker sc = new SpoofChecker.Builder().setChecks(SpoofChecker.CONFUSABLE).build(); HashSet<String> skeletons = new HashSet<String>(); for (String word : DICTIONARY) { skeletons.add(sc.getSkeleton(word)); } // Live Check: boolean result = skeletons.contains(sc.getSkeleton("1orern")); System.out.println(result); // true
Note: Since the Unicode confusables mapping table is frequently updated, confusable skeletons are not guaranteed to be the same between ICU releases. We therefore recommend that you always compute confusable skeletons at runtime and do not rely on creating a permanent, or difficult to update, database of skeletons.
The following snippet shows a minimal example of using SpoofChecker
to perform spoof detection on a
string:
SpoofChecker sc = new SpoofChecker.Builder() .setAllowedChars(SpoofChecker.RECOMMENDED.cloneAsThawed().addAll(SpoofChecker.INCLUSION)) .setRestrictionLevel(SpoofChecker.RestrictionLevel.MODERATELY_RESTRICTIVE) .setChecks(SpoofChecker.ALL_CHECKS &~ SpoofChecker.CONFUSABLE) .build(); boolean result = sc.failsChecks("pаypаl"); // with Cyrillic 'а' characters System.out.println(result); // true
As in the case for confusability checking, it is good practice to create one SpoofChecker
instance at
startup, and call the cheaper failsChecks(java.lang.String, com.ibm.icu.text.SpoofChecker.CheckResult)
online. In the second line, we specify the set of
allowed characters to be those with type RECOMMENDED or INCLUSION, according to the recommendation in UTS 39. In the
third line, the CONFUSABLE checks are disabled. It is good practice to disable them if you won't be using the
instance to perform confusability checking.
To get more details on why a string failed the checks, use a SpoofChecker.CheckResult
:
SpoofChecker sc = new SpoofChecker.Builder()
.setAllowedChars(SpoofChecker.RECOMMENDED.cloneAsThawed().addAll(SpoofChecker.INCLUSION))
.setRestrictionLevel(SpoofChecker.RestrictionLevel.MODERATELY_RESTRICTIVE)
.setChecks(SpoofChecker.ALL_CHECKS &~ SpoofChecker.CONFUSABLE)
.build();
SpoofChecker.CheckResult checkResult = new SpoofChecker.CheckResult();
boolean result = sc.failsChecks("pаypаl", checkResult);
System.out.println(checkResult.checks); // 16
The return value is a bitmask of the checks that failed. In this case, there was one check that failed:
RESTRICTION_LEVEL
, corresponding to the fifth bit (16). The possible checks are:
RESTRICTION_LEVEL
: flags strings that violate the
Restriction Level test as specified in UTS
39; in most cases, this means flagging strings that contain characters from multiple different scripts.INVISIBLE
: flags strings that contain invisible characters, such as zero-width spaces, or character
sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark.CHAR_LIMIT
: flags strings that contain characters outside of a specified set of acceptable
characters. See SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet)
and SpoofChecker.Builder.setAllowedLocales(java.util.Set<com.ibm.icu.util.ULocale>)
.MIXED_NUMBERS
: flags strings that contain digits from multiple different numbering systems.These checks can be enabled independently of each other. For example, if you were interested in checking for only the INVISIBLE and MIXED_NUMBERS conditions, you could do:
SpoofChecker sc = new SpoofChecker.Builder()
.setChecks(SpoofChecker.INVISIBLE | SpoofChecker.MIXED_NUMBERS)
.build();
boolean result = sc.failsChecks("৪8");
System.out.println(result); // true
Note: The Restriction Level is the most powerful of the checks. The full logic is documented in
UTS 39, but the basic idea is that strings
are restricted to contain characters from only a single script, except that most scripts are allowed to have
Latin characters interspersed. Although the default restriction level is HIGHLY_RESTRICTIVE
, it is
recommended that users set their restriction level to MODERATELY_RESTRICTIVE
, which allows Latin mixed
with all other scripts except Cyrillic, Greek, and Cherokee, with which it is often confusable. For more details on
the levels, see UTS 39 or SpoofChecker.RestrictionLevel
. The Restriction Level test is aware of the set of
allowed characters set in SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet)
. Note that characters which have script code
COMMON or INHERITED, such as numbers and punctuation, are ignored when computing whether a string has multiple
scripts.
In some circumstances, the only concern is confusion between identifiers displayed with the same paragraph direction.
An example is the case where identifiers are usernames prefixed with the @ symbol. That symbol will appear to the left in a left-to-right context, and to the right in a right-to-left context, so that an identifier displayed in a left-to-right context can never be confused with an identifier displayed in a right-to-left context:
In that case, the caller should check for both LTR-confusability and RTL-confusability:
boolean confusableInEitherDirection =
sc.areConfusable(Bidi.DIRECTION_LEFT_TO_RIGHT, id1, id2) ||
sc.areConfusable(Bidi.DIRECTION_RIGHT_TO_LEFT, id1, id2);
If the bidiSkeleton is used, the LTR and RTL skeleta should be kept separately and compared, LTR
with LTR and RTL with RTL.
In cases where confusability between the visual appearances of an identifier displayed in a left-to-right context with another identifier displayed in a right-to-left context is a concern, the LTR skeleton of one can be compared with the RTL skeleton of the other. However, this very broad definition of confusability may have unexpected results; for instance, it treats the ASCII identifiers "Mark_" and "_Mark" as confusable.
A SpoofChecker
instance may be used repeatedly to perform checks on any number of identifiers.
Thread Safety: The methods on SpoofChecker
objects are thread safe. The test functions for
checking a single identifier, or for testing whether two identifiers are potentially confusable, may called
concurrently from multiple threads using the same SpoofChecker
instance.
Modifier and Type | Class and Description |
---|---|
static class |
SpoofChecker.Builder
SpoofChecker Builder.
|
static class |
SpoofChecker.CheckResult
A struct-like class to hold the results of a Spoof Check operation.
|
static class |
SpoofChecker.RestrictionLevel
Constants from UTS 39 for use in setRestrictionLevel.
|
Modifier and Type | Field and Description |
---|---|
static int |
ALL_CHECKS
Enable all spoof checks.
|
static int |
ANY_CASE
Deprecated.
ICU 58 Any case confusable mappings were removed from UTS 39; the corresponding ICU API was
deprecated.
|
static int |
CHAR_LIMIT
Check that an identifier contains only characters from a specified set of acceptable characters.
|
static int |
CONFUSABLE
Enable this flag in
SpoofChecker.Builder.setChecks(int) to turn on all types of confusables. |
static int |
HIDDEN_OVERLAY
Check that an identifier does not have a combining character following a character in which that
combining character would be hidden; for example 'i' followed by a U+0307 combining dot.
|
static UnicodeSet |
INCLUSION
Security Profile constant from UTS 39 for use in
SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet) . |
static int |
INVISIBLE
Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences
that are likely not to display, such as multiple occurrences of the same non-spacing mark.
|
static int |
MIXED_NUMBERS
Check that an identifier does not mix numbers from different numbering systems.
|
static int |
MIXED_SCRIPT_CONFUSABLE
When performing the two-string
areConfusable(java.lang.String, java.lang.String) test, this flag in the return value indicates
that the two strings are visually confusable and that they are not from the same script, according to UTS
39 section 4. |
static UnicodeSet |
RECOMMENDED
Security Profile constant from UTS 39 for use in
SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet) . |
static int |
RESTRICTION_LEVEL
Check that an identifier satisfies the requirements for the restriction level specified in
SpoofChecker.Builder.setRestrictionLevel(com.ibm.icu.text.SpoofChecker.RestrictionLevel) . |
static int |
SINGLE_SCRIPT
Deprecated.
ICU 51 Use RESTRICTION_LEVEL
|
static int |
SINGLE_SCRIPT_CONFUSABLE
When performing the two-string
areConfusable(java.lang.String, java.lang.String) test, this flag in the return value indicates
that the two strings are visually confusable and that they are from the same script, according to UTS 39 section
4. |
static int |
WHOLE_SCRIPT_CONFUSABLE
When performing the two-string
areConfusable(java.lang.String, java.lang.String) test, this flag in the return value indicates
that the two strings are visually confusable and that they are not from the same script but both of them are
single-script strings, according to UTS 39 section 4. |
Modifier and Type | Method and Description |
---|---|
int |
areConfusable(int direction,
CharSequence s1,
CharSequence s2)
Check whether two specified strings are visually when displayed in a paragraph with the given direction.
|
int |
areConfusable(String s1,
String s2)
Check whether two specified strings are visually confusable.
|
boolean |
equals(Object other)
Equality function.
|
boolean |
failsChecks(String text)
Check the specified string for possible security issues.
|
boolean |
failsChecks(String text,
SpoofChecker.CheckResult checkResult)
Check the specified string for possible security issues.
|
UnicodeSet |
getAllowedChars()
Get a UnicodeSet for the characters permitted in an identifier.
|
Set<Locale> |
getAllowedJavaLocales()
Get a set of
Locale instances for the scripts that are acceptable in strings to be checked. |
Set<ULocale> |
getAllowedLocales()
Get a read-only set of locales for the scripts that are acceptable in strings to be checked.
|
String |
getBidiSkeleton(int direction,
CharSequence str)
Get the "bidiSkeleton" for an identifier string and a direction.
|
int |
getChecks()
Get the set of checks that this Spoof Checker has been configured to perform.
|
SpoofChecker.RestrictionLevel |
getRestrictionLevel()
Deprecated.
This API is ICU internal only.
|
String |
getSkeleton(CharSequence str)
Get the "skeleton" for an identifier string.
|
String |
getSkeleton(int type,
String id)
Deprecated.
ICU 58
|
int |
hashCode()
Overrides
Object.hashCode() . |
public static final UnicodeSet INCLUSION
SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet)
.public static final UnicodeSet RECOMMENDED
SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet)
.public static final int SINGLE_SCRIPT_CONFUSABLE
areConfusable(java.lang.String, java.lang.String)
test, this flag in the return value indicates
that the two strings are visually confusable and that they are from the same script, according to UTS 39 section
4.public static final int MIXED_SCRIPT_CONFUSABLE
areConfusable(java.lang.String, java.lang.String)
test, this flag in the return value indicates
that the two strings are visually confusable and that they are not from the same script, according to UTS
39 section 4.public static final int WHOLE_SCRIPT_CONFUSABLE
areConfusable(java.lang.String, java.lang.String)
test, this flag in the return value indicates
that the two strings are visually confusable and that they are not from the same script but both of them are
single-script strings, according to UTS 39 section 4.public static final int CONFUSABLE
SpoofChecker.Builder.setChecks(int)
to turn on all types of confusables. You may set the
checks to some subset of SINGLE_SCRIPT_CONFUSABLE, MIXED_SCRIPT_CONFUSABLE, or WHOLE_SCRIPT_CONFUSABLE to make
areConfusable(java.lang.String, java.lang.String)
return only those types of confusables.@Deprecated public static final int ANY_CASE
public static final int RESTRICTION_LEVEL
SpoofChecker.Builder.setRestrictionLevel(com.ibm.icu.text.SpoofChecker.RestrictionLevel)
. The default restriction level is
SpoofChecker.RestrictionLevel.HIGHLY_RESTRICTIVE
.@Deprecated public static final int SINGLE_SCRIPT
public static final int INVISIBLE
public static final int CHAR_LIMIT
SpoofChecker.Builder.setAllowedChars(com.ibm.icu.text.UnicodeSet)
and SpoofChecker.Builder.setAllowedLocales(java.util.Set<com.ibm.icu.util.ULocale>)
. Note that a string that fails this check
will also fail the RESTRICTION_LEVEL
check.public static final int MIXED_NUMBERS
public static final int HIDDEN_OVERLAY
More specifically, the following characters are forbidden from preceding a U+0307:
This list and the number of combing characters considered by this check may grow over time.
public static final int ALL_CHECKS
@Deprecated public SpoofChecker.RestrictionLevel getRestrictionLevel()
public int getChecks()
public Set<ULocale> getAllowedLocales()
public Set<Locale> getAllowedJavaLocales()
Locale
instances for the scripts that are acceptable in strings to be checked. If
no limitations on scripts have been specified, an empty set will be returned.public UnicodeSet getAllowedChars()
public boolean failsChecks(String text, SpoofChecker.CheckResult checkResult)
text
- A String to be checked for possible security issues.checkResult
- Output parameter, indicates which specific tests failed. May be null if the information is not wanted.public boolean failsChecks(String text)
text
- A String to be checked for possible security issues.public int areConfusable(String s1, String s2)
s1
- The first of the two strings to be compared for confusability.s2
- The second of the two strings to be compared for confusability.public int areConfusable(int direction, CharSequence s1, CharSequence s2)
direction
- The paragraph direction with which the identifiers are displayed.
Must be either Bidi.DIRECTION_LEFT_TO_RIGHT
or Bidi.DIRECTION_RIGHT_TO_LEFT
.s1
- The first of the two strings to be compared for confusability.s2
- The second of the two strings to be compared for confusability.public String getBidiSkeleton(int direction, CharSequence str)
direction
- The paragraph direction with which the string is displayed.
Must be either Bidi.DIRECTION_LEFT_TO_RIGHT
or Bidi.DIRECTION_RIGHT_TO_LEFT
.str
- The input string whose bidiSkeleton will be generated.public String getSkeleton(CharSequence str)
str
- The input string whose skeleton will be generated.@Deprecated public String getSkeleton(int type, String id)
getSkeleton(CharSequence id)
. Starting with ICU 55, the "type" parameter has been
ignored, and starting with ICU 58, this function has been deprecated.type
- No longer supported. Prior to ICU 55, was used to specify the mapping table SL, SA, ML, or MA.id
- The input identifier whose skeleton will be generated.public boolean equals(Object other)
public int hashCode()
Object.hashCode()
.Copyright © 2016 Unicode, Inc. and others.