java.lang.Object | |
↳ | sun.text.normalizer.NormalizerBase |
Unicode Normalization
normalize
transforms Unicode text into an equivalent composed or
decomposed form, allowing for easier sorting and searching of text.
normalize
supports the standard normalization forms described in
Unicode Standard Annex #15 — Unicode Normalization Forms.
Characters with accents or other adornments can be encoded in
several different ways in Unicode. For example, take the character A-acute.
In Unicode, this can be encoded as a single character (the
"composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):0041 LATIN CAPITAL LETTER A 0301 COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be treated as the same "user-level" character "A with acute accent". When you are searching or comparing text, you must ensure that these two sequences are treated equivalently. In addition, you must handle characters with more than one accent. Sometimes the order of a character's combining accents is significant, while in other cases accent sequences in different orders are really equivalent. Similarly, the string "ffi" can be encoded as three separate letters:0066 LATIN SMALL LETTER F 0066 LATIN SMALL LETTER F 0069 LATIN SMALL LETTER I
or as the single characterFB03 LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking it shouldn't be in Unicode at all, but it was included for compatibility with existing character sets that already provided it. The Unicode standard identifies such characters by giving them "compatibility" decompositions into the corresponding semantic characters. When sorting and searching, you will often want to use these mappings.normalize
helps solve these problems by transforming text into
the canonical composed and decomposed forms as shown in the first example
above. In addition, you can have it perform compatibility decompositions so
that you can treat compatibility characters the same as their equivalents.
Finally, normalize
rearranges accents into the proper canonical
order, so that you do not have to worry about accent rearrangement on your
own.
Form FCD, "Fast C or D", is also designed for collation.
It allows to work on strings that are not necessarily normalized
with an algorithm (like in collation) that works under "canonical closure",
i.e., it treats precomposed characters and their decomposed equivalents the
same.
It is not a normalization form because it does not provide for uniqueness of
representation. Multiple strings may be canonically equivalent (their NFDs
are identical) and may all conform to FCD without being identical themselves.
The form is defined such that the "raw decomposition", the recursive
canonical decomposition of each character, results in a string that is
canonically ordered. This means that precomposed characters are allowed for
as long as their decompositions do not need canonical reordering.
Its advantage for a process like collation is that all NFD and most NFC texts
- and many unnormalized texts - already conform to FCD and do not need to be
normalized (NFD) for such a process. The FCD quick check will return YES for
most strings in practice.
normalize(FCD) may be implemented with NFD.
For more details on FCD see the collation design document:
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_collation_design.htm
ICU collation performs either NFD or FCD normalization automatically if
normalization is turned on for the collator object. Beyond collation and
string search, normalized strings may be useful for string equivalence
comparisons, transliteration/transcription, unique representations, etc.
The W3C generally recommends to exchange texts in NFC.
Note also that most legacy character encodings use only precomposed forms and
often do not encode any combining marks by themselves. For conversion to such
character encodings the Unicode text needs to be normalized to NFC.
For more usage examples, see the Unicode Standard Annex.
Nested Classes | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
NormalizerBase.Mode | Constants for normalization modes. | ||||||||||
NormalizerBase.QuickCheckResult | Result values for quickCheck(). |
Constants | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
int | DONE | Constant indicating that the end of the iteration has been reached. | |||||||||
int | UNICODE_3_2 | Options bit set value to select Unicode 3.2 normalization (except NormalizationCorrections). | |||||||||
int | UNICODE_3_2_0_ORIGINAL | ||||||||||
int | UNICODE_LATEST |
Fields | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MAYBE | Indicates it cannot be determined if string is in the normalized format without further thorough checks. | ||||||||||
NFC | Canonical decomposition followed by canonical composition. | ||||||||||
NFD | Canonical decomposition. | ||||||||||
NFKC | Compatibility decomposition followed by canonical composition. | ||||||||||
NFKD | Compatibility decomposition. | ||||||||||
NO | Indicates that string is not in the normalized format | ||||||||||
NONE | No decomposition/composition. | ||||||||||
YES | Indicates that string is in the normalized format |
Public Constructors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Creates a new Normalizer object for iterating over the
normalized form of a given string.
| |||||||||||
Creates a new Normalizer object for iterating over the
normalized form of the given text.
| |||||||||||
Creates a new Normalizer object for iterating over the
normalized form of the given text.
| |||||||||||
Creates a new Normalizer object for iterating over the
normalized form of a given string.
|
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Clones this Normalizer object.
| |||||||||||
Compose a string.
| |||||||||||
Return the current character in the normalized text->
| |||||||||||
Decompose a string.
| |||||||||||
Decompose a string.
| |||||||||||
Retrieve the index of the end of the input text-> This is the end index
of the CharacterIterator or the length of the String
over which this Normalizer is iterating
| |||||||||||
This method is deprecated.
ICU 2.2. Use startIndex() instead.
| |||||||||||
This method is deprecated.
ICU 2.2. Use endIndex() instead.
| |||||||||||
Retrieve the current iteration position in the input text that is
being normalized.
| |||||||||||
Return the basic operation performed by this Normalizer
| |||||||||||
Internal API
| |||||||||||
Test if a string is in a given normalization form.
| |||||||||||
Test if a string is in a given normalization form.
| |||||||||||
Return the next character in the normalized text and advance
the iteration position by one.
| |||||||||||
Normalizes a
String using the given normalization form. | |||||||||||
Normalizes a
String using the given normalization form. | |||||||||||
Normalize a string.
| |||||||||||
Return the previous character in the normalized text and decrement
the iteration position by one.
| |||||||||||
Reset the index to the beginning of the text.
| |||||||||||
This method is deprecated.
ICU 3.2
| |||||||||||
Set the iteration position in the input text that is being normalized,
without any immediate normalization.
| |||||||||||
Set the normalization mode for this object.
| |||||||||||
Set the input text over which this Normalizer will iterate.
| |||||||||||
Set the input text over which this Normalizer will iterate.
|
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
Constant indicating that the end of the iteration has been reached.
This is guaranteed to have the same value as DONE
.
Options bit set value to select Unicode 3.2 normalization (except NormalizationCorrections). At most one Unicode version can be selected at a time.
Indicates it cannot be determined if string is in the normalized format without further thorough checks.
Canonical decomposition followed by canonical composition.
Compatibility decomposition followed by canonical composition.
Indicates that string is not in the normalized format
Indicates that string is in the normalized format
Creates a new Normalizer object for iterating over the normalized form of a given string.
The options parameter specifies which optional Normalizer features are to be enabled for this object.
str | The string to be normalized. The normalization will start at the beginning of the string. |
---|---|
mode | The normalization mode. |
opt | Any optional features to be enabled.
Currently the only available option is UNICODE_3_2 .
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument. |
Creates a new Normalizer object for iterating over the normalized form of the given text.
iter | The input text to be normalized. The normalization will start at the beginning of the string. |
---|---|
mode | The normalization mode. |
Creates a new Normalizer object for iterating over the normalized form of the given text.
iter | The input text to be normalized. The normalization will start at the beginning of the string. |
---|---|
mode | The normalization mode. |
opt | Any optional features to be enabled.
Currently the only available option is UNICODE_3_2 .
If you want the default behavior corresponding to one of the
standard Unicode Normalization Forms, use 0 for this argument. |
Creates a new Normalizer object for iterating over the normalized form of a given string.
str | The string to be normalized. The normalization will start at the beginning of the string. |
---|---|
mode | The normalization mode. |
Clones this Normalizer object. All properties of this
object are duplicated in the new object, including the cloning of any
CharacterIterator
that was passed in to the constructor
or to setText
.
However, the text storage underlying
the CharacterIterator is not duplicated unless the
iterator's clone method does so.
Compose a string. The string will be composed to according the the specified mode.
str | The string to compose. |
---|---|
compat | If true the string will be composed accoding to NFKC rules and if false will be composed according to NFC rules. |
options | The only recognized option is UNICODE_3_2 |
Return the current character in the normalized text->
Decompose a string. The string will be decomposed to according the the specified mode.
str | The string to decompose. |
---|---|
compat | If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. |
options | The normalization options, ORed together (0 for no options). |
Decompose a string. The string will be decomposed to according the the specified mode.
str | The string to decompose. |
---|---|
compat | If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. |
Retrieve the index of the end of the input text-> This is the end index of the CharacterIterator or the length of the String over which this Normalizer is iterating
This method is deprecated.
ICU 2.2. Use startIndex() instead.
Retrieve the index of the start of the input text. This is the begin index of the CharacterIterator or the start (i.e. 0) of the String over which this Normalizer is iterating
This method is deprecated.
ICU 2.2. Use endIndex() instead.
Retrieve the index of the end of the input text. This is the end index of the CharacterIterator or the length of the String over which this Normalizer is iterating
Retrieve the current iteration position in the input text that is being normalized. This method is useful in applications such as searching, where you need to be able to determine the position in the input text that corresponds to a given normalized output character.
Note: This method sets the position in the input, while
next()
and previous()
iterate through characters in the
output. This means that there is not necessarily a one-to-one
correspondence between characters returned by next and
previous and the indices passed to and returned from
setIndex and getIndex()
.
Return the basic operation performed by this Normalizer
Test if a string is in a given normalization form. This is semantically equivalent to source.equals(normalize(source, mode)). Unlike quickCheck(), this function returns a definitive result, never a "maybe". For NFD, NFKD, and FCD, both functions work exactly the same. For NFC and NFKC where quickCheck may return "maybe", this function will perform further tests to arrive at a true/false result.
str | the input string to be checked to see if it is normalized |
---|---|
form | the normalization form |
Test if a string is in a given normalization form. This is semantically equivalent to source.equals(normalize(source, mode)). Unlike quickCheck(), this function returns a definitive result, never a "maybe". For NFD, NFKD, and FCD, both functions work exactly the same. For NFC and NFKC where quickCheck may return "maybe", this function will perform further tests to arrive at a true/false result.
str | the input string to be checked to see if it is normalized |
---|---|
form | the normalization form |
options | the optional features to be enabled. |
Return the next character in the normalized text and advance
the iteration position by one. If the end
of the text has already been reached, DONE
is returned.
Normalizes a String
using the given normalization form.
str | the input string to be normalized. |
---|---|
form | the normalization form |
Normalizes a String
using the given normalization form.
str | the input string to be normalized. |
---|---|
form | the normalization form |
options | the optional features to be enabled. |
Normalize a string. The string will be normalized according the the specified normalization mode and options.
src | The char array to compose. |
---|---|
srcStart | Start index of the source |
srcLimit | Limit index of the source |
dest | The char buffer to fill in |
destStart | Start index of the destination buffer |
destLimit | End index of the destination buffer |
mode | The normalization mode; one of Normalizer.NONE, Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, Normalizer.NFKD, Normalizer.DEFAULT |
options | The normalization options, ORed together (0 for no options). |
IndexOutOfBoundsException | if the target capacity is less than the required length |
---|
Return the previous character in the normalized text and decrement
the iteration position by one. If the beginning
of the text has already been reached, DONE
is returned.
Reset the index to the beginning of the text. This is equivalent to setIndexOnly(startIndex)).
This method is deprecated.
ICU 3.2
Set the iteration position in the input text that is being normalized and return the first normalized character at that position.
Note: This method sets the position in the input text,
while next()
and previous()
iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next and previous and the indices passed to and
returned from setIndex and getIndex()
.
index | the desired index in the input text-> |
---|
IllegalArgumentException | if the given index is less than
getBeginIndex() or greater than getEndIndex() . |
---|
Set the iteration position in the input text that is being normalized, without any immediate normalization. After setIndexOnly(), getIndex() will return the same index that is specified here.
index | the desired index in the input text. |
---|
Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating
over a string, calls to next()
and previous()
may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call setText()
, #first,
#last, etc. after calling setMode.
newMode | the new mode for this Normalizer.
The supported modes are:
|
---|
Set the input text over which this Normalizer will iterate. The iteration position is set to the beginning of the input text->
newText | The new string to be normalized. |
---|
Set the input text over which this Normalizer will iterate. The iteration position is set to the beginning of the input text->
newText | The new string to be normalized. |
---|