| Version | 17.0 (draft 8) |
| Editors | Mark Davis, Markus Scherer |
| Date | 2026-01-15 |
| This Version | https://www.unicode.org/reports/tr58/tr58-1.html |
| Previous Version | none |
| Latest Version | https://www.unicode.org/reports/tr58/ |
| Latest Proposed Update | https://www.unicode.org/reports/tr58/proposed.html |
| Revision | 1 |
When URLs are stored and exchanged in structured data, the start and end of each URL is clear, and it can be parsed according to the relevant specifications. However, when URLs appear as unmarked strings in text content, detecting their boundaries can be challenging. For example, some characters that are often used as sentence-level punctuation in text, such as parentheses, commas, and periods, can also be valid characters within a URL. Implementations often do not behave intuitively and consistently.
When a URL is inserted into text, non-ASCII characters and “special” characters can be percent-encoded, which can make it easy for a later process to find the start and end of the URL. However, escaping more characters than necessary, especially normal letters, can make the URL illegible for a human reader.
Similar problems exist for email addresses.
This document specifies two consistent, standardized mechanisms that address these problems, consisting of:
The focus is on links with the Schemes
http:, https:, and mailto: —
and links where those Schemes are missing but implied.
For these cases, the two mechanisms of detecting and formatting are aligned, so that:
a minimally escaped URL string between two spaces in flowing text is accurately detected,
and a detected URL works when pasted into address bars of major browsers.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For more information see About Unicode Technical Reports and the Specifications FAQ. Unicode Technical Reports are governed by the Unicode Terms of Use.
Review Note: TBD: Sync ToC entries vs. heading numbers and titles before final.
The standards for URLs and their implementations in browsers generally handle Unicode quite well, permitting people around the world to use their writing systems in those URLs. This is important: in writing their native languages, the majority of humanity uses characters that are not limited to A-Z, and they expect other characters to work equally well. But there are certain ways in which their characters fail to work seamlessly. For example, consider the common practice of providing user handles such as:
The first three of these work well in practice. Copying from the address bar and pasting into text provides a readable result. However, the last example contains non-ASCII characters. In many browsers this turns into an unreadable string:
The names also expand in size and turn into very long strings:
While many people cannot read "महात्मा_गांधी", nobody can read %E0%A4%AE%E0%A4%B9...%E0%A5%80. This unintentional obfuscation also happens with URLs using Latin-script characters:
Very few languages using Latin-script characters are limited to the ASCII letters A-Z; English being a notable exception. This situation is doubly frustrating for people because the un-obfuscated URLs such as https://www.youtube.com/@핑크퐁 and https://en.wikipedia.org/wiki/Antonín_Dvořák work fine as plain text; you can copy and paste them back into your address bar — they go to the right page and display properly in the address bar.
Notes
- This specification uses the term URL broadly, as including unescaped non-ASCII characters; in other words, treating it as matching the formal definition of IRIs. Standardizing on the term “URL” and avoiding the terms “URI” and “IRI” follows the practice promoted by the WHATWG in [URL Standard: Goals].
See also the W3C’s [An Introduction to Multilingual Web Addresses].- The focus here is on two categories: URLs with special schemes (notably http and https), and email addresses.
- In examples, links will be shown with a background color, to make the extent of the linkification clear.
- UnicodeSet notation is used in this and other Unicode specifications. It is explained in Unicode Sets. There is an effort to formalize UnicodeSet notation more precisely in "Unicode® Technical Standard #61", which currently has Proposed Draft status.
Email addresses should also work well for all languages. With most email programs, when someone pastes in the plain text:
and sends to someone else, they receive it as:
URLs are “linkified” in many applications, such when pasting into a word processor (triggered by typing a space afterwards, for example). However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.
Linkification is the process of adding links to URLs and email addresses in plain text input, such as in email body text, text messaging, or video meeting chats. The first step in this process is link detection, which is determining the boundaries of spans of text that contain URLs. That substring can then have a link applied to it in output text. The functions that perform these operations are called a link detector and linkifier, respectively.
The specifications for a URL don’t specify how to handle link detection, since they are only concerned with the structure in isolation, not when it is embedded within flowing text. The lack of a clear specification for link detection also causes many implementations to overuse percent escaping for non-ASCII characters when converting URLs into plain text.
Different implementations linkify URLs and email addresses differently even when they contain only ASCII characters. The differences are even greater when non-ASCII characters are used. Handling letters of all writing systems well is very important for usability. Consider the last example above of a sentence in an email when displayed with a percent-escaped URL:
For example, take the lists of links on [List of articles every Wikipedia should have] in the available languages. When those links are tested with major products, there are significant differences: any two implementations are likely to linkify those differently, such as terminating the linkification at different places, or not linkifying at all. That makes it very difficult to exchange URLs between products within plain text, which is done surprisingly often — definitely causing problems for implementations that need predictable behavior.
This inconsistency causes problems for users and software companies. Having consistent rules for linkification also has additional benefits, leading to solutions for the following reported problems:
If linkification behavior becomes more predictable across platforms and applications, applications will be able to do minimal escaping. For example, in the following only one character would need escaping, the %29 — representing an unmatched “)”:
Providing a consistent, predictable solution that works well across the world’s languages requires standardized algorithms to define the behavior, and the corresponding Unicode character properties covering all Unicode characters.
Internationalized domain names have strong limitations on their structure. They basically consist of a sequence of labels separated by label separators ("."), where each label consists of a sequence of one or more valid characters. (This is a basic overview: there are some edge cases.) There are some additional syntactic constraints as well. Characters outside of the valid characters and label separators definitely terminate the domain name (either at the start or end). (For more information, see UTS #46, Unicode IDNA Compatibility Processing.)
The start of a URL is also easy to determine when it has a known Scheme (such as “https://”). For domain names, there are structural limitations imposed by ICANN on TLDs (top-level domains, like .fr or .com). For example, a TLD cannot contain digits, hyphens, CONTEXTJ or CONTEXTO characters, nor can it be less than a minimal length (single letters are not allowed for ASCII). Implementations also make use of the fact that there is a list of valid top-level domains — however, that should not be used unless the implementation regularly and frequently updates their copy of the list. There are other considerations when detecting domain names: consult Security Considerations.
The parsing up to the path, query, or fragment is as specified in [WHATWG URL: 4.4. URL parsing]. Implementations use this information and the structure of domain names to identify the Scheme and Host in link detection, and format to human-readable characters (instead of Punycode!). For example, implementations must not include in link detection a host with a forbidden host code point, or a domain with a forbidden domain code point. Implementations must not linkify if a domain is not a registrable domain. The terms forbidden host code point, forbidden domain code point, and registrable domain are defined in [WHATWG URL: Host representation]. An implementation would parse to the end of each of https://some.example.com, foo.рф, and xn--j1ay.xn--p1ai.
Similarly, quoted email local-parts, such as "Jane Doe"@example.com are already well specified, and do not need any discussion in this specification.
However, when it comes to the Path, Query, and Fragment, many implementations don't handle them well. It is much less clear to implementers how to handle the many different types of Unicode characters correctly for these Parts of the URL. The same is true of the email local-parts.
Thus this specification currently focuses on the detection and formatting of the Path, Query, and Fragment and email unquoted local-parts, not on the Scheme or Host, or emoji quoted local-parts.
UTS58-C1. For a given version of Unicode, a conformant implementation shall replicate the same link detection results as those produced by Section 3, Link Detection Algorithm.
UTS58-C2. For a given version of Unicode, a conformant implementation shall replicate the same minimal escaping results as those produced by Section 4, Minimal Escaping.
UTS58-C3. For a given version of Unicode, a conformant implementation shall replicate the same email link detection results as those produced by Section 5, Email Addresses.
The following table shows the relevant parts of a URL. For clarity, the separator characters are included in the examples. For more information see [WhatWG URL: Example URL Components].
Table 3-1. Parts of a URL
| Scheme | Host (incl. Domain) | Port | Path | Query | Fragment |
|---|---|---|---|---|---|
| https:// | docs.foobar.com | :8000 | /knowledge/area/ | ?name=article&topic=seo | #top |
Notes:
example.com.
The syntax of a URL actually permits a userinfo component, such as username:password@example.com,
but its use is deprecated due to security concerns. There are two main processes involved in Unicode link detection.
There are two special cases. Both of these introduce some complications in the algorithm, because each of the Parts have different internal syntax and different initial characters, and can be followed by different Parts.
The algorithm is a single-pass algorithm with backup, that is, remembering the latest ‘safe’ point to break, and returning that where necessary. It also has a stack, so that it can determine when a closing bracket matches.
As discussed in Focus, the determination of the start of a URL is outside of the scope of this specification; the focus is on the part of a URL extending after the domain name.
Internationalized domain names have strong limitations on their structure. They basically consist of a sequence of labels separated by ".", where each label consists of a sequence of one or more valid characters. There are some additional syntactic constraints as well. (For more information, see UTS #46, Unicode IDNA Compatibility Processing.) Characters outside of the valid characters definitely terminate the domain name (either at the start or end). The start of a URL is also easy to determine when it has a known Scheme (eg, “https://”). Implementations can also make use of the fact that the last label must come from the list of valid top-level domains, with structural limitations imposed by ICANN. (This is a brief overview of the topic: there are some edge cases, such as the use of the Ideographic Full Stop, and the option empty level after the top-level domain label.)
The parsing up to the path, query, or fragment is as specified in [WHATWG URL: 4.4. URL parsing]. Good implementations use this information and the structure of domain names to identify the Scheme and Host in link detection, and format to human-readable characters (instead of Punycode!).//p>
For example, implementations must terminate link detection if a forbidden host code point is encountered, or if the host is a domain and a forbidden domain code point is encountered. Implementations must not linkify if a domain is not a registrable domain. The terms forbidden host code point, forbidden domain code point, and registrable domain are defined in [WHATWG URL: Host representation]. An implementation would parse to the end of microsoft.com and google.de, foo.рф, or xn--j1ay.xn--p1ai.
Termination is much more challenging, because of the presence of characters from many different writing systems. While small, hard-coded sets of characters suffice for an ASCII implementation, there are over 150,000 Unicode characters, many with quite different behavior than ASCII. While in theory, almost any Unicode character can occur in certain URL parts, in practice many characters have very restricted usage in URLs.
Initiation stops at any Path, Query, or Fragment, so the termination process takes over with a “/”, “?”, or “#” character. Each Path, Query, or Fragment can contain most Unicode characters. The key is to be able to determine, given a URL Part (such as a Query), when a sequence of characters should cause termination of the link detection, even though that character would be valid in the URL specification.
It is impossible for a link detection algorithm to match user expectations in all circumstances, given the variation in usage of various characters both within and across languages. So the goal is to cover use cases as broadly as possible, recognizing that it will sometimes not match user expectations in certain cases. Exceptional cases (URLs that need to use characters that would terminate) can still be appropriately linkified if those few characters are represented with % escapes.
At a high level, this specification defines three features:
The focus is on the most common cases.
One of the goals is also predictability; it should be relatively easy for users to understand the link detection behavior at a high level.
Review Note: The names for Link_Termination and Link_Paired_Opener have changed to Link_Term and Link_Bracket. This change is not marked in the text, for ease of reading.
This specification defines three properties. The first two are used in URL link detection and formatting, and the last is used in email link detection and formatting.
The short property names are identical to the long property names.
Link_Term is an enumerated property of characters with five
enumerated values: {Include, Hard,
Soft, Close, Open}
The short property value aliases are the same as the long ones.
Table 3-2. Link_Term Property Values
| Value | Description / Examples |
|---|---|
| Include | There is no stop before the character; it is included in the link. |
Example: letters
|
|
| Hard | The URL terminates before this character. |
Example: a space
|
|
| Soft | The URL terminates before this
character, if it is followed by a sequence of zero or more characters with the Soft value followed by a Hard value or end of string.
That is: /\p{Link_Term=Soft}*(\p{Link_Term=Hard}|$)/
|
Example: a question mark
|
|
| Close | If the character is paired with a previous character in the same URL Part (path, query, fragment) and within the same sequence of characters delimited by separators as described in the Termination Algorithm below, it is treated as Include. Otherwise it is treated as Hard. |
Example: an end
parenthesis
|
|
| Open | Used to match Close characters. |
| Example: same as under Close |
Link_Bracket is a string property of characters, which for each character in \p{Link_Term=Close}, returns a character with \p{Link_Term=Open}.
Example
Link_Email is a binary property of characters, indicating the characters that can normally occur in
the local-part of an email address, such as σωκράτης@example.om
Example
The specification of the characters with each of these property values is given in Property Assignments.
The termination algorithm assumes that a domain (or other host) has been successfully parsed to the start of a Path, Query, or Fragment, as per the algorithm in [ WHATWG URL:3. Hosts (domains and IP addresses)].
This algorithm then processes each final URL Part [path, query, fragment] of the URL in turn. It stops when it encounters a code point that meets one of the terminating conditions and reports the last location in the current URL Part that is still safely considered inside the link. The common terminating conditions are based on the Link_Term and Link_Bracket properties:
Link_Term=Hard character, such as a space.
Link_Term=Soft character, such as a ?
that is followed by a sequence of zero or more Soft
characters, then either a Hard character or the end of
the text.
Link_Term=Close character, such as a
] that does not have a matching Open
character in the same Part of the URL. The matching process
uses the Link_Bracket property to determine the correct Open
character, and matches against the top element of a stack of Open
characters.
More formally:
The termination algorithm begins after the Host (and optionally Port) have been parsed, so there is potentially a Path, Query, or Fragment. In the algorithm below, each Part has three sets of Action strings that affect transitions within and between Parts:
| Sequence Sets | Actions |
|---|---|
| Initiator | Starts the Part |
| Terminator Set | Terminates the Part |
| ClearStackOpen Set | Clears the stack of open brackets within the Part |
Here are the sets of zero or more strings in each Sequence Set for each Part.
Table 3-3. Link Termination by URL Part
| Part | Initiator | Terminator set | ClearStackOpen set |
|---|---|---|---|
| path | '/' | [?#] | [/] |
| query | '?' | [#] | [=\&] |
| fragment | '#' | [{:~:}] | [] |
| fragment directive | :~: | [] | [\&,{:~:}] |
Fragment directives:
text directive,
as in https://example.com#:~:text=foo&text=bar.
Review Note: In a fragment directive, the comma and ampersand are separators, and thus cause the stack of open brackets to be cleared.
The dash '-' is an affix to the comma, rather than a separator, as the following syntax shows:
#:~:text=[prefix-,]start[,end][,-suffix]
In the following:
link_end)
where the link_start is determined outside of this algorithm as described above
(before the Scheme, if any, and otherwise before the Host).cp[i] refers to the ith code point in the
string being parsed, cp[start] is the first code point being
considered, and n is the length of the string.
openStack) is used for matching brackets. A limit is required for security;
the value 125 is chosen deliberately to far exceed any reasonable number of paired brackets.
lastSafe = start — this marks the offset after the
last code point that is included in the link detection (so far).part = none.openStack.i = start to n - 1
part ≠ none and one of the part.terminators matches at i
previousPart = part.part = none.part == none then try to match one of the URL Part initiators at i.
initiators match, then stop and return lastSafe.part according to which URL Part’s initiator matches.part is a Fragment Directive and previousPart
is neither a Fragment nor a Fragment Directive,
then stop and return lastSafe.i to just after the matched part.initiator.lastSafe = i.openStack.part.clearStackOpen elements matches at i
i to just after the matched part.clearStackOpen element.lastSafe = i.openStack.LT = Link_Term(cp[i]).LT == Include
lastSafe = i + 1.LT == Soft
LT == Hard
lastSafeLT == Open
openStack.length() == 125, then stop and return lastSafe.cp[i] onto openStacklastSafe = i + 1.LT == Close
openStack.isEmpty(), then stop and return lastSafe.lastOpen = openStack.pop().Link_Bracket(cp[i]) == lastOpen
lastSafe = i + 1.lastSafe.link_limit to lastSafe and return.As usual, any algorithm that produces the same results is conformant. Such algorithms can be optimized in various ways, and adapted to be single-pass.
For ease of understanding, this algorithm does not include all features of URL parsing. In implementations, the algorithm can be optimized in various ways, of course, as long as the results are the same.
The goal is to be able to generate a serialized form of a URL that:
Note that if not isolated (not bounded by start/end of string or Hard characters), the linkification may extend beyond the bounds of the serialized form. For example, the URL would fail to linkify correctly if pasted between the two X's in "See XX for more information.", resulting in “See Xabc.com/path1./path2%2EX for more information”.
The minimal escaping algorithm is parallel to the linkification algorithm. Basically, when serializing a URL, a character in a Path, Query, or Fragment is only percent-escaped if it is: Hard, Close when unmatched, or Soft when it is (in) a URL Part terminator in the enclosing URL Part.
This algorithm only handles the formatting of the Path, Query, and Fragment URL Parts. Formatting of the Scheme, Host, and Port should be done as is customary for those URL Parts. For the Host (domain name), see also UTS #46: Unicode IDNA Compatibility Processing and its ToUnicode operation.
In the following:
cp[i] refers to the ith code point in the URL part
being serialized, cp[0] is the first code point in the part, and n
is the number of code points.
part.terminators and part.clearStack,
to prevent them from being interpreted as literals:
output = ""part in any non-empty Path, Query, Fragment,
successively:
output: part.initiatorcopiedAlready = 0openStacki = 0 to n - 1
part.terminators matches at i
LT = HardLT = Link_Term(cp[i])part.clearStackOpen elements matches at i, clear the openStack.LT == Include
output: any code points between
copiedAlready (inclusive) and i (exclusive)output: cp[i]copiedAlready = i + 1LT == Hard
output: any code points between
copiedAlready (inclusive) and i (exclusive)output: percentEscape(cp[i])copiedAlready = i + 1LT == Soft
LT == Open
openStack.length() == 125, then do the same as LT == Hard.cp[i] onto openStack and
do the same as LT == IncludeLT == Close
lastOpen = openStack.pop(), or 0 if the
openStack is emptyLink_Bracket(cp[i]) == lastOpen
LT == IncludeLT == Hardpart is not last
output: all code points between copiedAlready
(inclusive) and n (exclusive)copiedAlready < n
output: all code points between copiedAlready
(inclusive) and n - 1 (exclusive)output: percentEscape(cp[n - 1])As usual, any algorithm that produces the same results is conformant. Such algorithms can be optimized in various ways, and adapted to be single-pass.
Additional characters can be escaped to reduce confusability, especially when they are confusable with URL syntax characters, such as a Ɂ character in a path. See Security Considerations below.
Email address link detection applies similar principles. An email address is of the form local-part@domain-name.
The local-part can include unusual characters by quoting: enclosing it in "…", and using backslash to escape those characters.
For example, "john\ doe"@example.com contains an escaped space.
However, the quoted local-part format is very easy to implement if desired, but is also very rarely implemented, so it is out of scope for this specification.
The algorithm is invoked whenever an '@' character is encountered at index n,
and another process has determined that the '@' sign is followed by a valid domain name.
The algorithm scans backward from the '@' sign to find the start of the local-part,
terminating at index end (exclusive). if there is a "mailto:" before the local-part, then that is also included.
The only complications are introduced by the requirement in the specifications that the local-part cannot start or end with a ".", nor contain "..". For details of the format, see [RFC6530].
The algorithm uses the property Link_Email to scan backwards, as follows.
In the following:
link_start)
where the link_end is determined outside of this algorithm as described above (after the last character of the domain name).
cp[i] refers to the ith code point in the string, and n is the offset before the '@' character.
n = 0, fail to match.n > 0 and cp[i] == '.', fail to match.i = n - 1 down to 0.
cp[i] == '.'
cp[i + 1] == '.', fail to match.cp[i] is not in Link_Email,
set start = i + 1 and terminate scanning.cp[start] == '.', fail to match.start = n, fail to match.start = start-7.link_start to start and return.As usual, any algorithm that produces the same results is conformant. Such algorithms can be optimized in various ways, and adapted to be single-pass.
A quoted local-part may include a broad range of Unicode characters. See [RFC6530]. For linkification, the values in a quoted local-part — while broader than in an unquoted locale-part — are more restrictive to prevent accidentally including linkifying more text than intended, especially since those code points are unlikely to be handled by mail servers in any event.
Table 5-1. Email Address Link Detection Examples
| See abcd@example.com | Stop backing up when a space is hit |
| See x.abcd@example.com | Include the medial dot. |
| See アルベルト.アルベルト@example.com | Handle non-ASCII |
| See @example.😎 | No valid domain name |
| See @example.com | No local-part |
| See john.@example.com | No valid local-part |
| See john..doe@example.com | No valid local-part |
| See .john.doe@example.com | No valid local-part |
Review Note: The algorithm causes linkification to fail in where the dots are illegal, such as: the last 3 examples. For the last two cases, instead of failing, the linkification could stop just before the problematic dots, such as: "john..doe@example.com" and ".john.doe@example.com". That approach is more error-prone, but could be supported with a customized algorithm.
The Minimal quoting algorithm for email addresses is trivial, given that the quoted forms are not supported.
The assignments of Link_Term and Link_Bracket property values are defined by the following files:
The initial property assignments are based on the following descriptions. However, their values may deviate from these descriptions in future versions. See Stability. Note that most characters that cause link termination are still valid, but require % encoding.
Whitespace, non-characters, deprecated characters, controls, private-use, surrogates, unassigned,...
Termination characters and ambiguous quotation marks:
if Bidi_Paired_Bracket_Type(cp) == Open then Link_Term(cp) = Open
else if Bidi_Paired_Bracket_Type(cp) == Close then Link_Term(cp) = Close
else if cp == "<" then Link_Term(cp) = Open
else if cp == ">" then Link_Term(cp) = Close
All other code points
if Bidi_Paired_Bracket_Type(cp) == Close then Link_Bracket(cp) = Bidi_Paired_Bracket(cp)
else if cp == ">" then Link_Bracket(cp) = "<"
else Link_Bracket(cp) = <none>
Only characters with Link_Term=Close have a Link_Bracket mapping.
Review Note: For comparison to the related General_Category values, see the characters in:
In the ASCII range, the characters are as specified for ASCII, as per RFC 5322, Section 3.2.3. That is:
Outside of the ASCII range, the characters follow UAX31 identifiers. That is:
The reasons for this are that non-ASCII in the local-part are less commonly supported at this point,
and the local-parts supported on most mail servers that go beyond ASCII are likely to have restrictions similar to programming identifiers.
Implementations could also customize the set, and it can be broadened in the future.
Review Note: We could have other exclusions to start with, such as only NFC characters; or only Identifier_Status=Allowed from UTS #39 Unicode Security Mechanisms?
The following test files supply data for testing conformance to this specification. The format of each test is explained in the header of the test.
The test files are not applicable to results that are modified by a higher-level algorithm, as discussed in Security Considerations.
Linkification in plain text is a service to users, and the end goal is to make that as useful as possible. It is a balancing act, because linkifying every substring in plaintext that has a syntactically valid domain would both be a bad user experience (eg, M.Sc.), and introduce security concerns.
The security considerations for Path, Query, and Fragment are far less important than for Domain names. See UTS #39: Unicode Security for more information about domain names.
A conformant implementation can have a fast low-level algorithm that simply finds all syntactically valid link opportunities — matching this specification — but then at a higher level apply some additional security checks. The result of such checks could be to reject particular link detection results entirely, or alter the bounds of the link resulting from the link detection.
For example, a higher level implementation could reject detection for the following:
Beyond just security considerations, usability is also a factor:
an implementation might refrain from linkify helpers.py when the context is a discussion of Python programming.
A higher level implementation could also adjust the boundaries from link detection, as in the following example:
Note that simply forcing characters to be percent-escaped in link formatting doesn't generally solve any problems; if anything, percent-escaping obfuscates characters even more than showing their regular appearance to users.
NOTE: the following seems misplaced; if anything, a longer discussion of this should be in UTS #39 Unicode Security Mechanisms. There are documented cases of how Format characters can be used to sneak malicious instructions into LLMs; see Invisible text that AI chatbots understand and humans can’t?. URLs are just a small aspect of the larger problem of feeding clean text to LLMs, both in building them and in querying them: making sure the text does not have malformed encodings, is in a consistent Unicode Normalization Form (NFC), and so on.
For security implications of URLs in general, see UTS #39: Unicode Security Mechanisms. For related issues, see UTS #55 Unicode Source Code Handling. For display of BIDI URLs, see also HL4 in UAX #9, Unicode Bidirectional Algorithm.
As with other Unicode Properties, the algorithms and property derivations may be changed in successive versions to adapt to new information and feedback from developers and end users.
The practical impact is very limited, such as when character is not escaped on a formatting system, but terminates the link on the detecting system.
An implementation may wish to just make minimal modifications to its use of existing URL link detection and formatting code. For example, it may use imported libraries for these services. The following provides some examples as to how that can be done.
The implementation may call its existing code library for link detection, but then post-process. Using such post-processing can retain the existing performance and feature characteristics of the code library, including the recognition of the Scheme and Host, and then refine the results for the Path, Query, and Fragment. The typical problem is that the code library terminates too early. For code libraries that 'mostly' handle non-ASCII characters this will be a fraction of the detected links.
initiator of a Path, Query, or Fragment URL Part.The implementation calls its existing code library for the Scheme and Host. It then invokes code implementing the Minimal Escaping algorithm for the Path, Query, and Fragment.
Review Note: TBD, put the references into the standard format.
[RFC6530]
[URL Fragment Text Directives]
[WHATWG URL: 3. Hosts (domains and IP addresses)]
[WHATWG URL: 4.4. URL parsing]
[WhatWG URL: Example URL Components]
[WHATWG URL: Host representation]
Thanks to the following people for their contributions and/or feedback on this document: Arnt Gulbrandsen, Dennis Tan, Elika Etemad, Hayato Ito, Jules Bertholet, Markus Scherer, Mathias Bynens, Peter Constable, Robin Leroy, [TBD flesh out further]
The following summarizes modifications from the previous revision of this document.
Modifications for previous versions are listed in those respective versions.
© 2024–2026 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.
Use of all Unicode Products, including this publication, is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.