diff options
| author | Barry Warsaw | 2017-05-25 01:04:33 +0000 |
|---|---|---|
| committer | Barry Warsaw | 2017-05-25 01:04:33 +0000 |
| commit | 6fb8bbbec1f8cc86e91002402b365bd1753d087a (patch) | |
| tree | 16ca60dc01fb6acb94e0ac981ec97e9c3b45e9ea /src | |
| parent | 816b8096243b962086f7f2ce71c610f9aa053978 (diff) | |
| parent | 674276a3d0404544bd5da74c8a2f2daa8174af40 (diff) | |
| download | mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.gz mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.zst mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.zip | |
Diffstat (limited to 'src')
| -rw-r--r-- | src/mailman/docs/internationalization.rst | 123 | ||||
| -rw-r--r-- | src/mailman/rest/docs/basic.rst | 4 |
2 files changed, 127 insertions, 0 deletions
diff --git a/src/mailman/docs/internationalization.rst b/src/mailman/docs/internationalization.rst new file mode 100644 index 000000000..da61153fb --- /dev/null +++ b/src/mailman/docs/internationalization.rst @@ -0,0 +1,123 @@ +.. _internationalization: + +================================ + Mailman 3 Internationalization +================================ + +Mailman does not yet support IDNA (internationalized domain names, RFC +5890) or internationalized mailboxes (RFC 6531) in email addresses. +But *display names* and *descriptions* are fully internationalized in +Mailman, using Unicode. Email content is handled by the Python email +package, which provides robust handling of internationalized content +conforming to the MIME standard (RFCs 2045-2049 and others). + +The encoding of URI components addressing a REST endpoint is Unicode +UTF-8. Mailman does not currently handle normalization, and we +recommend consistently using normal form NFC. (For some languages +NFKC is risky, as some users' personal names may be corrupted by this +normalization.) Mailman does not check for confusables or check +repertoire. + + +Introduction to Unicode Concepts +================================ + +The Unicode Standard is intended to provide an universal set of +characters with a single, standard encoding providing an invertible +mapping of characters to integers (called *code points* in this +context). + + +Repertoires +----------- + +A set of characters is called a *repertoire*. Unicode itself is +intended to provide an universal repertoire sufficient to represent +all words in all written languages, but a system may handle a +restricted repertoire and still be considered conformant, as long as +it does not corrupt characters it does not handle, and does not emit +non-character code points. + + +Convertibility +-------------- + +Unicode is intended to provide a character for each character defined +in a national character set standard. This is often controversial: +Chinese characters are often *unified* with Japanese characters that +appear somewhat different when displayed, while the Cyrillic and Greek +equivalents of the Latin character "A" are treated as separate +characters despite being pronounced the same way and being displayed +as identical glyphs. These judgments are informed by the notion that +a text should *round-trip*. That is, when a text is converted from +Unicode to another encoding, and then back to Unicode, the result +should be identical to the source text. + + +Normalization +------------- + +For several reasons, Unicode provides for construction of characters +by appending *composable characters* (such as accents) to *base +characters* (typically letters). But since most languages assign a +code point to each accented letter, the "round-tripping" requirement +described above implies that Unicode should provide a code point for +that accented letter, called a precomposed character. This means that +for most accented characters, there are two or more ways to represent +them, using various combinations of base characters, precomposed +characters, and composable characters. + +There are also a number of cases where equivalent characters have +different code points (in a few extreme cases, the same character has +different code points because the original national standard had +duplicates). These cases are called *compatibility* characters. + +The Unicode Standard requires that the compose character sequence be +treated identically to the precomposed (single) character by all +text-processing algorithms. For convenience in matching, an +application may choose to *normalize* texts. There are two +normalizations. The *NFC* normal form requires that all compositions +to precomposed characters that can be done should be done. It has the +advantage that the length of a word in characters is the number of +code points in the word. The *NFD* normal form requires that all +precomposed characters be decomposed into a sequence of a base +character followed by composable characters. It useful in contexts +where fuzzy matches (*i.e.*, ignoring accents) are desired. + +Finally, in each of these two forms a compatibility character may be +replaced by its *canonical equivalent*, denoted *NFKC* and *NFKD*, +respectively. + + +Using Unicode in Mailman +------------------------ + +In most cases in Mailman it is highly recommended that input be +encoded as UTF-8 in NFC format. Although highly conformant systems +are becoming more common, there are still many systems that assume +that one code point is translated to one glyph on display. On such +systems NFC will provide a smoother user experience than NFD. Since +much of the text data that Mailman handles is user names, and users +frequently strongly prefer a particular compatibility character to its +canonical equivalent, NFKC (or NFKD) should be avoided. + +There are two other considerations in using Unicode in Mailman. The +first is the problem of confusables. *Confusables* are characters +which are considered different but whose glyphs are indistinguishable, +such as Latin capital letter A and Greek capital letter Alpha. +Similarly, many code points in Unicode are not yet assigned +characters, or even defined as non-characters, and thus are not part +of the repertoire of characters represented by Unicode. + +Mailman makes no attempt to detect inappropriate use of confusables or +non-characters (for example, to redirect users to a domain +disseminating malware). The risks at present are vanishingly small +because the necessary support in the mail system itself is not yet +widespread, but this is likely to change in the near future. + + +Localization +============ + +We have it! We just don't have proper documentation here yet. + diff --git a/src/mailman/rest/docs/basic.rst b/src/mailman/rest/docs/basic.rst index 1f8084ecd..24b919bb2 100644 --- a/src/mailman/rest/docs/basic.rst +++ b/src/mailman/rest/docs/basic.rst @@ -2,6 +2,10 @@ Basic operation ================= +The encoding of URI components addressing a REST endpoint is Unicode +UTF-8. There is :ref:`more information about internationalization in +Mailman <internationalization>`. + In order to do anything with the REST API, you need to know its `Basic AUTH`_ credentials, and the version of the API you wish to speak to. |
