Merge branch 'doc-UTF8' into 'master'

Document Unicode usage in Mailman 3 REST API and email addresses. See merge request !274
author: Barry Warsaw 2017-05-25 01:04:33 +0000
committer: Barry Warsaw 2017-05-25 01:04:33 +0000
commit: 6fb8bbbec1f8cc86e91002402b365bd1753d087a (patch)
tree: 16ca60dc01fb6acb94e0ac981ec97e9c3b45e9ea /src
parent: 816b8096243b962086f7f2ce71c610f9aa053978 (diff)
parent: 674276a3d0404544bd5da74c8a2f2daa8174af40 (diff)
download: mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.gz
mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.zst
mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.zip
2 files changed, 127 insertions, 0 deletions
diff --git a/src/mailman/docs/internationalization.rst b/src/mailman/docs/internationalization.rst
new file mode 100644
index 000000000..da61153fb
--- /dev/null
+++ b/src/mailman/docs/internationalization.rst
@@ -0,0 +1,123 @@
+.. _internationalization:
+
+================================
+ Mailman 3 Internationalization
+================================
+
+Mailman does not yet support IDNA (internationalized domain names, RFC
+5890) or internationalized mailboxes (RFC 6531) in email addresses.
+But *display names* and *descriptions* are fully internationalized in
+Mailman, using Unicode.  Email content is handled by the Python email
+package, which provides robust handling of internationalized content
+conforming to the MIME standard (RFCs 2045-2049 and others).
+
+The encoding of URI components addressing a REST endpoint is Unicode
+UTF-8.  Mailman does not currently handle normalization, and we
+recommend consistently using normal form NFC.  (For some languages
+NFKC is risky, as some users' personal names may be corrupted by this
+normalization.)  Mailman does not check for confusables or check
+repertoire.
+
+
+Introduction to Unicode Concepts
+================================
+
+The Unicode Standard is intended to provide an universal set of
+characters with a single, standard encoding providing an invertible
+mapping of characters to integers (called *code points* in this
+context).
+
+
+Repertoires
+-----------
+
+A set of characters is called a *repertoire*.  Unicode itself is
+intended to provide an universal repertoire sufficient to represent
+all words in all written languages, but a system may handle a
+restricted repertoire and still be considered conformant, as long as
+it does not corrupt characters it does not handle, and does not emit
+non-character code points.
+
+
+Convertibility
+--------------
+
+Unicode is intended to provide a character for each character defined
+in a national character set standard.  This is often controversial:
+Chinese characters are often *unified* with Japanese characters that
+appear somewhat different when displayed, while the Cyrillic and Greek
+equivalents of the Latin character "A" are treated as separate
+characters despite being pronounced the same way and being displayed
+as identical glyphs.  These judgments are informed by the notion that
+a text should *round-trip*.  That is, when a text is converted from
+Unicode to another encoding, and then back to Unicode, the result
+should be identical to the source text.
+
+
+Normalization
+-------------
+
+For several reasons, Unicode provides for construction of characters
+by appending *composable characters* (such as accents) to *base
+characters* (typically letters).  But since most languages assign a
+code point to each accented letter, the "round-tripping" requirement
+described above implies that Unicode should provide a code point for
+that accented letter, called a precomposed character.  This means that
+for most accented characters, there are two or more ways to represent
+them, using various combinations of base characters, precomposed
+characters, and composable characters.
+
+There are also a number of cases where equivalent characters have
+different code points (in a few extreme cases, the same character has
+different code points because the original national standard had
+duplicates).  These cases are called *compatibility* characters.
+
+The Unicode Standard requires that the compose character sequence be
+treated identically to the precomposed (single) character by all
+text-processing algorithms.  For convenience in matching, an
+application may choose to *normalize* texts.  There are two
+normalizations.  The *NFC* normal form requires that all compositions
+to precomposed characters that can be done should be done.  It has the
+advantage that the length of a word in characters is the number of
+code points in the word.  The *NFD* normal form requires that all
+precomposed characters be decomposed into a sequence of a base
+character followed by composable characters.  It useful in contexts
+where fuzzy matches (*i.e.*, ignoring accents) are desired.
+
+Finally, in each of these two forms a compatibility character may be
+replaced by its *canonical equivalent*, denoted *NFKC* and *NFKD*,
+respectively.
+
+
+Using Unicode in Mailman
+------------------------
+
+In most cases in Mailman it is highly recommended that input be
+encoded as UTF-8 in NFC format.  Although highly conformant systems
+are becoming more common, there are still many systems that assume
+that one code point is translated to one glyph on display.  On such
+systems NFC will provide a smoother user experience than NFD.  Since
+much of the text data that Mailman handles is user names, and users
+frequently strongly prefer a particular compatibility character to its
+canonical equivalent, NFKC (or NFKD) should be avoided.
+
+There are two other considerations in using Unicode in Mailman.  The
+first is the problem of confusables.  *Confusables* are characters
+which are considered different but whose glyphs are indistinguishable,
+such as Latin capital letter A and Greek capital letter Alpha.
+Similarly, many code points in Unicode are not yet assigned
+characters, or even defined as non-characters, and thus are not part
+of the repertoire of characters represented by Unicode.
+
+Mailman makes no attempt to detect inappropriate use of confusables or
+non-characters (for example, to redirect users to a domain
+disseminating malware).  The risks at present are vanishingly small
+because the necessary support in the mail system itself is not yet
+widespread, but this is likely to change in the near future.
+
+
+Localization
+============
+
+We have it!  We just don't have proper documentation here yet.
+
diff --git a/src/mailman/rest/docs/basic.rst b/src/mailman/rest/docs/basic.rst
index 1f8084ecd..24b919bb2 100644
--- a/src/mailman/rest/docs/basic.rst
+++ b/src/mailman/rest/docs/basic.rst
@@ -2,6 +2,10 @@
  Basic operation
 =================
 
+The encoding of URI components addressing a REST endpoint is Unicode
+UTF-8.  There is :ref:`more information about internationalization in
+Mailman <internationalization>`.
+
 In order to do anything with the REST API, you need to know its `Basic AUTH`_
 credentials, and the version of the API you wish to speak to.
author	Barry Warsaw	2017-05-25 01:04:33 +0000
committer	Barry Warsaw	2017-05-25 01:04:33 +0000
commit	6fb8bbbec1f8cc86e91002402b365bd1753d087a (patch)
tree	16ca60dc01fb6acb94e0ac981ec97e9c3b45e9ea /src
parent	816b8096243b962086f7f2ce71c610f9aa053978 (diff)
parent	674276a3d0404544bd5da74c8a2f2daa8174af40 (diff)
download	mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.gz mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.tar.zst mailman-6fb8bbbec1f8cc86e91002402b365bd1753d087a.zip