Quirks of Handling Japanese Text in Emails
This technical paper was written by David Clarke of Dragon
Thoughts
Ltd
and is Copyright David Clarke © 2002
This document sets out some information regarding issues that are commonly encountered
for software that handles Japanese text in Emails.
The purpose of this document is to provide some assistance to other people who
are writing software which needs to handle Japanese text in Emails, based on practical
experience of personnel from Dragon
Thoughts
Ltd.
Some of the content has wider application to MIME formats in general.
It does not seek to be a complete reference for Japanese text handling in Emails.
Target Audience
The intended audience is anyone writing Mail User Agents (MUAs) or Mail Transport
Agents (MTAs) based around SMTP, that have a requirement to recognize or decode Japanese
text.
Character Sets
There are various ways in which Japanese characters can be encoded in Emails. The
character sets most commonly used as JIS (also known as ISO2022-JP), S-JIS, UTF-8,
UTF-7 and EUC-JP.
Header Encoding
RFC2047 is normally used as a basis for encoding non ASCII
characters in MIME headers and is often applied by Japanese Email Clients (MUAs).
There is another RFC, which is older and was defined specifically for Japanese text
which permits the inclusion of JIS directly into MIME headers.
To add to the confusion, non-MIME messages often include UTF-7 or JIS directly embedded
in the headers without identifying markers. This appears to be because they are both
seven bit formats which will pass through RFC822 mail servers (MUAs) without causing
problems. It should be noted that Microsoft Outlook will do this if options are chosen
to UUENCODE, rather than MIME encode, attachments.
Mail Body Encoding
MIME compliant MUAs use any of the Japanese character sets but some of the methods
of encoding the characters cause technical issues.
To again add to the confusion, non-MIME messages will again contain UTF-7
or JIS directly in the message body with no indications of their presence. This extends
to file names of UUENCODEd attachments. The real problems arise with identification
of the character encoding by the recipient MUA.
Technical Issues
Microsoft Outlook
Depending on the version and service packs, some versions of Microsoft Outlook do
not correctly produce UTF-7 or UTF-8
content. Please refer to Microsoft's own technical help for details. Some intelligent
MTAs are capable of detecting this a badly formed MIME content and reject the emails
as invalid. This is reasonable behaviour as the mail would be unreadable and many
security breaches are performed as side effects of badly formed content.
Converting header fields to RFC2047 compliant versions,
will often invalidate the content of Outlook emails which have UTF-7
or JIS directly in the message body.
Eudora
Some versions of Eudora seem to be unable to decode correctly formed UTF-8 MIME messages.
Positive Identification of Content
Where Email may contain directly embedded UTF-7 or JIS,
it is necessary to be able to positively identify which character set is being used.
Dragon
Thoughts
Ltd
is able to supply techniques or software tools for this, and other related issues.
This technical paper was written by David Clarke of Dragon
Thoughts
Ltd
and is Copyright David Clarke © 2001