Quirks of Handling Japanese Text in Emails

This technical paper was written by David Clarke of Dragon Thoughts Ltd and is Copyright David Clarke © 2002

This document sets out some information regarding issues that are commonly encountered for software that handles Japanese text in Emails.

The purpose of this document is to provide some assistance to other people who are writing software which needs to handle Japanese text in Emails, based on practical experience of personnel from Dragon Thoughts Ltd. Some of the content has wider application to MIME formats in general.

It does not seek to be a complete reference for Japanese text handling in Emails.

Target Audience

The intended audience is anyone writing Mail User Agents (MUAs) or Mail Transport Agents (MTAs) based around SMTP, that have a requirement to recognize or decode Japanese text.

Character Sets

There are various ways in which Japanese characters can be encoded in Emails. The character sets most commonly used as JIS (also known as ISO2022-JP), S-JIS, UTF-8, UTF-7 and EUC-JP.

Header Encoding

RFC2047 is normally used as a basis for encoding non ASCII characters in MIME headers and is often applied by Japanese Email Clients (MUAs). There is another RFC, which is older and was defined specifically for Japanese text which permits the inclusion of JIS directly into MIME headers.
To add to the confusion, non-MIME messages often include UTF-7 or JIS directly embedded in the headers without identifying markers. This appears to be because they are both seven bit formats which will pass through RFC822 mail servers (MUAs) without causing problems. It should be noted that Microsoft Outlook will do this if options are chosen to UUENCODE, rather than MIME encode, attachments.

Mail Body Encoding

MIME compliant MUAs use any of the Japanese character sets but some of the methods of encoding the characters cause technical issues.
To again add to the confusion, non-MIME messages will again contain UTF-7 or JIS directly in the message body with no indications of their presence. This extends to file names of UUENCODEd attachments. The real problems arise with identification of the character encoding by the recipient MUA.

Technical Issues

Microsoft Outlook

Depending on the version and service packs, some versions of Microsoft Outlook do not correctly produce UTF-7 or UTF-8 content. Please refer to Microsoft's own technical help for details. Some intelligent MTAs are capable of detecting this a badly formed MIME content and reject the emails as invalid. This is reasonable behaviour as the mail would be unreadable and many security breaches are performed as side effects of badly formed content.
Converting header fields to RFC2047 compliant versions, will often invalidate the content of Outlook emails which have  UTF-7 or JIS directly in the message body.

Eudora

Some versions of Eudora seem to be unable to decode correctly formed UTF-8 MIME messages.

Positive Identification of Content

Where Email may contain directly embedded UTF-7 or JIS, it is necessary to be able to positively identify which character set is being used. Dragon Thoughts Ltd is able to supply techniques or software tools for this, and other related issues.

This technical paper was written by David Clarke of Dragon Thoughts Ltd and is Copyright David Clarke © 2001