Skip to main content

Javamail heuristic on content encoding

  8 posts   Feedicon  
Replies: 7 - Last Post: August 19, 2009 05:20
by: Bill Shannon
showing 1 - 8 of 8
Posted: August 10, 2009 21:52 by lgalfaso
Hi,
When Javamail receives a mail without an encoding specified it tries to determine the encoding type using this heuristic:
- If all the characters return false to MimeUtility::nonascii (and some other properties that are not important in this case) then try to decode as if this was 7bit encoding, if not, then use the "quoted-string" or "base64" encoding based on some other heuristic. The issue is with the method nonascii that reads

static final boolean nonascii(int b) {
return b >= 0177 || (b < 040 && b != '\r' && b != '\n' && b != '\t');
}

and the issue we are having is when processing text emails that have the '\f' (0x0C, form feed) character as Javamail does not recognize this char as an ascii and breaks the content of these mails.

Is Javamail in fault or the email client that is sending the emails?
Posted: August 12, 2009 18:44 by Bill Shannon
I think there's some confusion here.

JavaMail does not try to guess the encoding when processing a received message with no encoding.
Instead, it assumes a default encoding (8-bit, I believe).

When creating a message, JavaMail examines the data to choose an appropriate encoding. Per RFC 2045,
the form feed character should be encoded using quoted-printable.

Tell me more about the message you've received, how you're processing it, and how the form feed character
"breaks the content" of the message.
Posted: August 12, 2009 18:55 by lgalfaso
Here is a sample code that triggers the issue, the part that has no Content-Transfer-Encoding defined has a '\f' character, so it is "converted" to quoted-printable (breaking the email content itself). If the '\f' character is removed, then the content is treated as 7bit and everything works as expected.


public class EncodingTest {
public static void main(String[] args) {
try {
/*
* For the mail in question, the output is
* <quote>
* Without a previous recursive scan
* [8bit]
* [8bit]
* null
*
* With a previous recursive scan
* [8bit]
* [8bit]
* [quoted-printable]
* </quote>
*
* The case with a previous recursive scan is wrong, the encoding of the
* last part is null so it should not be treated as "quoted-printable"
* but as "7bit" as the standard reads:
*
* <quote>
* "7bit data" refers to data that is all represented as relatively
* short lines with 998 octets or less between CRLF line separation
* sequences [RFC-821]. No octets with decimal values greater than 127
* are allowed and neither are NULs (octets with decimal value 0). CR
* (decimal value 13) and LF (decimal value 10) octets only occur as
* part of CRLF line separation sequences.
* </quote>
*/

MimeMessage mimeMessage;

// Pass without a previous recursive scan.
System.out.println("Without a previous recursive scan");
mimeMessage = new MimeMessage(null, new FileInputStream("Path-to-raw-mail.eml"));
mimeMessage.saveChanges();
recursivePrintEncoding(mimeMessage, "");

// Pass with a previous recursive scan.
System.out.println();
System.out.println("With a previous recursive scan");
mimeMessage = new MimeMessage(null, new FileInputStream("Path-to-raw-mail.eml"));
recursiveScan(mimeMessage); // This line is the only difference in both cases
mimeMessage.saveChanges();
recursivePrintEncoding(mimeMessage, "");
} catch (MessagingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

private static void recursiveScan(Part part) throws MessagingException, IOException {
// This method makes no modifications to "part".
Object content = part.getContent();
if (content instanceof Multipart) {
Multipart mp = (Multipart)content;
for (int i=0;i<mp.getCount();++i) {
recursiveScan(mp.getBodyPart(i));
}
}
}

private static void recursivePrintEncoding(Part part, String prefix) throws IOException, MessagingException {
System.out.println(prefix + Arrays.toString(part.getHeader("Content-Transfer-Encoding")));
Object content = part.getContent();
if (content instanceof Multipart) {
Multipart mp = (Multipart)content;
for (int i=0;i<mp.getCount();++i) {
recursivePrintEncoding(mp.getBodyPart(i), prefix + " ");
}
}
}
}
Posted: August 12, 2009 22:48 by Bill Shannon
Ok, so you're not reading a message from a mail server directly, you're creating a new MimeMessage
object based on message data in a file.

I can't tell what overall goal you're trying to accomplish so it's hard to suggest alternatives, but...

I don't know why you're calling saveChanges since you're not changing the message.
The saveChanges call causes JavaMail to reevaluate the encoding for the message
you've created, because after all you might've changed any of the content of the
message. This is why it converts the part in question to quoted-printable encoding,
since based on the content of that part it *should* be quoted-printable. It's probably
a bug that whatever program created the message to begin with didn't encode it.

Still, it's not clear to my why you think encoding the part as quoted-printable is "breaking"
the message. Is some other software you're using not properly handling quoted-printable
message? Or is it changing the header to say quoted-printable without also changing the
content of the part to be so encoded?

Note that there are several issues with the JavaMail implementation when used to "edit"
messages constructed from a stream. If that's your ultimate goal, to read in a message
and arbitrarily edit it, you're definitely going to run into some limitations of JavaMail.
Posted: August 13, 2009 18:34 by lgalfaso
The original body is something like this (I will put <space>, <ff>, <cr>, <lf>, <t> instead of a space, form feed, carriage return, line feed and tab as it is simpler to see the error.)

Index: foo.bar<cr><lf>
===================================================================<cr><lf>
--- foo.bar<t>(revision 1234)<cr><lf>
+++ foo.bar<t>(working copy)<cr><lf>
xxx<cr><lf>
<ff>
<space><cr><lf>
xxx<cr><lf>


and this is the result after this is "transformed" to quoted string


Index: foo.bar<cr><lf>
==================================================================--- foo.bar<t>(revision 1234)<cr><lf>
+++ foo.bar<t>(working copy)<cr><lf>
xxx<cr><lf>
<ff><cr><lf>
<cr><lf>
xxx<cr><lf>


So, this transformation breaks the file in two parts:
- The last equals is removed from the second line, and the CRLF is gone.
- The line that is <space><cr><lf> is transformed to <cr><lf> (removing the space).

The same message without the form feed char works as expected.

Would it be possible to change MimeUtility::nonascii(int) to

static final booleannonascii(int b) {
return b >= 0177 || (b < 040 && b != '\r' && b != '\n' && b != '\t' && b != '\f');
}

with this change in place, the error stated above does not happen.
Posted: August 15, 2009 04:22 by Bill Shannon
Can't change MimeUtility.nonascii, it's doing what the spec says it should do.

Again, I don't understand what you're trying to accomplish. Please reread my previous message.
Posted: August 18, 2009 21:33 by jackr
The context here is a mail-list manager, attempting to forward a message. We have received a message that consists of two parts. The message as a whole is

Content-Type: multipart/mixed; boundary="wzJLGUyc3ArbnUjN"
Content-Disposition: inline


The first part is:

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


The second part is:

Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="issue-3459-v1.diff"


We want to send out a new message where:

- The first part is probably changed (such as by adding footers linking back to the discussion), and therefore reasonable changes may be made in Content-Type and such things

- The second part is exactly, bit-for-bit, as we received it

The references to the ^L character represent our forensics as to why the second part is being changed: by debugging, we believe that the ^L's are what triggers the nonascii() claim, which in turn leads to non-bit-for-bit-preserving behavior. But the actual "breakage" is the trailing whitespace removal.
Posted: August 19, 2009 05:20 by Bill Shannon
I think you're getting in trouble by trying to modify a message that you created based on an InputStream
(although a reproducible test case sent to javamail@sun.com would be useful to verify that).

Instead of modifying the message you created, you need to create a new message and copy over the
content that you want to preserve. I believe you can simply use the BodyPart from the original message
when creating the new message.

As for preserving the original part bit-for-bit, what do you expect to happen if the original part if incorrectly
encoded, e.g., if it uses 7-bit encoding when it should've used quoted-printable? Do you need to preserve
the original incorrect encoding? Or do you only need to preserve the original data?

I'm not sure what's removing the trailing space.

I tried to reproduce your problem with the program posted above and the content posted above but was unable to.
showing 1 - 8 of 8
Replies: 7 - Last Post: August 19, 2009 05:20
by: Bill Shannon
 
 
Close
loading
Please Confirm
Close