What data type to store raw IMAP fetched email messages in Postgresql? - postgresql

I need to store email messages as soon as they are fetched from IMAP in the database for later processing. I extract the message using a FETCH request and data is returned using BODY.PEEK[].
From my understanding, all IMAP messages are returned as US-ASCII (the mail servers accept only that), but I could be wrong.
My options (in order of what I think it's right) are:
US-ASCII text column
Bytea
BLOB
I was thinking about using US-ASCII but I'm afraid of having problems with encoding, I don't know if there are "faulty" IMAP servers not returning us-ascii mails.
The alternative is Bytea, but I read you have to deal with encoding, so I'm not sure what's the advantage/disadvantage over US-ASCII.
BLOB is raw, and I'm not sure about the problems it deliver in this case. I assume I have to deal with the bytes-to-string conversion.
What's the recommended data type?

For small objects such as emails, I think you're going to be better off with Bytea. The storage and handling is different and since your objects are going to be small, it seems like it would be handled better as Bytea. See here for a comparison of the two by Microolap. That's not a full answer to your question but might take one option off the list.

You're making the very much unwarranted assumption that you can avoid dealing with encodings.
You can't.
Whether you use lob, bytea, or a text column that you assume contains 7-bit mail only... the mail is just arbitrary binary data. You do not know its text encoding. In practice mail clients have used 8-bit encoding forever; either standards-compliant via MIME quoted-printable, or often simply raw 8-bit text.
Some clients have even been known to include full 8-bit MIME segments that include null (zero) bytes. PostgreSQL won't tolerate that in a text column.
But even for clients using compliant MIME, quoted-printable escaping text bodies, etc... the mail may contain non-ASCII chars, they're just escaped. Indexing these and ignoring the escapes will yield weird and wrong results. Also, attachments will usually be arbitrary base64 data. Indexing this as text is utterly meaningless. Then there's all the HTML bodies, multi-part/alternative segments, CSS, etc...
When dealing with email, assume that anything a client or server can do wrong, it will do wrong. For storage, treat the email as raw bytes of unknown encoding. That's exactly what bytea is for.
If you want to do anything with the mail you'll need a defensive MIME parser that can extract the MIME parts, cope with broken parts, etc. It'll need to check the declared encoding (if any) against the actual mime-part body, and guess encodings if none are declared or the declared encoding is obviously wrong. It'll have to deal with all sorts of bogus MIME structure and contents; quoted-printable bodies that aren't really quoted-printable, and all that.
So if you plan to index this email, it's definitely not as simple as "create a fulltext index and merrily carry on". The question with that is not if it will fail but when.
Personally, if I had to do this (and given the choice I wouldn't) I'd store the raw email as bytea. Then for search I'd decompose it into MIME parts, detect text-like parts, do encoding detection and dequoting, etc, and inject the decoded and cleaned up text bodies into a separate table for text indexing.
There are some useful Perl modules for this that you can possibly use via plperlu, but I'd likely do it in an outside script/tool instead. Then you have your choice of MIME processors, languages, etc.

Related

How to safely store a sting with apostrophe in JSONB in postgres

I have a case where addresses and country names have special characters. For eg:
People's Republic of Korea
De'Paul & Choice Street
etc..
This data get send as JSON payload to backend to be inserted in a JSONB column in postgres.
The insert statement gets messed up because of the "single quote" and ends up erroring out.
The front-end developers are saying that they are using popular libraries to get country names etc and don't want to touch the data. They just want to pass as is.
Any tips on how to process such data with special characters especially something that contradicts with JSON formatted data and safely insert into postgres?
Your developers are using the popular libraries, whatever they may be, in the wrong fashion. The application is obviously vulnerable to SQL injection, the most popular way to attack a database application.
Use prepared statements, then the problem will go away. If you cannot do that, use the popular library's functions to escape the input string for use as an SQL string literal.

Verilog bit metadata

is there a way to easily add a Metadata to a verilog bit? My goal is to be able to identify certain bits that are well known prior to encryption, after an ethernet frame is being encrypted. I'd like to easily identify these bits location in the encrypted frame. I'd like this Metadata to be transparent to the actual design rtl (i.e. Allow it to flow naturally through external IPs that are not mine, and be recovered and analyzed on the other end).
Thanks
There is absolutely no way to do this using the original RTL path.
You were not clear about your reasoning for this, but sometimes people use a watermark which is encoding something into your data which is inconsequential to the design, but has meaning to your verification environment. For example, instead of sending completely random data in a packet, you send data with a specific checksum that has meaning to your verification environment.

Better way to load content from web, JSON or XML?

I have an app which will load content from a website.
There will be around 100 articles during every loading.
I would like to know which way is better to load content from web if we look at:
speed
compatibility (will there be any problems with encoding if we use special characters etc.)
your experience
JSON is better if your data is huge
read more here
http://www.json.org/xml.html
Strongly recommend JSON for better performance and less bandwidth consumption.
JSON all the way. The Saad's link is an excellent resource for comparing the two (+1 to the Saad), but here is my take from experience and based on your post:
speed
JSON is likely to be faster in many ways. Firstly the syntax is much simpler, so it'll be quicker to parse and to construct. Secondly, it is much less verbose. This means it will be quicker to transfer over the wire.
compatiblity
In theory, there are no issues with either JSON or XML here. In terms of character encodings, I think JSON wins because you must use Unicode. XML allows you to use any character encoding you like, but I've seen parsers choke because the line at the top specifies one encoding and the actual data is in a different one.
experience
I find XML to be far more difficult to hand craft. You can write JSON in any text editor but XML really needs a special XML editor in order to get it right.
XML is more difficult to manipulate in a program. Parsers have to deal with more complexity: name spaces, attributes, entities, CDATA etc. So if you are using a stream based parser you need to track attributes, element content, namespace maps etc. DOM based parsers tend to produce complex graphs of custom objects (because they have to in order to model the complexity). I have to admit, I've never used a stream based JSON parser, but parsers producing object graphs can use the natural Objective-C collections.
On the iPhone, there is no built in XML DOM parser in Cocoa (you can use the C based parser - libxml2) but there is a simple to use JSON parser as of iOS 5.
In summary, if I have control of both ends of the link, I'll use JSON every time. On OS X, if I need a structured human readable document format, I'll use JSON.
You say you are loading "articles". If you mean documents containing rich text (stuff like italic and bold), then it's not clear that JSON is an option - JSON doesn't really do mixed content.
If it's pure simple structured data, and if you don't have to handle complexities like the need for the software at both ends of the communication to evolve separately rather than remaining in lock sync, then JSON is simpler and cheaper: you don't need the extra power or complexity of XML.

User expectations and unicode normalization

This is a bit of a soft question, feel free to let me know if there's a better place for this.
I'm developing some code that accepts a password that requires international characters - so I'll need to compare an input unicode string with a stored unicode string. Easy enough.
My question is this - do users of international character sets generally expect normalization in such a case? My Google searches show some conflicts in opinion from 'always do it' (http://unicode.org/faq/normalization.html) to 'don't bother'. Are there any pros/cons to not normalizing? (i.e., less likely to able guess a password, etc.)
I would recommend that if your password field accepts Unicode input (presumably UTF-8 or UTF-16), that you normalize it before hashing and comparing. If you don't normalize it, and people access it from different systems (different operating systems, or different browsers if it's a web app, or with different locales), then you may get the same password represented with different normalization. This would mean that your user would type the correct password, but have it rejected, and it would not be obvious why, nor would they have any way to fix it.
I wouldn't bother for a couple reasons:
You're going to make things less secure. If two or more characters are all represented in your DB as the same thing, then that means there are fewer possible passwords for the site. (Though this probably isn't a huge deal, since the number of possible passwords is pretty huge.)
You will be building code into your program that does complicated work that is (probably) part of a library you didn't write...and eventually somebody won't be able to log in as a result. Better in my mind to keep things simple, and to trust that people using different character sets know how to type them properly. That said, I've never implemented this in an international password form, so I couldn't tell you what the standard design pattern is.

Hashing SMTP and NNTP messages?

I want to store and index all of my historical e-mail and news as individual message files, using some computed hash code based on the message body+headers. Then I'll index on other things as well -- for searching.
For the primary index key, my thought is to use SHA-1 for the hash algorithm and assume that there will never be any collisions (although I know that there theoretically could be).
Besides the body, what headers should I index? Or more generally, what transformations should I apply to an in-memory copy of the message prior to hashing?
Should I ignore "ReSent-*:" headers? Should I join line-broken headers into single-line headers and remove extraneous whitespace?
(The reason I want to index the messages based on some head instead of on the Message-ID header is because Message-ID headers aren't uniformly formatted.)
You should hash precisely that which constitutes uniqueness of the message. If two messages may differ by the presence of "ReSent-*:" headers but still must be considered to be the "same" message, then those headers must not be part of what is hashed. Similarly, if equal messages may differ in header syntax then you should normalize header syntax. Hash functions such as SHA-1 return the same output only if the input is eaxctly the same, every single bit of it.
Now if using Message-IDs are just enough for you, save for the formatting issue, then there is a simple way: just hash the Message-IDs. A hashed Message-ID will have your regular, fixed-size, randomized format on which you can index.