To use unicode or not in web development project using flask and sqlalchemy - unicode

I am working on a web development project using flask and sqlalchemy orm. My post is related to use of unicode in developing this app.
What I have understood till now about unicode :
If I want my webapp to handle data in languages other than English I need to use unicode data type for my variables. Because string variables can't handle unicode data.
I use some database which stores unicode data or take responsibility to convert unicode to raw while saving and vice versa while retrieving. Sqlalchemy gives me option to set automatic conversion both ways, so that I dont have to worry about them.
I am using python2.7 so I have to be aware of processing unicode data properly. Normal string operations on unicode data maybe buggy.
Correct me if any of the above assumption is wrong.
Now my doubts or questions :
If I dont use unicodes now then will I have some problems if I or flask people decide to port to python3?
I dont want to hassle with the thought of my webapp catering to different languages right now. I just want to concentrate on first creating the app. Can I do that later without using unicode right now?
If I use unicode now then how it affects my code. Do I replace every string input and output with unicode or what?
Conversion of unicode when saving to database, Can it be source of performance problems?
Basically I am asking whether to use unicode or not with explaining my needs and requirement out of the project.

No, but make sure you separate binary data from text data. That makes it easier to port.
It's easier to use Unicode from the start, but of course you can postpone it. But it's really not very difficult.
You replace everything that should be text data with Unicode, yes.
Only of you make loads of conversions of really massive amounts of text.

Related

how to store image in postgres using Flask model

I am trying to store an image using flask model. I don't know how to store the image in postgres, so I have encoded the image to base64 and I am trying to store that resulting text in postgres. It is working but is there any recommended way to store that encoded text or the image in postgres using flask model
class User_tbl(db.Model):
id = db.Column(db.Integer,primary_key=True)
mobile=db.Column(db.String(13),unique=True)
country=db.Column(db.String(30))
image=db.Column(db.String(256))
def __init__(self,mobile,country,image):
self.mobile=mobile
self.country=country
self.image = image
I know that maybe it's too late to answer this question, but in this days I was trying to solve something similar and none of the solutions proposed seem to shed light on the main problem.
Of course any best practice rests on your needs. In general terms however, you will find that embed a file in the database is not a good practice. Well, it depends.
Reading the "Storing Binary files in the Database" produced by postgresql wiki, I discovered that there are some circumtances in which this practice is instead higly recommended, for instance when the files must be ACID.
In those cases, at least in Postgres, bytea datatype is to be preferred over text or BLOB binary, sometimes at the cost of some higher memory requirements for the server.
In this case:
1) you don't need special sqlalchemy dialects. LargeBinary datatype will suffice, since it will be translated as a "large and/or unlengthed binary type for the target platform".
2) You don't need any encode/decode functions in PostgreSQL, of course in this specific case.
3) As I told before, it is not always a good strategy to save the files into the filesystem. In any case do not use text data type with base64 encoding. Your data will inflated more or less of the 33%, thus resulting in a huge storage impact, whereas bytea has not the same drawback
Thus, I propose these changes to your model:
class User_tbl(db.Model):
id = db.Column(db.Integer,primary_key=True)
mobile=db.Column(db.String(13),unique=True)
country=db.Column(db.String(30))
image=db.Column(db.LargeBinary)
Then you can save files into Postgres simply by passing your FileStorage parameter as a binary:
image = request.files['fileimg'].read()
It would be far easier to avoid all of this encoding and decoding and simply save it as a binary blob. In which case, use a sqlalchemy.dialects.postgresql.BYTEA column.
I know of the encode and decode functions in PostgreSQL for dealing with base64 data, see:
https://www.postgresql.org/docs/current/static/functions-string.html
(encode/decode)
Thanks,
The recomended way to store an image in postgres via flask is to store the image in your static folder(where you store Javascript & CSS files) and serve it via a web server i.e. nginx. It will be able to do it more efficiently than flask.You should only store the path to your image on postgres and then store the actual image on the File system.

Data migration from MongoDB to Teradata

We are working towards migrating data from MongoDB to Teradata (DW).
We feel that transformations on the data will be necessary.
Could you please help me answer the below questions which will guide us on developing a solution for migration :
Which would be the best and efficient format to export data from MongoDB to load into Teradata(DW) considering transformations involved ? (CSV/JSON/Others)
Transformations could include omission of line(s) from the exported file, omission of fields, aggregation(sum/count) across fields etc.
If developing a framework for ETL, will Java be a good choice ?
We noticed that ‘\n’ [newline character] is also part of some records. Hence, in the csv we are seeing some blank lines in between.
Do we need to be concerned of the right line delimiter ? Or can the export format help us in this regard ?
We are seeing some records getting truncated because the length of the record exceeds 1024 characters.
We get the ‘Line too long’ message in VI editor. We don’t have an alternate editor in our system. Is there a way around to handle line truncation ?
CSV is not particularly well-specified - there are several variants of it in the wild with slightly different escaping behaviors. I almost always prefer anything-but-csv.
JSON
Yes
This is not a question, but ok.
Don't edit the data with vi, this is purely a limitation of the editor and not the export format. Do transformations programmatically

How to solve this Arabic language problem in Sybase PowerBuilder 6 and 7?

How to view arabic characters correctly in Sybase PowerBuilder 6 or 7 as I use Arial(Arabic) or any arabic language in the properties of the table and the database but it shows the characters as strange symbols that has no meaning like ÓíÇÑÉ ÕÛíÑÉ ?
I'm no expert in dealing with Arabic language characters, so there may be a work around with ANSI code pages, but I'd expect your best solution is Unicode. There was a distinct version of PB6 supporting Unicode (i.e. a separate product), but it was discontinued in PB6 and there was no Unicode support until it was integrated into the primary product in PB10. However, unless you have the PB6/Unicode product on hand, or you need Win9x support or some other old platform support, I'd recommend moving to something more current, like PB12.5 just out. Not only will you get Unicode, but a lot of features that will help your application look more up to date and integrate better with modern services. (See http://www.techno-kitten.com/Changes_to_PowerBuilder/changes_to_powerbuilder.html for a list that at the moment is a little out of date, but will get the majority of what you're after.)
Good luck,
Terry.
This problem is called Mojibake and it's due to the PowerBuilder client and the database using different character encodings. This problem is frequently encountered on the web, and also in email. As Terry suggested, you would get the best results using Unicode in the database and PowerBuilder. If that's not possible, you have to use the same code page on the PowerBuilder client as in the database. A complicating issue is that it sounds like you have existing data. If you want to switch encoding you would have to convert the existing data to the new encoding.

Normalizing Unicode data for indexing (for Multi-byte languages): What products do this? Does Lucene/Hadoop/Solr?

I have several (1 million+) documents, email messages, etc, that I need to index and search through. Each document potentally has a different encoding.
What products (or configuration for the products) do I need to learn and understand to do this properly?
My first guess is something Lucene-based, but this is something I'm just learning as I go. My main desire is to start the time consuming encoding process ASAP so that we can concurrently build the search front end. This may require some sort of normalisation of double byte characters.
Any help is appreciated.
Convert everything to UTF-8 and run it through Normalization Form D, too. That will help for your searches.
You could try Tika.
Are you implying you need to transform the documents themselves? This sounds like a bad idea, especially on a large, heterogeneous collection.
A good search engine will have robust encoding detection. Lucene does and Solr uses it (Hadoop isn't a search engine). And I don't think it's possible to have a search engine that doesn't use a normalised encoding in its internal index format. So normalisation won't be a choice criteria, though trying out the encoding detection would be.
I suggest you use Solr. The ExtractingRequestHandler handles encodings and document formats. It is relatively easy to get a working prototype using Solr. DataImportHandler enables importing a document repository into Solr.

Storing parts of user data in files for preventing SQL injection

I am new to web programming and have been exploring issues related to web security.
I have a form where the user can post two types of data - lets call them "safe" and "unsafe" (from the point of view of sql).
Most places recommend storing both parts of the data in database after sanitizing the "unsafe" part (to make it "safe").
I am wondering about a different approach - to store the "safe" data in database and "unsafe" data in files (outside the database). Ofcourse this approach creates its own set of problems related to maintaining association between files and DB entries. But are there any other major issues with this approach, especially related to security?
UPDATE: Thanks for the responses! Apologies for not being clear regarding what I am
considering "safe" so some clarification is in order. I am using Django, and the form
data that I am considering "safe" is accessed through the form's "cleaned_data"
dictionary which does all the necessary escaping.
For the purpose of this question, let us consider a wiki page. The title of
wiki page does not need to have any styling attached with it. So, this can be accessed
through form's "cleaned_data" dictionary which will convert the user input to
"safe" format. But since I wish to provide the users the ability to arbitrarily style
their content, I can't perhaps access the content part using "cleaned_data" dictionary.
Does the file approach solve the security aspects of this problem? Or are there other
security issues that I am overlooking?
You know the "safe" data you're talking about? It isn't. It's all unsafe and you should treat it as such. Not by storing it al in files, but by properly constructing your SQL statements.
As others have mentioned, using prepared statements, or a library which which simulates them, is the way to go, e.g.
$db->Execute("insert into foo(x,y,z) values (?,?,?)", array($one, $two, $three));
What do you consider "safe" and "unsafe"? Are you considering data with the slashes escaped to be "safe"? If so, please don't.
Use bound variables with SQL placeholders. It is the only sensible way to protect against SQL injection.
Splitting your data will not protect you from SQL injection, it'll just limit the data which can be exposed through it, but that's not the only risk of the attack. They can also delete data, add bogus data and so on.
I see no justification to use your approach, especially given that using prepared statements (supported in many, if not all, development platforms and databases).
That without even entering in the nightmare that your approach will end up being.
In the end, why will you use a database if you don't trust it? Just use plain files if you wish, a mix is a no-no.
SQL injection can targeted whole database not only user, and it is the matter of query (poisoning query), so for me the best way (if not the only) to avoid SQL injection attack is control your query, protect it from possibility injected with malicious characters rather than splitting the storage.