How to import big amount of data from file to sqlite inside the application (in real-time) - iphone

I have a big list of words (over 2 millions) in CSV file (size about 35MB).
I wanted to import the CSV file into sqlite3 with index (primiary key).
So I've imported it using sqlite command line tool. The DB has been created and size of the .sqlite file has grown to over 120MB! (50% because of primary key index)
And here we get the problem: if I add this 120MB .sqlite file to the resources even after compressing to .ipa file it has >60MB. And I'd like if it will be less then 30MB (because of the limitiation through E/3G).
Also because of the size I cannot import it (zipped sqlite file) by a web service (45MB * 1000 download = 45GB! it's my server's half year limit).
So I thought I could do something like this:
compress the CSV file with words to ZIP and than the file will have only 7MB file.
add ZIP file to resources.
in the application I can unzip the file and import data from the unzipped CSV file to sqlite.
But I don't know how to do this. I've tried to do this :
sqlite3_exec(sqlite3_database, ".import mydata.csv mytable", callback, 0, &errMsg);
but it doesn't work. The reason for the failure is ".import" is a part of the command line interface and not in the C API.
So I need to know how to import it(unzipped CSV file) to the SQLite file inside app (not during develompent using command line).

If the words that you are inserting are unique you could make the text the primary key.
If you only want to test whether words exist in a set (say for a spell checker), you could use an alternative data structure such as a bloom filter, which only requires 9.6 bits for each word with 1% false positives.
http://en.wikipedia.org/wiki/Bloom_filter

As FlightOfStairs mentioned depending on the requirements a bloom filter is one solution, if you need the full data another solution is to use a trie or radix tree data structure. You would preprocess your data and build these datastructures and then either put it in sqlite or some other external data format.

The simplest solution would be to write a CSV parser using NSScanner and insert the rows into the database one by one. That's actually a fairly easy job—you can find a complete CSV parser here.

Related

Which is the fastest way to read a few lines out of a large hdfs dir using spark?

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString
If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

In OpenEdge, how do you transfer parts of the data in the database in an easy way?

I have a lot of data in 2 different databases and in many different tables I would like to move from one computer into a few others. The others has the same definition of the db:s. Note, not all the data should be transfered, only some that I define. Some tables fully, and some others just partly.
How would I move these data in the easiest way? To dump each table and load separately in many .d files - is not an easy way. Could you do something similar to the Incremental .df File that contains all that has to be changed?
Dumping (and loading) entire tables is easy. You can do it from the GUI or by command line. Look at for instance this KnowledgeBase entry about command line dump & load and this about creating scripts for dumping the entire database.
Parts of the data is another story. This is very individual and depends on your database and your application. It's hard for a generic tool to compare data and tell if a difference in data depends on changed data, added data or deleted data. Different databases has different kinds of layout, keys and indices.
There are however several built in commands that could help you:
For instance:
IMPORT and EXPORT for importing and exporting data to files, streams etc.
Basic import and export
OUTPUT TO c:\temp\foo.data.
FOR EACH foo NO-LOCK:
EXPORT foo.
END.
OUTPUT CLOSE.
INPUT FROM c:\temp\foo.data.
REPEAT:
CREATE foo.
IMPORT foo.
END.
INPUT CLOSE.
BUFFER-COPY and BUFFER-COMPARE for copying and comparing data between tables (and possibly even databases).
You could also use the built in commands for doing "dump" and then manually edit the created files.
Calling Progress Built in commands
You can call the back end that dumps data from Data Administration. That will require you to extract those .p-files from it's archives and calling them manually. This will also require you to change PROPATHS etc so it's not straightforward. You could also look into modifying the extracted files to your needs. Remember that this might break when upgrading Progress so store away your changes in separate files.
Look at this Progress KB entry:
Progress KB 15884
Best way for you depends on if this is a one time or reacurring task, size and layout of database etc.

Performance in MongoDB and GridFS

I am developing a plugin that using mongodb. The plugin has to store some .dcm files (DICOM files) in the database as binary files. After that, the plugin has to store the metadata of the file and be able to make some query on only these metadata.
Naturally, I chose GridFs to answer at my problem. Because I can use the same file to store the binary data in the chunks collection and the metadata in the metadata field in the files collection (and bypass the sized limit of MongoDB).
But another problem comes to me. This solution would be great but I am storing at the same time the binary data and the metadata. Let me explain : first I store the binary file and after that I retrieve the file and read metadata from it and store the metadata in the same file. It is an obligation for me for some externals reasons. So I lost a lot of time to retrieve the file and restore it again. For update the metadata from a file that it is already stored, I am using this code :
GridFSDBFile file = saveFs.findOne(uri.getFileName());
if (file == null) {
return false;
} else {
file.setMetaData(new BasicDBObject());
file.save();
return true;
}
The main problem it that I have to find the file before to modify it and then store it AGAIN !
So my first question is : Is there a best way to retrieve file from the database instead of findOne(String fileName) ? Is the method findOne(ObjectID id) is faster ? (I don't think so because I think that fileName is already indexed by default, is not it ?)
I have tried another way to do it. To bypass this problem, I decided to store 2 different files, ones for binary data and ones for metadata. In this case, I don't loose time to retrieve the file in the database. But I have got 2 times more files... But I almost sure that it exist a better way to do it !
So my second question : Do you think that I would have to used 2 different collections ? One which used GridFs to store the binary data and the other one that used classic mongo storage (or GridFS) to only store the metada ?
Thank you a lot for reading me and for your answer :).
For your first question, both _id and filename fields are indexed by default. While _id field is unique, filename is not. So if you have files with same filenames, getting a file with the filename will be relatively slower than getting it by the _id field.
For your second question, you can always have metadata for any GirdFS file you inserted. That means you don't have to have more than GridFS. Use GridFS to insert data, but just before inserting it, assign your metadata to the file you want to insert. That way you can query files using the metadata. If the metadata you want to have is fixed for all documents, then you can have those fields indexed too, and queryable of course.

SQLite3: Batch Insert?

I've got some old code on a project I'm taking over.
One of my first tasks is to reduce the final size of the app binary.
Since the contents include a lot of text files (around 10.000 of them), my first thought was to create a database containing them all.
I'm not really used to SQLite and Core Data, so I've got basically two questions:
1 - Is my assumption correct? Should my SQLite file have a smaller size than all of the text files together?
2 - Is there any way of automating the task of getting them all into my newly created database (maybe using some kind of GUI or script), one file per record inside a single table?
I'm still experimenting with CoreData, but I've done a lot of searching already and could not find anything relevant to bringing everything together inside the database file. Doing that manually has proven no easy task already!
Thanks.
An alternative to using SQLite might be to use a zipfile instead. This is easy to create, and will surely safe space (and definitely reduce the number of files). There are several implementations of using zipfiles on the iphone, e.g. ziparchive or TWZipArchive.
1 - It probably won't be any smaller, but you can compress the files before storing them in the database. Or without the database for that matter.
2 - Sure. It's shouldn't be too hard to write a script to do that.
If you're looking for a SQLite bulk insert command to write your script for 2), there isn't one AFAIK. Prepared insert statments in a loop inside a transaction is the best you can do, I imagine it would take only a few seconds (if that) to insert 10,000 records.

Can i store sqlite db as zip file in iphone application

My sqlite file has a size of 7MB. I want to reduce its size. How i can do that ? When am simply compressing it will come around only 1.2 MB. Can i compress my mydb.sqlite to a zip file ? If it is not possible, any other way to reduce size of my sqlite file ?
It is possible to compress before hand, but is very redundant. You will compress your binary before distribution, Apple distributes your app through the store compressed and the compression of a compressed file is fruitless. Thus, any work you do to compress beforehand should not have much of an effect on the resulted size of your application
without details of what you are storing in the DB it's hard to give specific advice. The usual generics on DB Design will apply. Normalise your database.. for example
reduce/remove repeating data. If you have text/data that is repeated then store it once, and use key to reference it
If you are storing large chunks of data then you might be able to zip and unzip these in and out of the database in your app code rather than try to zip the DB