SQLite3: Batch Insert? - iphone

I've got some old code on a project I'm taking over.
One of my first tasks is to reduce the final size of the app binary.
Since the contents include a lot of text files (around 10.000 of them), my first thought was to create a database containing them all.
I'm not really used to SQLite and Core Data, so I've got basically two questions:
1 - Is my assumption correct? Should my SQLite file have a smaller size than all of the text files together?
2 - Is there any way of automating the task of getting them all into my newly created database (maybe using some kind of GUI or script), one file per record inside a single table?
I'm still experimenting with CoreData, but I've done a lot of searching already and could not find anything relevant to bringing everything together inside the database file. Doing that manually has proven no easy task already!
Thanks.

An alternative to using SQLite might be to use a zipfile instead. This is easy to create, and will surely safe space (and definitely reduce the number of files). There are several implementations of using zipfiles on the iphone, e.g. ziparchive or TWZipArchive.

1 - It probably won't be any smaller, but you can compress the files before storing them in the database. Or without the database for that matter.
2 - Sure. It's shouldn't be too hard to write a script to do that.

If you're looking for a SQLite bulk insert command to write your script for 2), there isn't one AFAIK. Prepared insert statments in a loop inside a transaction is the best you can do, I imagine it would take only a few seconds (if that) to insert 10,000 records.

Related

Compare tables to ensure non regression in postgresql

Here is my issue: I often need to compare the same postgresql tables (or views that depend on it) between some ETL code refactoring to check for non regressions in my developments.
Let's say I have an ETL code I want to refactor, which regularly uploads data in a table. Currently, once my modifs are done, I often download my data from postgresql as a .csv file as a first step, then empty it, fill it again using my refactored code, and download the data again. Then, I compare the .csv files using for instance Python in a Jupyter Notebook.
That does not seem like the way to go at all. That notably supposes I am the only one to use that table during the operation, and so many other things I can't list them all here.
Is there a better way to go ?
It sounds to me like you have the correct approach. There's no magic to the CSV export operation: whatever tool you use runs a query and formats its resultset into the file. Any other before-and-after comparison operation would have to run the same query.
If you're doing this sort of regression test on an active database, it's probably wise to put some sort of distinctive tag on your test records, maybe prepend ETLTEST- to your customer names, so it's ETLTEST-John Bull. Then you can make your queries handle only your test records. And make sure you do something reliable for ORDER BY.
Juptyer seems a complex way to diff your csv files. Most operating systems have lightweight fast difftools.

delete temporary files in postgresql

I have a huge database of about 800GB. When I tried to run a query which groups certain variables and aggregates the result, it was stopping after running for a couple of hours. Postgres was throwing a message that disk space is full. After looking at the statistics I realized that the dB has about 400GB of temporary files. I believe these temp files where created while I was running the query. My question is how do I delete these temp files. Also, how do I avoid such problems - use cursors or for-loops to not process all the data at once? Thanks.
I'm using Postgres 9.2
The temporary files that get created in base/pgsql_tmp during query execution will get deleted when the query is done. You should not delete them by hand.
These files have nothing to do with temporary tables, they are use to store data for large hash or sort operations that would not fit in work_mem.
Make sure that the query is finished or canceled, try running CHECKPOINT twice in a row and see if the files are still there. If yes, that's a bug; did the PostgreSQL server crash when it ran out of disk space?
If you really have old files in base/pgsql_tmp that do not get deleted automatically, I think it is safe to delete them manually. But I'd file a bug with PostgreSQL in that case.
There is no way to avoid large temporary files if your execution plan needs to sort large result sets or needs to create large hashes. Cursors won't help you there. I guess that with for-loops you mean moving processing from the database to application code – doing that is usually a mistake and will only move the problem from the database to another place where processing is less efficient.
Change your query so that it doesn't have to sort or hash large result sets (check with EXPLAIN). I know that does not sound very helpful, but there's no better way. You'll probably have to do that anyway, or is a runtime of several hours acceptable for you?

In OpenEdge, how do you transfer parts of the data in the database in an easy way?

I have a lot of data in 2 different databases and in many different tables I would like to move from one computer into a few others. The others has the same definition of the db:s. Note, not all the data should be transfered, only some that I define. Some tables fully, and some others just partly.
How would I move these data in the easiest way? To dump each table and load separately in many .d files - is not an easy way. Could you do something similar to the Incremental .df File that contains all that has to be changed?
Dumping (and loading) entire tables is easy. You can do it from the GUI or by command line. Look at for instance this KnowledgeBase entry about command line dump & load and this about creating scripts for dumping the entire database.
Parts of the data is another story. This is very individual and depends on your database and your application. It's hard for a generic tool to compare data and tell if a difference in data depends on changed data, added data or deleted data. Different databases has different kinds of layout, keys and indices.
There are however several built in commands that could help you:
For instance:
IMPORT and EXPORT for importing and exporting data to files, streams etc.
Basic import and export
OUTPUT TO c:\temp\foo.data.
FOR EACH foo NO-LOCK:
EXPORT foo.
END.
OUTPUT CLOSE.
INPUT FROM c:\temp\foo.data.
REPEAT:
CREATE foo.
IMPORT foo.
END.
INPUT CLOSE.
BUFFER-COPY and BUFFER-COMPARE for copying and comparing data between tables (and possibly even databases).
You could also use the built in commands for doing "dump" and then manually edit the created files.
Calling Progress Built in commands
You can call the back end that dumps data from Data Administration. That will require you to extract those .p-files from it's archives and calling them manually. This will also require you to change PROPATHS etc so it's not straightforward. You could also look into modifying the extracted files to your needs. Remember that this might break when upgrading Progress so store away your changes in separate files.
Look at this Progress KB entry:
Progress KB 15884
Best way for you depends on if this is a one time or reacurring task, size and layout of database etc.

sql server split mirror db on to multiple devices

Say I have a large production mirrored 1TB DB that resides on a single MDF device and I would like to split that up into say 5 200 Gig devices.
I want to do this without interruption to Production.
I thought I could break the mirror and use the RESTORE process for creating a mirror to achieve the split to multiple devices quickly and without interruption to Production. Doing this twice would allow me to get this done in a few hours.
Has anyone done this? Is it the preferred method seeing as we are mirroring anyways?
What are other my alternatives, Pros and Cons? And gotchas?
Also, I recall another more organic process where one would create the 5 new New Devices and somehow, over time get the objects to move over to the new devices. Not sure of the process for this but I seem to recall it being discussed. Sounds like this could take a long time and possibly cause some clocking at times?
Thanks
...Ray
This isn't quite as simple a process as it first looks, the reason being is that just adding the files to SQL server isn't enough as even if you were to add 4 new files, they would all be empty space, you would have one file with 1Tb of data in it and 4 empty ones, which would eventually fill up as SQL server uses a proportional fill method for the files, but most of your queries would still be hitting the single file.
I take it you are doing this to improve performance? If so, you will need to move data around into different files in order to actually split the data up. Whether you can do this online or not depends on whether you are running Enterprise Edition or not (as this allows you to rebuild indexes online).
An easy way to move a table (or more accurately a clustered index, which is pretty much the same thing as the table for all intents and purposes) is to add a new filegroup with a new data file and then rebuild the clustered index specifying the new filegroup, you can use the following to do this:
CREATE CLUSTERED INDEX Existing_Index_Name ON schema_name.table_name(column_name)
WITH(DROP_EXISTING=ON,Online=ON) on [new_filegroup_name]
GO
This code will create the new index on the new filegroup, get rid of the old one and if you are running enterprise edition, it will do it all without blocking the users.
See the following link for more methods of moving the data between filegroups:
Move data between SQL Server database filegroups
You should also look into partitioning your tables to help improve performance too:
Partitioning Tables and Indexes
With regards to your mirroring setup, you should break the mirror, then on the primary add all your files/filegroups, then move the data between the filegroups, then backup the modified database on the primary, restore on the mirror (so all the files are set up the same on the mirror) and then re-set up your mirroring.

Sharing a file among several processes [Perl]

I have an application that updates a CSV file (single one), the CSV is being updated randomly from several processes, and I guess if two processes try to update it (add a row...) on the same time, some data will be lost I guess, or overwritten(?).
what is the best way to avoid this?
thanks,
Use Perl's DBI with the DBD::CSV driver to access your data; that'll take care of the flocking for you. (Unless you're using Windows 95 or the old Mac OS.) If you decide to switch to an RDBMS later on, you'll be well prepared.
Simple flocking as suggested by #Fluff should also be fine, of course.
If you want to have a simple and manual way to take care of file locking.
1) As soon as a process opens the csv, it creates a lock.
(Lock can be in the form of creating a dummy file. The process has to delete
the file(lock) as soon as it is done reading/updating the csv)
2) Have each process check for file lock before trying to update the csv.
(If dummy file is present, some process is accessing the csv,
else it can update the csv)