Import a CSV into Matlab in multiple parts - matlab

I have a very large CSV file (870mb) that I'm trying to import into Matlab. Some of the data are numeric and some are text. I have 16GB of RAM and a SSD but the import wizard script is using 37GB and doesn't progress past 0% scanning file after a couple of hours.
Is there a way to break up the import wizard script to import the first 500,000 rows and save them to variables and empty dataArray then import the next 500,000 rows and append it to the variables, etc, until the file is complete? I'm surprised that Matlab doesn't do something like this natively.
Thank you for your help.

Have a look at the memory-mapping approach I described here. If you know the format of your file, or can deduce it from the contents, I have found this to be the fastest way to read large CSV files into Matlab. It also helps reduce memory usage.

Related

Import free text files

I have been asked to do NLP on a folder of free text documents in SAS. Normally I do this in Python or R and I am not sure how to import he txt files into SAS because there is no structure.
I have thought about using proc import but don't know what I would use as a delimiter. How can one import free text files with no structure into SAS? I supposed once I got in I could use things '%like%' sort of items to pull out what they want.
I would strongly recommend against this. Use the right tool for the right job, in this case it's not SAS.
Ok, that being said some basics you could do:
Import text files and create n grams. Ideally, 1,2 &3 words.
Use PROC FREQ to summarize n-grams.
Find a parts of speech corpus and merge than with the 1 gram to remove useless words.
Calculate length of words and length of sentence to create a document complexity score.
Those are all doable in Base.

Memory issues with large amounts of data stored as nested cells in MATLAB

I have large amounts of data stored as nested cells in .mat files. My biggest problem right now is the load times for accessing these files, but I'm wondering if the underlying problem is that I came up with an inefficient way for storing the data and I should restructure it to be smaller.
The full file consists of a cell aray:
Hemi{1,h} where there are 52 versions of h
.{n,p} where there are 85 versions of n and up to ~100 versions of p
.Variable where there are 10 variables, each with ~2500 values
This full file ate up all my memory, so I saved it in parts, aka:
Hemi1.mat=Hemi{1,1}
Hemi2.mat=Hemi{1,2}
etc.
The next step for this application is to load each file, determine which part of it is an appropriate solution (I need Hemi{1,h}.{n,p}.Var1, Hemi{1,h}.{n,p}.Var2, and Hemi{1,h}.{n,p}.Var3 for this, but I still need to keep track of the other Variables), save the solution, then close the file and move to the next one.
Is there a faster way to load these files?
Is the problem less my dataset and more how I've chosen to store it? Is there a better alternative?
That is quite a lot of data. I have a few suggestions that you could look into. The first is to see if you can change the datatypes to something like Categorical Objects. They are way more memory efficient. Also if you are storing strings as your final data this can be quite heavy storage wise.
Second you could look into HDF5 file storage. I hear it is a nice way to store structured data.
You could finally try to convert your {n,p} arrays into Table structures. I am not sure if this is better for memory, but tables are nice to work with and it may help you out. (Depending on your version of Matlab you may not have tables :P).
I hope this helps!
-Kyle

NetLogo BehaviorSpace memory size constraint

In my model I'm using behaviour space to carry out a number of runs, with variables changing for each run and the output being stored in a *.csv for later analysis. The model runs fine for the first few iterations, but quickly slows as the data grows. My questions is will file-flush when used in behaviour space help this? Or is there a way around it?
Cheers
Simon
Make sure you are using table format output and spreadsheet format is disabled. At http://ccl.northwestern.edu/netlogo/docs/behaviorspace.html we read:
Note however that spreadsheet data is not written to the results file until the experiment finishes. Since spreadsheet data is stored in memory until the experiment is done, very large experiments could run out of memory. So you should disable spreadsheet output unless you really want it.
Note also:
doing runs in parallel will multiply the experiment's memory requirements accordingly. You may need to increase NetLogo's memory ceiling (see this FAQ entry).
where the linked FAQ entry is http://ccl.northwestern.edu/netlogo/docs/faq.html#howbig
Using file-flush will not help. It flushes any buffered data to disk, but only for a file you opened yourself with file-open, and anyway, the buffer associated with a file is fixed-size, not something that grows over time. file-flush is really only useful if you're reading from the same file from another process during a run.

Manipulating large csv files with Matlab

I am trying to work with a large set of numerical data stored in a csv file. Is so big that I cannot store in a single variable, as Matlab does not have enough memory.
I was wondering if there is some way to manipulate large csv files in matlab similar as if they were variables, i.e. I want to sort it, delete some rows, find the column and row of some values, etc.
If that is not possible, what programming language do you recommend to do that, considering that the data is stored in a matrix form?
You can import the csv file into a database. E.g. sqlite - https://sqlite.org/cvstrac/wiki?p=ImportingFiles
Take one of the sqlite Toolboxes for Matlab, e.g. http://go-sqlite.osuv.de/doc/
You should be able to select single rows and columns due sql language and import to matlab. Or use sqlite functions (for sort -> order by etc.)...
Another option is to access csv files directly like it is a sql database with q. See https://github.com/harelba/q

How to import big amount of data from file to sqlite inside the application (in real-time)

I have a big list of words (over 2 millions) in CSV file (size about 35MB).
I wanted to import the CSV file into sqlite3 with index (primiary key).
So I've imported it using sqlite command line tool. The DB has been created and size of the .sqlite file has grown to over 120MB! (50% because of primary key index)
And here we get the problem: if I add this 120MB .sqlite file to the resources even after compressing to .ipa file it has >60MB. And I'd like if it will be less then 30MB (because of the limitiation through E/3G).
Also because of the size I cannot import it (zipped sqlite file) by a web service (45MB * 1000 download = 45GB! it's my server's half year limit).
So I thought I could do something like this:
compress the CSV file with words to ZIP and than the file will have only 7MB file.
add ZIP file to resources.
in the application I can unzip the file and import data from the unzipped CSV file to sqlite.
But I don't know how to do this. I've tried to do this :
sqlite3_exec(sqlite3_database, ".import mydata.csv mytable", callback, 0, &errMsg);
but it doesn't work. The reason for the failure is ".import" is a part of the command line interface and not in the C API.
So I need to know how to import it(unzipped CSV file) to the SQLite file inside app (not during develompent using command line).
If the words that you are inserting are unique you could make the text the primary key.
If you only want to test whether words exist in a set (say for a spell checker), you could use an alternative data structure such as a bloom filter, which only requires 9.6 bits for each word with 1% false positives.
http://en.wikipedia.org/wiki/Bloom_filter
As FlightOfStairs mentioned depending on the requirements a bloom filter is one solution, if you need the full data another solution is to use a trie or radix tree data structure. You would preprocess your data and build these datastructures and then either put it in sqlite or some other external data format.
The simplest solution would be to write a CSV parser using NSScanner and insert the rows into the database one by one. That's actually a fairly easy job—you can find a complete CSV parser here.