MATLAB - How to load and handle of a big TXT file (32GB)

MATLAB - How to load and handle of a big TXT file (32GB) - matlab

First os all, sorry about my english...
I would like to know a better way to load and handle a big TXT file (around 32GB, matrix 83.000.000x66). I already tried some experiments with TEXTSCAN, IMPORT (out of memory), fgets, fget1,.... Except import approach, all methods works but take to much time (much more than 1 week).
I aim to use this database to execute my sampling process and, after that, a neural network for learning the behabiour.
Someone know how to import this type of data faster? I am thinking to make a database dump in other format (instead TXT), for exemplo SQL server and try to handle with this data accessing the database by queries.
Other doubt, after load all data, can I save in .MAT format and handle with this format in my experiments? Other better idea?
Thanks in advance.

It's impossible to hold such big matrix (5,478,000,000 values) in your workspace/memory (unless you've got tons of ram). So the file format (.mat or .csv) doesn't matter!
You definitly have to use a database (or split the file in sevaral smaller ones and calculate step by step (takes very long too).
Personaly, I only have experiances with sqlite3 and did similar with a 1.47mio x 23 matrix/csv file.
http://git.osuv.de/markus/sqlite-demo (Remember that my csv2sqlite.m was just designed to run with GNU Octave [19k seconds at night ...well, it was bad scripted too :) ].
After everything was imported to the sqlite3 database, I simply can access only the data I need within 8-12 seconds (take a look in the comment header of leistung.m).
If your csv file is straight, you can simply import it with sqlite3 itself
For example:
┌─[markus#x121e]─[/tmp]
└──╼ cat file.csv
0.9736834199195674,0.7239387515366997,0.3382008456696883
0.6963824911102146,0.8328410999877027,0.5863203843393815
0.2291736458336333,0.1427739134201017,0.8062332551565472
┌─[markus#x121e]─[/tmp]
└──╼ sqlite3 csv.db
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
sqlite> CREATE TABLE csvtest (col1 TEXT NOT NULL, col2 TEXT NOT NULL, col3 TEXT NOT NULL);
sqlite> .separator ","
sqlite> .import file.csv csvtest
sqlite> select * from csvtest;
0.9736834199195674,0.7239387515366997,0.3382008456696883
0.6963824911102146,0.8328410999877027,0.5863203843393815
0.2291736458336333,0.1427739134201017,0.8062332551565472
sqlite> select col1 from csvtest;
0.9736834199195674
0.6963824911102146
0.2291736458336333
All is done with https://github.com/markuman/go-sqlite (Matlab and Octave compatible! but I guess no one but me has ever used it!)
However, I recommand Version 2-beta in branch 2 (git checkout -b 2 origin/2) running in coop mode (You'll hit max string length from sqlite3 in ego mode). There's a html doku for version 2 too. http://go-sqlite.osuv.de/doc/

Related

Import CSV file into PostgreSQL while rounding from decimal to integer

I am loading a 10 GB CSV file into an AWS Aurora postgres database. This file has a few fields where the values are decimal and the values are +/- 0.1 from whole number, but in reality they are supposed to be integers. When I loaded this data into Oracle using SQLLDR I was able to round the fields from decimal to integers. I would like to do the same in the PostgreSQL database using the \copy command, but I can't find any options which allow this.
Is there a way to import this data and round the values during a \copy without going through a multistep process like creating a temporary table?

There doesn't seem to be a built-in way to do this as I have seen in other database applications.
I didn't use an external program as suggested in the comments, but I did preprocess the data using an awk script that read each line and reformatted the incorrect field with the printf function to round the output with the parameter "%.0f".

Decimals less than 1 appear as ",x" in output file while they appear correctly in the result window

I am having difficulty with my decimal columns. I have defined a view in which I convert my decimal values like this
E.g.
SELECT CONVERT(decimal(8,2), [ps_index]) AS PriceSensitivityIndex
When I query my view, the numbers appear correctly on the results window e.g. 0,50, 0,35.
However, when I export my view to file using Tasks > Export Data ... feature of SSMS, the decimals lower than zero appear as ,5, ,35.
How can I get the same output as in the results window?

Change your query to this:
SELECT CAST( CONVERT(decimal(8,2), [ps_index]) AS VARCHAR( 20 ) ) AS PriceSensitivityIndex
Not sure why, but bcp is dropping leading zero. My guess is it's either because of the transition from SQL Storage to a text file. Similar to how the "empty string" and nulls are exchanged on BCP in or out. Or there is some deeper config (windows, sql server, ?) where a SQL Server config differs from an OS config? Not sure yet. But since you are going to text/character data anyway when you BCP to a text file, it's safe (and likely better in most cases) to first cast/convert your data to a character data type.

Import Flat File via SSMS to SQL Server fails

When importing a seemingly valid flat file (csv, text etc) into a SQL Server database using the SSMS Import Flat File option, the following error appears:
Microsoft SQL Server Management Studio
Error inserting data into table. (Microsoft.SqlServer.Import.Wizard)
Error inserting data into table. (Microsoft.SqlServer.Prose.Import)
Object reference not set to an instance of an object. (Microsoft.SqlServer.Prose.Import)
The target table may contain rows that imported just fine. The first row that is not imported appears to have no formatting errors.
What's going wrong?

Check the following:
that there are no blank lines at the end of the file (leaving the last line's line terminator intact) - this seems to be the most common issue
there are no unexpected blank columns
there are no badly escaped quotes
It looks like the import process loads lines in chunks. This means that the lines following the last successfully loaded chunk may appear to have no errors. You need to look at subsequent lines, that are part of the failing chunk, to find the offending line(s).
This cost me hours of hair pulling while dealing with large files. Hopefully this saves someone some time.

If the file you're importing is already open, SSMS will throw this error. Close the file and try again.

Make sure when you are creating your flat-file IF you have text (varchar) value in any of your columns, DO NOT select your file to be comma "," delimited. Instead, select vertical line "|" or something that you are SURE it can't be in those values. the comma is supper common to have in nvarchar filed.
I have this issue and none of the recommendations from other answers helped me!
I hope this saves someone some times and it took me hours to figure it out!!!

None of these other ones worked for me, however this did:
When you import a flat file, SSMS gives you a brief summary of the data types within each column. Whenever you see a nvarchar that's in an int or double column, change it to int or double. And change all nvarchars to nvarchar(max). This worked for me.

I've been working with csv data for a long time. I encountered the similar problems when I first started this job, however as a novice, I couldn't obtain a precise fault from the exceptions.
Here are a few things you should look at before importing anything.
Your csv file must not be opened in any software, such as Excel.
Your csv file cells should not include comma or quotation symbols.
There are no unnecessary blanks at the end of your data.
There is no usage of a reserved term as data. In Excel, open
yourfile and save it as a new file.

After considering all the suggestions, if anyone is still having issues, check the length of the DataType for your columns. It took hours for me to figure this out but increasing the nvarchar length from (50) to (100) worked for me.

One thing that worked for me : You can change the error range to 1 in "Modify colums"
Image for clarity of where it is
You get an error message with the specific line that's problematic in your file instead of "ran out of memory"

I fixed these errors by playing around with the data type. For instance, change my tinyint to smallint, smallint to int, and increased my nvarchar() to reasonable values, else I set it to nvarchar(MAX). Since most of the real-life data do have missing values, I checked allowed missing values in all columns. Everything then worked with a warning message.

SAS PROC IMPORT not creating OUT dataset as commanded

Situation: I'm importing an xlsx file with PROC IMPORT and wanting to send the data OUT to a new netezza database table.
My issue: SAS appears to run fine, but the log shows a completely different table name was been created with a libref that I'm not using (and this libref is cleared).
LIBNAME abc sasionza server=server database=db port=123 user=user pass=pass;
PROC IMPORT
OUT = abc.DesiredTableName
DATAFILE= "my/excelfile/file.xlsx"
DBMS=xlsx
REPLACE;
SHEET="Sheet1";
GETNAMES=YES;
RUN;
This "runs" just fine, or so it appears to. I check the log and I see this:
NOTE: The import data set has 11 observations and 7 variables.
NOTE: xyz.ATableCreatedDaysAgoInAnotherProgram data set was successfully created. NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.55 seconds
cpu time 0.02 seconds
I thought, hmm, that is weird. libref xyz is actually cleared, so I couldn't possibly use it, and ATableCreatedDaysAgoInAnotherProgram is a tablename used in a completely different SAS E-Guide program I have going on.
Sounds like a memory or cache issue. So, I close all instances of SAS E-Guide and fire up a new one. I created a new program that only has my desired lines (the code listed above).
It runs, and I get the following log as a result:
NOTE: The import data set has 11 observations and 7 variables.
NOTE: WORK._PRODSAVAIL data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.55 seconds
cpu time 0.02 seconds
I will note that this is the first time I've actually tried to use PROC IMPORT to send something directly to a netezza table. Up until now, I've always imported files into WORK and worked with them for a bit before inserting them into a table in a database. I thought that maybe this is a SAS limitation I may not be aware of, but the SAS documentation for PROC IMPORT (https://v8doc.sas.com/sashtml/proc/z0308090.htm) says that you can specify a two level name in the OUT statement, so I feel that this should work. If it can't work, then I feel that SAS should error out instead of randomly creating a table name that I'm not even executing in my code.
Summary (tl;dr): Can you PROC IMPORT directly into a netezza database table using a libref? And if you can't, why is my code executing and producing text that isn't even related to what I'm doing?
Thanks, everyone!

The Solution: A column in the xlsx file being imported had a space in one of the column names... Simply removing the space in the column name and saving the changes to the xlsx file allowed for the PROC IMPORT code above to be executed flawlessly with the desired results being imported into the named netezza table.
NOTE: This fixed my problem, but it does not explain the SAS log showing text executing that wasn't actually in the code to be executed.

Sounds like you should report the issue with not getting a working ERROR message to SAS.
To make sure that your SAS/Netezza tables do not have variable names with spaces in them change the setting of the VALIDVARNAME option before running your program. That way PROC IMPORT will convert your column headings in the XLSX file into valid variable names.
options validvarname=v7;
libname out ...... ;
proc import out=out.table replace ...

loading multiple non-CSV tables to R, and perform a function on each file.

First day on R. I may be expecting too much from it but here is what I'm looking for:
I have multiple files (140 tables), and each table has two columns (V1=values & V2=frequencies). I use the following code to get the Avg from each table:
I was wondering if it's possible to do this once instead of 140 times!
i.e: to load all files and get an exported file that shows Avg of each table in front of the original name of the file.
-I use read.table to load files as read.CSV doesn't work well for some reason.
I'll appreciate any input!
Sum(V1*V2)/Sum(V2)