Read a file as SciPy sparse matrix directly - scipy

Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?

Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.
Output: Convert the file into SciPy CSR sparse matrix as fast as possible
May be there are better solutions out there, but this solution worked for me after a lot of suggestions from #CJR (some of which I couldn't take into account).
Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.
import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter
start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))
Output:
--- 406.22810888290405 seconds ---
Matrix Size.
df_np_sp_matrix
Output:
<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>

Related

K-means clustering KDDcup99 data set error

I am using idx = kmeans(kddcup,5); for kmeans clustering. 145586 records with 41 features of kddcup99, 10% subset of database into 5 clusters, but MATLAB r2017a gives this error:
Kmeans cannot accept complex data!
I loaded a database in MATLAB that has 42 columns instead of 41, which means that the 42nd column is for type of row (attack, normal, ...) and is not a feature, and I don't know if I should keep that 42nd row or delete it.
I don't know if my work is correct or if there is a mistake in that code.
idx = kmeans(X,k) see documentation requires numeric input for X.
Data, specified as a numeric matrix. The rows of X correspond to
observations, and the columns correspond to variables.
If X is a numeric vector, then kmeans treats it as an n-by-1 data
matrix, regardless of its orientation.
Data Types: single | double
You will need to not pass the 42nd column of kddcup to kmeans.
Since you said that kddcup contains (attack,normal,...) are those char? So, what datatype is kddcup?
Regardless it will need to stripped of the 42nd column and possibly converted into a numeric matrix if it isn't already.

Matlab string to double (str2double)

I am trying to build a finantial application that handle economical data using Matlab. The file I want to load is in a csv file and are double numbers in this format '1222.3'. So far, I am just working with one dimension and I am able to load the data into a vector.
The problem is that the data is loaded into the vector in String format. To change all the vector into double format I use str2double(vector), but the numbers into the vector end like this:
1222.3 -> 1.222
153.4 -> 0.1534
I have tried to multiply the vector per 100 (vector.*100), but did not work.
Any idea?
If your vector components are sufficiently large enough, MATLAB will print the numbers in exponential format.
>> a = 1234.56
a =
1.2346e+03
The numbers are also shown in scientific notation in the workspace browser:
You can print the numbers in decimal form using e.g. fprintf:
>> fprintf('%5.3f\n',a)
1234.560
>>
As a side note, 1.222 * 100 ≠ 1222 ...
Matlab automatically pulls some common factor out numerical vectors, which has confused me many times myself. The line that gives the common factor is easy to miss, especially for large vectors, because it is displayed at the top.
If I define a vector with the two number you gave, Matlab displays it to me in the following way:
It pulled out a factor of 1000, as indicated by the line 1.0e+03 *.

Matlab xlsread: force empty cells to be read as NaN

I need to import data from a square region (10 by 10 cells) on an Excel sheet into Matlab.
All data in the region are numerical, but some outer rows and columns of the region are empty.
In Matlab I still want to have a 10 by 10 matrix of doubles with NaNs in places where there are empty cells in Excel (also in outer rows and columns).
If I use xlsread then empty outer rows and columns are automatically truncated.
Needless to say that all should be done automatically without the knowledge how many empty outer rows and columns are there.
How can I do this?
Let's say your 10 by 10 spreadsheet's first row and column and last row and column are empty (like this). Using:
[num,txt,raw] = xlsread('myfile.xlsx',1,'A1:J10'); % Read input.
will return:
num 8x8 double
txt 0x0 cell
raw 10x10 cell
In num, non-scalar leading rows and columns are automatically truncated, while in txt any numerical values are omitted. However, raw contains all information, so it can be used to extract the numerical values:
raw(cellfun(#ischar,raw)) = {NaN}; % Set non-scalar values to missing.
A = cell2mat(raw); % Convert to matrix.

Single-Column Matrix Indexing

So I've got a .tcl file with data representing a large three-dimensional matrix, and all values associated with the matrix appended in a single column, like this:
128 128 512
3.2867
4.0731
5.2104
4.114
2.6472
1.0059
0.68474
...
If I load the file into the command window and whos the variable, I have this:
whos K
Name Size Bytes Class Attributes
K 8388810x3 201331440 double
The two additional columns seem to be filled with NaNs, which don't appear in the original file. Is this a standard way for MATLAB to store a three-dimensional matrix? I'm more familiar with the .mat way of storing a matrix, and I'm curious if there's a quick command I can run to revert it to a more friendly format.
The file's first line (128 128 512) gives it 3 columns. I don't know why there are 2so many extra rows (128*128*512 = 8388608), but your 3d matrix can probably be constructed like this:
N = 128*128*512;
mat = reshape(tab(2:N+1,1),[128 128 512]);
What's on the last hundred lines of the table that gets loaded?

Extract large Matlab dataset subsets

Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2
Example:
alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).
The format for the dataset is:
'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'
I then convert 2 type integer cols into type boolean
the following subset assignment:
somedata = alldata(1:m,:)
takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.
Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.
Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.
I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.
I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property
Try the following fix:
%# I assume you have a 'dataset' object
ds = dataset(...);
%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];
Amro suggested clearing the observation names:
ds.Properties.ObsNames = [];
This results in a massive performance benefit as the subset assignment changes from quadratic (linear when plotted against rows^2) to linear (when plotted against rows) with rows at the minor cost of losing the ObsNames.
Copying a DataSet is near instantaneous, so when combined with clearing the unneeded rows also results in a massive performance improvement, though slightly a less optimal solution (but with no loss of ObsNames). Performance is about 2x slower compared to dropping ObsNames. This only improves by 2% when ObsNames are also dropped.
supporting data
I used a small script to assign a subset rows of a 150,000 x 25 mixed string/integer/boolean dataset generated the following time measurements (seconds).
The memory heap size made no significant difference in performance and was left at 128 MB.
Subref means standard function for subset assignment was used
ObsNames=[] means the ObsNames are dropped
Delete means dataset was copied and unneeded rows cleared.
Rows, subref, subref&ObsName=[], Delete, Delete&ObsName=[]
8000, 4.19, 2.06, 4.81, 4.72
32000, 57.61, 2.49, 5.26, 5.21
72000, 390.72, 3.21, 6.09, 6.03
128000, ?(*), 4.21, 7.25, 7.19
(*) I gave up on evaluating this value. Based on linear extrapolation against rows^2 I would guess 2000 sec, or half an hour.
Script
clear
load('data'); % load 'alldata' dataset
% alldata.Properties.ObsNames = []; % drop obsnames
tic;
x = ((1:4).^2.*8000);
for h = 1:length(x)
start = toc;
somedata = alldata(1:x(h),:);
% somedata = alldata;
% somedata(x(h):end,:) = []; % drop unrequired obs
t(h) = toc - start;
clear somedata
disp([x(h), t(h)]);
end