I am working with a CSV file that contains information in the following format:
col1 col2 col3
row1 id1 , text1 (year1) , a|b|c
row2 id2 , text2 (year2) , a|b|c|d|e
row3 id3 , text3 (year3) , a|b
...
The number of rows in the CSV is very large. The years are embedded in col2 in parentheses. Also, as can be seen col3 can have variate number of elements.
I would like to read the CSV file EFFICIENTLY and end up for each item (id) with an array as follows:
For 'item' with id#_i :
A = [id_i,text_i,year_i,101010001]
where if all possible features in col3 are [a,b,c,d,....,z], the binary vector shows its presence or absence.
I am interested in efficient implementation of this in MATLAB. Ideas are more than welcome. Thank You
I would like to add what I have found to be one of the fastest ways of reading a CSV file:
importdata()
This will allow you to read numeric and non-numeric data, but it assumes there is some number of header lines. You can either input the number of header lines as an input argument to importdata() or you can let it try on its own...to which it didn't work for my use in the past.
This was much faster than xlsread() for me, where it took 1/6th the time to read something 6 times larger!
If you are reading only numeric data, you can use csvread()--which actually uses dlmread().
Thing is, there are about 10 ways to read these files, and it is really dependent not only on your goals, but the file contents.
You can use T = readtable(filename). This has the option for 'ReadVariableNames' which takes first row as header and 'ReadRowNames' that will take first column as row variable.
Related
In there any way is kdb to read a csv file which is as simple as read_csv() function in pandas.
I usually use something like below code to read a csv in kdb
("I*FS";enlist ",")0:`:a.csv / where a.csv is a csv file with Integer, String, Float and Symbol columns
Many times in practical cases, the csv file we want to read has more than 100 columns, then it is difficult to provide the column types to function.
Is there a way in kdb to read csv where kdb can understand the type of column by itself?
something like
("*";enlist ",")0:`:a.csv / this fails
Simon Garland wrote a "csv guess" script many years ago: https://github.com/simongarland/csvguess
It might still be relevant. Some IDEs (such as qStudio and Kx's analyst(?)) I believe also have this functionality built in.
Alternatively you could read the first line of the csv to get the number of columns (say n) and then n#"*" to read the entire csv as string columns:
q)(count["," vs first system"head -1 a.csv"]#"*";enlist ",")0:`:a.csv
col1 col2 col3
----------------------
,"a" ,"1" "2019-01-01"
,"b" ,"2" "2019-01-01"
,"c" ,"3" "2019-01-01"
I have a Spark (1.4) dataframe where the data in a column is like "1-2-3-4-5-6-7-8-9-10-11-12". I want to split the data into multiple columns. Please note that the number of fields can vary from 1 to 12, its not fixed.
P.S. we are using Scala API.
Edit:
Editing over the original question. I have the delimited string as below:
"ABC-DEF-PQR-XYZ"
From this string I need to create delimited strings in separate columns as below. Please note that this string is in a column in DF.
Original column: ABC-DEF-PQR-XYZ
New col1 : ABC
New col2 : ABC-DEF
New col3 : ABC-DEF-PQR
New col4 : ABC-DEF-PQR-XYZ
Please note that there can be 12 such new columns which needs to get derived from original field. Also, the string in original column might vary i.e. some times 1 column, some time 2 but max can be 12.
Hope I have articulated the problem statement clearly.
Thanks!
You can use explode and pivot. Here is some sample data:
df=sc.parallelize([["1-2-3-4-5-6-7-8-9-10-11-12"], ["1-2-3-4"], ["1-2-3-4-5-6-7-8-9-10"]]).toDF(schema=["col"])
Now add a unique id to rows so that we can keep track of which row the data belongs to:
df=df.withColumn("id", f.monotonically_increasing_id())
Then split the columns by delimiter - and then explode to get a long-form dataset:
df=df.withColumn("col_split", f.explode(f.split("col", "\-")))
Finally pivot on id to get back to wide form:
df.groupby("id")
.pivot("col_split")
.agg(f.max("col_split"))
.drop("id").show()
Matlab 2015b
I have several large (100-300MB) csv files, I want to merge to one and filter out some of the columns. They are shaped like this:
timestamp | variable1 | ... | variable200
01.01.16 00:00:00 | 1.59 | ... | 0.5
01.01.16 00:00:01 | ...
.
.
For this task I am using a datastore class including all the csv files:
ds = datastore('file*.csv');
When I read all of the entries and try to write them back to a csv file using writetable, I get an error, that the input has to be a cell array.
When looking at the cell array read from the datastore in debug mode, I noticed, that there are several rows containing only a timestamp, which are not in the original files. These columns are between the last row of a file and the first rows of the following one. The timestamps of this rows are the logical continuation of the last timestamp (as you would get them using excel).
Is this a bug or intended behaviour?
Can I avoid reading this rows in the first place or do I have to filter them out afterwards?
Thanks in advance.
As it seems nobody else had this problem, I will share how I dealt with it in the end:
toDelete = strcmp(data.(2), '');
data(toDelete, :) = [];
I took the second column of the table and checked for an empty string. Afterwards I filled all faulty rows with an empty array via logical indexing. (As shown in the Matlab Documentation)
Sadly I found no method to prevent loading the faulty data, but in the end the amount of data was not to big to do this processing step in memory.
I have a .txt file with 8 column . The columns with \t have taken apart. I want read just 6 column with readtable Instruction. please help me. thank you.
The instructions below read all columns table. please correct this instruction for me:
Table = readtable('D:\DataIntable.txt','Delimiter','\t','ReadVariableNames',true);
Data has 5 millions rows and hence, dropping columns after reading would be pointless time consumption
The overhead of converting to *.xls is silly. If you read the documentation for readtable you will see that it supports textscan-style format specifiers. This allows you to use * to ignore a field.
Using asdf.txt:
column1 column2 column3
a b c
d e f
And:
T = readtable('asdf.txt', 'ReadVariableNames', true, 'Delimiter', '\t', 'Format', '%s%s%*s');
We obtain:
T =
column1 column2
_______ _______
'a' 'b'
'd' 'e'
If you can save your data as an .xls file instead of a .txt file, you can use xlsread which allows you to specify the range of your data in the call.
[data,txt] = xlsread('filename',sheet,xlRange)
You would have to know the cell indices of your data in the spreadsheet (i.e. A1:C500 would be a 500x3 matrix of imported data) but it would allow you to specify only importing the desired columns. The txt output will import column titles as strings since it appears that you want the names associated with the data as well.
I have a .csv file and I can't read it on Octave. On R I just use the command below and everything is read alright:
myData <- read.csv("myData.csv", stringsAsFactors = FALSE)
However, when I go to Octave it doesn't do it properly with the below command:
myData = csvread('myData.csv',1,0);
When I open the file with Notepad, the data looks something like the below. Note there isn't a comma separating the last column name (i.e. Column3) from the first value (i.e. Value1) and the same thing happens with the last value of the first row (i.e. Value3) and the first value of the second row (i.e Value4)
Column1,Column2,Column3Value1,Value2,Value3Value4,Value5,Value6
The Column1 is meant for date values (with format yyyy-mm-dd hh:mm:ss), I don't know if that has anything to do with the problem.
Alex's answers already explains why csvread does not work for your case. That function only reads numeric data and returns an array. Since your fields are all strings, you need something that reads a csv file into a cell array.
That function is named csv2cell and is part of the io package.
As a separate note, if you plan to make operation with those dates, you may want to convert those dates as strings, into serial date numbers. This will allow you to put your dates in a numeric array which will allow for faster operations and reduced memory usage. Also, the financial package has many functions to deal with dates.
csvread only reads numeric data, so a date does not qualify unfortunately.
In Octave you might want to check out the dataframe package. In Matlab you would do readtable.
Otherwise there are also more primitive functions you can use like textscan.