why are csvs copied from QPAD and csvs saved from q process so different in terms of size? - kdb

I am trying to save a csv generated from a table.
If I 'Export all as CSV' from QPAD the file is 22MB.
If I do `:path.csv 0: csv 0: table the file is 496MB.
The file contains same data.
I do have some columns which are list of dates, list of symbols which cause some issues when parsing to csv.
To get over that I use this {`$$[1=count x;string first x;`$" "sv string x]}
i.e. one of the cols is called allDates and looks like this:
someOtherCol
allDates
stackedSymCol
val1
, 2001.01.01
, `sym 1
val2
2001.01.01 2001.01.02
`sym 2`sym 3
Where is this massive difference in size coming from and how can I reduce the the size.
If I remove these 3 columns which are lists of lists, the file goes down significantly.
Doing an ungroup is not an option.
I think the important question here is why is QPAD capable to handle columns which are lists of lists of type 'D' 'S' etc and how I can achieve that without casting those columns to a space delimited string. This is what is causing my saved csv to be so massive.
ie. I can do an 'Export all to csv' from QPAD on this and it is 21MB :
but if I want to save it programatically, I need to change those allDates and DESK_NAME column and it goes up to 500MB
UPDATE: Thanks everyone. I did not know that QPAD is truncating data like that on exports. That is worrying.

These csvs will not be identical. qPad truncates nested lists(including strings). The csv exported directly from kdb will be complete.
Eg.
([]a:3#enlist til 1000;b:3#enlist til 1000)
The qPad csv export of this looks like this at the end: 30j, 31j ....

Based on the update to your answer it seems you are exporting the data shown in the screenshot which would not be the same as the data you are transforming to save to csv directly from q.
Based on the screenshot it is likely the csv files are not identical for at least 3 reasons:
QPad is truncating the nested dates at a certain length
QPad adds enlist to nested lists of length 1
QPad adds/keeps backticks before symbols
Example data comparison
Here is a minimal example that should highlight this:
q)example:{n:1 2 20;([]someOtherCol:3?10;allDates:n?\:.z.d;stackedSymCol:n?\:`3)}[]
q)example
someOtherCol allDates
stackedSymCol
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 ,2006.01.13
,`hfg
1 2008.04.06 2008.01.11
`nha`plc
4 2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 2020.03.10 2003.07.02 2007.11.29 2010.07.18 2001.10.23 2000.11.07 `ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak
I have used 'Export All to CSV' to save to C:/q/qpad.csv.
I couldn't get your "razing" function to work as-is so I modified it slightly and used that to convert nested lists to strings and saved the file to csv.
q)f:{`$$[1=count x;string first x;" "sv string x]}
q)`:C:/q/q.csv 0: csv 0: update f'[allDates], f'[stackedSymCol] from example
Reading from both files and comparing the contents results in mismatched contents.
q)a:read0`:C:/q/q.csv
q)b:read0`:C:/q/qpad.csv
q)a~b
0b
Side note
Since kdb+ V4.0 2020.03.17 it is possible to save nested vectors to csv using .h.cd to prepare the text. The variable .h.d is used as the delimiter for sublist items.
q).h.d:" ";
q).h.cd example
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)`:somefile.csv 0: .h.cd example
CSV saved from q
Contents of the csv saved from q and the character count are shown in the example:
q)read0`:C:/q/q.csv
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)count raze read0`:C:/q/q.csv
383
CSV saved from QPad
Similarly the contents of the csv saved from QPad and the character count:
q)read0`:C:/q/qpad.csv
"someOtherCol,allDates,stackedSymCol"
"1,enlist 2006.01.13,enlist `hfg"
"1,2008.04.06 2008.01.11,`nha`plc"
"4,2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 ...,`ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak"
q)count raze read0`:C:/q/qpad.csv
338
Conclusion
We can see from these examples the points outlined above. The dates are truncated at a certain length, enlist is added to nested lists of length 1, and backticks are kept before symbols.
The truncated dates could be the reason why the file you have exported from QPad is so much smaller. Based on your comments above the files are not identical, so this may be the reason.
TL;DR - Both files are created differently and that's why they differ.

Related

Parsing a text file or an html file to create a table

I have a simple issue with a .msg file from outlook, but I discovered that with a code someone helped me with, it was not working since the htmlbody from the .msg file would vary between different emails even though they are from the same source, so my next option was to save the email as a .txt and .html file, since I have no knowledge of html I have no idea how to grab the table which is structured in the html with a . but on the text I found something easy, for example this is data from one table:
Summary
Date
Good mail
Rule matches
Spam
Malware
2019-10-22
4927
4519
2078
0
2019-10-23
4783
4113
1934
0
this is on the text file, Summary is the keyword, and after that key word, the next 5 lines are the columns of the table, after that ,each 5 lines following are the rows, this goes up to 7 rows in total, so headers and then 7 rows.
Now what I want to do is create a table from this text using the 5 first lines after summary as my columns. Since each .msg is different, this 5 columns will change order on each file randomly so I want to avoid this, my best attempt was to use convertfrom-string to create a table , but I have little idea on how to format the table with the conditions set above.
The problem I have is this simple, I have a table on the txt file shown as above, with 5 columns, each column besides the headers contains 7 rows, therei s also the condition that the email since it has more data, I need to stop there nad just grab that part which should be easy.
How can I use convertfrom-string to create the table using those 5 columns , how can I set the delimiter as a new line and how can I set the first 5 lines as the column headers?
I think trying to make this work with ConvertFrom-StringData is adding more work than necessary. But here is an alternative that works with your sample set.
$text = Get-Content -Path File.txt
$formattedText = if ($text[0] -match '^Summary') {
for ($i = 1; $i -lt $text.count; $i+=5 ) {
$text[$i..($i+4)] -join ','
}
}
$fomattedText | ConvertFrom-Csv | ConvertTo-Html
Explanation:
If we assume your text data is in File.txt, Get-Content is used to read the data as an array ($text). If the first line begins with Summary, the file will be parsed.
The for loop is used to skip 5 lines during each iteration until the end of the file. The for loop begins with $text values (indexes 1, 2, 3, 4, and 5) joined together by a ,. Then the index increment ($i) is increased by 5 and the next five index values are joined together. Each increment will create a new line of comma separated values. The reason for the , join is just to use the simple ConvertFrom-Csv later.
ConvertFrom-Csv converts the CSV data into an array of objects ($formattedText) with the first row becoming those objects' properties.
Finally, the array is piped to ConvertTo-Html, which will output all of the objects in a table.
Note: If you want to resize or add extra format to the table, you may need to do that after the code is generated. If your data has commas, you will need a different delimiter when joining the strings. You will then need to add the -Delimiter parameter to the ConvertFrom-Csv with the delimiter you choose.
Adaptation:
The code is fairly flexible. If you need to work with more than five properties, the $i+=5 will need to reflect the number of properties you need to cycle through. The same change needs to apply to $text[$i..($i+4)]. You want the .. to separate two values that differ by your property number.

How do I extract the last string of a csv file and append it to the other?

I have csv file of many rows, each having 101 columns, with the 101th column being a char, while the rest of the columns are doubles. Eg.
1,-2.2,3 ... 98,99,100,N
I implemented a filter to operate on the numbers and wrote the result in a different file, but now I need to map the last column of my old csv to my new csv. how should I approach this?
I did the original loading using loadcsv but that didn't seem to load the character so how should I proceed?
In MATLAB there are many ways to do it, this answer expands on the use of tables:
Input
test.csv
1,2,5,A
2,3,5,G
5,6,8,C
8,9,7,T
test2.csv
1,2,1.2
2,3,8
5,6,56
8,9,3
Script
t1 = readtable('test.csv'); % Read the csv file
lastcol = t{:,end}; % Extract the last column
t2 = readtable('test2.csv'); % Read the second csv file
t2.addedvar = lastcol; % Add the last column of the first file to the table from the second file
writetable(t2,'test3.csv','Delimiter',',','WriteVariableNames',false) % write the new table in a file
Note that test3.csv is a new file but you could also overwrite test2.csv
'WriteVariableNames',false allows you to write the csv file without the headers of the table.
Output
test3.csv
1,2,1.2,A
2,3,8,G
5,6,56,C
8,9,3,T

How to export every table to csv in a kdb+ database?

Assume my kdb+ database has a few tables. How can I export all tables to csv files where the name of each csv is same as the table name?
There may be a number of ways to approach this, one solution could be:
q)t1:([]a:1 2 3;b:1 2 3)
q)t2:([]a:1 2 3;b:1 2 3;c:1 2 3)
q){save `$(string x),".csv"} each tables[]
`:t1.csv`:t2.csv
ref: http://code.kx.com/q/ref/filewords/#save
If you wish to specify the directory of the file being saved down then you could enhance the function like so:
q){[dir;tSym] save ` sv dir,(`$ raze string tSym,`.csv)}[`:C:/Users/dhughes1/Documents;] each tables[]
`:C:/Users/dhughes1/Documents/t1.csv`:C:/Users/dhughes1/Documents/t2.csv
An alternative method to save is to use 0: to prepare text, specifying a delimiter of ",":
q)tab:([]a:1 2 3;b:`a`b`c)
q)show t:","0:tab
"a,b"
"1,a"
"2,b"
"3,c"
And again to save text:
q)`:tab 0: t
`:tab
The advantage of this method is that the delimiter can be specified before saving to disk.

Reading portion of CSV with multiple data types

I have a csv file with both numbers and letters that I want to read. The file also has headers(first row) but I can read them separately so that's not a concern.
What I can't solve is the fact that the file has multiple data types and that I only want to read a portion(since the file is very large), say the first 5000 rows.
I've tried xlsread with three outputs but I get the following error : "??? Error: Object returned error code: 0x800A03EC". I've also tried textscan but if I understood correctly you've to type the variable types as an input and that's not very practical for me since I have a large amount of columns. I hope this is not a duplicate but I've read other solutions and I could not apply them to my problem.
Is there a way to do this?
Thank you in advance
To test the problem i created a small test.csv file.
It contains the following lines:
header1;header2;header3
a;1;xx
b;2;yy
c;3;zz
d;4;xxx
e;5;yyy
I use the following code to read the data:
range = 'A2:C3'
[num, text, both] = xlsread('test.csv', 1, range)
Output of the both variable, that contains the text and numbers, is as expected:
both =
'a' [1] 'xx'
'b' [2] 'yy'

How can I copy columns from several files into the same output file using Perl

This is my problem.
I need to copy 2 columns each from 7 different files to the same output file.
All input and output files are CSV files.
And I need to add each new pair of columns beside the columns that have already been copied, so that at the end the output file has 14 columns.
I believe I cannot use
open(FILEHANDLE,">>file.csv").
Also all 7 CSV files have nearlly 20,000 rows each, therefore I'm reading and writing the files line by line.
It would be a great help if you could give me an idea as to what I should do.
Thanx a lot in advance.
Provided that your lines are 1:1 (Meaning you're combining data from line 1 of File_1, File_2, etc):
open all 7 files for input
open output file
read line of data from all input files
write line of combined data to output file
Text::CSV is probably the way to access CSV files.
You could define a csv handler for each file (including output), use getline or getline_hr (returns hashref) methods to fetch data, combine it into arrayrefs, than use print.