How can I copy columns from several files into the same output file using Perl - perl

This is my problem.
I need to copy 2 columns each from 7 different files to the same output file.
All input and output files are CSV files.
And I need to add each new pair of columns beside the columns that have already been copied, so that at the end the output file has 14 columns.
I believe I cannot use
open(FILEHANDLE,">>file.csv").
Also all 7 CSV files have nearlly 20,000 rows each, therefore I'm reading and writing the files line by line.
It would be a great help if you could give me an idea as to what I should do.
Thanx a lot in advance.

Provided that your lines are 1:1 (Meaning you're combining data from line 1 of File_1, File_2, etc):
open all 7 files for input
open output file
read line of data from all input files
write line of combined data to output file

Text::CSV is probably the way to access CSV files.
You could define a csv handler for each file (including output), use getline or getline_hr (returns hashref) methods to fetch data, combine it into arrayrefs, than use print.

Related

why are csvs copied from QPAD and csvs saved from q process so different in terms of size?

I am trying to save a csv generated from a table.
If I 'Export all as CSV' from QPAD the file is 22MB.
If I do `:path.csv 0: csv 0: table the file is 496MB.
The file contains same data.
I do have some columns which are list of dates, list of symbols which cause some issues when parsing to csv.
To get over that I use this {`$$[1=count x;string first x;`$" "sv string x]}
i.e. one of the cols is called allDates and looks like this:
someOtherCol
allDates
stackedSymCol
val1
, 2001.01.01
, `sym 1
val2
2001.01.01 2001.01.02
`sym 2`sym 3
Where is this massive difference in size coming from and how can I reduce the the size.
If I remove these 3 columns which are lists of lists, the file goes down significantly.
Doing an ungroup is not an option.
I think the important question here is why is QPAD capable to handle columns which are lists of lists of type 'D' 'S' etc and how I can achieve that without casting those columns to a space delimited string. This is what is causing my saved csv to be so massive.
ie. I can do an 'Export all to csv' from QPAD on this and it is 21MB :
but if I want to save it programatically, I need to change those allDates and DESK_NAME column and it goes up to 500MB
UPDATE: Thanks everyone. I did not know that QPAD is truncating data like that on exports. That is worrying.
These csvs will not be identical. qPad truncates nested lists(including strings). The csv exported directly from kdb will be complete.
Eg.
([]a:3#enlist til 1000;b:3#enlist til 1000)
The qPad csv export of this looks like this at the end: 30j, 31j ....
Based on the update to your answer it seems you are exporting the data shown in the screenshot which would not be the same as the data you are transforming to save to csv directly from q.
Based on the screenshot it is likely the csv files are not identical for at least 3 reasons:
QPad is truncating the nested dates at a certain length
QPad adds enlist to nested lists of length 1
QPad adds/keeps backticks before symbols
Example data comparison
Here is a minimal example that should highlight this:
q)example:{n:1 2 20;([]someOtherCol:3?10;allDates:n?\:.z.d;stackedSymCol:n?\:`3)}[]
q)example
someOtherCol allDates
stackedSymCol
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 ,2006.01.13
,`hfg
1 2008.04.06 2008.01.11
`nha`plc
4 2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 2020.03.10 2003.07.02 2007.11.29 2010.07.18 2001.10.23 2000.11.07 `ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak
I have used 'Export All to CSV' to save to C:/q/qpad.csv.
I couldn't get your "razing" function to work as-is so I modified it slightly and used that to convert nested lists to strings and saved the file to csv.
q)f:{`$$[1=count x;string first x;" "sv string x]}
q)`:C:/q/q.csv 0: csv 0: update f'[allDates], f'[stackedSymCol] from example
Reading from both files and comparing the contents results in mismatched contents.
q)a:read0`:C:/q/q.csv
q)b:read0`:C:/q/qpad.csv
q)a~b
0b
Side note
Since kdb+ V4.0 2020.03.17 it is possible to save nested vectors to csv using .h.cd to prepare the text. The variable .h.d is used as the delimiter for sublist items.
q).h.d:" ";
q).h.cd example
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)`:somefile.csv 0: .h.cd example
CSV saved from q
Contents of the csv saved from q and the character count are shown in the example:
q)read0`:C:/q/q.csv
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)count raze read0`:C:/q/q.csv
383
CSV saved from QPad
Similarly the contents of the csv saved from QPad and the character count:
q)read0`:C:/q/qpad.csv
"someOtherCol,allDates,stackedSymCol"
"1,enlist 2006.01.13,enlist `hfg"
"1,2008.04.06 2008.01.11,`nha`plc"
"4,2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 ...,`ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak"
q)count raze read0`:C:/q/qpad.csv
338
Conclusion
We can see from these examples the points outlined above. The dates are truncated at a certain length, enlist is added to nested lists of length 1, and backticks are kept before symbols.
The truncated dates could be the reason why the file you have exported from QPad is so much smaller. Based on your comments above the files are not identical, so this may be the reason.
TL;DR - Both files are created differently and that's why they differ.

Powershell script compare 2 .csv files?

Is there a way to make a powershell script that compares 2 CSV files, and make a new .csv fil with the word that isent in 1 of the csv files?
I got 1 CSV file with 24mil words down in column 1.
And i got a nr2 CSV file with 24mil words. I want to compare those 2 list and see what words are missing, iknow 1 mil are missing.
So is there a way to make a powershell script that compares :) ?
Best Regards

Error code in importing multiple csv files from certain folder using matlab

I am really a newbie in matlab programming. I have a problem in coding to import multiple csv files into one from certain folder:
This is my code:
%% Importing multiple CSV files
myDir = uigetdir; %gets directory
myFiles = dir(fullfile(myDir,'*.csv')); %gets all csv files in struct
for k = 1:length(myFiles)
data{k} = csvread(myFiles{k});
end
I use the code uigetdir in order to be able to select data from any folder, because I try to make an automation program so it would be flexible to use by others. The code that I run only look for the directory and shows the list, but not for merging the csv files into one and read it in "import data". I want it to be merged and read as one file.
My merged file should look like this with semicolon delimited and consist of 47 csv files merged together (this picture is one of the csv file I have):
my merged file
I have been working for it a whole day but I find always error code. Please help me :(. Thank you very much in advance for your help.
As the error message states, you're attempting to reference myFiles as a cell array when it is not. The output of dir is a structure, which cannot be indexed like a cell array.
You want to do something like the following:
for k = 1:numel(myFiles)
filepath = fullfile(myFiles(k).folder, myFiles(k).name);
data{k} = csvread(filepath);
end

spark - write to separate files by key

My code:
df.write.partitionBy("col").format("json").save(output)
then my output:
col=1
col=2
col=3
I need a csv format, but when I changed the format("json") to format("csv") I don't have separate files, but one file with all data
part-00000
why? and how can I fix it?

how to find the difference between a csv file and a file containing only one column of this csv

I have a CSV file containing some user data it looks like this:
"10333","","an.10","Kenyata","","Aaron","","","","","","","","","",""
"12222","","an.4","Wendy","","Aaron","","","","","","","","","",""
"14343","","aaron.5","Nanci","","Aaron","","","","","","","","","",""
I also have a file which has an item on each line like this:
an.10
arron.5
What I want is to find only the lines in the CSV file contained in the list file.
So desired output would be:
"10333","","an.10","Kenyata","","Aaron","","","","","","","","","",""
"14343","","aaron.5","Nanci","","Aaron","","","","","","","","","",""
(Note how an.4 is not contained in this new list.)
I have any environment available to me and am willing to try just about anything aside from manually doing so as this csv contains millions of records and there are about 100k entries in the list itself.
How unique are the identifiers an.10 and the like?
Maybe a very small *x shell script would be enough:
for i in $(uniq list.txt); do grep "\"$i\"" data.csv; done
That would, for every unique entry in the list, return all matching lines in the csv file. It does not match exclusively on the second column however. (That could be done with awk for example)
If the csv file is data.csv and the list file is list.txt, I would do this:
for i in `cat list.txt`; do grep $i data.csv; done