Convert symbol to a list of symbols - kdb

I read in a csv file:
csvFile:1!("SSS"; enlist ",") 0: hsym `$"\\\\location\\of\\csv";
But lists of symbols are read as single symbols. e.g. in the csv file I have 'a'b'c but
csvFile`some_keyed_value
`col1`col2!``a`b`c`
What I want is this - and note how a single ticker should be an empty list:
`col1`col2!(`a`b`c;())
Is there a way to make this cast or read in the csv differently or modify the csv so that it reads in correctly? Any modifications I make to the csv (e.g. replacing ' with () ) simply converts it to a single symbol (e.g. I get '() ).
Here is a screenshot of a few lines from the csv

For your input, the following works. You can also use cut instead of vs, as in previous example.
q)update `$table, `$_[1;]each vs["`";]each writeAllow, `$_[1;]each
vs["`";]each writeLog from ("***";enlist",")0:`:tmp.csv
table writeAllow writeLog
------------------------------------------------------
:/loader/P1 `pg`sec-fg-id `symbol$()
:/loader/P2 `pg`shara`mcdonald `pg`MD`svc
:/loader/P3 `symbol$() `pg`MD`svc
You should probably reconsider storing the sym data with backticks - it would be straightforward to store with a different delimiter to separate the sub-records and have a dedicated function for parsing those fields.

Taking this csv as an example:
cat ex.csv
x,y
`aa`bb,
`cc`dd,`ee`ff
,`gg`hh
You need to load those nested symbol columns in as strings first:
q)show tab:("**";enlist",")0:`:ex.csv
x y
-----------------
"`aa`bb" ""
"`cc`dd" "`ee`ff"
"" "`gg`hh"
From here you then need to drop the backticks and convert the strings to symbols. One possible way to do this is:
q)update {`$1_'where["`"=x]cut x}'[x] from tab
x y
-------------------
`aa`bb ""
`cc`dd "`ee`ff"
`symbol$() "`gg`hh"

You can also utilize the function value to convert a string with embedded backticks to a symbol list.
// read in as strings as others have said
q)("**";1#",")0:`:test.csv
x y
-------------
"`a`b" "`c`d"
"`e`f" "`g`f"
// value each string to convert to symbol lists
q)value#''("**";1#",")0:`:test.csv
x y
-------
a b c d
e f g f
// check its now nested symbol type
q)meta value#''("**";1#",")0:`:test.csv
c| t f a
-| -----
x| S
y| S
You can of course use value on specific columns that you need if it serves your purposes too, ie update value each col1 from ...

q)t
table writeAllow writeLog
----------------------------------------------------------
:loader/p1 "`pg`sec-fg-id" ""
:loader/p2 "`pg`shara`cmacdonald" "`pg`MD`svc"
:loader/p3 "" "`pg`MD`svc"
q)foo:(`$1_"`" vs) each
q)update foo[writeAllow], foo[writeLog] from t
table writeAllow writeLog
------------------------------------------------------
:loader/p1 `pg`sec-fg-id `symbol$()
:loader/p2 `pg`shara`mcdonald `pg`MD`svc
:loader/p3 `symbol$() `pg`MD`svc

Related

Spark scala CSV write with option escape and quote together breaks CSV file

The Input is Below:
Name
Text
A'
D,John
B
"AB
C
A"B"
D
This is "78-DC-DF-001"20 23:11:01 - 12323
I am using the below code to write the above data into a CSV file(output is dataframe which contain the input data):
output.coalesce(1).write
.format("csv")
.option("escape","")
.option("quote","")
.save("Output")
When I am using only the escape option then Output is like below which is not proper:
Name
Text
A'
D,John
B
"AB""
C
A""B""""
D
This is ""78-DC-DF-001""20 23:11:01 - 12323""
And when I use both escape and quote together then it is shifting the comma values like below:
Name
Text
A'
D
John
B
"AB
C
A"B"
D
This is "78-DC-DF-001"20 23:11:01 - 12323
Any suggestion on how to resolve this issue in spark scala. Require the same input as output.

How to strip extra spaces when writing from dataframe to csv

Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."

How to check for valid file name format in kdb/q?

I'd like to check that the file names in my directory are all formatted properly. First I create a variable dir and then use the keyword key to see what files are listed...
q)dir:`:/myDirectory/data/files
q)dirkey:key dir
q)dirkey
`FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
`FILEB_ABC_20190430_b556nyc1_OrderSale_000456.meta
I select and parse the .json file name...
q)dirjsn:dirkey where dirkey like "*.json"
q)sepname:raze{"_" vs string x}'[dirjsn]
"FILEA"
"XYZ"
"20190501"
"b233nyc9"
"OrderPurchase"
"000123.json"
Next I'd like to confirm that each character in sepname[0] and sepname[1] are letters, that characters in sepname[2] are numerical/temporal, and that sepname[3] contains alphanumeric values.
What is the best way to optimize the following sequential if statements for performance and how can I check for alphanumeric values, like in the case of sepname[3], not just one or the other?
q)if[not sepname[0] like "*[A-Z]";:show "Incorrect Submitter"];
if[not sepname[1] like "*[A-Z]";:show "Incorrect Reporter"];
if[not sepname[2] like "*[0-9]";:show "Incorrect Date"];
if[not sepname[3] like " ??? ";:show "Incorrect Kind"];
show "Correct File Format"
If your valid filenames alway have that same structure (specifically 5 chars, 3 chars, 8 chars, 8 chars) then you can use a single regex like statement like so:
dirjsn:("FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json";"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json");
q)dirjsn
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
F2ILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ2_20190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_2A190501_b233nyc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233%yc9_OrderPurchase_000123.json
FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json
q)AZ:"[A-Z]";n:"[0-9]";Azn:"[A-Za-z0-9]";
q)dirjsn where dirjsn like raze(AZ;"_";AZ;"_";n;"_";Azn;"*")where 5 1 3 1 8 1 8 1
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"
"FILEA_XYZ_20190501_b233nyc9_OrderPurchase_000123.json"
like will not work in this case as we need to check each character. One way to do that is to use in and inter:
q) a: ("FILEA"; "XYZ"; "20190501"; "b233nyc9")
Create a character set
q) c: .Q.a, .Q.A
For first 3 cases, check if each charcter belongs to specific set:
q) r1: all#'(3#a) in' (c;c;.Q.n) / output 111b
For alphanumeric case, check if it contains both number and character and no other symbol.
q)r2: (sum[b]=count a[3]) & all b:sum#'a[3] in/: (c;.Q.n) / output 1b
Print output/errors:
q) errors: ("Incorrect Submitter";"Incorrect Reporter";"Incorrect Date";"Incorrect Kind")
q) show $[0=count r:where not r1,r2;"All good";errors r]
q) "All good"

Matlab: Read a certain file and make a calculation on columns

I have file names go on 2.txt 4.txt 8.txt 12.txt 14.txt. And each file structure looks like
I want to read each designated file and do some calculations with the designated columns for instance, after calling 2.txt file I want to calculate
column(A)+column(I)
The questions
How can I call the certain file with their name
How can I do calculations with this file columns
Here is my code
function[t]=ad(x)
folderName='C:\Users\zeldagrey6\Desktop\AD';
fileinfo=dir([folderName filesep '**/*.txt'] );
filename={fileinfo.name};
fullFileName=[folderName filesep filename{x}];
d=readtable(fullFileName, 'ReadVariableNames', true);
t=d.A+d.I;
end
The problems of the code
When I put ad(2) into array i call 4.txt instead of 2.txt. I guess it does not care the names of the text just read them according to their sequence
Is there any way to assign with each column like var1,var2 and do some
calculations with var1+var2 instead of d.A+d.I
yes, you can refer to table contents with curly brackets like this:
A = (30.1:0.1:30.5)';
I = (324:328)';
Angle = (35:5:55)';
FWHM = (0.2:0.05:0.4)';
d = table(A,I,Angle,FWHM);
t1 = d.A + d.I;
t2 = d{:,1} + d{:,2};
See that t1 and t2 are equal

Extract specific column information from table in MATLAB

I have several *.txt files with 3 columns information, here just an example of one file:
namecolumn1 namecolumn2 namecolumn3
#----------------------------------------
name1.jpg someinfo1 name
name2.jpg someinfo2 name
name3.jpg someinfo3 name
othername1.bmp info1 othername
othername2.bmp info2 othername
othername3.bmp info3 othername
I would like to extract from "namecolumn1" only the names starting with name but from column 1.
My code look like this:
file1 = fopen('test.txt','rb');
c = textscan(file1,'%s %s %s','Headerlines',2);
tf = strcmp(c{3}, 'name');
info = c{1}{tf};
the problem is that when I do disp(info) I got only the first entry from the table: name1.jpg and I would like to have all of them:
name1.jpg
name2.jpg
name3.jpg
You're pretty much there. What you're seeing is an example of MATLAB's Comma Separated List, so MATLAB is returning each value separately.
You can verify this by entering c{1}{tf} in the command line after running your script, which returns:
>> c{1}{tf}
ans =
name1.jpg
ans =
name2.jpg
ans =
name3.jpg
Though sometimes we'd want to concatenate them, I think in the case of character arrays it is more difficult to work with than retaining the cell arrays:
>> info = [c{1}{tf}]
info =
name1.jpgname2.jpgname3.jpg
versus
>> info = c{1}(tf)
info =
'name1.jpg'
'name2.jpg'
'name3.jpg'
The former would require you to reshape the result (and whitespace pad, if the strings are different lengths), whereas you can index the strings in a cell array directly without having to worry about any of that (e.g. info{1}).