(q/kdb+) Create a table with string column - kdb

I can create the following table in kdb using
([]idx:0,1,2;str:"a","b","c")
idx str
0 a
1 b
2 c
but I cant do for instance
([]idx:0,1,2;str:"aa","bb","cc")
I would like to get
idx str
0 aa
1 bb
2 cc
What am I doing wrong when creating this string column?

Use braces and semi-colons rather than commas to separate list items:
q)([]idx:(0;1;2);str:("aa";"bb";"cc"))
idx str
--------
0 "aa"
1 "bb"
2 "cc"

Related

Convert the contents of columns containing numeric text to numbers

I have a csv file that consists of text or number. But some columns are corrupted as seen in the image below("<<"K.O). When I open the csv file via Matlab (without importing), it converts them to number and define undefined values such as "<<"K.O as NaN as I wanted. But when I read the file via script I wrote:
opts = detectImportOptions(filedir);
table = readtable(filedir,opts);
It reads them as char arrays. Since I have many different csv files (columns are different), I want to do it automatically rather than using textscan(since it requires file format and my file format is different for each csv file). Is there any way to convert the contents of columns containing numeric text to numbers automatically?
As far as I can understand from your comments, this is what you are actually looking for:
for i = 1:numel(files)
file = fullfile(folder,files(i).name));
opts = detectImportOptions(file);
idx = strcmp(opts.VariableNames,'Grade');
if (any(idx))
opts.VariableTypes(idx) = {'double'};
end
tabs(i) = readtable(file,opts);
end
Assuming you have your data stored in a table, you can attempt to convert each column of character arrays to numeric values using str2double. Any values that don't convert to a numeric value (empty entries, words, non-numeric strings, etc.) will be converted to NaN.
Since you want to do the conversions automatically, we'll have to make one key assumption: any column that converts to all NaN values should remain unchanged. In such a case, the data was likely either all non-convertable character arrays, or already numeric. Given that assumption, this generic conversion could be applied to any table T:
for varName = T.Properties.VariableNames
numData = str2double(T.(varName{1}));
if ~all(isnan(numData))
T.(varName{1}) = numData;
end
end
As a test, the following sample data:
T = table((1:5).', {'Y'; 'N'; 'Y'; 'Y'; 'N'}, {'pi'; ''; '1.4e5'; '1'; 'A'});
T =
Var1 Var2 Var3
____ ____ _______
1 'Y' 'pi'
2 'N' ''
3 'Y' '1.4e5'
4 'Y' '1'
5 'N' 'A'
Will be converted to the following by the above code:
T =
Var1 Var2 Var3
____ ____ ______
1 'Y' NaN
2 'N' NaN
3 'Y' 140000
4 'Y' 1
5 'N' NaN

How to remove double quotes and extra delimiter(s) with in double quotes of TextQualifier file in Scala

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with spark.read.textFile (single column) and then remove Text Qualifier along with delimiter (need to replace delimiter with space) with in double quotes. Here I want do with out considering columns i.e. I should not split into columns
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
val y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", " "))
I got above code which works fine for static value. But I am not able to apply to DF.
"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , "FER" sfsf,sfd f"
4 , "XAA" , "sf,sd sdfsf"
I need output as
ID,Name,DESC
1 , ABC , A B C
2 , XYZ , ABC is bother
3 , YYZ , FER" sfsf sfd f
4 , XAA , sf sd sdfsf
Thanks in Advance.
Resolved
var SourceFile = spark.read.textFile("/data/test.csv")
val SourceFileDF= SourceFile.withColumn("value", RemoveQualifier(col("value")))
def RemoveQualifier = udf((RawData:String)=>
{
var Data = RawData
val pattern = """"[^"]*(?:""[^"]*)*"""".r
Data = pattern replaceAllIn (Data , m => m.group(0).replaceAll("[,]", " "))
Data
})
Thanks.
you can two replaceAll() like this use like this:
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\\\\n]", " ").replaceAll("\"|\"", ""))
output: String = 4 , XAA , sf sd sdfsf

Strange behaviour in size(strfind(n,',')) for n = 44

For some reason in
size(strfind(n,','))
the number 44 is special and produces a comma found result:
value={55}
numCommas = size(strfind(value{1},','),2)
ans= 0 ...(GOOD)
value={44}
numCommas = size(strfind(value{1},','),2)
ans= 1 ...(BAD) - Why is it doing this?
value={'44,44,44'}
numCommas = size(strfind(value{1},','),2)
ans= 2 ...(GREAT)
I need to find the number of commas in a cell element, where the element can either be an integer or a string.
To elaborate on my comment. The ASCII code for a comma, (,), is 44. Effectively what you are doing in your code is
size(strfind(44,','),2)
or
size(strfind(char(44),','),2)
where 44 is not a string but is interpreted as a numeric value which is then converted to a character and results in a comma, (,) which we can see when we use char
>> char(44)
ans =
,
You can fix your code by changing
value={44}
to
value={'44'}
so then you will be performing strfind on a string instead of a numeric value.
>> size(strfind('44', ','), 2)
ans =
0
which provides the correct answer.
Alternatively you could use num2str
>> size(strfind(num2str(value{1}), ','), 2)
ans =
0
You can avoid this by simply doing value{1} = '44'. Or if that's not an alternative, use num2str like this:
value={44};
numCommas = size(strfind(num2str(value{1}),','),2)
numCommas =
0
This will also work for string inputs:
value={'44,44,44'};
numCommas = size(strfind(num2str(value{1}),','),2)
numCommas =
2
Why do you get "wrong" results?`
It's because 44 is the ASCII code for comma ,.
You can check this quite simply by casting the value to char.
char(44)
ans =
,
You are checking for commas in a string. As the input to strfind is an integer, it automatically cast it to char. In the last example, your are inserting a "real" string, thus it finds the two commas in there.
Try this one:
value={'44'}
numCommas = size(strfind(value{1},','),2)
instead of:
value={44}
numCommas = size(strfind(value{1},','),2)
It should work, since it's a char now.

Perl: print certain rows based on certain values of column

Hey guys im begginer in Perl programming ,on my list.txt i have a 5 row and 7 columns what i want to do is print certain rows based on the value that the column have for example:
NO. RES REF ERRORS WARNING PROB_E PROB_C
1 k C 0 0 0.240 0.713
2 l C 16 2 0.365 0.568
3 n C 7 4 0.365 0.568
4 f E 0 0 0.613 0.342
I want to print from the column 3,4(error and warnings ) all the rows that have value different than 0. In this case the output to is the row 2 and 3.I hope i make myself clear :) sorry for my poor english.
Try this:
perl -ane 'print if ($F[3] or $F[4])' list.txt

Unicode character transformation in SPSS

I have a string variable. I need to convert all non-digit characters to spaces (" "). I have a problem with unicode characters. Unicode characters (the characters outside the basic charset) are converted to some invalid characters. See the code for example.
Is there any other way how to achieve the same result with procedure which would not choke on special unicode characters?
new file.
set unicode = yes.
show unicode.
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
end data.
string Z (a10).
comp Z = T.
loop #k = 1 to char.len(Z).
if ~range(char.sub(Z, #k, 1), "0", "9") sub(Z, #k, 1) = " ".
end loop.
comp Z = normalize(Z).
comp len = char.len(Z).
list var = all.
exe.
The result:
T Z len
1234 1234 4
5678 5678 4
absd 0
12as 12 2
12(a 12 2
12(vi 12 2
12(vī 12 � 6
>Warning # 649
>The first argument to the CHAR.SUBSTR function contains invalid characters.
>Command line: 1939 Current case: 8 Current splitfile group: 1
12āčž 12 �ž 7
Number of cases read: 8 Number of cases listed: 8
The substr function should not be used on the left hand side of an expression in Unicode mode, because the replacement character may not be the same number of bytes as the character(s) being replaced. Instead, use the replace function on the right hand side.
The corrupted characters you are seeing are due to this size mismatch.
How about instead of replacing non-numeric characters, you cycle though and pull out the numeric characters and rebuild Z? (Note my version here is pre CHAR. string functions.)
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
12as23
end data.
STRING Z (a10).
STRING #temp (A1).
COMPUTE #len = LENGTH(RTRIM(T)).
LOOP #i = 1 to #len.
COMPUTE #temp = SUBSTR(T,#i,1).
DO IF INDEX('0123456789',#temp) > 0.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1),#temp).
ELSE.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1)," ").
END IF.
END LOOP.
EXECUTE.