Remove string prefix from first column - sed

I need to remove the prefix chr in the first column
column1 column2
chr1 123456
chr2 125679
to look like
1 123456
2 125679
I tried sed -i 's/chr//g', but it will create an empty space.

try this one without the g
sed -i 's/chr//'

Related

Truncate multiple columns in PySpark Python

I have a table with multiple columns and only some of these columns need to be truncated down.
For example a text field might go beyond 7 characters and this needs to be reduced.
lets say I have df:
Column A
Column B
Column C
Column D
aaaaaaaaaa
12345
abcdefg
Cell1
bbbbbbbbbb
12345
abcdefg
Cell2
cccccccccc
12345
abcdefg
Cell3
dddddddddd
12345
abcdefg
Cell4
eeeeeeeeee
12345
abcdefg
Cell5
ffffffffff
12345
abcdefg
Cell6
gggggggggg
12345
abcdefg
Cell7
I can see that Columns A and C need truncating down to 5 characters.
``col_to_truncate = ['Column A', 'Column C']
df.withColumn('Column A', substring('Column A', 1, 5)).withColumn('Column C', substring('Column C', 1, 5))
``
The code will work but what if I want to process lots of columns dynamically, is my only option using a for loop?
Is it possible to use list comprehension rather than a for loop?
You can use select instead of withColumn like so:
df.select(*[substring(c, 1, 5).alias(c) for c in df.columns])

How to convert a symbol to a string in kdb+?

For example, if I have a list of symbols i.e (`A.ABC;`B.DEF;`C.GHI) or (`A;`B;`C), how could I convert each item in the list to a string?
string will convert them. It's an atomic function
q)string (`A.ABC;`B.DEF;`C.GHI)
"A.ABC"
"B.DEF"
"C.GHI"
You can use the keyword string to do this documented here
q)lst:(`A;`B;`C)
// convert to list of strings
q)string lst
,"A"
,"B"
,"C"
As the others have mentioned, string is what you're after. In your example if you're interested in separating the prefix and suffix separated by the . you can do
q)a:(`A.ABC;`B.DEF;`C.GHI)
q)` vs' a
A ABC
B DEF
C GHI
and if you want to convert these to strings you can just use string again on the above.
q)string each (`A.ABC;`B.DEF;`C.GHI)
"A.ABC"
"B.DEF"
"C.GHI"
Thanks all, useful answers! While I was trying to solve this on my own in parallel, I came across ($) that appears to work as well.
q)example:(`A;`B;`C)
q)updatedExample:($)example;
q)updatedExample
enlist "A"
enlist "B"
enlist "C"
use String() function.
q)d
employeeID firstName lastName
-----------------------------------------------------
1001 Employee 1 First Name Employee 1 Last Name
1002 Employee 2 First Name Employee 2 Last Name
q)update firstName:string(firstName) from `d
`d
q)d
employeeID firstName lastName
-------------------------------------------------------
1001 "Employee 1 First Name" Employee 1 Last Name
1002 "Employee 2 First Name" Employee 2 Last Name

How to build a dictionary from contents of a csv file in kdb?

I have a csv file with contents like below
source,address,table,tableName,sym,symSet
source_one,addr1:port1:id1:pass1,table_one,tableName1,syms_one,SYM1 SYM2 SYM3
source_two,addr2:port2:id2:pass2,table_two,tableName2,syms_two,SYM21 SYM22 SYM23
My code to load a csv into a table is as below
table:("******";enlist ",") 0: `sourceFileName.csv
I want to create a dictionary out of contents of 'table' in the below format
source_one|addr1:port1:id1:pass1
table_one|tableName1
syms_one|SYM1 SYM2 SYM3
source_two|addr2:port2:id2:pass2
table_two|tableName2
syms_two|SYM21 SYM22 SYM23
How do I achieve this?
Thanks!
You can also use 0: to directly Parse Key-Value Pairs however it would require a change to the way your text file is stored.
Need to drop the first line and add comma on the end of each line:
$ cat test.txt
source_one=addr1:port1:id1:pass1,table_one=tableName1,syms_one=SYM1 SYM2 SYM3,
source_two=addr2:port2:id2:pass2,table_two=tableName2,syms_two=SYM21 SYM22 SYM23,
If its easy to change the load then becomes one line:
q)(!). "S=,"0: raze read0 `:test.txt
source_one| "addr1:port1:id1:pass1"
table_one | "tableName1"
syms_one | "SYM1 SYM2 SYM3"
source_two| "addr2:port2:id2:pass2"
table_two | "tableName2"
syms_two | "SYM21 SYM22 SYM23"
This has the advantage over loading to a table if the data is irregular, .e.g not ever line has source and table and syms.
If they did why not just have those as column names in a table?
One way would be something like this:
q)(!) . flip raze 2 cut'1_flip("******";",")0:`:test.csv
"source_one"| "addr1:port1:id1:pass1"
"table_one" | "tableName1"
"syms_one" | "SYM1 SYM2 SYM3"
"source_two"| "addr2:port2:id2:pass2"
"table_two" | "tableName2"
"syms_two" | "SYM21 SYM22 SYM23"
(If you want the keys & values as symbols, replace the * in the 0: params with S)
This works by reading the file in as lists of strings, flipping into the original rows, dropping the first (i.e. the headers), performing 2 cut on each row to split into pairs, using raze to remove a level of nesting and then finally uses dot apply to apply the ! function (i.e. make a dictionary) to this flipped, so that the left arg to ! is the keys and the right arg is the values.
Given the table defined above you could make use of value to extract the data from the table without column headers:
q)value each table
"source_one" "addr1:port1:id1:pass1" "table_one" "tableName1" "syms_one" "SYM1 SYM2 SYM3"
"source_two" "addr2:port2:id2:pass2" "table_two" "tableName2" "syms_two" "SYM21 SYM22 SYM23"
From here you can raze the ouptut to give a single list which can then be cut into pairs (2):
q)2 cut raze value each table
"source_one" "addr1:port1:id1:pass1"
"table_one" "tableName1"
"syms_one" "SYM1 SYM2 SYM3"
...
Finally using flip puts it into a format that can be used to make a dictionary using !:
(!). flip 2 cut raze value each table
"source_one"| "addr1:port1:id1:pass1"
"table_one" | "tableName1"
"syms_one" | "SYM1 SYM2 SYM3"
"source_two"| "addr2:port2:id2:pass2"
"table_two" | "tableName2"
"syms_two" | "SYM21 SYM22 SYM23"
If the keys need to be symbols then you can make use of # apply to convert them before creating the dictionary:
(!). #[;0;`$]flip 2 cut raze value each table
A better approach may be to create the table without the use of enlist and dropping the column headers with 1_, before making use of the same method to create the dictionary:
(!). flip raze cut[2]each 1_flip("******";",") 0: `:source.csv
I would forego the table creation and do something like this:
q)(!). flip 2 cut raze ","vs/:1_read0`source.csv
"source_one"| "addr1:port1:id1:pass1"
"table_one" | "tableName1"
"syms_one" | "SYM1 SYM2 SYM3"
"source_two"| "addr2:port2:id2:pass2"
"table_two" | "tableName2"
"syms_two" | "SYM21 SYM22 SYM23"
Explanation. From right to left, first, 1_read0 reads the source file as a list of lines and discards the first line. Second, ","vs/: cuts each line on "," separators. Third, 2 cut raze flattens the list of lists and cuts it in pairs. Fourth, flip transposes the list of pairs turning it into a pair of lists. Last, (!). constructs a dictionary from a pair of lists containing keys and values. Note that (!).(x;y) translates into x!y.

Get substring into a new column

I have a table that contains a column that has data in the following format - lets call the column "title" and the table "s"
title
ab.123
ab.321
cde.456
cde.654
fghi.789
fghi.987
I am trying to get a unique list of the characters that come before the "." so that i end up with this:
ab
cde
fghi
I have tried selecting the initial column into a table then trying to do an update to create a new column that is the position of the dot using "ss".
something like this:
t: select title from s
update thedot: (title ss `.)[0] from t
i was then going to try and do a 3rd column that would be "N" number of characters from "title" where N is the value stored in "thedot" column.
All i get when i try the update is a "type" error.
Any ideas? I am very new to kdb so no doubt doing something simple in a very silly way.
the reason why you get the type error is because ss only works on string type, not symbol. Plus ss is not vector based function so you need to combine it with each '.
q)update thedot:string[title] ss' "." from t
title thedot
---------------
ab.123 2
ab.321 2
cde.456 3
cde.654 3
fghi.789 4
There are a few ways to solve your problem:
q)select distinct(`$"." vs' string title)[;0] from t
x
----
ab
cde
fghi
q)select distinct(` vs' title)[;0] from t
x
----
ab
cde
fghi
You can read here for more info: http://code.kx.com/q/ref/casting/#vs
An alternative is to make use of the 0: operator, to parse around the "." delimiter. This operator is especially useful if you have a fixed number of 'columns' like in a csv file. In this case where there is a fixed number of columns and we only want the first, a list of distinct characters before the "." can be returned with:
exec distinct raze("S ";".")0:string title from t
`ab`cde`fghi
OR:
distinct raze("S ";".")0:string t`title
`ab`cde`fghi
Where "S " defines the types of each column and "." is the record delimiter. For records with differing number of columns it would be better to use the vs operator.
A variation of WooiKent's answer using each-right (/:) :
q)exec distinct (` vs/:x)[;0] from t
`ab`cde`fghi

Subtracting a constant value from a column without round-up

I have a file like this.
Column1 Column2
-3500 value1
-3480 value2
-3460 value 3
9920 value 50
9940
9960
10000
10020
40000 Last value
Look at this example:
awk 'NR>1{$1=$1-4.91}1' file
Column1 Column2
-3504.91 value1
-3484.91 value2
-3464.91 value 3
9915.09 value 50
9935.09
9955.09
9995.09
10015.1
39995.1 Last value
I would like to have the correct value, not the rounded like this:
Column1 Column2
-3504.91 value1
-3484.91 value2
-3464.91 value 3
9915.09 value 50
9935.09
9955.09
9995.09
10015.09
39995.09 Last value
I am trying to subtract a constant value 4.91 from the first column. Everything works fine until 9980, but starting at 10000, awk is subtracting only 4.9 from the data and gives values 0.01 lesser than the original ones.I think it is rounding upto some decimal places, but I don't want the rounded values.Can anybody suggest me a workaround..
Anyother suggestions from shell script or Perl are welcome.
Can be done in perl:
perl -pe 's/^([-0-9]+)/$1 - 4.91/e' your_file
Details:
-p reads file line by line and prints it
-e runs perl code with line content in $_
s/.../.../e - replace regexp by expression
^([-0-9]+) - matched any digit and/or - sign. Also captures matched fragment.
$1 - 4.91 - does the work using value captured by regexp.
Here is how to do it with awk:
You simply set how many decimal you like using the printf function.
awk 'NR>1{printf "%.2f\n",$1-4.91}' file
-3504.91
-3484.91
-3464.91
9915.09
9935.09
9955.09
9995.09
10015.09
39995.09
$ awk 'BEGIN{CONVFMT="%.2f"} NR>1{$1=$1-4.91}1' file
Column1 Column2
-3504.91 value1
-3484.91 value2
-3464.91 value 3
9915.09 value 50
9935.09
9955.09
9995.09
10015.09
39995.09 Last value
or if you prefer:
$ awk 'NR>1{$1=sprintf("%.2f",$1-4.91)}1' file
Column1 Column2
-3504.91 value1
-3484.91 value2
-3464.91 value 3
9915.09 value 50
9935.09
9955.09
9995.09
10015.09
39995.09 Last value