Related
I have data made of varying periodic strings that are effectively a time value list with a periodicity flag contained within. Unfortunately, each string length can have a different number of elements but no more than 7.
Example below - (# and #/M at the end of each string means these are monthly values) starting at 8/2020 while #/Y are annual numbers so we divide by 12 for example to get to a monthly value. # at the beginning simple means continue from prior period.
copied from CSV
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
I would normally tokenize then pass each of the 7 columns to a function but the limit is 8 arguments and I would like to send other meta data in addition to these 7 arguments.
t:#[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
ddonelly is correct, a dictionary or list gets around the limitation of 8 parameters for functions but I think it is not the right approach here. Below achieves the desired output:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:#[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:#[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from #[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
Below is a full breakdown of whats going on inside the nested function should you need it for understanding.
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:#[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:#[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
could you pass the columns into a dictionary and then pass the dictionary into the function? This with circumvent the issue of having a maximum of 8 arguments since the dictionary can be as long as you require.
I want to replace all excluding first result
I have txt file:
AAA
BBB
CCC
AAA
BBB
CCC
AAA
BBB
CCC
I want to get this:
AAA
BBB <-- stay same
CCC
AAA
XXX <-- 2nd find replaced
CCC
AAA
XXX <-- 3rd and nth find replaced
CCC
I looking something similar to this, but for whole lines, not for words in lines
sed -i 's/AAA/XXX/2' ./test01
Use branching:
sed -e'/BBB/ {ba};b; :a {n;s/BBB/XXX/;ba}'
I.e. on the first BBB, we branch to :a, otherwise b without parameter starts processing of the next line.
Under :a, we read in a new line, replace all BBB by XXX and branch to a again.
Following awk may also help you on same.
awk '{++a[$0];$0=$0=="AAA"&&(a[$0]==2||a[$0]==3)?$0 ORS "XXX":$0} 1' Input_file
$ # replace all occurrences greater than s
$ # use gsub instead of sub to replace all occurrences in line
$ # whole line: awk -v s=1 '$0=="AAA" && ++c>s{$0="XXX"} 1' ip.txt
$ awk -v s=1 '/AAA/ && ++c>s{sub(/AAA/, "XXX")} 1' ip.txt
AAA
BBB
CCC
XXX
BBB
CCC
XXX
BBB
CCC
$ # replace exactly when occurrence == s
$ awk -v s=2 '/AAA/ && ++c==s{sub(/AAA/, "XXX")} 1' ip.txt
AAA
BBB
CCC
XXX
BBB
CCC
AAA
BBB
CCC
Further reading: Printing with sed or awk a line following a matching pattern
awk '/BBB/{c++;if(c >=2)sub(/BBB/,"XXX")}1' file
AAA
BBB
CCC
AAA
XXX
CCC
AAA
XXX
CCC
As soon as your file does not contain null chars (\0) you can fool sed to consider the whole file as a big string by intstructing sed to separate records using null char \0 instead of the default \n with sed option -z:
$ sed -z 's/BBB/XXX/2g' file66
AAA
BBB
CCC
AAA
XXX
CCC
AAA
XXX
CCC
/2g at the end means from second match and then globally.
You can combine -i with -z without problem.
I use Spark 2.2.0 and Scala 2.11.8. I have some problems with joining two DataFrames.
df1 =
product1_PK product2_PK
111 222
333 111
...
and:
df2 =
product_PK product_name
111 AAA
222 BBB
333 CCC
I want to get this result:
product1_PK product2_PK product1_name product2_name
111 222 AAA BBB
333 111 CCC AAA
...
How can I do it?
This is how I tried as a part solution, but I don't know how to efficiently make joining for both product1_PK and product2_PK and rename columns:
val result = df1.as("left")
.join(df2.as("right"), $"left.product1_PK" === $"right.product_PK")
.drop($"left.product_PK")
.withColumnRenamed("right.product_name","product1_name")
You need to use two joins : first for product1_name and second for product2_name
df1.join(df2.withColumnRenamed("product_PK", "product1_PK").withColumnRenamed("product_name", "product1_name"), Seq("product1_PK"), "left")
.join(df2.withColumnRenamed("product_PK", "product2_PK").withColumnRenamed("product_name", "product2_name"), Seq("product2_PK"), "left")
.show(false)
You should have your desired output as
+-----------+-----------+-------------+-------------+
|product2_PK|product1_PK|product1_name|product2_name|
+-----------+-----------+-------------+-------------+
|222 |111 |AAA |BBB |
|111 |333 |CCC |AAA |
+-----------+-----------+-------------+-------------+
I have a file content of text lines (e.g 9 lines below):
111
AAA
AAA
EEEE
EEEE
EEEE
EEEE
ZZZ1
How can I get the occurrence result of the text line as following using PS?
EEEE 4
AAA 2
111 1
ZZZ1 1
I have a text file that looks like this:
AAA
BBB
CCC
AAA
DDD
EEE
It has a specific keyword, for example AAA. After encountering the keyword, I'd like to copy the following line and then write it a second time in my output file.
I want it to look like this:
AAA
BBB
BBB
CCC
AAA
DDD
DDD
EEE
Is there anybody who will help me to do this?
Sed can do it like this:
$ sed '/AAA/{n;p}' infile
AAA
BBB
BBB
CCC
AAA
DDD
DDD
EEE
This looks for the pattern (/AAA/), the reads the next line of input (n) and prints it (p). Because printing is the default action anyway, the line gets printed twice, which is what we want.
awk to the rescue!
$ awk 'd{print;d=0} /AAA/{d=1}1' file
AAA
BBB
BBB
CCC
AAA
DDD
DDD
EEE
Explanation
d{print;d=0}
if flag dset print the line and reset the flag,
/AAA/{d=1}
set a flag to duplicate the line for the given pattern,
1
and print all lines.
You can use perl for this
perl -e ' $a =undef;
while(<>){
chomp;
if ($a eq "AAA"){
print "$_\n"
}
print "$_\n";
$a=$_;
}' your_file.txt
This iterates through the file and prints each line. If the previous line is "AAA", it prints it twice.
I don't know whether you share my hatred of one-line programs, but this is entirely possible in Perl
$ perl -ne'print; print scalar <> x 2 if /AAA/' aaa.txt
output
AAA
BBB
BBB
CCC
AAA
DDD
DDD
EEE