separating a string column, converting sub values into float and assigning column labels - kdb

I have a table t and vector times.
sym vals
-------------------
A "3.6, 2.1, 1.8"
B "2.1, 1.8, 1.6"
C "2.2, 1.9, 1.6"
D "1.9, 1.5, 1.3"
E "2.6, 2.1, 1.9"
times: `0`1`2
I want to separate the comma separated string in each row and convert each value into float. The column labels then need to be times. I also want to then drop the column vals. The following statement does this.
t_out: delete vals from t, 'flip exec times!("FFF";",")0:vals from t
sym 1 2 3
---------------
A 3.6 2.1 1.8
B 2.1 1.8 1.6
C 2.2 1.9 1.6
D 1.9 1.5 1.3
E 2.6 2.1 1.9
why does exec times!("FFF";",")0:vals from t transpose the table after converting the values to float? Why do we need 'flip and not just flip? I appreciate your help. Are there alternative methods to achieve this?

(This is a bit of a guess here, since you're asking about q's design choices.)
0: is often used to read in CSV files, where you would store a row per line, separated by a comma. This row won't always have the same types, for instance, this could be a CSV file (building on your example):
3.6, 2.1, 1.8, 10
2.1, 1.8, 1.6, 20
2.2, 1.9, 1.6, 30
1.9, 1.5, 1.3, 40
2.6, 2.1, 1.9, 50
So we have four columns, the first three are float columns, and the fourth is a long (int) column.
When you use 0: to read in the CSV (or in your case, just a list of strings that resemble a CSV), q will transpose the table so that you have a list containing four lists.
q)vals: ("3.6, 2.1, 1.8, 10"; "2.1, 1.8, 1.6, 20"; "2.2, 1.9, 1.6, 30"; "1.9, 1.5, 1.3, 40"; "2.6, 2.1, 1.9, 50")
q)vals
"3.6, 2.1, 1.8, 10"
"2.1, 1.8, 1.6, 20"
"2.2, 1.9, 1.6, 30"
"1.9, 1.5, 1.3, 40"
"2.6, 2.1, 1.9, 50"
q)("FFFJ"; ",") 0: vals
3.6 2.1 2.2 1.9 2.6
2.1 1.8 1.9 1.5 2.1
1.8 1.6 1.6 1.3 1.9
10 20 30 40 50
Each of the four lists in this list will be correctly typed:
q)first ("FFFJ"; ",") 0: vals
3.6 2.1 2.2 1.9 2.6
q)type first ("FFFJ"; ",") 0: vals
9h
q)last ("FFFJ"; ",") 0: vals
10 20 30 40 50
q)type last ("FFFJ"; ",") 0: vals
7h
This makes it easier to work with, as you don't have a list of mixed lists. The alternative would be:
q)flip ("FFFJ"; ",") 0: vals
3.6 2.1 1.8 10
2.1 1.8 1.6 20
2.2 1.9 1.6 30
1.9 1.5 1.3 40
2.6 2.1 1.9 50
q)first flip ("FFFJ"; ",") 0: vals
3.6
2.1
1.8
10
q)type first flip ("FFFJ"; ",") 0: vals
0h
I'm guessing the reasoning for this is performance, as under the hood, tables are in fact column dictionaries, so actually look something like this:
q)`1`2`3`4 ! ("FFFJ"; ",") 0: vals
1| 3.6 2.1 2.2 1.9 2.6
2| 2.1 1.8 1.9 1.5 2.1
3| 1.8 1.6 1.6 1.3 1.9
4| 10 20 30 40 50
But again, you're asking about q's design choices, so I'm just guessing.
The reason you need to use 'flip instead of just flip, is because you want to join each element of two tables (which are treated as lists of dictionaries) to each other, so you are using the each iterator. You can read about it here (if you scroll down just a bit to the "Advanced part, just above the each-left header, it explains it a bit better).
Just to make it clear that the iterator ' is changing , and not flip, I would write your query as:
... from t ,' flip exec ...

This is more of a curiosity but you can effectively force the "flip" to occur naturally by using a by grouping (even if the by grouping is meaningless)
q)exec times!raze("FFF";",")0:vals by sym:sym from t
sym| 0 1 2
---| -----------
A | 3.6 2.1 1.8
B | 2.1 1.8 1.6
C | 2.2 1.9 1.6
D | 1.9 1.5 1.3
E | 2.6 2.1 1.9
This also does away with the need to append-each (,') to sideways-join the result

Related

xbar rounding up and application of last

q) t:([]time:(2021.01.31D17:50:19.986000000;2021.01.31D18:01:32.894000000;2021.01.31D18:02:08.884000000;2021.01.31D18:25:25.984000000;2021.01.31D18:25:27.134000000;2021.01.31D18:25:28.834000000;2021.01.31D18:25:29.934000000);val:(3.2;2.9;3.9;6.8;5.0;3.0;2.2);sym:(`AUD;`AUD;`AUD;`AUD;`AUD;`AUD;`AUD))
time val sym
-------------------------------------
2021.01.31D17:50:19.986000000 3.2 AUD
2021.01.31D18:01:32.894000000 2.9 AUD
2021.01.31D18:02:08.884000000 3.9 AUD
2021.01.31D18:25:25.984000000 6.8 AUD
2021.01.31D18:25:27.134000000 5 AUD
2021.01.31D18:25:28.834000000 3 AUD
2021.01.31D18:25:29.934000000 2.2 AUD
prices: 0!select last val by sym, 0D00:01+0D00:01 xbar time from t
sym x val
-------------------------------------
AUD 2021.01.31D17:51:00.000000000 3.2
AUD 2021.01.31D18:02:00.000000000 2.9
AUD 2021.01.31D18:03:00.000000000 3.9
AUD 2021.01.31D18:26:00.000000000 2.2
for the first row in prices for e.g. how does q work to ensure that the val is not the last value between 2021.01.31D17:51:00.000000000 and 2021.01.31D17:52:00.000000000 but that between 2021.01.31D17:50:00.000000000 and 2021.01.31D17:51:00.000000000? Asking because the command involves 0D00:01+0D00:01 xbar time and not just 0D00:01 xbar time.
Appreciate your help.
Kdb still reads right-of-left within the sub-components of a select statement, so
0D00:01+0D00:01 xbar time
is read as
0D00:01 xbar time
and the additional 0D00:01 is added after the xbar operation. So the 0D00:01+ really only effects the "display" of the result, not the values used in the grouping.
This is what you possibly think kdb would confuse it for:
0D00:01 xbar 0D00:01+time
The above would return last value between 17:51 and 17:52 since the times are bumped up before the xbar/grouping rather than after the xbar/grouping but the results would actually be the same because this is really just a labelling exercise.

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.
Spark 2.4
scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349
Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349
Here are the details:
As it turns out default behavior of Spark 3.0 has changed - it tries to infer timestamp unless schema is specified and that results into huge amount of text scan. I tried to load the data with inferTimestamp=false time did come close to that of Spark 2.4 but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I have no idea why this behavior was changed but its should have been notified in BOLD letters.
Spark 2.4
spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 29706 ms
res0: Long = 2605349
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 31431 ms
res0: Long = 2605349
Spark 3.0
spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 32826 ms
res0: Long = 2605349
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 34011 ms
res0: Long = 2605349
Note:
Make sure you never turn on prefersDecimal to true even when
inferTimestamp is false, it again takes huge amount of time.
Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.

What does '60D' mean in scala?

What does the 60D mean in below Scala line? Is it 60 in decimal number? I need to convert the code into equivalent snowflake query for migration activity.
((col("date_1").cast("long")-col("date_2").cast("long"))/60D)
Thanks in advance.
In Scala if you divide integer by integer you will get an integer.
12 / 5 == 2
This is similar to Spark (I see you probably rewrite some Spark job?). If you want to get a double value instead you have to make sure at least one value is a double. There are many options including
12 / 5.0 == 2.4
12 / 5D == 2.4
It is probably not relevant to you anyways as you are not producing any Scala code there.

Dividing large numbers in postgresql

I am working with numbers of 18 decimals, I have decided to save the number as a "NUMERIC (36)" in database
Now I want to present it by doing the following division
select (5032345678912345678::decimal / power(10, 18)::decimal )::decimal(36,18)
result
5.032345678912345700
expected result
5.032345678912345678
It works if I use a precision of 16 decimals
select (50323456789123456::decimal / power(10, 16)::decimal )::decimal(36,16)
result 5.0323456789123456
Any idea how to work with 18 decimals without losing information?
Use a constant typed as decimal(38,18):
select 5032345678912345678::decimal / 1000000000000000000::decimal(38,18);
?column?
----------------------
5.032345678912345678
(1 row)
A constant should be a bit faster. However the same cast should work for power(10,18) as well.

T-SQL stripping redundant data efficiently

I have a table that tracks price data over time for various goods. Here's a simplified example:
Table name [Product_Prices]
PRODUCT DATE PRICE
------------------
Corn 1/1/2011 1.35
Corn 1/2/2011 1.40
Corn 1/3/2011 1.40
Corn 1/4/2011 1.50
Beef 1/1/2011 1.35
Beef 1/2/2011 1.15
Beef 1/3/2011 1.15
Beef 1/4/2011 1.30
Beef 1/5/2011 1.30
Beef 1/6/2011 1.35
I want a query that pulls the earliest date that the prices changed, for each instance where the price actually did change. Based on the sample table above, this is the output I want:
PRODUCT DATE PRICE
------------------
Corn 1/1/2011 1.35
Corn 1/2/2011 1.40
Corn 1/4/2011 1.50
Beef 1/1/2011 1.35
Beef 1/2/2011 1.15
Beef 1/4/2011 1.30
Beef 1/6/2011 1.35
I am currently doing it in a cursor but it's incredibly inefficient and I feel that there must be a simpler way to get this data. The table I'm working with has about 2.3 million records.
SQL 2000
Thanks!
SQL is, unfortunately, not a language that's well-suited to working with ordered sets (relational databases are great for it, but the SQL language is not). Additionally, some of the T-SQL features that make working with these sets easier (ROW_NUMBER(), for example) were not introduced until SQL Server 2005.
Given the restriction to SQL Server 2000, you'll have to do something like this:
select
pp.Product,
pp.Date,
pp.Price
from Product_Prices pp
where pp.Price <> (select top 1
pp2.Price
from Product_Prices pp2
where pp2.Date < pp.Date
and pp2.Product = pp.Product
order by pp2.Date desc)
(I don't have SQL Server 2000 available to test, but I believe this should function correctly on 2000)
This will retrieve every row from Product_Prices where the price for that product is not equal to the previous record for that product.