Spark: retrieving old values of rows after casting made invalid input nulls - scala

I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a double type. However, when a cast is done to anything that is not a double, spark changes it to null.
Currently, I have dataframes
df1:
num1
unique_id
1
id1
a
id2
2
id3
and a copy of df1: df1_copy where the cast is made.
when running
df1_copy = df1_copy.select(df1_copy.col('num1').cast('double'), df1_copy.col('unique_id'))
it returns
df1_copy:
num1
unique_id
1
id1
null
id2
2
id3
I have tried putting it into a different dataframe using select and when but get an error about not being able to find the column num1. The following is what I tried:
df2 = df1_copy.select(when(df1_copy.col("unique_id").equalTo(df1.col("unique_id")),df1.col('num1)).alias("invalid"), df1_copy.col('unique_id'))

Related

Pyspark fill null value of a column based on value of another column

I have a dataframe with 2 columns: col1 and col2:
col1 col2
aaa 111
222
ccc 333
I want to fill the null values (here the 2nd row of col1).
Here for example the logic I want to use is: if col2 is 222 and col1 is null, use the arbitrary string "zzz". For each possibility in col2, I have an arbitrary string I want to fill col1 if it's null (if it's not, I just want to get the value that is already in col1).
My idea was to do something like this:
mapping = {"222":"zzz", "444":"fff"}
df = df.select(F.when(F.col('col1').isNull(), mapping[F.col('col2')] ).otherwise(F.col('col1'))
I know F.col() is actually a column object and I can't simply do this.
What is the simplest solution to achieve the result I want with pyspark please ?
This should work:
from pyspark.sql.functions import col, create_map, lit, when
from itertools import chain
mapping = {"222":"zzz", "444":"fff"}
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df = df.select(when(col('col1').isNull(), mapping_expr[col('col2')] ).otherwise(col('col1'))

Type error of getting average by id in KDB

I am trying make a function for the aggregate consumption by mid in a kdb+ table (aggregate value by mid). Also this table is being imported from a csv file like this:
table: ("JJP";enlist",")0:`:data.csv
Where the meta data is for the table columns is:
mid is type long(j), value(j) is type long and ts is type timestamp (p).
Here is my function:
agg: {select avg value by mid from table}
but I get the
'type
[0] get select avg value by mid from table
But the type of value is type long (j). So I am not sure why I can't get the avg I also tried this with type int.
Value can't be used as a column name because it is keyword used in kdb+. Renaming the column should correct the issue.
value is a keyword and should not be used as a column name.
https://code.kx.com/q/ref/value/
You can remove it as a column name using .Q.id
https://code.kx.com/q/ref/dotq/#qid-sanitize
q)t:flip`value`price!(1 2;1 2)
q)t
value price
-----------
1 1
2 2
q)t:.Q.id t
q)t
value1 price
------------
1 1
2 2
Or xcol
https://code.kx.com/q/ref/cols/#xcol
q)(enlist[`value]!enlist[`val]) xcol t
val price
---------
1 1
2 2
You can rename the value column as you read it:
flip`mid`val`ts!("JJP";",")0:`:data.csv

How to retrieve column value by passing another column value with IN clause in spark

I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?
You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()

PySpark UDF function with data frame query?

I have another solution, but I prefer to use PySpark 2.3 to do it.
I have a two dimensional PySpark data frame like this:
Date | ID
---------- | ----
08/31/2018 | 10
09/31/2018 | 10
09/01/2018 | null
09/01/2018 | null
09/01/2018 | 12
I wanted to replace ID null values by looking for the closest in the past, or if that value is null, by looking forward (and if it is again null, set a default value)
I have imagined adding a new column with .withColumn and use a UDF function which will query the data frame itself.
Something like that in pseudo code (not perfect but it is the main idea):
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def return_value(value,date):
if value is not null:
return val
value1 = df.filter(df['date']<= date).select(df['value']).collect()
if (value1)[0][0] is not null:
return (value1)[0][0]
value2 = df.filter(tdf['date']>= date).select(df['value']).collect()
return (value2)[0][0]
value_udf = udf(return_value,StringType())
new_df = tr.withColumn("new_value", value_udf(df.value,df.date))
But it does not work. Am I completely on the wrong way to do it? Is it only possible to query a Spark data frame in a UDF function? Did I miss an easier solution?
Create new dataframe that have one column - unique list of all dates:
datesDF = yourDF.select('Date').distinct()
Create another one that will consist of dates and ID's but only ones where there is no nulls. And also lets keep only first (whatever will be first) occurrence of ID for each date (judging from your example you can have multiple rows per date)
noNullsDF = yourDF.dropna().dropDuplicates(subset='Date')
Lets now join those two so that we have list of all dates with whatever value we have for it (or null)
joinedDF = datesDF.join(noNullsDF, 'Date', 'left')
Now for every date get the value of ID from previous date and next date using window functions and also lets rename our ID column so later there will be less problems with join:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
w = Window.orderBy('Date')
joinedDF = joinedDF.withColumn('previousID',f.lag('ID').over(w))
.withColumn('nextID',f.lead('ID').over(w))
.withColumnRenamed('ID','newID')
Now lets join it back to our original Dataframe by date
yourDF = yourDF.join(joinedDF, 'Date', 'left')
Now our Dataframe have 4 ID columns:
original ID
newID - ID of any non-null value of given date if any or null
previousID - ID from previous date (non null if any or null)
nextID - ID from next date (non null if any or null)
Now we need to combine them into finalID in order:
original value if not null
value for current date if any non nulls present (it's in contrast with your question but you pandas code suggest you go <= on date checking) if result is not null
value for previous date if its not null
value for next date if its not null
some default value
We do it's simply by coalescing:
default = 0
finalDF = yourDF.select('Date',
'ID',
f.coalesce('ID',
'newID',
'previousID',
'nextID',
f.lit(default)).alias('finalID')
)

KDB: How to assign string datatype to all columns

When I created the table Tab, I specified the columns as string,
Tab: ([Key1:string()] Col1:string();Col2:string();Col3:string())
But the column datatype (t) is empty. I suppose specifying the column as string has no effect.
meta Tab
c t f a
--------------------
Key1
Col1
Col2
Col3
After I do a bulk upsert in Java...
c.Dict dict = new c.Dict((Object[]) columns.toArray(new String[columns.size()]), data);
c.Flip flip = new c.Flip(dict);
conn.c.ks("upsert", table, flip);
The datatypes are all symbols:
meta Tab
c t f a
--------------------
Key1 s
Col1 s
Col2 s
Col3 s
How can I specify the datatype of the columns as string and have it remain as string?
You cant define a column of the empty table with as strings as they are merely lists of lists of characters
You can just set them as empty lists which is what your code is doing.
But the column will then take on the type of whatever data is inserted into it.
Real question is what is your java process sending symbols when it should be sending strings. You need to make the change there before publishing to KDB
Note if you define as chars you still wont be able to upsert strings
q)Tab: ([Key1:`char$()] Col1:`char$();Col2:`char$();Col3:`char$())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
'rank
[0] Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
^
q)Tab: ([Key1:()] Col1:();Col2:();Col3:())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
Key1 | Col1 Col2 Col3
------| --------------------
"test"| "test" "test" "test"
KDB does not allow to define column types as list during creation of table. So that means you can not define your column type as String because that is also a list.
To do that only way is to define column as empty list like:
q) t:([]id:`int$();val:())
Then when you insert data to this table the column will automatically take type of that data.
q)`t insert (4;"row1")
q) meta t
c | t f a
---| -----
id | i
val| C
In your case, one option is to send string data from your Java process as mentioned by user 'emc211' or other option is to convert your data to string in KDB process before insertion.