PySpark - Combine calculation and kwargs in 'select' in a 'for' loop - select

I was told using withColumn within a for loop is not a good idea in terms of performance and I try to replace it with select, as suggested. But I don't think it's a good idea to write all columns (and yes, I'm a bit lazy^^); I already have a dict with the list of columns and I try to use it, but I get an error at the second loop.
My main issue is that I need at least one column for the calculation, but **kwargs argument come after the others, so I try to make it dynamic: remove the column from the list, pass the column first, then my calculation, finally the other columns (of course, the loop is over the column list).
The code runs in a palantir-foundry code repository (I have no other pySpark environment to test).
Here is a snippet of my code, I hope it's enough to make it reproductible:
input_schema = {
'col_1': T.StringType(),
'col_2': T.StringType(),
'col_3': T.StringType(),
'col_4': T.StringType(),
'col_4': T.StringType(),
}
#tranform_df(
Output("output_df"),
nsw=Input("My_Input")
)
def my_compute_function(nsw):
nsw = nsw \
.withColumn("errors", F.lit(None).cast("array<string>")) \
.withColumn("error_check", F.lit("Pouet").cast("string"))
for col_name, test_list in validation_list.items():
l_input_schema = dict(input_schema)
del l_input_schema[col_name]
for to_test in test_list:
nsw = nsw \
.select(
*col_name,
F.when(
(nsw.error_check.isNotNull()),
F.when(nsw.errors.isNull(), F.array(nsw.error_check))
.otherwise(F.array_union(nsw.errors, F.array(nsw.error_check))),
)
.otherwise(nsw.errors)
.alias("errors"),
**l_input_schema
)
return nsw
No syntax error here, but at the second loop, the following message:
Function select() got an unexpected argument col_2. Please review your code.
The error points out the line where kwargs is used (**l_input_schema), and it's always the second column

I feel a bit stupid because it was just a wrong way of using my dict as kwargs instead of args.
The following works fine:
nsw = nsw \
.select(
*input_schema,
F.when(
(nsw.error_check.isNotNull()),
F.when(nsw.errors.isNull(), F.array(nsw.error_check))
.otherwise(F.array_union(nsw.errors, F.array(nsw.error_check))),
)
.otherwise(nsw.errors)
.alias("errors"),
)
No need anymore to try to have the current column before calculation for each loop. On the other hand, all new columns need to be added before the loop, and the dict input_schema should also be updated accordingly.

Related

Using 'where' instead of 'expr' when filtering for values in multiple columns in scala spark

I'm having some trouble refactoring a spark dataframe to not use expr but instead use dataframe filters and when conditionals.
My code is this:
outDF = outDF.withColumn("MAIN_TYPE", expr
("case when 'TYPE_A' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_A'" +
"when 'TYPE_B' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_B'" +
"when 'TYPE_C' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_C'" +
"when 'TYPE_D' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_D' else '0' end")
.cast(StringType))
The only solution that I could think of, so far is a series of individual .when().otherwise() chains, but that would require mXn lines, where m the number of Types and n the number of Groups that I need.
Is there any better way to do this kind of operation?
Thank you very much for your time!
So, this is how I worked this out, in case anyone is interested:
I used a helper column for the groups which I later dropped.
This is how this worked:
outDF = outDF.withColumn("Helper_Column", concat(col("Group_A"),col("Group_B"),
col("Group_C"),col("Group_D")))
outDF = outDF.withColumn("MAIN_TYPE", when(col("Helper_Column").like("%Type_A%"),"Type_A").otherwise(
when(col("Helper_Column").like("%Type_B%"),"Type_B").otherwise(
when(col("Helper_Column").like("%Type_C%"),"Type_C").otherwise(
when(col("Helper_Column").like("%Type_D%"),"Type_D").otherwise(lit("0")
)))))
outDF = outDF.drop("Helper_Column")
Hope this helps someone.

Will spark lazy execution lead to list override?

My question is will quantile_foo and quantile_bar be the right value passed to each loop?
Will the value of quantile_foo and quantile_bar be set to last time, that is, i=5, in the loop because of spark lazy execution, so that I always get wrong foo_quantile_{i} except for foo_5?
df = spark.sql("select * from some_table")
for i in range(5):
quantile_foo = df.approxQuantile("foo_{}".format(str(i)),[0.25,0.5,0.75],0.05)
quantile_bar = df.approxQuantile("bar_{}".format(str(i)),[0.25,0.5,0.75],0.05)
df = df.withColumn("foo_quantile_{}".format(str(i)),
F.when(F.col("foo_{}".format(str(i))>quantile_foo[0],75))\
.when(F.col("foo_{}".format(str(i))>quantile_foo[1],50))
... ...
)
df = df.withColumn("bar_quantile_{}".format(str(i)),\
F.when(F.col("bar_{}".format(str(i))>quantile_foo[0],75))\
.when(F.col("bar_{}".format(str(i))>quantile_foo[1],50))
... ...
)
I would say the answer to your question is Yes, it would take the last value of i to perform the operation but that has nothing to do with spark lazy evaluation.
You are assigning a different value to a variable in python and as it is in for loop, whenever you come out of the for loop you will get the last value.

How can I convert this select statement to functional form?

I am having a couple of issues to put this in a functional format.
select from tableName where i=fby[(last;i);([]column_one;column_two)]
This is what I got:
?[tableName;fby;enlist(=;`i;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two))));0b;()]
but I get a type error.
Any suggestions?
Consider using the following function, adjust from the buildQuery function given in the whitepaper on Parse Trees. This is a pretty useful tool for quickly developing in q, this version is an improvement on that given in the linked whitepaper, having been extended to handle updates by reference (i.e., update x:3 from `tab)
\c 30 200
tidy:{ssr/[;("\"~~";"~~\"");("";"")] $[","=first x;1_x;x]};
strBrk:{y,(";" sv x),z};
//replace k representation with equivalent q keyword
kreplace:{[x] $[`=qval:.q?x;x;"~~",string[qval],"~~"]};
funcK:{$[0=t:type x;.z.s each x;t<100h;x;kreplace x]};
//replace eg ,`FD`ABC`DEF with "enlist`FD`ABC`DEF"
ereplace:{"~~enlist",(.Q.s1 first x),"~~"};
ereptest:{((0=type x) & (1=count x) & (11=type first x)) | ((11=type x)&(1=count x))};
funcEn:{$[ereptest x;ereplace x;0=type x;.z.s each x;x]};
basic:{tidy .Q.s1 funcK funcEn x};
addbraks:{"(",x,")"};
//where clause needs to be a list of where clauses, so if only one whereclause need to enlist.
stringify:{$[(0=type x) & 1=count x;"enlist ";""],basic x};
//if a dictionary apply to both, keys and values
ab:{$[(0=count x) | -1=type x;.Q.s1 x;99=type x;(addbraks stringify key x),"!",stringify value x;stringify x]};
inner:{[x]
idxs:2 3 4 5 6 inter ainds:til count x;
x:#[x;idxs;'[ab;eval]];
if[6 in idxs;x[6]:ssr/[;("hopen";"hclose");("iasc";"idesc")] x[6]];
//for select statements within select statements
//This line has been adjusted
x[1]:$[-11=type x 1;x 1;$[11h=type x 1;[idxs,:1;"`",string first x 1];[idxs,:1;.z.s x 1]]];
x:#[x;ainds except idxs;string];
x[0],strBrk[1_x;"[";"]"]
};
buildSelect:{[x]
inner parse x
};
We can use this to create the functional query that will work
q)n:1000
q)tab:([]sym:n?`3;col1:n?100.0;col2:n?10.0)
q)buildSelect "select from tab where i=fby[(last;i);([]col1;col2)]"
"?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]"
So we have the following as the functional form
?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
// Applying this
q)?[tab;enlist (=;`i;(fby;(enlist;last;`i);(flip;(lsq;enlist`col1`col2;(enlist;`col1;`col2)))));0b;()]
sym col1 col2
----------------------
bah 18.70281 3.927524
jjb 35.95293 5.170911
ihm 48.09078 5.159796
...
Glad you were able to fix your problem with converting your query to functional form.
Generally it is the case that when you use parse with a fby in your statement, q will convert this function into its k definition. Usually you should just be able to replace this k code with the q function itself (i.e. change (k){stuff} to fby) and this should run properly when turning the query into functional form.
Additionally, if you check out https://code.kx.com/v2/wp/parse-trees/ it goes into more detail about parse trees and functional form. Additionally, it contains a script called buildQuery which will return the functional form of the query of interest as a string which can be quite handy and save time when a functional form is complex.
I actually got it myself ->
?[tableName;((=;`i;(fby;(enlist;last;`i);(+:;(!;enlist`column_one`column_two;(enlist;`column_one;`column_two)))));(in;`venue;enlist`venueone`venuetwo));0b;()]
The issues was a () missing from the statement. Works fine now.
**if someone wants to add a more detailed explanation on how manual parse trees are built and how the generic (k){} function can be replaced with the actual function in q feel free to add your answer and I'll accept and upvote it

Dataframe filtering with condition applied to list of columns

I want to filter a pyspark dataframe if any of the string columns in a list are empty.
df = df.where(all([col(x)!='' for x in col_list]))
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can use reduce from functools to simulate all like this
from functools import reduce
spark_df.where(reduce(lambda x, y: x & y, (F.col(x) != '' for x in col_list))).show()
Since filter (or where) are lazy evaluated transformation we can merge multiple conditions by applying them one by one, e.g.
for c in col_list:
spark_df = spark_df.filter(col(c) != "")
spark_df.show()
Which may be a bit more readable, but in the end it will be executed in a completely same way as Sreeram's answer.
On a side note, removing rows with empty values would be most often done with
df.na.drop(how="any", subset=col_list)
but it only handles missing (null / None) values, not empty strings.

Haskell HDBC Elegance in F#?

I'm struck by Haskell's terseness and elegance. But I work in a .Net house, so I use F# when I can get away with it--I may be the only one of hundreds across the country who uses it.
Does ADO.NET or F# offer something as terse and elegant as HDBC's executeMany? I'm making my way through Real World Haskell. In chapter 21 it offers this example:
ghci> conn <- connectSqlite3 "test1.db"
ghci> stmt <- prepare conn "INSERT INTO test VALUES (?, ?)"
ghci> executeMany stmt [[toSql 5, toSql "five's nice"], [toSql 6, SqlNull]]
ghci> commit conn
ghci> disconnect conn
I'd like to get this elegance and terseness in my F#. I've seen a lot of hype around using parameterized queries to avoid SQL injection attacks. I'm not using them in this case for three reasons:
I find parameterized queries in .Net ugly and burdensome.
My data comes from the corporate office, so it's (mostly) clean.
My table has 34 columns. I despise the idea of parameterizing a query with 34 columns.
Here's my F# code:
module Data
open System
open System.Data
open System.Data.OleDb
open System.Text.RegularExpressions
type Period = Prior | Current
let Import period records db =
use conn = new OleDbConnection(#"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + db + ";Persist Security Info=False;")
let execNonQuery s =
let comm = new OleDbCommand(s, conn) in
comm.ExecuteNonQuery() |> ignore
let enquote = sprintf "\"%s\""
let escapeQuotes s = Regex.Replace(s, "\"", "\"\"")
let join (ss:string[]) = String.Join(",", ss)
let table = match period with
| Prior -> "tblPrior"
| Current -> "tblCurrent"
let statements =
[| for r in records do
let vs = r |> Array.map (escapeQuotes >> enquote) |> join
let vs' = vs + sprintf ",\"14\",#%s#" (DateTime.Now.ToString "yyyy-MM-dd") in
yield sprintf "INSERT INTO %s ( [Field01], [Field02], [Field03] [Field04], [Field05], [Field06], [Field07], [Field08], [Field09], [Field10], [Field11], [Field12], [Field13], [Field14], [Field15], [Field16], [Field17], [Field18], [Field19], [Field20], [Field21], [Field22], [Field23], [Field24], [Field25], [Field26], [Field27], [Field28], [Field29], [Field30], [Field31], [Field32], [Field33], [Field34] ) VALUES (%s)" table vs' |] in
do conn.Open()
execNonQuery (sprintf "DELETE FROM %s" table)
statements |> Array.iter execNonQuery
I've renamed the fields of the table(s) for security reasons.
Because all the fields on the table are text, I can easily Array.map them to escape and quote the values.
At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible. Hence my interest in the executeMany of Haskell. Too, though, I like the idea behind parameterized queries, and I like the way Hasekll has implemented them. Is there something equivalent in terseness and elegance in F#?
I agree with #JonnyBoats comment that generally using an F# SQL type provider like SqlDataConnection (LINQ-to-SQL) or SqlEntityConnection (Entity Framework) would be far more elegant than any kind of solution involving building insert statement strings by hand.
But, there is one important qualifier to your question: "At between 9,000 and 10,000 records per day to import to each of the two tables, I want to do this as efficiently as possible." In a scenario like this, you'll want to use SqlBulkCopy for efficient bulk inserts (it leverages native database driver features for much faster inserts than you are likely getting with HDBC's executeMany).
Here's a small example that should help you getting started using SqlBulkCopy with F#: https://stackoverflow.com/a/8942056/236255. Note that you'll be working with a DataTable to stage the data which though old and somewhat awkward to use from F#, is still superior to building insert statement strings in my opinion.
Update in response to comment
Here's a generalized approach to using SqlBulkCopy which is improved for your scenario (we pass in a column specification separately from the row data, and both are dynamic):
//you must reference System.Data and System.Xml
open System
open System.Data
open System.Data.SqlClient
let bulkLoad (conn:SqlConnection) tableName (columns:list<string * Type>) (rows: list<list<obj>>) =
use sbc = new SqlBulkCopy(conn, SqlBulkCopyOptions.TableLock, null, BatchSize=500, BulkCopyTimeout=1200, DestinationTableName=tableName)
sbc.WriteToServer(
let dt = new DataTable()
columns
|> List.iter (dt.Columns.Add>>ignore)
for row in rows do
let dr = dt.NewRow()
row |> Seq.iteri(fun i value -> dr.[i] <- value)
dt.Rows.Add(dr)
dt)
//example usage:
//note: since you know all your columns are of type string, you could define columns like
//let columns = ["Field1", "Field2", "Field3"] |> List.map (fun name -> name, typeof<String>)
let columns = [
"Field1", typeof<String>
"Field2", typeof<String>
"Field3", typeof<String>
]
let rows = [
["a"; "b"; "c"]
["d"; "e"; "f"]
["g"; "h"; "i"]
["j"; "k"; "l"]
["m"; "n"; "o"]
]
//a little funkiness to transform our list<list<string>> to list<list<obj>>,
//probably not needed in practice because you won't be constructing your lists literally
let rows = rows |> List.map (fun row -> row |> List.map (fun value -> value :> obj))
bulkLoad conn "tblPrior" columns rows
You could get even fancier / more terse using an approach involving reflection. e.g. create a type like
type RowData = { Field1:string; Field2:string; Field3:string }
and make a bulkLoad with a signature that takes a list<'a> argument such that it reflects over the property names and types of typeof<'a> to build the DataTable Columns, and similarly uses reflection to iterate over all the properties of a row instance to create and add a new row to the DataTable. In fact, this question shows how to make a generic ToDataTable method that does it (in C#).