I am trying to translate a pyspark solution into scala.
Here is the code:
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]
status = when(df1["id"].isNull(), lit("added"))
.when(df2["id"].isNull(), lit("deleted"))
.when(size(array_remove(array(*conditions_), "")) > 0, lit("updated"))
.otherwise("unchanged")
for scala, I am simply trying to use expr instead of * to substitute the conditions_ expression in my when clause, but it is not supported due to for syntax.
Can you please point me to the right syntax here to add a loop in when clause, calculating the count of column differences dynamically.
If you want to unpack array in scala you can use following syntax:
when(size(array_remove(array(conditions_:_*), "")) > 0, lit("updated"))
Examples of "_*" operator
Related
I'm having some trouble refactoring a spark dataframe to not use expr but instead use dataframe filters and when conditionals.
My code is this:
outDF = outDF.withColumn("MAIN_TYPE", expr
("case when 'TYPE_A' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_A'" +
"when 'TYPE_B' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_B'" +
"when 'TYPE_C' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_C'" +
"when 'TYPE_D' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_D' else '0' end")
.cast(StringType))
The only solution that I could think of, so far is a series of individual .when().otherwise() chains, but that would require mXn lines, where m the number of Types and n the number of Groups that I need.
Is there any better way to do this kind of operation?
Thank you very much for your time!
So, this is how I worked this out, in case anyone is interested:
I used a helper column for the groups which I later dropped.
This is how this worked:
outDF = outDF.withColumn("Helper_Column", concat(col("Group_A"),col("Group_B"),
col("Group_C"),col("Group_D")))
outDF = outDF.withColumn("MAIN_TYPE", when(col("Helper_Column").like("%Type_A%"),"Type_A").otherwise(
when(col("Helper_Column").like("%Type_B%"),"Type_B").otherwise(
when(col("Helper_Column").like("%Type_C%"),"Type_C").otherwise(
when(col("Helper_Column").like("%Type_D%"),"Type_D").otherwise(lit("0")
)))))
outDF = outDF.drop("Helper_Column")
Hope this helps someone.
Hello everyone I am new to scala and after seeing the do..while syntax which is the following:
do{
<action>
}while(condition)
I have been asked to perform this exercise which consists in predicting the output of a program containing the do..while loop.
var set = Set(1)
do {
set = set + (set.max + 1)
} while (set.sum < 32)
println(set.size)
After execution I get the following error:
end of statement expected but 'do' found
do {
I know that it is possible to convert this loop to while (which is even ideal), however I would like to know if the do..while loop still works in scala if yes, is it a syntax error (Because I searched on the net but I found the same syntax and no mention of this error)? if no, is there a version from which the loop is no longer supported?
You can still use do-while in Scala 2.13.10
https://scastie.scala-lang.org/DmytroMitin/JcGnZS3DRle3jXIUiwkb0A
In Scala 3 you can write do-while in while-do manner using Scala being expression-oriented (i.e. the last expression is what is returned from a block)
while ({
set = set + (set.max + 1)
set.sum < 32
}) {}
https://scastie.scala-lang.org/DmytroMitin/JcGnZS3DRle3jXIUiwkb0A/2
https://docs.scala-lang.org/scala3/reference/dropped-features/do-while.html
I have a python class with two data class, first one is a polars time series, second one a list of string.
In a dictionary, a mapping from string and function is provided, for each element of the string is associated a function that returns a polars frame (of one column).
Then there is a function class that create a polars data frame with first column the time series and the other columns are created with this function.
Columns are all independent.
Is there a way to create this data frame in parallel?
Here I try to define a minimal example:
class data_frame_constr():
function_list: List[str]
time_series: pl.DataFrame
def compute_indicator_matrix(self) -> pl.DataFrame:
for element in self.function_list:
self.time_series.with_column(
[
mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
]
)
return self.time_series
For example, function_list = ["square", "square_root"].
Time frame is a column time series, I need to create square and square root (or other custom complex functions, identified by its name) columns, but I know the list of function only at runtime, specified in the constructor.
You can use the with_columns context to provide a list of expressions, as long as the expressions are independent. (Note the plural: with_columns.) Polars will attempt to run all expressions in the list in parallel, even if the list of expressions is generated dynamically at run-time.
def mapping(func_str: str) -> pl.Expr:
'''Generate Expression from function string'''
...
def compute_indicator_matrix(self) -> pl.DataFrame:
expr_list = [mapping(next_funct_str)
for next_funct_str in self.function_list]
self.time_series = self.time_series.with_columns(expr_list)
return self.time_series
One note: it is a common misconception that Polars is a generic Threadpool that will run any/all code in parallel. This is not true.
If any of your expressions call external libraries or custom Python bytecode functions (e.g., using a lambda function, map, apply, etc..), then your code will be subject to the Python GIL, and will run single-threaded - no matter how you code it. Thus, try to use only the Polars expressions to achieve your objectives (rather than calling external libraries or Python functions.)
For example, try the following. (Choose a value of nbr_rows that will stress your computing platform.) If we run the code below, it will run in parallel because everything is expressed using Polars expressions without calling external libraries or custom Python code. The result is embarassingly parallel performance.
nbr_rows = 100_000_000
df = pl.DataFrame({
'col1': pl.repeat(2, nbr_rows, eager=True),
})
df.with_columns([
pl.col('col1').pow(1.1).alias('exp_1.1'),
pl.col('col1').pow(1.2).alias('exp_1.2'),
pl.col('col1').pow(1.3).alias('exp_1.3'),
pl.col('col1').pow(1.4).alias('exp_1.4'),
pl.col('col1').pow(1.5).alias('exp_1.5'),
])
However, if we instead write the code using lambda functions that call Python bytecode, then it will run very slowly.
import math
df.with_columns([
pl.col('col1').apply(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
pl.col('col1').apply(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
pl.col('col1').apply(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
pl.col('col1').apply(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
pl.col('col1').apply(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
])
I want to filter a pyspark dataframe if any of the string columns in a list are empty.
df = df.where(all([col(x)!='' for x in col_list]))
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can use reduce from functools to simulate all like this
from functools import reduce
spark_df.where(reduce(lambda x, y: x & y, (F.col(x) != '' for x in col_list))).show()
Since filter (or where) are lazy evaluated transformation we can merge multiple conditions by applying them one by one, e.g.
for c in col_list:
spark_df = spark_df.filter(col(c) != "")
spark_df.show()
Which may be a bit more readable, but in the end it will be executed in a completely same way as Sreeram's answer.
On a side note, removing rows with empty values would be most often done with
df.na.drop(how="any", subset=col_list)
but it only handles missing (null / None) values, not empty strings.
val rtnRdd = originRdd.filter( ~~~ ) // 1
// 2
var eventList: List[myType] = Nil
originRdd.foreach{
if( some condition)
eventList :+= myType( ~~ )
}
// eventList convert to RDD
Which way is proper and fast way in spark? if '1' is proper way, why shouldn't I use '2' code style?
2nd style is not favored because functional programming leans towards more using expression over statements. In second statements, we are using statements. Statements causes side effects. Besides that, you are doing assignment programming which also causes side effects. It becomes hard to parallelize it. For more info refer