What do lit(0) and lit(1) do in Scala/Spark aggregate functions? - scala

I have this piece of code:
val df = resultsDf
.withColumn("data_exploded", explode(col("data_item")))
.groupBy("data_id","data_count")
.agg(
count(lit(1)).as("aggDataCount"),
sum(when(col("data_exploded.amount")==="A",col("data_exploded.positive_amount")).otherwise(lit(0))).as("aggAmount")
)
Is lit(0) referring to an index at position 0, or to the literal value of number 0? The definition I see at https://mungingdata.com/apache-spark/spark-sql-functions/#:~:text=The%20lit()%20function%20creates,spanish_hi%20column%20to%20the%20DataFrame.&text=The%20lit()%20function%20is%20especially%20useful%20when%20making%20boolean%20comparisons says that "The lit() function creates a Column object out of a literal value." That definition makes me think that it is not referring to an index position but to a literal value such as a number or String. However, the usage in count(lit(1)).as("aggDataCount") looks like referring to an index position of a column to me. Thank you.

lit(1) means the literal value 1
count(lit(1)).as("aggDataCount") is a way to count the number of rows (each row having a column with a value of 1 and summing this column)

In spark lit represents literal value.
lit(0)--> put 0 as a value in column , lit(1) --> means put 1 as a value in column.
In the code you shown above they are applying an aggregation over 2 columns and keeping a count of how many rows from count(lit(1)), over a condition.
The next lit(0) is in otherwise clause which is like an else condition. lit(0) will add 0 as a literal value in the column.

Related

How do I replace string with 0 in multiple columns in Pyspark

As in the title. I have a list of columns and need to replace a certain string with 0 in these columns. I can do that using select statement with nested when function but I want to preserve my original dataframe and only change the columns in question. df.replace(string, 0, list_of_columns) doesn't work as there is a data type mismatch.
So I ended up with something like this which worked for me:
for column in column_list:
df = df.withColumn(column, F.when((F.col(column) == "string"), "0").otherwise(F.col(column)))

Replace randomly RDD values to null with scala spark

I have a csv file containing almost 15000 records. Each line contain 3 types of data devided by a tab (\t). I Actually want to replace randomly the second column value into null ! Maybe I will keep 8000 as they are and replace 7000 values to null.
Any help please with scala (spark)?
This is how it looks like:
read data as dataframe
generate a new column say rnd which is a random number from 0 to 1
make col2 = col2 when rnd < 0.5 (if you want to make 50% values null) else null
import org.apache.spark.sql.functions.{lit, rand, when}
import spark.implicits._
spark.read.option("header", "true").option("sep", "\t").csv(<your_path>)
.withColumn("rnd", rand())
.withColumn("col2", when($"rnd" < 0.5, $"col2").otherwise(lit(null).cast(<col2_datatype_here>)))
#amelie, notice $ in front of "rnd" in when condition in my answer.
You were supposed to do a column comparison, not value comparison.
PS: Not able to put comment as I am a stackoverflow newbie, hence a separate answer.

How to create a Column expression from collection of column names?

I have a list of strings, which represents the names of various columns I want to add together to make another column:
val myCols = List("col1", "col2", "col3")
I want to convert the list to columns, then add the columns together to make a final column. I've looked for a number of ways to do this, and the closest I can come to the answer is:
df.withColumn("myNewCol", myCols.foldLeft(lit(0))(col(_) + col(_)))
I get a compile error where it says it is looking for a string, when all I really want is a column. What's wrong? How to fix it?
When I tried it out in spark-shell it gave me the error that says exactly what the error is and where.
scala> myCols.foldLeft(lit(0))(col(_) + col(_))
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
myCols.foldLeft(lit(0))(col(_) + col(_))
^
Just think of the first pair that is given to the function of foldLeft. It's going to be lit(0) of type Column and col1 of type String. There's no col function that accepts a Column.
Try reduce instead:
myCols.map(col).reduce(_ + _)
From the official documentation of reduce:
Applies a binary operator to all elements of this collection, going right to left.
the result of inserting op between consecutive elements of this collection, going right to left:
op(x_1, op(x_2, ..., op(x_{n-1}, x_n)...))
where x1, ..., xn are the elements of this collection.
Here is how you can add columns dynamically based on the column names on a List. When all columns are numeric the result is a number. The 1st variable on foldLeft is of same type as return. foldLeft would work as much as reduce.
val employees = //a dataframe with 2 numeric columns "salary","exp"
val initCol = lit(0)
val cols = Seq("salary","exp")
val col1 = cols.foldLeft(initCol)((x,y) => x + col(y))
employees.select(col1).show()

How to convert null rows to 0 and sum the entire column using DB2?

I'm using the following query to sum the entire column. In the TOREMOVEALLPRIV column, I have both integer and null values.
I want to sum both null and integer values and print the total sum value.
Here is my query which print the sum values as null:
select
sum(URT.PRODSYS) as URT_SUM_PRODSYS,
sum(URT.Users) as URT_SUM_USERS,
sum(URT.total_orphaned) as URT_SUM_TOTAL_ORPHANED,
sum(URT.Bp_errors) as URT_SUM_BP_ERRORS,
sum(URT.Ma_errors) as URT_SUM_MA_ERRORS,
sum(URT.Pp_errors) as URT_SUM_PP_ERRORS,
sum(URT.REQUIREURTCBN) as URT_SUM_CBNREQ,
sum(URT.REQUIREURTQEV) as URT_SUM_QEVREQ,
sum(URT.REQUIREURTPRIV) as URT_SUM_PRIVREQ,
sum(URT.cbnperf) as URT_SUM_CBNPERF,
sum(URT.qevperf) as URT_SUM_QEVPERF,
sum(URT.privperf) as URT_SUM_PRIVPERF,
sum(URT.TO_REMOVEALLPRIV) as TO_REMOVEALLPRIV_SUM
from
URTCUSTSTATUS URT
inner join CUSTOMER C on URT.customer_id=C.customer_id;
Output image:
Expected Output:
Instead of null, I need to print sum of rows whichever have integers.
The SUM function automatically handles that for you. You said the column had a mix of NULL and numbers; the SUM automatically ignores the NULL values and correctly returns the sum of the numbers. You can read it on IBM Knowledge Center:
The function is applied to the set of values derived from the argument values by the elimination of null values.
Note: All aggregate functions ignore NULL values except the COUNT function. Example: if you have two records with values 5 and NULL, the SUM and AVG functions will both return 5, but the COUNT function will return 2.
However, it seems that you misunderstood why you're getting NULL as a result. It's not because the column contains null values, it's because there are no records selected. That's the only case when the SUM function returns NULL. If you want to return zero in this case, you can use the COALESCE or IFNULL function. Both are the same for this scenario:
COALESCE(sum(URT.TO_REMOVEALLPRIV), 0) as TO_REMOVEALLPRIV_SUM
or
IFNULL(sum(URT.TO_REMOVEALLPRIV), 0) as TO_REMOVEALLPRIV_SUM
I'm guessing that you want to do the same to all other columns in your query, so I'm not sure why you only complained about the TO_REMOVEALLPRIV column.
What you're looking for is the COALESCE function:
select
sum(URT.PRODSYS) as URT_SUM_PRODSYS,
sum(URT.Users) as URT_SUM_USERS,
sum(URT.total_orphaned) as URT_SUM_TOTAL_ORPHANED,
sum(URT.Bp_errors) as URT_SUM_BP_ERRORS,
sum(URT.Ma_errors) as URT_SUM_MA_ERRORS,
sum(URT.Pp_errors) as URT_SUM_PP_ERRORS,
sum(URT.REQUIREURTCBN) as URT_SUM_CBNREQ,
sum(URT.REQUIREURTQEV) as URT_SUM_QEVREQ,
sum(URT.REQUIREURTPRIV) as URT_SUM_PRIVREQ,
sum(URT.cbnperf) as URT_SUM_CBNPERF,
sum(URT.qevperf) as URT_SUM_QEVPERF,
sum(URT.privperf) as URT_SUM_PRIVPERF,
sum(COALESCE(URT.TO_REMOVEALLPRIV,0)) as TO_REMOVEALLPRIV_SUM
from
URTCUSTSTATUS URT
inner join CUSTOMER C on URT.customer_id=C.customer_id;

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark