Flip each bit in Spark dataframe calling a custom function - scala

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .

I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Related

Filling blank field in a DataFrame with previous field value

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!
A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
//df.show(false)
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
//df2.show(false)
val df3 = df2.filter($"act".isNull)
//df3.show(false)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
//df4.show(false)
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
//df5.show(false)
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
//df6.show
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
dfFinal.show(false)
returns:
+---+----------+----------+
|ID |dt |act |
+---+----------+----------+
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
+---+----------+----------+
I solved this step-wise and in a rush, but no so apparent.

How to remove the fractional part from a dataframe column?

Input dataframe:
val ds = Seq((1,34.44),
(2,76.788),
(3,54.822)).toDF("id","mark")
Expected output:
val ds = Seq((1,34),
(2,76),
(3,54)).toDF("id","mark")
I want to remove the fractional part from the column mark as above. I have searched for any builtin functions, but did not find anything. How should an udf look like to achieve the above result?
You can just use cast to integer as
import org.apache.spark.sql.functions._
ds.withColumn("mark", $"mark".cast("integer")).show(false)
which should give you
+---+----+
|id |mark|
+---+----+
|1 |34 |
|2 |76 |
|3 |54 |
+---+----+
I hope the answer is helpful
Update
You commented as
But if any string values are there in the column , it is getting null since we are casting into integer . i don't want that kind of behaviour
So I guess your mark column must be a StringType() and you can use regexp_replace
import org.apache.spark.sql.functions._
ds.withColumn("mark", regexp_replace($"mark", "(\\.\\d+)", "")).show(false)

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

Executing multiple spark queries and storing as dataframe

I have 3 Spark queries saved in List - sqlQueries. The first 2 of them creates global temporary views and third one executes on those temporary views and fetches some output.
I am able to run a single query using this -
val resultDF = spark.sql(sql)
Then I add partition information on this dataframe object and save it.
In case of multiple queries, I tried executing
sqlQueries.foreach(query => spark.sql(query))
How do I save my output of third query keeping other 2 queries run.
I have 3 queries just for example, It can be any number.
You can write the last query as insert statement to save the results into table. You are executing queries through foreach which will execute sequentially.
I am taking reference from your other question for the query which needs some modification as explained in global-temporary-view in sql section.
After modification your query file should look like
CREATE GLOBAL TEMPORARY VIEW VIEW_1 AS select a,b from abc
CREATE GLOBAL TEMPORARY VIEW VIEW_2 AS select a,b from global_temp.VIEW_1
select * from global_temp.VIEW_2
Then answering this question: you can use foldLeft again for the multiple queries to be reflected.
Lets say you have a dataframe
+----+---+---+
|a |b |c |
+----+---+---+
|a |b |1 |
|adfs|df |2 |
+----+---+---+
And given above multiple line query file, you can do the following
df.createOrReplaceTempView("abc")
val sqlFile = "path to test.sql"
val queryList = scala.io.Source.fromFile(sqlFile).getLines().filterNot(_.isEmpty).toList
val finalresult = queryList.foldLeft(df)((tempdf, query) => sqlContext.sql(query))
finalresult.show(false)
which should give you
+----+---+
|a |b |
+----+---+
|a |b |
|adfs|df |
+----+---+