Spark Scala - Bitwise operation in filter - scala

I have a source dataset aggregated with columns col1, and col2. Col2 values are aggregated by bitwise OR operation. I need to apply filter on the Col2 values to select records whose bits are on for 8,4,2
initial source raw data
val SourceRawData = Seq(("Flag1", 2),
("Flag1", 4),("Flag1", 8), ("Flag2", 8), ("Flag2", 16),("Flag2", 32)
,("Flag3", 2),("Flag4", 32),
("Flag5", 2), ("Flag5", 8)).toDF("col1", "col2")
SourceRawData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 2|
|Flag1| 4|
|Flag1| 8|
|Flag2| 8|
|Flag2| 16|
|Flag2| 32|
|Flag3| 2|
|Flag4| 32|
|Flag5| 2|
|Flag5| 8|
+-----+----+
Aggregated source data based on 'SourceRawData above' after collapsing Col1 values to single row per Col1 value and this is provided other team and Col2 values are aggregated by Bitwise OR operation. Note I here i am providing the output not the actual aggregation logic
val AggregatedSourceData = Seq(("Flag1", 14L),
("Flag2", 56L),("Flag3", 2L), ("Flag4", 32L), ("Flag5", 10L)).toDF("col1", "col2")
AggregatedSourceData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag4| 32|
|Flag5| 10|
+-----+----+
Now I need to apply filter on the aggregated dataset above to get the rows whose col2 values meeting any of the (8,4,2) col2 bits are on as per the original source raw data values
expected output
+-----+----+
|Col1 |Col2|
+-----+----+
|Flag1|14 |
|Flag2|56 |
|Flag3|2 |
|Flag5|10 |
+-----+----+
I tried something like below and seems to be getting hte expected output but unable to understand how its working. Is this the correct approach?? if so ,how its working ( I am not that knowledgeable in bitwise operations so looking for expert help to understand please)
`
``
var myfilter:Long = 2 | 4| 8
AggregatedSourceData.filter($"col2".bitwiseAND(myfilter) =!= 0).show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag5| 10|
+-----+----+

I think you do not need to use bitWiseAnd to filter, instead, just use contains/in “A set of decimal representation of the bit numbers you want” or == to “a decimal representation of a bit number you want”
Also if you try your existing calculations without Scala or spark, you will see where you understood things wrong, eg use here :
https://www.rapidtables.com/calc/math/binary-calculator.html
you will find you defined you filter “wrong”.
18&18 is 18
18|2 is 18
Your dataset the flag column each row will only be one value, so just filter the flag column , whose values are in the set of numbers you want .
$"flag == 18 or
(18,2,20) contains $"flag for example

Related

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

From many pyspark columns (with certain condition) to one column with all the conditions combined. PYSPARK

I have one Python list with some PySpark columns which contains certain condition. I want to have just one column that summarizes all the conditions I have in the list of columns.
I've tried to use the sum() operation to combine all the columns but it didn't work (obviusly). Also, I've been checking the documentation https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
But nothing seemed to work for me.
I'm doing something like this:
my_condition_list = [col(c).isNotNull() for c in some_of_my_sdf_columns]
That returns a list of different Pyspark columns, I want just one with all the conditiones included combined with the | operator so I can use it in a .filter() or .when() clause.
THANK YOU
PySpark wouldn't accept a list as to where/filter condition. It accepts either a string or condition.
The way you tried wouldn't work, you need to tweak certain things to work on it. Below are 2 approaches for this -
data = [(("ID1", 3, None)), (("ID2", 4, 12)), (("ID3", None, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()
from pyspark.sql import functions as F
way - 1
#below change df_name if you have any other name
df_name = "df"
my_condition_list = ["%s['%s'].isNotNull()"%(df_name, c) for c in df.columns]
print (my_condition_list[0])
"df['ID'].isNotNull()"
print (" & ".join(my_condition_list))
"df['ID'].isNotNull() & df['colA'].isNotNull() & df['colB'].isNotNull()"
print (eval(" & ".join(my_condition_list)))
Column<b'(((ID IS NOT NULL) AND (colA IS NOT NULL)) AND (colB IS NOT NULL))'>
df.filter(eval(" & ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(eval(" | ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
way - 2
my_condition_list = ["%s is not null"%c for c in df.columns]
print (my_condition_list[0])
'ID is not null'
print (" and ".join(my_condition_list))
'ID is not null and colA is not null and colB is not null'
df.filter(" and ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(" or ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
Preferred way is way-2

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

How to add a new column with maximum value?

I have a Dataframe with 2 columns tag and value.
I want to add a new column that contains the max of value column. (It will be the same value for every row).
I tried to do something as follows, but it didn't work.
val df2 = df.withColumn("max",max($"value"))
How to add the max column to the dataset?
There are 3 ways to do it (one you already know from the other answer). I avoid collect since it's not really needed.
Here is the dataset with the maximum value 3 appearing twice.
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3| <-- maximum value
|tg4| 4|
|tg3| 3| <-- another maximum value
+---+-----+
Cartesian Join With "Max" Dataset
I'm going to use a cartesian join of the tags and a single-row dataset with the maximum value.
val maxDF = tags.select(max("value") as "max")
scala> maxDF.show
+---+
|max|
+---+
| 4|
+---+
val solution = tags.crossJoin(maxDF)
scala> solution.show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
I'm not worried about the cartesian join here since it's just a single-row dataset.
Windowed Aggregation
My favorite windowed aggregation fits this problem so nicely. On the other hand, I don't really think that'd be the most effective approach due to the number of partitions in use, i.e. just 1, which gives the worst possible parallelism.
The trick is to use the aggregation function max over an empty window specification that informs Spark SQL to use all rows in any order.
val solution = tags.withColumn("max", max("value") over ())
scala> solution.show
18/05/31 21:59:40 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+
Please note the warning that says it all.
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I would not use this approach given the other solutions and am leaving it here for educational purposes.
If you want the maximum value of a columns for all rows, you are going to need to compare all the rows in some form. That means doing an an aggregation. withColumn only operates on a single row so you have no way to get the DataFrame max value.
The easiest way to do this is like below:
val data = Seq(("a", 1), ("b", 2), ("c", 3), ("d", 4))
val df = sc.parallelize(data).toDF("name", "value")
// first is an action, so this will execute spark stages to compute the value
val maxValue = df.groupBy().agg(max($"value")).first.getInt(0)
// Now you can add it to your original DF
val updatedDF = df.withColumn("max", lit(maxValue))
updatedDF.show
There is also one alternative to this that might be a little faster. If you don't need the max value until the end of your processsing (after you have already run a spark action) you can compute it by writing your own Spark Acccumulator instead that gathers the value while doing whatever other Spark Action work you have requested.
Max column value as additional column by window function
val tags = Seq(
("tg1", 1), ("tg2", 2), ("tg1", 3), ("tg4", 4), ("tg3", 3)
).toDF("tag", "value")
scala> tags.show
+---+-----+
|tag|value|
+---+-----+
|tg1| 1|
|tg2| 2|
|tg1| 3|
|tg4| 4|
|tg3| 3|
+---+-----+
scala> tags.withColumn("max", max("value").over(Window.partitionBy(lit("1")))).show
+---+-----+---+
|tag|value|max|
+---+-----+---+
|tg1| 1| 4|
|tg2| 2| 4|
|tg1| 3| 4|
|tg4| 4| 4|
|tg3| 3| 4|
+---+-----+---+

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))