Filtering on a dataframe based on columns defined in a list - scala

I have a dataframe -
df
+----------+----+----+-------+-------+
| WEEK|DIM1|DIM2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-02| 14|NULL| -5| 60|
|2016-04-30| 14| FR| 90| 4|
+----------+----+----+-------+-------+
I have defined a list as targetList
List(T1_diff, T2_diff)
I want to filter out all rows in dataframe where T1_diff and T2_diff is greater than 3. In this scenario the output should only contain the second row as first row contains -5 as T1_Diff. targetList can contain more columns, currently it has T1_diff, T2_diff, if there is another column called T3_diff, so that should be automatically handled.
What is the best way to achieve this ?

Suppose you have following List of columns which you want to filter out for a value greater than 3.
val lst = List("T1_diff", "T2_diff")
Then you can create a String using these column names and then pass that String to where function.
val condition = lst.map(c => s"$c>3").mkString(" AND ")
df.where(condition).show(false)
For the above Dataframe it will output only second row.
+----------+----+----+-------+-------+
|Week |Dim1|Dim2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-30|14 |FR |90 |4 |
+----------+----+----+-------+-------+
If you have another column say T3_diff you can add it to the List and it will get added to the filter condition.

Related

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Flatmap on Spark Dataframe in Scala

I have a Dataframe. I need to create one or more rows from each row in dataframe. I am hoping FlapMap could help me in solving the problem. One or More rows would be created by applying logic on 2 columns of the row.
Example Input dataframe
+--------------------+
| Name|Float1|Float2|
+--------------------+
| Java| 2.3| 0.2|
|Python| 3.2| 0.5|
| Scala| 4.3| 0.8|
+--------------------+
Logic:
If *|Float1 + Float2| = |Float1)|* Then one row is created.
eg : 2.3 +0.2 = |2.5| = 2
|2.3| = 2
if *|Float1 +Float2| > |Float1|* Then two rows are created
eg: 4.3+0.8 = |5.1| = 5
|4.3| = 4
Can we solve this problem using flatmap or any other transformation in spark?
Create a UDF that takes in two columns and returns back a list.
Once you have a list, then use the explode function on the column which will give you what you desire

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

fetch more than 20 rows and display full value of column in spark-shell

I am using CassandraSQLContext from spark-shell to query data from Cassandra. So, I want to know two things one how to fetch more than 20 rows using CassandraSQLContext and second how do Id display the full value of column. As you can see below by default it append dots in the string values.
Code :
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KeySpace")
val maxDF = csc.sql("SQL_QUERY" )
maxDF.show
Output:
+--------------------+--------------------+-----------------+--------------------+
| id| Col2| Col3| Col4|
+--------------------+--------------------+-----------------+--------------------+
|8wzloRMrGpf8Q3bbk...| Value1| X| K1|
|AxRfoHDjV1Fk18OqS...| Value2| Y| K2|
|FpMVRlaHsEOcHyDgy...| Value3| Z| K3|
|HERt8eFLRtKkiZndy...| Value4| U| K4|
|nWOcbbbm8ZOjUSNfY...| Value5| V| K5|
If you want to print the whole value of a column, in scala, you just need to set the argument truncate from the show method to false :
maxDf.show(false)
and if you wish to show more than 20 rows :
// example showing 30 columns of
// maxDf untruncated
maxDf.show(30, false)
For pyspark, you'll need to specify the argument name :
maxDF.show(truncate = False)
You won't get in nice tabular form instead it will be converted to scala object.
maxDF.take(50)

Unexpected column values after the IN condition in where() method of dataframe in spark

Task: I want the value of child_id column [Which is generated using withColumn() method and monoliticallyIncreasingId() method] corresponding to family_id and id column.
Let me explain steps to complete my task:
Step 1: 1. adding 2 columns to the dataframe. 1 with unique id and named as child_id, and another with value 0 and named parent_id.
Step 2: need all family_ids from dataframe.
Step 3: want the dataframe of child_id and id, where id == family_id.
[Problem is here.]
def processFoHierarchical(param_df: DataFrame) {
var dff = param_df.withColumn("child_id", monotonicallyIncreasingId() + 1)
println("Something is not gud...")
dff = dff.withColumn("parent_id", lit(0.toLong))
dff.select("id","family_id","child_id").show() // Original dataframe.
var family_ids = ""
param_df.select("family_id").distinct().coalesce(1).collect().map(x => family_ids = family_ids + "'" + x.getAs[String]("family_id") + "',")
println(family_ids)
var x: DataFrame = null
if (family_ids.length() > 0) {
family_ids = family_ids.substring(0, family_ids.length() - 1)
val y = dff.where(" id IN (" + family_ids + ")").select("id","family_id","child_id")
y.show() // here i am getting unexpected values.
}
This is the output of my code. I am trying to get the child_id values as per in dataframe. but i am not getting it.
Note: Using Spark with Scala.
+--------------------+--------------------+----------+
| id| family_id| child_id|
+--------------------+--------------------+----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 4|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 9|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 5|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 10|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 8|
|dca16200-e462-11e...|dca16200-e462-11e...|8589934595|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 1|
|4cc19690-e819-11e...|ff81457a-e9cf-11e...| 3|
|dca16200-e462-11e...|ff81457a-e9cf-11e...|8589934596|
|72dd0250-eff4-11e...|78be8950-ecca-11e...| 2|
|84ed0df0-e81a-11e...|78be8950-ecca-11e...| 6|
|78be8951-ecca-11e...|78be8950-ecca-11e...| 7|
|d1515310-e9ad-11e...|78be8951-ecca-11e...|8589934593|
|d1515310-e9ad-11e...|72dd0250-eff4-11e...|8589934594|
+--------------------+--------------------+----------+
'72dd0250-eff4-11e5-9ce9-5e5517507c66','dca16200-e462-11e5-90ec-c1cf090b354c','78be8951-ecca-11e5-a5f5-c1cf090b354c','4261cca0-f0e9-11e5-bbba-c1cf090b354c','98c7dc00-f0e5-11e5-bc76-c1cf090b354c','fe60c680-eb59-11e5-9582-c1cf090b354c','ff81457a-e9cf-11e5-9ce9-5e5517507c66','8d9680a0-ec14-11e5-a94f-c1cf090b354c','78be8950-ecca-11e5-a5f5-c1cf090b354c',
+--------------------+--------------------+-----------+
| id| family_id| child_id|
+--------------------+--------------------+-----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 1|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 2|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 3|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 4|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 5|
|dca16200-e462-11e...|dca16200-e462-11e...| 6|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 8589934593|
|dca16200-e462-11e...|ff81457a-e9cf-11e...| 8589934594|
|72dd0250-eff4-11e...|78be8950-ecca-11e...|17179869185|
|78be8951-ecca-11e...|78be8950-ecca-11e...|17179869186|
+--------------------+--------------------+-----------+
I know that it doesn't produce consecutive values, those values are dependents on partitions. Unexpected values means (see 2nd dataframe) those child_ids are meant to belong from the previous dataframe where family_id = id and to match multiple ids i am using IN. Unexpected values here means the child_id column have no values from the above dataframe instead it is creating new child_id column with monoliticallyIncresingIds().
See the last 2 values in 2nd dataframe doesn't belong to the above dataframe. So where does it coming from. I am not applying monoliticallyIncresingIds() again on dataframe. So, why it looks like that column (child_id) having the values like monoliticallyIncresingIds() is applied again.
However, The problem is not with spark DataFrame . When we are using monoliticallyIncresingId() with DataFrame, it will create new id for each time on DataFrame.show().
if we need to generate id once and needs to refer same id at other place in code then we may need to DataFrame.cache().
In your case, you need to cache DataFrame after Step1 so it will not create duplicate child_id every time on show().