Unexpected column values after the IN condition in where() method of dataframe in spark - scala

Task: I want the value of child_id column [Which is generated using withColumn() method and monoliticallyIncreasingId() method] corresponding to family_id and id column.
Let me explain steps to complete my task:
Step 1: 1. adding 2 columns to the dataframe. 1 with unique id and named as child_id, and another with value 0 and named parent_id.
Step 2: need all family_ids from dataframe.
Step 3: want the dataframe of child_id and id, where id == family_id.
[Problem is here.]
def processFoHierarchical(param_df: DataFrame) {
var dff = param_df.withColumn("child_id", monotonicallyIncreasingId() + 1)
println("Something is not gud...")
dff = dff.withColumn("parent_id", lit(0.toLong))
dff.select("id","family_id","child_id").show() // Original dataframe.
var family_ids = ""
param_df.select("family_id").distinct().coalesce(1).collect().map(x => family_ids = family_ids + "'" + x.getAs[String]("family_id") + "',")
println(family_ids)
var x: DataFrame = null
if (family_ids.length() > 0) {
family_ids = family_ids.substring(0, family_ids.length() - 1)
val y = dff.where(" id IN (" + family_ids + ")").select("id","family_id","child_id")
y.show() // here i am getting unexpected values.
}
This is the output of my code. I am trying to get the child_id values as per in dataframe. but i am not getting it.
Note: Using Spark with Scala.
+--------------------+--------------------+----------+
| id| family_id| child_id|
+--------------------+--------------------+----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 4|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 9|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 5|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 10|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 8|
|dca16200-e462-11e...|dca16200-e462-11e...|8589934595|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 1|
|4cc19690-e819-11e...|ff81457a-e9cf-11e...| 3|
|dca16200-e462-11e...|ff81457a-e9cf-11e...|8589934596|
|72dd0250-eff4-11e...|78be8950-ecca-11e...| 2|
|84ed0df0-e81a-11e...|78be8950-ecca-11e...| 6|
|78be8951-ecca-11e...|78be8950-ecca-11e...| 7|
|d1515310-e9ad-11e...|78be8951-ecca-11e...|8589934593|
|d1515310-e9ad-11e...|72dd0250-eff4-11e...|8589934594|
+--------------------+--------------------+----------+
'72dd0250-eff4-11e5-9ce9-5e5517507c66','dca16200-e462-11e5-90ec-c1cf090b354c','78be8951-ecca-11e5-a5f5-c1cf090b354c','4261cca0-f0e9-11e5-bbba-c1cf090b354c','98c7dc00-f0e5-11e5-bc76-c1cf090b354c','fe60c680-eb59-11e5-9582-c1cf090b354c','ff81457a-e9cf-11e5-9ce9-5e5517507c66','8d9680a0-ec14-11e5-a94f-c1cf090b354c','78be8950-ecca-11e5-a5f5-c1cf090b354c',
+--------------------+--------------------+-----------+
| id| family_id| child_id|
+--------------------+--------------------+-----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 1|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 2|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 3|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 4|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 5|
|dca16200-e462-11e...|dca16200-e462-11e...| 6|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 8589934593|
|dca16200-e462-11e...|ff81457a-e9cf-11e...| 8589934594|
|72dd0250-eff4-11e...|78be8950-ecca-11e...|17179869185|
|78be8951-ecca-11e...|78be8950-ecca-11e...|17179869186|
+--------------------+--------------------+-----------+
I know that it doesn't produce consecutive values, those values are dependents on partitions. Unexpected values means (see 2nd dataframe) those child_ids are meant to belong from the previous dataframe where family_id = id and to match multiple ids i am using IN. Unexpected values here means the child_id column have no values from the above dataframe instead it is creating new child_id column with monoliticallyIncresingIds().
See the last 2 values in 2nd dataframe doesn't belong to the above dataframe. So where does it coming from. I am not applying monoliticallyIncresingIds() again on dataframe. So, why it looks like that column (child_id) having the values like monoliticallyIncresingIds() is applied again.

However, The problem is not with spark DataFrame . When we are using monoliticallyIncresingId() with DataFrame, it will create new id for each time on DataFrame.show().
if we need to generate id once and needs to refer same id at other place in code then we may need to DataFrame.cache().
In your case, you need to cache DataFrame after Step1 so it will not create duplicate child_id every time on show().

Related

Pyspark Rename column based on column position

How do I rename the 3rd column of a dataframe in PySpark. I want to call the column index rather than the actual name.
Here is my attempt:
df
Col1 Col2 jfdklajfklfj
A B 2
df.withColumnRenamed([3], 'Row_Count')
Since python indexing starts at 0, you can index df.columns list by subtracting 1:
index_of_col = 3
df.withColumnRenamed(df.columns[index_of_col-1],'Row_Count').show()
+----+----+---------+
|Col1|Col2|Row_Count|
+----+----+---------+
| A| B| 2|
+----+----+---------+

Spark dataframe replace values of specific columns in a row with Nulls

I am facing a problem when trying to replace the values of specific columns of a Spark dataframe with nulls.
I have a dataframe with more than fifty columns of which two are key columns. I want to create a new dataframe with same schema and the new dataframe should have values from the key columns and null values in non-key columns.
I tried the following ways but facing issues:
//old_df is the existing Dataframe
val key_cols = List("id", "key_number")
val non_key_cols = old_df.columns.toList.filterNot(key_cols.contains(_))
val key_col_df = old_df.select(key_cols.head, key_cols.tail:_*)
val non_key_cols_df = old_df.select(non_key_cols.head, non_key_cols.tail:_*)
val list_cols = List.fill(non_key_cols_df.columns.size)("NULL")
val rdd_list_cols = spark.sparkContext.parallelize(Seq(list_cols)).map(l => Row(l:_*))
val list_df = spark.createDataFrame(rdd_list_cols, non_key_cols_df.schema)
val new_df = key_col_df.crossJoin(list_df)
This approach was good when I only have string type columns in the old_df. But I have some columns of double type and int type which is throwing error because the rdd is a list of null strings.
To avoid this I tried the list_df as an empty dataframe with schema as the non_key_cols_df but the result of crossJoin is an empty dataframe which I believe is because one dataframe is empty.
My requirement is to have the non_key_cols as a single row dataframe with Nulls so that I can do crossJoin with key_col_df and form the required new_df.
Also any other easier way to update all columns except key columns of a dataframe to nulls will resolve my issue. Thanks in advance
crossJoin is an expensive operation so you want to avoid it if possible.
An easier solution would be to iterate over all non-key columns and insert null with lit(null). Using foldLeft this can be done as follows:
val keyCols = List("id", "key_number")
val nonKeyCols = df.columns.filterNot(keyCols.contains(_))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c, lit(null)))
Input example:
+---+----------+---+----+
| id|key_number| c| d|
+---+----------+---+----+
| 1| 2| 3| 4.0|
| 5| 6| 7| 8.0|
| 9| 10| 11|12.0|
+---+----------+---+----+
will give:
+---+----------+----+----+
| id|key_number| c| d|
+---+----------+----+----+
| 1| 2|null|null|
| 5| 6|null|null|
| 9| 10|null|null|
+---+----------+----+----+
Shaido answer has small drawback - column type will be lost.
Can be fixed with schema usage, like this:
val nonKeyCols = df.schema.fields.filterNot(f => keyCols.contains(f.name))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c.name, lit(null).cast(c.dataType)))

dynamically pass arguments to function in scala

i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.
You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+

Filtering on a dataframe based on columns defined in a list

I have a dataframe -
df
+----------+----+----+-------+-------+
| WEEK|DIM1|DIM2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-02| 14|NULL| -5| 60|
|2016-04-30| 14| FR| 90| 4|
+----------+----+----+-------+-------+
I have defined a list as targetList
List(T1_diff, T2_diff)
I want to filter out all rows in dataframe where T1_diff and T2_diff is greater than 3. In this scenario the output should only contain the second row as first row contains -5 as T1_Diff. targetList can contain more columns, currently it has T1_diff, T2_diff, if there is another column called T3_diff, so that should be automatically handled.
What is the best way to achieve this ?
Suppose you have following List of columns which you want to filter out for a value greater than 3.
val lst = List("T1_diff", "T2_diff")
Then you can create a String using these column names and then pass that String to where function.
val condition = lst.map(c => s"$c>3").mkString(" AND ")
df.where(condition).show(false)
For the above Dataframe it will output only second row.
+----------+----+----+-------+-------+
|Week |Dim1|Dim2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-30|14 |FR |90 |4 |
+----------+----+----+-------+-------+
If you have another column say T3_diff you can add it to the List and it will get added to the filter condition.

Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

I am having problems transposing values in a DataFrame in Scala. My initial DataFramelooks like this:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| X| 6|null|
| B| Z|null| 5|
| C| Y| 4|null|
+----+----+----+----+
col1 and col2 are type String and col3 and col4 are Int.
And the result should look like this:
+----+----+----+----+------+------+------+
|col1|col2|col3|col4|AXcol3|BZcol4|CYcol4|
+----+----+----+----+------+------+------+
| A| X| 6|null| 6| null| null|
| B| Z|null| 5| null| 5| null|
| C| Y| 4| 4| null| null| 4|
+----+----+----+----+------+------+------+
That means that the three new columns should be named after col1, col2 and the column the value is extracted. The extracted value comes from the column col2, col3 or col5 depending which value is not null.
So how to achieve that? I first thought of a UDF like this:
def myFunc (col1:String, col2:String, col3:Long, col4:Long) : (newColumn:String, rowValue:Long) = {
if col3 == null{
val rowValue=col4;
val newColumn=col1+col2+"col4";
} else{
val rowValue=col3;
val newColumn=col1+col2+"col3";
}
return (newColumn, rowValue);
}
val udfMyFunc = udf(myFunc _ ) //needed to treat it as partially applied function
But how can I call it from the dataframe in the right way?
Of course, all code above is rubbish and there could be a much better way. Since I am just juggling the first code snippets let me know... Comparing the Int value to null is already not working.
Any help is appreciated! Thanks!
There is a simpler way:
val df3 = df2.withColumn("newCol", concat($"col1", $"col2")) //Step 1
.withColumn("value",when($"col3".isNotNull, $"col3").otherwise($"col4")) //Step 2
.groupBy($"col1",$"col2",$"col3",$"col4",$"newCol") //Step 3
.pivot("newCol") // Step 4
.agg(max($"value")) // Step 5
.orderBy($"newCol") // Step 6
.drop($"newCol") // Step 7
df3.show()
The steps work as follows:
Add a new column which contains the contents of col1 concatenated with col2
// add a new column, "value" which contains the non-null contents of either col3 or col4
GroupBy the columns you want
pivot on newCol, which contains the values which are now to be column headings
Aggregate by the max of value, which will be the value itself if the groupBy is single-valued per group; or alternatively .agg(first($"value")) if value happens to be a string rather than a numeric type - max function can only be applied to a numeric type
order by newCol so DF is in ascending order
drop this column as you no longer need it, or skip this step if you want a column of values without nulls
Credit due to #user8371915 who helped me answer my own pivot question in the first place.
Result is as follows:
+----+----+----+----+----+----+----+
|col1|col2|col3|col4| AX| BZ| CY|
+----+----+----+----+----+----+----+
| A| X| 6|null| 6|null|null|
| B| Z|null| 5|null| 5|null|
| C| Y| 4| 4|null|null| 4|
+----+----+----+----+----+----+----+
You might have to play around with the column header strings concatenation to get the right result.
Okay, I have a workaround to achieve what I want. I do the following:
(1) I generate a new column containing a tuple with [newColumnName,rowValue] following this advice Derive multiple columns from a single column in a Spark DataFrame
case class toTuple(newColumnName: String, rowValue: String)
def createTuple (input1:String, input2:String) : toTuple = {
//do something fancy here
var column:String= input1 + input2
var value:String= input1
return toTuple(column, value)
}
val UdfCreateTuple = udf(createTuple _)
(2) Apply function to DataFrame
dfNew= df.select($"*", UdfCreateTuple($"col1",$"col2").alias("tmpCol")
(3) Create array with distinct values of newColumnName
val dfDistinct = dfNew.select($"tmpCol.newColumnName").distinct
(4) Create an array with distinct values
var a = dfDistinct.select($"newCol").rdd.map(r => r(0).asInstanceOf[String])
var arrDistinct = a.map(a => a).collect()
(5) Create a key value mapping
var seqMapping:Seq[(String,String)]=Seq()
for (i <- arrDistinct){
seqMapping :+= (i,i)
}
(6) Apply mapping to original dataframe, cf. Mapping a value into a specific column based on annother column
val exprsDistinct = seqMapping.map { case (key, target) =>
when($"tmpCol.newColumnName" === key, $"tmpCol.rowValue").alias(target) }
val dfFinal = dfNew.select($"*" +: exprsDistinct: _*)
Well, that is a bit cumbersome but I can derive a set of new columns without knowing how many there are and at the same time transfer the value into that new column.