dynamically pass arguments to function in scala - scala

i have record as string with 1000 fields with delimiter as comma in dataframe like
"a,b,c,d,e.......upto 1000" -1st record
"p,q,r,s,t ......upto 1000" - 2nd record
I am using below suggested solution from stackoverflow
Split 1 column into 3 columns in spark scala
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
however in my case i am having 1000 columns which i have in JSON schema which i can retrive like
column_seq:Seq[Array]=Schema_func.map(_.name)
for(i <-o to column_seq.length-1){println(i+" " + column_seq(i))}
which returns like
0 col1
1 col2
2 col3
3 col4
Now I need to pass all this indexes and column names to below function of DataFrame
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),$"_tmp".getItem(2).as("col3")).drop("_tmp")
in
$"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2"),
as i cant create the long statement with all 1000 columns , is there any effective way to pass all this arguments from above mentioned json schema to select function , so that i can split the columns , add the header and then covert the DF to parquet.

You can build a series of org.apache.spark.sql.Column, where each one is the result of selecting the right item and has the right name, and then select these columns:
val columns: Seq[Column] = Schema_func.map(_.name)
.zipWithIndex // attach index to names
.map { case (name, index) => $"_tmp".getItem(index) as name }
val result = df
.withColumn("_tmp", split($"columnToSplit", "\\."))
.select(columns: _*)
For example, for this input:
case class A(name: String)
val Schema_func = Seq(A("c1"), A("c2"), A("c3"), A("c4"), A("c5"))
val df = Seq("a.b.c.d.e").toDF("columnToSplit")
The result would be:
// +---+---+---+---+---+
// | c1| c2| c3| c4| c5|
// +---+---+---+---+---+
// | a| b| c| d| e|
// +---+---+---+---+---+

Related

Converting List of List to Dataframe

I'm reading in data (as show below) into a list of lists, and I want to convert it into a dataframe with seven columns. The error I get is: requirement failed: number of columns doesn't match. Old column names (1): value, new column names (7): <list of columns>
What am I doing incorrectly and how can I fix it?
Data:
Column1, Column2, Column3, Column4, Column5, Column6, Column7
a,b,c,d,e,f,g
a2,b2,c2,d2,e2,f2,g2
Code:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0)
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = body.toDF()
erDf.show()
You get this number of columns doesn't match error because your erDf dataframe contains only one column, that contains an array:
+----------------------------+
|value |
+----------------------------+
|[a, b, c, d, e, f, g] |
|[a2, b2, c2, d2, e2, f2, g2]|
+----------------------------+
You can't match this unique column with the seven columns contained in your header.
The solution here is, given this erDf dataframe, to iterate over your header columns list to build columns one by one. Your complete code thus become:
val spark = SparkSession.builder.appName("er").master("local").getOrCreate()
import spark.implicits._
val erResponse = response.body.toString.split("\\\n")
val header = erResponse(0).split(", ") // build header columns list
val body = erResponse.drop(1).map(x => x.split(",").toList).toList
val erDf = header
.zipWithIndex
.foldLeft(body.toDF())((acc, elem) => acc.withColumn(elem._1, col("value")(elem._2)))
.drop("value")
That will give you the following erDf dataframe:
+-------+-------+-------+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|Column5|Column6|Column7|
+-------+-------+-------+-------+-------+-------+-------+
| a| b| c| d| e| f| g|
| a2| b2| c2| d2| e2| f2| g2|
+-------+-------+-------+-------+-------+-------+-------+

how to convert seq[row] to a dataframe in scala

Is there any way to convert Seq[Row] into a dataframe in scala.
I have a dataframe and a list of strings that have weights of each row in input dataframe.I want to build a DataFrame that will include all rows with unique weights.
I was able to filter unique rows and append to seq[row] but I want to build a dataframe.
This is my code .Thanks in advance.
def dataGenerator(input : DataFrame, val : List[String]): Dataset[Row]= {
val valitr = val.iterator
var testdata = Seq[Row]()
var val = HashSet[String]()
if(valitr!=null) {
input.collect().foreach((r) => {
var valnxt = valitr.next()
if (!valset.contains(valnxt)) {
valset += valnxt
testdata = testdata :+ r
}
})
}
//logic to convert testdata as DataFrame and return
}
You said that 'val is calculated using fields from inputdf itself'. If this is the case then you should be able to make a new dataframe with a new column for the 'val' like this:
+------+------+
|item |weight|
+------+------+
|item 1|w1 |
|item 2|w2 |
|item 3|w2 |
|item 4|w3 |
|item 5|w4 |
+------+------+
This is the key thing. Then you will be able to work on the dataframe instead of doing a collect.
What is bad about doing collect? Well there is no point in going to the trouble and overhead of using a distributed big data processing framework just to pull all the data into the memory of 1 machine. See here: Spark dataframe: collect () vs select ()
When you have the input dataframe how you want it, as above, you can get the result. Here is a way that works, which groups the data by the weight column and picks the first item for each grouping.
val result = input
.rdd // get underlying rdd
.groupBy(r => r.get(1)) // group by "weight" field
.map(x => x._2.head.getString(0)) // get the first "item" for each weight
.toDF("item") // back to a dataframe
Then you get the only the first item in case of duplicated weight:
+------+
|item |
+------+
|item 1|
|item 2|
|item 4|
|item 5|
+------+

Spark dataframe replace values of specific columns in a row with Nulls

I am facing a problem when trying to replace the values of specific columns of a Spark dataframe with nulls.
I have a dataframe with more than fifty columns of which two are key columns. I want to create a new dataframe with same schema and the new dataframe should have values from the key columns and null values in non-key columns.
I tried the following ways but facing issues:
//old_df is the existing Dataframe
val key_cols = List("id", "key_number")
val non_key_cols = old_df.columns.toList.filterNot(key_cols.contains(_))
val key_col_df = old_df.select(key_cols.head, key_cols.tail:_*)
val non_key_cols_df = old_df.select(non_key_cols.head, non_key_cols.tail:_*)
val list_cols = List.fill(non_key_cols_df.columns.size)("NULL")
val rdd_list_cols = spark.sparkContext.parallelize(Seq(list_cols)).map(l => Row(l:_*))
val list_df = spark.createDataFrame(rdd_list_cols, non_key_cols_df.schema)
val new_df = key_col_df.crossJoin(list_df)
This approach was good when I only have string type columns in the old_df. But I have some columns of double type and int type which is throwing error because the rdd is a list of null strings.
To avoid this I tried the list_df as an empty dataframe with schema as the non_key_cols_df but the result of crossJoin is an empty dataframe which I believe is because one dataframe is empty.
My requirement is to have the non_key_cols as a single row dataframe with Nulls so that I can do crossJoin with key_col_df and form the required new_df.
Also any other easier way to update all columns except key columns of a dataframe to nulls will resolve my issue. Thanks in advance
crossJoin is an expensive operation so you want to avoid it if possible.
An easier solution would be to iterate over all non-key columns and insert null with lit(null). Using foldLeft this can be done as follows:
val keyCols = List("id", "key_number")
val nonKeyCols = df.columns.filterNot(keyCols.contains(_))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c, lit(null)))
Input example:
+---+----------+---+----+
| id|key_number| c| d|
+---+----------+---+----+
| 1| 2| 3| 4.0|
| 5| 6| 7| 8.0|
| 9| 10| 11|12.0|
+---+----------+---+----+
will give:
+---+----------+----+----+
| id|key_number| c| d|
+---+----------+----+----+
| 1| 2|null|null|
| 5| 6|null|null|
| 9| 10|null|null|
+---+----------+----+----+
Shaido answer has small drawback - column type will be lost.
Can be fixed with schema usage, like this:
val nonKeyCols = df.schema.fields.filterNot(f => keyCols.contains(f.name))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c.name, lit(null).cast(c.dataType)))

Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

I am having problems transposing values in a DataFrame in Scala. My initial DataFramelooks like this:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| X| 6|null|
| B| Z|null| 5|
| C| Y| 4|null|
+----+----+----+----+
col1 and col2 are type String and col3 and col4 are Int.
And the result should look like this:
+----+----+----+----+------+------+------+
|col1|col2|col3|col4|AXcol3|BZcol4|CYcol4|
+----+----+----+----+------+------+------+
| A| X| 6|null| 6| null| null|
| B| Z|null| 5| null| 5| null|
| C| Y| 4| 4| null| null| 4|
+----+----+----+----+------+------+------+
That means that the three new columns should be named after col1, col2 and the column the value is extracted. The extracted value comes from the column col2, col3 or col5 depending which value is not null.
So how to achieve that? I first thought of a UDF like this:
def myFunc (col1:String, col2:String, col3:Long, col4:Long) : (newColumn:String, rowValue:Long) = {
if col3 == null{
val rowValue=col4;
val newColumn=col1+col2+"col4";
} else{
val rowValue=col3;
val newColumn=col1+col2+"col3";
}
return (newColumn, rowValue);
}
val udfMyFunc = udf(myFunc _ ) //needed to treat it as partially applied function
But how can I call it from the dataframe in the right way?
Of course, all code above is rubbish and there could be a much better way. Since I am just juggling the first code snippets let me know... Comparing the Int value to null is already not working.
Any help is appreciated! Thanks!
There is a simpler way:
val df3 = df2.withColumn("newCol", concat($"col1", $"col2")) //Step 1
.withColumn("value",when($"col3".isNotNull, $"col3").otherwise($"col4")) //Step 2
.groupBy($"col1",$"col2",$"col3",$"col4",$"newCol") //Step 3
.pivot("newCol") // Step 4
.agg(max($"value")) // Step 5
.orderBy($"newCol") // Step 6
.drop($"newCol") // Step 7
df3.show()
The steps work as follows:
Add a new column which contains the contents of col1 concatenated with col2
// add a new column, "value" which contains the non-null contents of either col3 or col4
GroupBy the columns you want
pivot on newCol, which contains the values which are now to be column headings
Aggregate by the max of value, which will be the value itself if the groupBy is single-valued per group; or alternatively .agg(first($"value")) if value happens to be a string rather than a numeric type - max function can only be applied to a numeric type
order by newCol so DF is in ascending order
drop this column as you no longer need it, or skip this step if you want a column of values without nulls
Credit due to #user8371915 who helped me answer my own pivot question in the first place.
Result is as follows:
+----+----+----+----+----+----+----+
|col1|col2|col3|col4| AX| BZ| CY|
+----+----+----+----+----+----+----+
| A| X| 6|null| 6|null|null|
| B| Z|null| 5|null| 5|null|
| C| Y| 4| 4|null|null| 4|
+----+----+----+----+----+----+----+
You might have to play around with the column header strings concatenation to get the right result.
Okay, I have a workaround to achieve what I want. I do the following:
(1) I generate a new column containing a tuple with [newColumnName,rowValue] following this advice Derive multiple columns from a single column in a Spark DataFrame
case class toTuple(newColumnName: String, rowValue: String)
def createTuple (input1:String, input2:String) : toTuple = {
//do something fancy here
var column:String= input1 + input2
var value:String= input1
return toTuple(column, value)
}
val UdfCreateTuple = udf(createTuple _)
(2) Apply function to DataFrame
dfNew= df.select($"*", UdfCreateTuple($"col1",$"col2").alias("tmpCol")
(3) Create array with distinct values of newColumnName
val dfDistinct = dfNew.select($"tmpCol.newColumnName").distinct
(4) Create an array with distinct values
var a = dfDistinct.select($"newCol").rdd.map(r => r(0).asInstanceOf[String])
var arrDistinct = a.map(a => a).collect()
(5) Create a key value mapping
var seqMapping:Seq[(String,String)]=Seq()
for (i <- arrDistinct){
seqMapping :+= (i,i)
}
(6) Apply mapping to original dataframe, cf. Mapping a value into a specific column based on annother column
val exprsDistinct = seqMapping.map { case (key, target) =>
when($"tmpCol.newColumnName" === key, $"tmpCol.rowValue").alias(target) }
val dfFinal = dfNew.select($"*" +: exprsDistinct: _*)
Well, that is a bit cumbersome but I can derive a set of new columns without knowing how many there are and at the same time transfer the value into that new column.

Unexpected column values after the IN condition in where() method of dataframe in spark

Task: I want the value of child_id column [Which is generated using withColumn() method and monoliticallyIncreasingId() method] corresponding to family_id and id column.
Let me explain steps to complete my task:
Step 1: 1. adding 2 columns to the dataframe. 1 with unique id and named as child_id, and another with value 0 and named parent_id.
Step 2: need all family_ids from dataframe.
Step 3: want the dataframe of child_id and id, where id == family_id.
[Problem is here.]
def processFoHierarchical(param_df: DataFrame) {
var dff = param_df.withColumn("child_id", monotonicallyIncreasingId() + 1)
println("Something is not gud...")
dff = dff.withColumn("parent_id", lit(0.toLong))
dff.select("id","family_id","child_id").show() // Original dataframe.
var family_ids = ""
param_df.select("family_id").distinct().coalesce(1).collect().map(x => family_ids = family_ids + "'" + x.getAs[String]("family_id") + "',")
println(family_ids)
var x: DataFrame = null
if (family_ids.length() > 0) {
family_ids = family_ids.substring(0, family_ids.length() - 1)
val y = dff.where(" id IN (" + family_ids + ")").select("id","family_id","child_id")
y.show() // here i am getting unexpected values.
}
This is the output of my code. I am trying to get the child_id values as per in dataframe. but i am not getting it.
Note: Using Spark with Scala.
+--------------------+--------------------+----------+
| id| family_id| child_id|
+--------------------+--------------------+----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 4|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 9|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 5|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 10|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 8|
|dca16200-e462-11e...|dca16200-e462-11e...|8589934595|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 1|
|4cc19690-e819-11e...|ff81457a-e9cf-11e...| 3|
|dca16200-e462-11e...|ff81457a-e9cf-11e...|8589934596|
|72dd0250-eff4-11e...|78be8950-ecca-11e...| 2|
|84ed0df0-e81a-11e...|78be8950-ecca-11e...| 6|
|78be8951-ecca-11e...|78be8950-ecca-11e...| 7|
|d1515310-e9ad-11e...|78be8951-ecca-11e...|8589934593|
|d1515310-e9ad-11e...|72dd0250-eff4-11e...|8589934594|
+--------------------+--------------------+----------+
'72dd0250-eff4-11e5-9ce9-5e5517507c66','dca16200-e462-11e5-90ec-c1cf090b354c','78be8951-ecca-11e5-a5f5-c1cf090b354c','4261cca0-f0e9-11e5-bbba-c1cf090b354c','98c7dc00-f0e5-11e5-bc76-c1cf090b354c','fe60c680-eb59-11e5-9582-c1cf090b354c','ff81457a-e9cf-11e5-9ce9-5e5517507c66','8d9680a0-ec14-11e5-a94f-c1cf090b354c','78be8950-ecca-11e5-a5f5-c1cf090b354c',
+--------------------+--------------------+-----------+
| id| family_id| child_id|
+--------------------+--------------------+-----------+
|fe60c680-eb59-11e...|fe60c680-eb59-11e...| 1|
|ff81457a-e9cf-11e...|ff81457a-e9cf-11e...| 2|
|98c7dc00-f0e5-11e...|98c7dc00-f0e5-11e...| 3|
|8d9680a0-ec14-11e...|8d9680a0-ec14-11e...| 4|
|4261cca0-f0e9-11e...|4261cca0-f0e9-11e...| 5|
|dca16200-e462-11e...|dca16200-e462-11e...| 6|
|78be8950-ecca-11e...|ff81457a-e9cf-11e...| 8589934593|
|dca16200-e462-11e...|ff81457a-e9cf-11e...| 8589934594|
|72dd0250-eff4-11e...|78be8950-ecca-11e...|17179869185|
|78be8951-ecca-11e...|78be8950-ecca-11e...|17179869186|
+--------------------+--------------------+-----------+
I know that it doesn't produce consecutive values, those values are dependents on partitions. Unexpected values means (see 2nd dataframe) those child_ids are meant to belong from the previous dataframe where family_id = id and to match multiple ids i am using IN. Unexpected values here means the child_id column have no values from the above dataframe instead it is creating new child_id column with monoliticallyIncresingIds().
See the last 2 values in 2nd dataframe doesn't belong to the above dataframe. So where does it coming from. I am not applying monoliticallyIncresingIds() again on dataframe. So, why it looks like that column (child_id) having the values like monoliticallyIncresingIds() is applied again.
However, The problem is not with spark DataFrame . When we are using monoliticallyIncresingId() with DataFrame, it will create new id for each time on DataFrame.show().
if we need to generate id once and needs to refer same id at other place in code then we may need to DataFrame.cache().
In your case, you need to cache DataFrame after Step1 so it will not create duplicate child_id every time on show().