Dynamically select column content based on other column from the same row - scala

I am using Spark 1.6.1. Lets say my data frame looks like:
+------------+-----+----+
|categoryName|catA |catB|
+------------+-----+----+
| catA |0.25 |0.75|
| catB |0.5 |0.5 |
+------------+-----+----+
Where categoryName has String type, and cat* are Double. I would like to add column that will contain value from column which name is in the categoryName column:
+------------+-----+----+-------+
|categoryName|catA |catB| score |
+------------+-----+----+-------+
| catA |0.25 |0.75| 0.25 | ('score' has value from column name 'catA')
| catB |0.5 |0.7 | 0.7 | ('score' value from column name 'catB')
+------------+-----+----+-------+
I need such extraction to some later calculations. Any ideas?
Important: I don't know names of category columns. Solution needs to be dynamic.

Spark 2.0:
You can do this (for any number of category columns) by creating a temporary column which holds a map of categroyName -> categoryValue, and then selecting from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// create a map of category -> value, and then select from that map using categoryName:
input
.withColumn("asMap", map(catCols.flatMap(c => Seq(lit(c), col(c))): _*))
.withColumn("score", $"asMap".apply($"categoryName"))
.drop("asMap")
Spark 1.6: Similar idea, but using an array and a UDF to select from it:
// sequence of any number of category columns
val catCols = input.columns.filterNot(_ == "categoryName")
// UDF to select from array by index of colName in catCols
val getByColName = udf[Double, String, mutable.WrappedArray[Double]] {
case (colName, colValues) =>
val index = catCols.zipWithIndex.find(_._1 == colName).map(_._2)
index.map(colValues.apply).getOrElse(0.0)
}
// create an array of category values and select from it using UDF:
input
.withColumn("asArray", array(catCols.map(col): _*))
.withColumn("score", getByColName($"categoryName", $"asArray"))
.drop("asArray")

You have several options:
If you are using scala you can use the Dataset API in which case you would simply create a map which does the calculation.
You can move to RDD from dataframe and use a map
You can create a UDF which receives all relevant columns as input and do the calculation inside
you can use a bunch of when/otherwise clauses to do the search (e.g. when(col1 == CatA, col(CatA)).otherwise(col(CatB)))

Related

Explode or pivot spark scala dataframe horizontally to create a big flat dataframe

I have a dataframe with following schema:
UserID | StartDate | endDate | orderId | OrderCost| OrderItems| OrderLocation| Rank
Where Rank is 1 to 10.
I need to transpose this dataframe on rank and create dataframe in the below format:
UserID| StartDate_1 | endDate_1 | orderId_1 | OrderCost_1| OrderItems_1| OrderLocation_1|start_2 |endDate_2| orderId_2 | OrderCost_2| OrderItems_2| OrderLocation_2 |............| startDate_N|endDate_N | orderId_N | OrderCost_N| OrderItems_N| OrderLocation_N
If a user has only two records with rank 3 and 10 then the requirement is populate columns with suffix _3 and _10 the rest of the cell values for the user will be null.
I have tried 2 brute force approaches
Filter the DF for a rank, and rename the columns with suffix and do self join back to DF.
Grouped by UserID, collect as list and pass it to map function where I populate a array based on rank and then return the seq of string. Create the DF by passing the required schema
Both seemed to be working (Unsure if its the right approach
)but they are not generic that i can re use for different usecase i have
In this example I used https://github.com/bokeh/bokeh/blob/master/bokeh/sampledata/_data/auto-mpg.csv
Spark by default puts the rank in front, so the column names are "reversed" from what you specified, but this is done in only a few steps. The key is that exprs should be dynamically created, and that agg requires this to be split into a head and tail (which is why there is .agg(exprs(0), exprs.slice(1, exprs.length) below)
scala> df2.columns
res39: Array[String] = Array(mpg, cyl, displ, hp, weight, accel, yr, origin, name, Rank)
// note here, you would use columns.slice with the indices for
// the columns you need, i.e. (1, 7)
val exprs = for (col <- df2.columns.slice(0, 8)) yield expr(s"first(${col}) as ${col}")
exprs: Array[org.apache.spark.sql.Column] = Array(first(mpg, false) AS `mpg`, first(cyl, false) AS `cyl`, first(displ, false) AS `displ`, first(hp, false) AS `hp`, first(weight, false) AS `weight`, first(accel, false) AS `accel`, first(yr, false) AS `yr`, first(origin, false) AS `origin`)
scala> val resultDF = df2.groupBy("name").pivot("Rank").agg(exprs(0), exprs.slice(1, exprs.length):_*)
scala> resultDF.columns
res40: Array[String] = Array(name, 1_mpg, 1_cyl, 1_displ, 1_hp, 1_weight, 1_accel, 1_yr, 1_origin, 2_mpg, 2_cyl, 2_displ, 2_hp, 2_weight, 2_accel, 2_yr, 2_origin, 3_mpg, 3_cyl, 3_displ, 3_hp, 3_weight, 3_accel, 3_yr, 3_origin, 4_mpg, 4_cyl, 4_displ, 4_hp, 4_weight, 4_accel, 4_yr, 4_origin, 5_mpg, 5_cyl, 5_displ, 5_hp, 5_weight, 5_accel, 5_yr, 5_origin, 6_mpg, 6_cyl, 6_displ, 6_hp, 6_weight, 6_accel, 6_yr, 6_origin, 7_mpg, 7_cyl, 7_displ, 7_hp, 7_weight, 7_accel, 7_yr, 7_origin, 8_mpg, 8_cyl, 8_displ, 8_hp, 8_weight, 8_accel, 8_yr, 8_origin, 9_mpg, 9_cyl, 9_displ, 9_hp, 9_weight, 9_accel, 9_yr, 9_origin, 10_mpg, 10_cyl, 10_displ, 10_hp, 10_weight, 10_accel, 10_yr, 10_origin)

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

How to convert rows of DataFrame to a List/Map

I have the following DataFrame df:
id | type | count
-------------------
1 | A | 2
2 | B | 4
I want to pass each row of this df as the input of the function saveObj.
df.foreach( row => {
val list = List("id" -> row.get(0),"type" -> row.get(1))
saveObj(list)
})
Inside saveObj I want to access list values as follows: list("id"), list("type").
How can I avoid using indices of columns?: row.get(0)or row.get(1).
You can use getAs which expects a column name. By first creating a list of column names you're interested in - you can map them to the desired list of tuples:
// can also use df.columns.toList to get ALL columns
val columns = List("id", "type")
df.foreach(row => {
saveObj(columns.map(name => name -> row.getAs[Any](name)))
})
Alternatively, you can take advantage of Row.apply by using pattern matching - but in this case still require knowing the order of columns in the Row and repeating the column names:
df.foreach(_ match {
case Row(id: Any, typ: Any, _) => saveObj(List("id" -> id, "type" -> typ))
})

Get Seq index from element in Spark Sql

I have a dataframe as shown below:
---------------------+------------------------
text | featured_text
---------------------+------------------------
sun | [type, move, sun]
---------------------+------------------------
I want to search the "text" column value in "featured_text" Array and get the index of the "text" value if present. In the above example, I want to search for "sun" in Array [type, move, sun] and result will be "2" (index).
Is there any spark sql function/scala function available to get the index from the element?
As far as I know, there is no function to do this directly with the Spark SQL API. However, you can use an UDF instead as follows (I'm assuming the input dataframe is called df):
val getIndex = udf((text: String, featuredText: Seq[String]) => {
featuredText.indexOf(text)
})
val df2 = df.withColumn("index", getIndex($"text", $"featured_text"))
Which will give:
+----+-----------------+-----+
|text| featured_text|index|
+----+-----------------+-----+
| sun|[type, move, sun]| 2|
+----+-----------------+-----+
In the case where the value is not present the index column will have a -1.

Spark dataframe get column value into a string variable

I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?
The col("name") gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name"):
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)
By this snippet, you can extract all the values in a column into a string.
Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10
Let us assume you need to pick the name from the below table for a particular Id and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name column is selected and finally stored into name variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x version of Python. The code which I showed above was tested for 2.7 version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect function. Hence it is always recommended to apply transformations (like filter,where etc) before you call the collect function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0 as command line argument to make use of unlimited driver memory.
For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)
s is the string of column values
.collect() converts columns/rows to an array of lists, in this case, all rows will be converted to a tuple, temp is basically an array of such tuples/row.
x(n-1) retrieves the n-th column value for x-th row, which is by default of type "Any", so needs to be converted to String so as to append to the existing strig.
s =""
// say the n-th column is the target column
val temp = test.collect() // converts Rows to array of list
temp.foreach{x =>
s += (x(n-1).asInstanceOf[String])
}
println(s)