change a dataframe row value with dynamic number of columns spark scala - scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.

Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))

I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

Spark-Scala: Get Dataframe Variable by concatenating two String Variables

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.
Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)
Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

Array manipulation in Spark, Scala

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.
I have a dataframe having the following two columns:
Name_Description Grade
Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.
Name description is not an array that's of fixed size. It could be something like
['asdf_ Brandon', 'Ca%abc%rd']
['fthhhhChris', 'Rock', 'is the %abc%man']
The only problems are the following:
1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.
2. %abc% randomly pops up on elements, so I wanna erase them.
Is there any way to achieve those two things in Scala? For instance, I just want
['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']
to change to
['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:
val removeGarbage = udf { arr: WrappedArray[String] =>
//in case that array is empty we need to map over option
arr.headOption
//drop first 6 characters from first element, then remove %abc% from the rest
.map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
.getOrElse(arr)
}
Then you just need to use this UDF on your Name_Description column:
val df = List(
(1, Array("asdf_ Brandon", "Ca%abc%rd")),
(2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")
df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)
Show prints:
+-----+-------------------------+
|Grade|Name_Description |
+-----+-------------------------+
|1 |[Brandon, Card] |
|2 |[Chris, Rock, is the man]|
+-----+-------------------------+
We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.
Please find below my approach. Hope it helps.
val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))
val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")
This is how I created the input dataframe.
df.select('Grade,posexplode('Name_Description)).registerTempTable("data")
We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.
spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show
This query will give out the required output. Hope this helps.

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?
You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

Select values from a dataframe column

I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?
Current code
Using Spark 1.4.0 and Scala 2.11.6
myDF = some dataframe with n rows by m columns
def difference(col: Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1"),
difference($"Column2"),
difference($"Column3"),
difference($"Column4")
)
}
When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.
After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.
def difference(col:Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1").alias("newColumn1"),
difference($"Column2").alias("newColumn2"),
difference($"Column3").alias("newColumn3"),
difference($"Column4").alias("newColumn4")
)
}