I am trying to create an UDF function to replace some values in a DF.
I have the following DF:
df1
+-------------+
| Periodicity |
+-------------+
| Monthly |
| Daily |
| Annual |
+-------------+
So if I find in this DF "Annual", I want to change it to "EveryYear" and if I find "Daily" to "EveryDay". This is what I am trying:
val modifyColumn = () => if (df1.col("Periodicity").equals("Annual")) "EveryYear"
val modifyColumnUDF = udf(modifyColumn)
val result = df1.withColumn("Periodicity", modifyColumnUDF(df1.col("Periodicity")))
But is giving me an EvaluateException. What am I doing wrong?
You can use one of these approaches:
// First approach
dataFrame
.withColumn("Periodicity",
when(col("Periodicity") === "Annual", "EveryYear").otherwise(
when(col("Periodicity") === "Monthly", "EveryMonth").otherwise(
when(col("Periodicity") === "Daily", "EveryDay"))))
// Second approach
val permutations = Map("Annual" -> "EveryYear", "Monthly" -> "EveryMonth", "Daily" -> "EveryDay")
val medianUDF = udf[String, String]((origValue: String) => permutations(origValue))
dataFrame.withColumn("Periodicity", medianUDF(col("Periodicity")))
The second approach can be used if you have many permutations and/or want it to be configured dynamically.
Related
I have URL data in a column in my dataframe that I need to parse out parameters from the query string and create new columns for.
Sometimes the parameters will exist, sometimes they won't, and they aren't in a specific guaranteed order so I need to be able to find them by name. I am writing this in Qcala but can't get the syntax correct and would love some help.
My code:
val df = Seq(
(1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00%3a00%3a00"),
(2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")
val result = df
// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))
// .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
// .toMap
// .get("originlongitude"))
display(result)
Desired Result:
+---+--------------------+--------------------+--------------------+
|KEY| URL| originlatitude | originlongitude |
+---+--------------------+--------------------+--------------------+
| 1|https://www.myweb...| 35.0133612060147 | -116.156211232302 |
| 2|https://www.myweb...| 19.9141319141121 | -56.1241881401291 |
+---+--------------------+--------------------+--------------------+
parse_url function can actually take a third parameter key for the query parameter name you want to extract, like this:
val result = df
.withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
.withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))
result.show
//+---+--------------------+----------------------+-----------------------+
//|key| URL|Search_OriginLongitude|Search_OriginLongitude2|
//+---+--------------------+----------------------+-----------------------+
//| 1|https://www.myweb...| 35.0133612060147| -116.156211232302|
//| 2|https://www.myweb...| 19.9141319141121| -56.1241881401291|
//+---+--------------------+----------------------+-----------------------+
Or you can use str_to_map function to create a map of parameter->value like this:
val result = df
.withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
.withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
.withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))
var columnnames= "callStart_t,callend_t" // Timestamp column names are dynamic input.
scala> df1.show()
+------+------------+--------+----------+
| name| callStart_t|personid| callend_t|
+------+------------+--------+----------+
| Bindu|1080602418 | 2|1080602419|
|Raphel|1647964576 | 5|1647964576|
| Ram|1754536698 | 9|1754536699|
+------+------------+--------+----------+
code which i tried :
val newDf = df1.withColumn("callStart_Time", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_Time", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
Here, I don't want new columns to convert (from_unixtime to to_utc_timestamp), the existing column itself I want to convert
Example Output
+------+---------------------+--------+--------------------+
| name| callStart_t |personid| callend_t |
+------+---------------------+--------+--------------------+
| Bindu|1970-01-13 04:40:02 | 2|1970-01-13 04:40:02 |
|Raphel|1970-01-20 06:16:04 | 5|1970-01-20 06:16:04 |
| Ram|1970-01-21 11:52:16 | 9|1970-01-21 11:52:16 |
+------+---------------------+--------+--------------------+
Note: The Timestamp column names are dynamic.
how to get each column dynamically?
Just use the same name for the column and it will replace it:
val newDf = df1.withColumn("callStart_t", to_utc_timestamp(from_unixtime($"callStart_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
val newDf = df1.withColumn("callend_t", to_utc_timestamp(from_unixtime($"callend_t"/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
To make it dynamic, just use the relevant string. For example:
val colName = "callend_t"
val newDf = df.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))
For multiple columns you can do:
val columns=Seq("callend_t", "callStart_t")
val newDf = columns.foldLeft(df1){ case (curDf, colName) => curDf.withColumn(colName , to_utc_timestamp(from_unixtime(col(colName)/1000,"yyyy-MM-dd hh:mm:ss"),"Europe/Berlin"))}
Note: as stated in the comments, the division by 1000 is not needed.
So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+
Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!
You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.