Scala: How can I replace value in Dataframes using scala - scala

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add #marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

Building off of the solution from #Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map on the DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value S for make else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.

import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))

Related

Capitalize the first letter of each word | Spark Scala

I have a table as below -
ID
City
Country
1
Frankfurt am main
Germany
The dataframe needs to be displayed by capitalizing the first letter of each word in the city i.e. output should look like this ->
ID
City
Country
1
Frankfurt Am Main
Germany
The solution I worked with is as below ->
df.map(x => x.getString(1).trim().split(' ').map(_.capitalize).mkString(" ")).show()
This only provides the City column aliased as "value".
How can I get all the columns with the above-mentioned transformation implemented?
You can use initcap function Docu
public static Column initcap(Column e)
Returns a new string column by converting the first letter of each
word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
Parameters:
e - (undocumented) Returns:
(undocumented) Since:
1.5.0
Sample code
import org.apache.spark.sql.functions._
val data = Seq(("1", "Frankfurt am main", "Germany"))
val df = data.toDF("Id", "City", "Country")
df.withColumn("City", initcap(col("City"))).show
And the output is:
+---+-----------------+-------+
| Id| City|Country|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Your sample code was returning only 1 column because that's exactly what you coded in your map. Take x(so your df), get from it column on index 1, do some transformations and return it.
You could do what you wanted with map as you can see in other answers but output of your map needs to include all columns.
Why in my answer i am not doing map? General rule is: when there is build in sql function use it instead of custom map/udf. Most of the time sql function will be better in terms of performance as it easier to optimize for Catalyst
You can call an udf and loop over all columns:
import spark.implicits._
val data = Seq(
(1, "Frankfurt am main", "just test", "Germany"),
(2, "should do this also", "test", "France"),
)
val df = spark.sparkContext.parallelize(data).toDF("ID", "City", "test", "Country")
val convertUDF = udf((value: String) => value.split(' ').map(_.capitalize).mkString(" "))
val dfCapitalized = df.columns.foldLeft(df) {
(df, column) => df.withColumn(column, convertUDF(col(column)))
}
dfCapitalized.show(false)
+---+-------------------+---------+-------+
|ID |City |test |Country|
+---+-------------------+---------+-------+
|1 |Frankfurt Am Main |Just Test|Germany|
|2 |Should Do This Also|Test |France |
+---+-------------------+---------+-------+
You could map over your Dataframe, and then simply use normal Scala functions to capitalize. This gives you quite some flexibility in which exact transformations you want to do, giving you the Scala language to your disposal.
Something like this:
import spark.implicits._
val df = Seq(
(1, "Frankfurt am main", "Germany")
).toDF("ID", "City", "Country")
val output = df.map{
row => (
row.getInt(0),
row.getString(1).split(' ').map(_.capitalize).mkString(" "),
row.getString(2)
)
}
output.show
+---+-----------------+-------+
| _1| _2| _3|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Inside of the map function, we're outputting a tuple with the same amount of elements as the amount of columns you want to end up with.
Hope this helps!

Spark Dataframe extracting columns based dynamically selected columns

Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

Adding StringType column to existing Spark DataFrame and then applying default values

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...) method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?
You can use lit function. First you have to import it
import org.apache.spark.sql.functions.lit
and use it as shown below
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.

Spark and SparkSQL: How to imitate window function?

Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!
You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.