Capitalize the first letter of each word | Spark Scala - scala

I have a table as below -
ID
City
Country
1
Frankfurt am main
Germany
The dataframe needs to be displayed by capitalizing the first letter of each word in the city i.e. output should look like this ->
ID
City
Country
1
Frankfurt Am Main
Germany
The solution I worked with is as below ->
df.map(x => x.getString(1).trim().split(' ').map(_.capitalize).mkString(" ")).show()
This only provides the City column aliased as "value".
How can I get all the columns with the above-mentioned transformation implemented?

You can use initcap function Docu
public static Column initcap(Column e)
Returns a new string column by converting the first letter of each
word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
Parameters:
e - (undocumented) Returns:
(undocumented) Since:
1.5.0
Sample code
import org.apache.spark.sql.functions._
val data = Seq(("1", "Frankfurt am main", "Germany"))
val df = data.toDF("Id", "City", "Country")
df.withColumn("City", initcap(col("City"))).show
And the output is:
+---+-----------------+-------+
| Id| City|Country|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Your sample code was returning only 1 column because that's exactly what you coded in your map. Take x(so your df), get from it column on index 1, do some transformations and return it.
You could do what you wanted with map as you can see in other answers but output of your map needs to include all columns.
Why in my answer i am not doing map? General rule is: when there is build in sql function use it instead of custom map/udf. Most of the time sql function will be better in terms of performance as it easier to optimize for Catalyst

You can call an udf and loop over all columns:
import spark.implicits._
val data = Seq(
(1, "Frankfurt am main", "just test", "Germany"),
(2, "should do this also", "test", "France"),
)
val df = spark.sparkContext.parallelize(data).toDF("ID", "City", "test", "Country")
val convertUDF = udf((value: String) => value.split(' ').map(_.capitalize).mkString(" "))
val dfCapitalized = df.columns.foldLeft(df) {
(df, column) => df.withColumn(column, convertUDF(col(column)))
}
dfCapitalized.show(false)
+---+-------------------+---------+-------+
|ID |City |test |Country|
+---+-------------------+---------+-------+
|1 |Frankfurt Am Main |Just Test|Germany|
|2 |Should Do This Also|Test |France |
+---+-------------------+---------+-------+

You could map over your Dataframe, and then simply use normal Scala functions to capitalize. This gives you quite some flexibility in which exact transformations you want to do, giving you the Scala language to your disposal.
Something like this:
import spark.implicits._
val df = Seq(
(1, "Frankfurt am main", "Germany")
).toDF("ID", "City", "Country")
val output = df.map{
row => (
row.getInt(0),
row.getString(1).split(' ').map(_.capitalize).mkString(" "),
row.getString(2)
)
}
output.show
+---+-----------------+-------+
| _1| _2| _3|
+---+-----------------+-------+
| 1|Frankfurt Am Main|Germany|
+---+-----------------+-------+
Inside of the map function, we're outputting a tuple with the same amount of elements as the amount of columns you want to end up with.
Hope this helps!

Related

Spark Dataframe extracting columns based dynamically selected columns

Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Return combined Dataset after joinWith in Spark Scala

Given the below two Spark Datasets, flights and capitals, what would be the most efficient way to return combined (i.e. "joined") result without converting first to a DataFrame or writing out all the columns out by name in a .select() method? I know, for example, that I can access either tuple with (e.g. .map(x => x._1) or use the * operator with:
result.select("_1.*","_2.*")
But the latter may result in duplicate column names and I'm hoping for a cleaner solution.
Thank you for your help.
case class Flights(tripNumber: Int, destination: String)
case class Capitals(state: String, capital: String)
val flights = Seq(
(55, "New York"),
(3, "Georgia"),
(12, "Oregon")
).toDF("tripNumber","destination").as[Flights]
val capitals = Seq(
("New York", "Albany"),
("Georgia", "Atlanta"),
("Oregon", "Salem")
).toDF("state","capital").as[Capitals]
val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))
There are 2 options, but you will have to use join instead of joinWith:
That is the best part of the Dataset API, is to drop one of the join columns
, thus no need to repeat projection columns in a select: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
rename join column to be the same in both datasets and use a slightly different way of specifying the join: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))
Output:
result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
| New York| 55| Albany|
| Georgia| 3|Atlanta|
| Oregon| 12| Salem|
+-----------+----------+-------+

How to create a DataFrame from List?

I want to create DataFrame df that should look as simple as this:
+----------+----------+
| timestamp| col2|
+----------+----------+
|2018-01-11| 123|
+----------+----------+
This is what I do:
val values = List(List("timestamp", "2018-01-11"),List("col2","123")).map(x =>(x(0), x(1)))
val df = values.toDF
df.show()
And this is what I get:
+---------+----------+
| _1| _2|
+---------+----------+
|timestamp|2018-01-11|
| col2| 123|
+---------+----------+
What's wrong here?
Use
val df = List(("2018-01-11", "123")).toDF("timestamp", "col2")
toDF expects the input list to contain one entry per resulting Row
Each such entry should be a case class or a tuple
It does not expect column "headers" in the data itself (to name columns - pass names as arguments of toDF)
If you don't know the names of the columns statically you can use following syntax sugar
.toDF( columnNames: _*)
Where columnNames is the List with the names.
EDIT (sorry, I missed that you had the headers glued to each column).
Maybe something like this could work:
val values = List(
List("timestamp", "2018-01-11"),
List("col2","123")
)
val heads = values.map(_.head) // extracts headers of columns
val cols = values.map(_.tail) // extracts columns without headers
val rows = cols(0).zip(cols(1)) // zips two columns into list of rows
rows.toDF(heads: _*)
This would work if the "values" contained two longer lists, but it does not generalize to more lists.

Scala: How can I replace value in Dataframes using scala

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks
Edit:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add #marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.
Building off of the solution from #Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.
import org.apache.spark.sql.functions._
val newsdf =
sdf.withColumn(
"make",
when(col("make") === "Tesla", "S").otherwise(col("make"))
);
Old DataFrame
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
This can be achieved in dataframes with user defined functions (udf).
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Note:
As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)
Can not delete this answer as it has been accepted
Here is my take on this one:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map on the DataFrame.
So you basically check the column 1 for the String tesla.
If it's tesla, use the value S for make else you the current value of column 1
Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example)
There is probably a better way to do it. I am not that familiar yet with the Spark umbrella
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))