Map monthname to monthnumber in Azure databricks using scala

Map monthname to monthnumber in Azure databricks using scala - scala

Need to map monthname to monthnumber in Azure databricks using scala.
I have column name PERIOD which have data like months name(like Jan,Feb,Mar,....,Nov,Dec), I want to replace this monthname with monthnumber (like 01,02,03,...,11,12).
Result should be like Jan,Feb,Mar,..,Nov,Dec replaced with 01,02,03,...,11,12.
"Jan"  -> 01,
      "Feb"  -> 02,
      "Mar"  -> 03,
      "Apr"  -> 04,
      "May"  -> 05,
      "Jun"  -> 06,
      "Jul"  -> 07,
      "Aug"  -> 08,
      "Sep" -> 09,
      "Oct"  -> 10,
      "Nov"  -> 11,
      "Dec"  -> 12
I'm new to scala and azure databricks. I tried mapping approach but not getting desired solution.
enter image description here

I have a dataframe with column called month_in_words as in the following image:
Now to convert this month in words to month number, first I have created a map to where key is the month name and value is month number as string (Since we need month number as 01, but not 1).
val monthNumber = Map(
"Jan" -> "01",
"Feb" -> "02",
"Mar" -> "03",
"Apr" -> "04",
"May" -> "05",
"Jun" -> "06",
"Jul" -> "07",
"Aug" -> "08",
"Sep" -> "09",
"Oct" -> "10",
"Nov" -> "11",
"Dec" -> "12"
)
Now you can use the following code to convert month name to month number:
var final_df = df.rdd.map(f => {
val number = monthNumber(f.getString(0))
(f.getString(0),number)
}).toDF("month_in_words","month_in_number")
//display(df)
To directly convert, you can use the following code:
var final_df = df.rdd.map(f => monthNumber(f.getString(0))).toDF("month_in_number")
//display(final_df)

Related

Spark SQL Dataframes - replace function from DataFrameNaFunctions does not work if the Map is created with RDD.collectAsMap()

From DataFrameNaFunctions I am using replace function to replace values of a column in a dataframe with those from a Map.
The keys & values of the Map are available as a delimited file. These are read into an RDD, then transformed to a pair RDD and converted to a Map.
For example a text file of month number & month name available as a file as shown below:
01,January
02,February
03,March
... ...
... ...
val mRDD1 = sc.textFile("file:///.../monthlist.txt")
When this data is transformed as a Map using RDD.collect().toMap as given below the dataframe.na.replace function works fine which I am referring as Method 1.
val monthMap1= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collect().toMap
monthMap1: scala.collection.immutable.Map[String,String] = Map(12 -> December, 08 -> August, 09 -> September, 11 -> November, 05 -> May, 04 -> April, 10 -> October, 03 -> March, 06 -> June, 02 -> February, 07 -> July, 01 -> January)
val df2 = df1.na.replace("monthname", monthMap1)
df2: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 13 more fields]
However when this data is transformed as a Map using RDD.collectAsMap() as shown below since it is not an immutable Map it is not working which I am calling Method 2.
Is there simple a way to convert this scala.collection.Map into scala.collection.immutable.Map so that it does not give this error?
val monthMap2= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()
monthMap2: scala.collection.Map[String,String] = Map(12 -> December, 09 -> September, 03 -> March, 06 -> June, 11 -> November, 05 -> May, 08 -> August, 02 -> February, 01 -> January, 10 -> October, 04 -> April, 07 -> July)
val df3 = df1.na.replace("monthname", monthMap2)
<console>:30: error: overloaded method value replace with alternatives:
[T](cols: Seq[String], replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](col: String, replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](cols: Array[String], replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](col: String, replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame
cannot be applied to (String, scala.collection.Map[String,String])
val cdf3 = cdf2.na.replace("monthname", monthMap2)
^
Method 1 mentioned above is working fine.
However, for using Method 2, I would like to know what is the simple and direct way to convert a scala.collection.Map into scala.collection.immutable.Map and which libraries I need to import as well.
Thanks

You can try this :
val monthMap2 = mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()
// create an immutable map from monthMap2
val monthMap = collection.immutable.Map(monthMap2.toSeq: _*)
val df3 = df1.na.replace("monthname", monthMap)
The method replace takes also a java map, you can also convert it like this:
import scala.jdk.CollectionConverters._
val df3 = df1.na.replace("monthname", monthMap2.asJava)

Select Columns in Spark Dataframe based on Column name pattern

I have a spark dataframe with the following column structure:
UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY
In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc
I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q.
Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year
I am using spark 2.4.4

Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :
scala> val columns = df.columns
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)
scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$
scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)
You can check that the selected columns does not contain the unwanted columns :
scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)
Now you can use the selection :
scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]

You could use desc sql command to get list of column names
val fyStringList=new util.ArrayList[String]()
spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
println(fyStringList)
Use above snippet to get list of column name which contains "fy"
You can update filter logic with regex and also update logic in forEach for storing string columns

you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.
var DF1 = DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)
This will remove those FY related columns. Hope this works for you.

Aggregate data in scala

I have data in a file like :
2005, 08, 20, 50
2005, 08, 21, 52
2005, 08, 22, 38
2005, 08, 23, 70
Data is : Year, Month, Date, temperature.
I want to read this data and output data year and month wise temperatures.
example : 2015-08: 38, 50, 52, 70.
temperature is sorted in ascending order.
What should be the spark scala code for the same? Answer in RDD transformations would appreciate a lot.
Until now I have done this so far :
val conf= new SparkConf().setAppName("demo").setMaster("local[*]")
val spark = new SparkContext(conf)
val input = spark.textFile("src/main/resources/someFile.txt")
val fields = input.flatMap(_.split(","))
What I am thinking is, to have year-month as a key and then list of temperatures as values. But I am not able to get this into the code.

val myData = sc.parallelize(Array((2005, 8, 20, 50), (2005, 8, 21, 52), (2005, 8, 22, 38), (2005, 8, 23, 70)))
myData.sortBy(_._4).collect
returns:
res1: Array[(Int, Int, Int, Int)] = Array((2005,8,22,38), (2005,8,20,50), (2005,8,21,52), (2005,8,23,70))
Leave you to do the concat function

From file
val filesRDD = sc.textFile("/FileStore/tables/Weather2.txt",1)
val linesRDD = filesRDD.map(line => (line.trim.split(","))).map(entries=>(entries(0).toInt,entries(1).toInt,entries(2).toInt,entries(3).toInt))
linesRDD.sortBy(_._4).collect
returns:
res13: Array[(Int, Int, Int, Int)] = Array((2005,7,22,7), (2005,7,15,10), (2005,8,22,38), (2005,8,20,50), (2005,7,19,50), (2005,8,21,52), (2005,7,21,52), (2005,8,23,70))
You can think of the concat yourself, and, what if sort values are common? Multiple sorts, but this I think answers your first less well-formed question.

How to get data out of Wrapped Array in Apache Spark / Scala

I have a Dataframe with rows that look like this:
[WrappedArray(1, 5DC7F285-052B-4739-8DC3-62827014A4CD, 1, 1425450997, 714909, 1425450997, 714909, {}, 2013, GAVIN, ST LAWRENCE, M, 9)]
[WrappedArray(2, 17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF, 2, 1425450997, 714909, 1425450997, 714909, {}, 2013, LEVI, ST LAWRENCE, M, 9)]
[WrappedArray(3, 53E20DA8-8384-4EC1-A9C4-071EC2ADA701, 3, 1425450997, 714909, 1425450997, 714909, {}, 2013, LOGAN, NEW YORK, M, 44)]
...
Everything before the year (2013 in this example) is nonsense that should be dropped. I would like to map the data to a Name class that I have created and put it into a new dataframe.
How do I get to the data and do that mapping?
Here is my Name class:
case class Name(year: Int, first_name: String, county: String, sex: String, count: Int)
Basically, I would like to fill my dataframe with rows and columns according to the schema of the Name class. I know how to do this part, but I just don't know how to get to the data in the dataframe.

Assuming the data is an array of strings like this:
val df = Seq(Seq("1", "5DC7F285-052B-4739-8DC3-62827014A4CD", "1", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "GAVIN", "STLAWRENCE", "M", "9"),
Seq("2", "17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF", "2", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LEVI", "ST LAWRENCE", "M", "9"),
Seq("3", "53E20DA8-8384-4EC1-A9C4-071EC2ADA701", "3", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LOGAN", "NEW YORK", "M", "44"))
.toDF("array")
You could either use an UDF that returns a case class or you can use withColumn multiple times. The latter should be more efficient and can be done like this:
val df2 = df.withColumn("year", $"array"(8).cast(IntegerType))
.withColumn("first_name", $"array"(9))
.withColumn("county", $"array"(10))
.withColumn("sex", $"array"(11))
.withColumn("count", $"array"(12).cast(IntegerType))
.drop($"array")
.as[Name]
This will give you a DataSet[Name]:
+----+----------+-----------+---+-----+
|year|first_name|county |sex|count|
+----+----------+-----------+---+-----+
|2013|GAVIN |STLAWRENCE |M |9 |
|2013|LEVI |ST LAWRENCE|M |9 |
|2013|LOGAN |NEW YORK |M |44 |
+----+----------+-----------+---+-----+
Hope it helped!

converting EEE mmm ddd hh:mm:ss to dd-mm-yyy hh:mm

def run : List[Map[String,Any]]={
val startTime = "19-9-2014 23:00"
val endTime = "19-9-2014 13:15"
val df: SimpleDateFormat = new SimpleDateFormat("dd-MM-yyyy HH:mm")
val calendar: Calendar = Calendar.getInstance
val finalTime : Date = df.parse(endTime)
var firstTime=startTime
var timeValuePair = List[Map[String , Any]]()
var last = new Date()
do {
val first: Date = df.parse(firstTime)
calendar.setTime(first)
calendar.add(Calendar.MINUTE, 5)
last = calendar.getTime
val (repAlert, slowQueryAlert, statusAlert) = AggregateAlert.test(first.toString, last.toString)
timeValuePair=Map("repAlert"->repAlert.toString,"slowQueryAlert"->slowQueryAlert.toString,"statusAlert"->statusAlert.toString) :: timeValuePair
firstTime=last.toString
}while(last.compareTo(finalTime)<0)
I basically need to send date and time in the format of dd-mmm-yyyy hh:mm to the function AggregateAlert for every 5 minutes interval between the startTime and endTime.
However with my code I am getting "first" as "Fri Sep 19 11:00:00 IST 2014" and "last" as "Fri Sep 19 19:01:21 IST 2014". However I want the same format as dd-mmm-yyyy. Is there anyway to convert this to the required format?
Thanks in advance!

Try :
val (repAlert, slowQueryAlert, statusAlert) =
AggregateAlert.test(df.format(first), df.format(last))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Map monthname to monthnumber in Azure databricks using scala - scala

Related

Spark SQL Dataframes - replace function from DataFrameNaFunctions does not work if the Map is created with RDD.collectAsMap()

Select Columns in Spark Dataframe based on Column name pattern

Aggregate data in scala

How to get data out of Wrapped Array in Apache Spark / Scala

converting EEE mmm ddd hh:mm:ss to dd-mm-yyy hh:mm

Categories

Resources