How to concat multiple columns in a data frame using Scala - scala

I am trying concat multiple columns in a data frame . My column list are present in a variable. I am trying to pass that variable into concat function but not able to do that.
Ex: base_tbl_columns contain list of columns and I am using below code to select all the columns mentioned in the varibale .
scala> val base_tbl_columns = scd_table_keys_df.first().getString(5).split(",")
base_tbl_columns: Array[String] = Array(acct_nbr, account_sk_id, zip_code, primary_state, eff_start_date, eff_end_date, load_tm, hash_key, eff_flag)
val hist_sk_df_ld = hist_sk_df.select(base_tbl_columns.head,base_tbl_columns.tail: _*)
Similarly, I have one more list whcih I want to use for concatenation. But there the concat function is not taking the .head and .tail argument.
scala> val hash_key_cols = scd_table_keys_df.first().getString(4)
hash_key_cols: String = primary_state,zip_code
Here I am hard coding the value primary_state and zip_code.
.withColumn("hash_key_col",concat($"primary_state",$"zip_code"))
Here I am passing the variable hash_key_cols .
.withColumn("hash_key_col",concat(hash_key_cols ))
I was able t do this in python by using the code below.
hist_sk_df = hist_tbl_df.join(broadcast(hist_tbl_lkp_df) ,primary_key_col,'inner' ).withColumn("eff_start_date",lit(load_dt))**.withColumn('hash_key_col',F.concat(*hash_key_cols))**.withColumn("hash_key",hash_udf('hash_key_col')).withColumn("eff_end_date",lit(eff_close_dt)).withColumn("load_tm",lit(load_tm)).withColumn("eff_flag",lit(eff_flag_curr))

Either:
val base_tbl_columns: Array[String] = ???
df.select(concat(base_tbl_columns.map(c => col(c)): _*))
or:
df.select(expr(s"""concat(${base_tbl_columns.mkstring(",")})"""))

Related

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?

I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }

How to use Scala format and substitution interpolation together?

I am new to scala and spark and have a requirement where i want to use both format and substitution in a single println statement.
Here is the code:
val results = minTempRdd.collect()
for(result <- results.sorted){
val station = result._1
val temp = result._2
println(f" StId $station Temp $temp%.2f F")
}
where results is an RDD having structure (stationId, Temperature).
Now i want to convert this code into one liner. I tried the following code:
val results = minTempRdd.collect()
results.foreach(x => println(" stId "+x._1+" temp = "+x._2))
It works fine, but i am not able to format the second value in tuple here.
Any suggestions, how can we achieve this?
The first way is to use curly brackets inside interpolation, which allow to pass arbitrary expressions instead of variables:
println(f" StId ${result._1} Temp ${result._2}%.2fF")
The second way is to unpack the tuple:
for ((station, temp) <- results.sorted)
println(f" StId $station Temp $temp%.2fF")
Or:
results.sorted.foreach { case (station, temp) =>
println(" stId "+x._1+" temp = "+x._2)
}

How to select several element from an RDD file line using Spark in Scala

I'm new in spark and scala and I would like to select several columns from a dataset.
I transformed my data in RDD a file using :
val dataset = sc.textFile(args(0))
Then I split my line
val resu = dataset.map(line => line.split("\001"))
But I in my dataset I have a lot of features and I just want to keep some of then (colums 2 and 3)
I tried this (which works with Pyspark) but It does'nt work.
val resu = dataset.map(line => line.split("\001")[2,3])
I know this is a newbie question but is there someone who can help me ? thanks.
I just want to keep some of then (colums 2 and 3)
If you want columns 2 and 3 in tuple form you can do
val resu = dataset.map(line => {
val array = line.split("\001")
(array(2), array(3))
})
But if you want column 2 and 3 in array form then you can do
val resu = dataset.map(line => {
val array = line.split("\001")
Array(array(2), array(3))
})
In Scala, in order to access to specific list elements you have to use parentheses.
In your case, you want a sublist, so you can try the slice(i, j) function. It extracts the elements from the index i to the j-1. So in your case, you may use:
val resu = dataset.map(line => line.split("\001").slice(2,4))
Hope it helps.

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

How to convert values from a file to a Map in spark-scala?

I have my values in a file as comma separated. Now, i want this data to be converted into a key value pairs(Maps). I know that we can split the values and store in a Array like below.
val prop_file = sc.textFile("/prop_file.txt")
prop_file.map(_.split(",").map(s => Array(s)))
Is there any way to store the data as Map in spark-scala ?
Assuming that each line of your file contain 2 values where first word is Key and next is value, separated by space: -
A 1
B 2
C 3
Something like this can be done: -
val file = sc.textFile("/prop_file.txt")
val words = file.flatMap(x => createDataMap(x))
And here is your function - createDataMap
def createDataMap(data:String): Map[String, String] = {
val array = data.split(",")
//Creating the Map of values
val dataMap = Map[String, String](
(array(0) -> array(1)),
(array(2) -> array(3))
)
return dataMap
}
Next for retrieving the key/ values from the RDD you can leverage following operations: -
//This will print all elements of RDD
words.foreach(f=>println(f))
//Or You can filter the elements too.
words.filter(f=>f._1.equals("A"))
Sumit, I have used the below code to retrieve the value for a particular key and it worked.
val words = file.flatMap(x => createDataMap(x)).collectAsMap
val valueofA = props("A")
print(valueofA)
This gives me 1 as a result