How to convert values from a file to a Map in spark-scala? - scala

I have my values in a file as comma separated. Now, i want this data to be converted into a key value pairs(Maps). I know that we can split the values and store in a Array like below.
val prop_file = sc.textFile("/prop_file.txt")
prop_file.map(_.split(",").map(s => Array(s)))
Is there any way to store the data as Map in spark-scala ?

Assuming that each line of your file contain 2 values where first word is Key and next is value, separated by space: -
A 1
B 2
C 3
Something like this can be done: -
val file = sc.textFile("/prop_file.txt")
val words = file.flatMap(x => createDataMap(x))
And here is your function - createDataMap
def createDataMap(data:String): Map[String, String] = {
val array = data.split(",")
//Creating the Map of values
val dataMap = Map[String, String](
(array(0) -> array(1)),
(array(2) -> array(3))
)
return dataMap
}
Next for retrieving the key/ values from the RDD you can leverage following operations: -
//This will print all elements of RDD
words.foreach(f=>println(f))
//Or You can filter the elements too.
words.filter(f=>f._1.equals("A"))

Sumit, I have used the below code to retrieve the value for a particular key and it worked.
val words = file.flatMap(x => createDataMap(x)).collectAsMap
val valueofA = props("A")
print(valueofA)
This gives me 1 as a result

Related

How to select several element from an RDD file line using Spark in Scala

I'm new in spark and scala and I would like to select several columns from a dataset.
I transformed my data in RDD a file using :
val dataset = sc.textFile(args(0))
Then I split my line
val resu = dataset.map(line => line.split("\001"))
But I in my dataset I have a lot of features and I just want to keep some of then (colums 2 and 3)
I tried this (which works with Pyspark) but It does'nt work.
val resu = dataset.map(line => line.split("\001")[2,3])
I know this is a newbie question but is there someone who can help me ? thanks.
I just want to keep some of then (colums 2 and 3)
If you want columns 2 and 3 in tuple form you can do
val resu = dataset.map(line => {
val array = line.split("\001")
(array(2), array(3))
})
But if you want column 2 and 3 in array form then you can do
val resu = dataset.map(line => {
val array = line.split("\001")
Array(array(2), array(3))
})
In Scala, in order to access to specific list elements you have to use parentheses.
In your case, you want a sublist, so you can try the slice(i, j) function. It extracts the elements from the index i to the j-1. So in your case, you may use:
val resu = dataset.map(line => line.split("\001").slice(2,4))
Hope it helps.

How to concat multiple columns in a data frame using Scala

I am trying concat multiple columns in a data frame . My column list are present in a variable. I am trying to pass that variable into concat function but not able to do that.
Ex: base_tbl_columns contain list of columns and I am using below code to select all the columns mentioned in the varibale .
scala> val base_tbl_columns = scd_table_keys_df.first().getString(5).split(",")
base_tbl_columns: Array[String] = Array(acct_nbr, account_sk_id, zip_code, primary_state, eff_start_date, eff_end_date, load_tm, hash_key, eff_flag)
val hist_sk_df_ld = hist_sk_df.select(base_tbl_columns.head,base_tbl_columns.tail: _*)
Similarly, I have one more list whcih I want to use for concatenation. But there the concat function is not taking the .head and .tail argument.
scala> val hash_key_cols = scd_table_keys_df.first().getString(4)
hash_key_cols: String = primary_state,zip_code
Here I am hard coding the value primary_state and zip_code.
.withColumn("hash_key_col",concat($"primary_state",$"zip_code"))
Here I am passing the variable hash_key_cols .
.withColumn("hash_key_col",concat(hash_key_cols ))
I was able t do this in python by using the code below.
hist_sk_df = hist_tbl_df.join(broadcast(hist_tbl_lkp_df) ,primary_key_col,'inner' ).withColumn("eff_start_date",lit(load_dt))**.withColumn('hash_key_col',F.concat(*hash_key_cols))**.withColumn("hash_key",hash_udf('hash_key_col')).withColumn("eff_end_date",lit(eff_close_dt)).withColumn("load_tm",lit(load_tm)).withColumn("eff_flag",lit(eff_flag_curr))
Either:
val base_tbl_columns: Array[String] = ???
df.select(concat(base_tbl_columns.map(c => col(c)): _*))
or:
df.select(expr(s"""concat(${base_tbl_columns.mkstring(",")})"""))

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten
Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

Split and choose in scala

I found some explanation to do this but i still can't do it !!
I want to split val data=sc.textFile("hdfs://ncdc/isd-history.csv")
the data have the form : ("949999","00338","PORTLAND (CASHMORE)","AS","","","-38.320","+141.480","+0081.0","19690724","19781113")
I want to split data and take only the 1st (949999) and the 3rd (PORTLAND (CASHMORE))
I have done this ,
val RDD = (data.filter(s => (s.split(',')(0) , s.split(',')(2))))
But, it doesn't work.
RDD.filter filters records, not "columns" - it expects a function from the record type (String, I assume, in this case) to Boolean, and would filter out all records for which this function returned false.
You're trying to transform each record from a String into a tuple (while "filtering" out parts of that string), so you should use RDD.map instead of RDD.filter:
val RDD = data.map(s => (s.split(',')(0), s.split(',')(2)))
Or better yet:
val RDD = data.map(_.split(',')).map(arr => (arr(0), arr(2)))
You should use split to split strings and not collections.
If this is a RDD of tuples, this should work:
val RDD = data map(row => (row._1, row._3))
If this is a RDD of Array/Seq[String] just sub _1 and _3 for indexes 0 and 2.

How to create Map[Int,Set[String]] in scala by reading a CSV file?

I want to create Map[Int,Set[String]] in scala by reading input from a CSV file.
My file.csv is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
I want the output as,
var Attributes = Map[Int,Set[String]] = Map()
Attributes += (0 -> Set("sunny","overcast","rainy"))
Attributes += (1 -> Set("hot","mild","cool"))
Attributes += (2 -> Set("high","normal"))
Attributes += (3 -> Set("false","true"))
Attributes += (4 -> Set("yes","no"))
This 0,1,2,3,4 represents the column number and Set contains the distinct values in each column.
I want to add each (Int -> Set(String)) to my attribute "Attributes". ie, If we print Attributes.size , it displays 5(In this case).
Use one of the existing answers to read in the CSV file. You'll have a two dimensional array or vector of strings. Then build your map.
// row vectors
val rows = io.Source.fromFile("file.csv").getLines.map(_.split(",")).toVector
// column vectors
val cols = rows.transpose
// convert each vector to a set
val sets = cols.map(_.toSet)
// convert vector of sets to map
val attr = sets.zipWithIndex.map(_.swap).toMap
The last line is bit ugly because there is no direct .toMap method. You could also write
val attr = Vector.tabulate(sets.size)(i => (i, sets(i))).toMap
Or you could do the last two steps in one go:
val attr = cols.zipWithIndex.map { case (xs, i) =>
(i, xs.toSet)
} (collection.breakOut): Map[Int,Set[String]]