Read CSV File to OrderedMap - scala

I'm reading the CSV File and adding data to Map in Scala.
val br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(fileName)), "UTF-8"))
val inputFormat = CSVFormat.newFormat(delimiter.charAt(0)).withHeader().withQuote('"')
import scala.collection.JavaConverters._
import org.apache.commons.csv.{CSVFormat, CSVParser}
val csvRecords = new CSVParser(br, inputFormat).getRecords.asScala
val buffer = for (csvRecord <- csvRecords; if csvRecords != null && csvRecords.nonEmpty)
yield csvRecord.toMap.asScala
buffer.toList
But as the Map is not ordered I'm not able to read the columns in order. Is there any way to read the csvRecords in order?
The CSV file contains comma separated values along with the header. It should generate the output in List[mutable.LinkedHashMap[String, String]] format something like [["fname", "A", "lname", "B"], ["fname", "C", "lname", "D"]].
The above code is working but it is not preserving the order. For Ex: if CSV file contains columns in order fname, lname, the output map is having lname first and fname last.

If I understand you question correctly, here's one way to create a list of LinkedHashMaps with elements in order:
// Assuming your CSV File has the following content:
fname,lname,grade
John,Doe,A
Ann,Cole,B
David,Jones,C
Mike,Duke,D
Jenn,Rivers,E
import collection.mutable.LinkedHashMap
// Get indexed header from CSV
val indexedHeader = io.Source.fromFile("/path/to/csvfile").
getLines.take(1).next.
split(",").
zipWithIndex
indexedHeader: Array[(String, Int)] = Array((fname,0), (lname,1), (grade,2))
// Aggregate LinkedHashMap using foldLeft
val ListOfLHM = for ( csvRecord <- csvRecords ) yield
indexedHeader.foldLeft(LinkedHashMap[String, String]())(
(acc, x) => acc += (x._1 -> csvRecord.get(x._2))
)
ListOfLHM: scala.collection.mutable.Buffer[scala.collection.mutable.LinkedHashMap[String,String]] = ArrayBuffer(
Map(fname -> John, lname -> Doe, grade -> A),
Map(fname -> Ann, lname -> Cole, grade -> B),
Map(fname -> David, lname -> Jones, grade -> C),
Map(fname -> Mike, lname -> Duke, grade -> D),
Map(fname -> Jenn, lname -> Rivers, grade -> E)
)

Related

Convert MapPartitionsRDD to DataFrame and grouping data by 2 keys

I have a dataframe which looks like this:
country | user | count
----------------------
Germany | Sarah| 2
China | Paul | 1
Germany | Alan | 3
Germany | Paul | 1
...
What I am trying to do is to convert this dataframe to another which looks like this:
dimension | value
--------------------------------------------
Country | [Germany -> 4, China -> 1]
--------------------------------------------
User | [Sarah -> 2, Paul -> 2, Alan -> 3]
...
At first I tried to do it by this:
var newDF = Seq.empty[(String, Map[String,Long])].toDF("dimension", "value")
df.collect()
.foreach(row => { Array(0,1)
.map(pos =>
newDF = newDF.union(Seq((df.columns.toSeq(pos).toString, Map(row.mkString(",").split(",")(pos) -> row.mkString(",").split(",")(2).toLong))).toDF())
)
})
val newDF2 = newDF.groupBy("dimension").agg(collect_list("value")).as[(String, Seq[Map[String, Long]])].map {case (id, list) => (id, list.reduce(_ |+| _))}.toDF("dimension", "value")
But the collect() was killing my driver. Therefore, I have tried to do it like this:
class DimItem[T](val dimension: String, val value: String, val metric: T)
val items: RDD[DimItem[Long]] = df.rdd.flatMap(row => {
dims.zipWithIndex.map{case (dim, i) =>
new DimItem(dim, row(i).toString, row(13).asInstanceOf[Long])
}
})
// with the format [ DimItem(Country, Germany, 2), DimItem(User, Sarah, 2)], ...
val itemsGrouped: RDD[((String, String), Iterable[DimItem[Long]])] = items.groupBy(x => (x.dimension, x.value))
val aggregatedItems: RDD[DimItem[Long]] = itemsGrouped.map{case (key, items) => new DimItem(key._1, key._2, items.reduce((a,b) => a.metric + b.metric)}
The idea is to save in an RDD objects like (Country, China, 1), (Country, Germany, 3), (Country, Germany, 1), ... and then group it by the 2 first keys (Country, China), (Country, Germany), ... Once grouped, sum the count they have. Ex: having (Country, Germany, 3), (Country, Germany, 1) will become (Country, Germany, 4).
But once I get here, it tells me that in items.reduce() there is a mismatch: it expects a DimItem[Long] but gets a Long.
Next step will be to group it by the key "dimension" and create the Map[String, Int]()format in the column "value" and convert it to a DF.
I have 2 questions.
First: is this last code correct?
Second: How can I convert this MapPartitionsRDD into a DF?
Here is one solution based on dataframe API:
import org.apache.spark.sql.functions.{lit, map_from_arrays, collect_list}
def transform(df :DataFrame, colName: String) : DataFrame =
df.groupBy(colName)
.agg{sum("count").as("sum")}
.agg{
map_from_arrays(
collect_list(colName),
collect_list("sum")
).as("value")
}.select(lit(colName).as("dimension"), $"value")
val countryDf = transform(df, "country")
val userDf = transform(df, "user")
countryDf.unionByName(userDf).show(false)
// +---------+----------------------------------+
// |dimension|value |
// +---------+----------------------------------+
// |Country |[Germany -> 6, China -> 1] |
// |User |[Sarah -> 2, Alan -> 3, Paul -> 2]|
// +---------+----------------------------------+
Analysis: first we get the sum by country and user grouping by country and user respectively. Next we add one more custom aggregation to the pipeline which collects the previous results into a map. Map will be populated via map_from_arrays function found in Spark 2.4.0. The keys/values of the map we collect them with collect_list. Finally we union the two dataframes to populate the final results.

How to get distinct count value in scala

I want to find the distinct values from this query in scala
select
key,
count(distinct suppKey)
from
file
group by
key ;
I write this code in scala, but didn't working.
val count= file.map(line=> (line.split('|')(0),line.split('|')(1)).distinct().count())
I make split, because key is in the first row in file, and suppkey in the second.
File:
1|52|3956|337.0
1|77|4069|357.8
1|7|14|35.2
2|3|8895|378.4
2|3|4969|915.2
2|3|8539|438.3
2|78|3025|306.3
Expected output:
1|3
2|2
Instead of a file, for simpler testing, I use a String:
scala> val s="""1|52|3956|337.0
| 1|77|4069|357.8
| 1|7|14|35.2
| 2|3|8895|378.4
| 2|3|4969|915.2
| 2|3|8539|438.3
| 2|78|3025|306.3"""
scala> s.split("\n").map (line => {val sp = line.split ('|'); (sp(0), sp(1))}).distinct.groupBy (_._1).map (e => (e._1, e._2.size))
res198: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 3)
Imho, we need a groupBy to specify what to group over, and to count groupwise.
Done in spark REPL. test.txt is the file with the text you've provided
val d = sc.textFile("test.txt")
d.map(x => (x.split("\\|")(0), x.split("\\|")(1))).distinct.countByKey
scala.collection.Map[String,Long] = Map(2 -> 2, 1 -> 3)

Data frame creation from MAP of N elements with schemaDetails of N elements

How do I convert the input5 data format into DataFrame, using the schema details mentioned in schemanames?
The conversion should be dynamic without using Row(r(0),r(1)) - the number of columns can increase or decrease in input and schema, hence the code should be dynamic.
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(Entry("a","b",0,Map("col1 " -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")))
val schemanames= "col1,ref"
Target dataframe should be only from Map of input 5 (like col1 and ref). There can be many other columns (like col2, col3...). If there are more columns in Map same columns would be mentioned in schema name.
Schema name variable should be used to create structure , input5.row(Map) should be data source ...as number of columns in schema name can be in 100's , same applies to data in Input5.row.
This would work for any number of columns, as long as they're all Strings, and each Entry contains a map with values for all of these columns:
// split to column names:
val columns = schemanames.split(",")
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, StringType)))
// convert input5 to Seq[Row], while selecting the values from "row" Map in same order of columns
val rows = input5.map(_.row)
.map(valueMap => columns.map(valueMap.apply).toSeq)
.map(Row.fromSeq)
// finally - create dataframe
val dataframe = spark.createDataFrame(sc.parallelize(rows), schema)
You can go through entries in schemanames (which are presumably selected keys in the Map based on your description) along with a UDF for Map manipulation to assemble the dataframe as shown below:
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(
Entry("a", "b", 0, Map("col1" -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")),
Entry("c", "b", 1, Map("col1" -> "0000444", "col2" -> "0000444", "ref" -> "2017-08-14 14:14:14.0")),
Entry("a", "d", 0, Map("col2" -> "0000666", "ref" -> "2017-08-16 16:16:16.0")),
Entry("e", "f", 0, Map("col1" -> "0000777", "ref" -> "2017-08-17 17:17:17.0", "others" -> "x"))
)
val schemanames= "col1, ref"
// Create dataframe from input5
val df = input5.toDF
// A UDF to get column value from Map
def getColVal(c: String) = udf(
(m: Map[String, String]) =>
m.get(c).getOrElse("n/a")
)
// Add columns based on entries in schemanames
val df2 = schemanames.split(",").map(_.trim).
foldLeft( df )(
(acc, c) => acc.withColumn( c, getColVal(c)(df("row"))
))
val df3 = df2.select(cols.map(c => col(c)): _*)
df3.show(truncate=false)
+-------+--------------------------+
|col1 |ref |
+-------+--------------------------+
|0000555|2017-08-12 12:12:12.266528|
|0000444|2017-08-14 14:14:14.0 |
|n/a |2017-08-16 16:16:16.0 |
|0000777|2017-08-17 17:17:17.0 |
+-------+--------------------------+

Adding/selecting fields to/from RDD

I've an RDD lets say dataRdd with fields like timestamp ,url, ...
I want to create a new RDD with few fields from this dataRdd.
Following code segment creates the new RDD, where timestamp and URL are considered values and not field/column names:
var fewfieldsRDD= dataRdd.map(r=> ( "timestamp" -> r.timestamp , "URL" -> r.url))
However, with below code segment, one, two, three, arrival, and SFO are considered as column names.:
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val numairRdd= sc.makeRDD(Seq(numbers, airports))
Can anyone tell me what am I doing wrong and how can I create a new Rdd with field names mapped to values from another Rdd?
You are creating an RDD of tuples, not Map objects. Try:
var fewfieldsRDD= dataRdd.map(r=> Map( "timestamp" -> r.timestamp , "URL" -> r.url))

Invert map and reduceByKey in Spark-Scala

I'm have a CSV dataset that I want to process using Spark, the second column is of this format:
yyyy-MM-dd hh:mm:ss
I want to group each MM-dd
val days : RDD = sc.textFile(<csv file>)
val partitioned = days.map(row => {
row.split(",")(1).substring(5,10)
}).invertTheMap.groupOrReduceByKey
The result of groupOrReduceByKey is of form:
("MM-dd" -> (row1, row2, row3, ..., row_n) )
How should I implement invertTheMap and groupOrReduceByKey?
I saw this in Python here but I wonder how is it done in Scala?
This should do the trick
val testData = List("a, 1987-09-30",
"a, 2001-09-29",
"b, 2002-09-30")
val input = sc.parallelize(testData)
val grouped = input.map{
row =>
val columns = row.split(",")
(columns(1).substring(6, 11), row)
}.groupByKey()
grouped.foreach(println)
The output is
(09-29,CompactBuffer(a, 2001-09-29))
(09-30,CompactBuffer(a, 1987-09-30, b, 2002-09-30))