Adding/selecting fields to/from RDD - scala

I've an RDD lets say dataRdd with fields like timestamp ,url, ...
I want to create a new RDD with few fields from this dataRdd.
Following code segment creates the new RDD, where timestamp and URL are considered values and not field/column names:
var fewfieldsRDD= dataRdd.map(r=> ( "timestamp" -> r.timestamp , "URL" -> r.url))
However, with below code segment, one, two, three, arrival, and SFO are considered as column names.:
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val numairRdd= sc.makeRDD(Seq(numbers, airports))
Can anyone tell me what am I doing wrong and how can I create a new Rdd with field names mapped to values from another Rdd?

You are creating an RDD of tuples, not Map objects. Try:
var fewfieldsRDD= dataRdd.map(r=> Map( "timestamp" -> r.timestamp , "URL" -> r.url))

Related

How to change the schema of existing dataframe

Problem statement : I have a csv file with around 100+ fields.I need to perform transformation on these fields and generate new 80+ fields and write only these new fields into s3 in parquet format.
The parquet predefined schema = 80+ newly populated fields + some non populated fields.
Is there any way to pass this predefined parquet schema while writing data to s3 so that these extra fields also populated with null data.
select will not be useful to select only 80+ fields as predefined schema might have around 120 predefined fields.
Below is sample data and transformation requirementCSV data
aid, productId, ts, orderId
1000,100,1674128580179,edf9929a-f253-487
1001,100,1674128580179,cc41a026-63df-410
1002,100,1674128580179,9732755b-1207-471
1003,100,1674128580179,51125ddd-4129-48a
1001,200,1674128580179,f4917676-b08d-41e
1004,200,1674128580179,dc80559d-16e6-4fa
1005,200,1674128580179,c9b743eb-457b-455
1006,100,1674128580179,e8611141-3e0e-4d5
1002,200,1674128580179,30be34c7-394c-43a
Parquet schema
def getPartitionFieldsSchema() = {
List(
Map("name" -> "company", "type" -> "long",
"nullable" -> true, "metadata" -> Map()),
Map("name" -> "epoch_day", "type" -> "long",
"nullable" -> true, "metadata" -> Map()),
Map("name" -> "account", "type" -> "string",
"nullable" -> true, "metadata" -> Map()),
)
}
val schemaMap = Map("type" -> "struct",
"fields" -> getPartitionFieldsSchema)
simple example
val dataDf = spark
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("./scripts/input.csv")
dataDf
.withColumn("company",lit(col("aid")/100))
.withColumn("epoch_day",lit(col("ts")/86400))
.write // how to write only company, epoch_day, account ?
.mode("append")
.csv("/tmp/data2")
Output should have below columns: company, epoch_day, account
This is how I understand your problem:
you wanna read some csv and transform them to parquet in s3.
during the transformation, you need to create 3 new cols based on existing cols in csv files.
but since only 2 out of 3 new cols are calculated, the output are only showing two new cols but not 3.
In such case, you can create a external table in redshift, and you specify all the cols. As a result even some colums are not fed, there would be null in your external tables.

How to replace a value for a particular key in List[Row] in scala [duplicate]

This question already has answers here:
How to calculate sum and count in a single groupBy?
(3 answers)
Closed 4 years ago.
New to Scala !! I have a List[Row] where Row is org.apache.spark.sql.Row and it has something like
val list = List("""{"name":"abc","salary":"somenumber","id":"1"}"""")
How do I replace the key salary with something else ?
Please find the below solution. I hope, It will help you
If your input is like below
scala> val list = List("""{"name":"abc","salary":"somenumber","id":"1"}""")
list: List[String] = List({"name":"abc","salary":"somenumber","id":"1"})
I am converting List[org.apache.spark.sql.Row] to List[scala.collection.immutable.Map[String,String]]
scala> val listOfMaps=list.map(Row=>Row(0).toString.replaceAll("[{}]","").split(",").map(str=>(str.split(":")(0),str.split(":")(1))).toMap)
listOfMaps: List[scala.collection.immutable.Map[String,String]] = List(Map(name -> abc, salary -> somenumber, id -> 1"))
Since, I can't update the value of immutable map to converting it to mutable map and updating value
import collection.mutable.Map
scala> val mutableMap=listOfMaps.map(mp=>collection.mutable.Map(mp.toSeq: _*)).map(mp=>mp+("\""+"salary"+"\""->"2000"))
mutableMap: List[scala.collection.mutable.Map[String,String]] = List(Map(name -> abc, salary -> 2000, id -> 1"))
Getting output in original format that List[Row]
scala> val ans=mutableMap.map(mp=>Row("{"+mp.mkString(",").replaceAll("->",":")+"}"))
ans: List[org.apache.spark.sql.Row] = List([{name : abc,salary : 2000,id : 1"}])
if you want to maintain data in spark SQL rows, then you can perform the groupBy operation in spark itself. see below taken from How to calculate sum and count in a single groupBy?
// In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"))
feel free to vote on the close duplicate, placing it here until decision is made.

How to update multiple columns of Dataframe from given set of maps in Scala?

I have below dataframe
val df=Seq(("manuj","kumar","CEO","Info"),("Alice","Beb","Miniger","gogle"),("Ram","Kumar","Developer","Info Delhi")).toDF("fname","lname","designation","company")
or
+-----+-----+-----------+----------+
|fname|lname|designation| company|
+-----+-----+-----------+----------+
|manuj|kumar| CEO| Info|
|Alice| Beb| Miniger| gogle|
| Ram|Kumar| Developer|Info Delhi|
+-----+-----+-----------+----------+
Below is the given maps for individual column
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
I also have list of columns which need to be updated so my requirement is that update all the columns of dataframe(df) which are in given list of columns using given maps.
val colList=Iterator("fname","lname","designation","company")
Output must be like
+-----+-----+-----------+--------+
|fname|lname|designation| company|
+-----+-----+-----------+--------+
|Manoj|kumar| CEO|Info Ltd|
|Alice| Bob| Manager| Google|
| Ram|Kumar| Developer|Info Ltd|
+-----+-----+-----------+--------+
Edit: Dataframe may have around 1200 columns and colList will have less than 1200 column names so I need to iterate over colList and update value of corresponding column from corresponding map.
Since DataFrames are immutable, in this example it can be processed progressively column by column, by creating a new DataFrame containing an intermediate column with replaced values, then renaming this column to initial name and finally overwriting the original DataFrame.
To achieve all this, several steps will be necessary.
First, we'll need a udf that returns a replacement value if it occurs in the provided map:
def replaceValueIfMapped(mappedValues: Map[String, String]) = udf((cellValue: String) =>
mappedValues.getOrElse(cellValue, cellValue)
)
Second, we'll need a generic function that expects a DataFrame, a column name and its replacements map. This function produces a dataframe with a temporary column, containing replaced values, drops the original column, renames the temporary one to the original name and finally returns the produced DataFrame:
def replaceColumnValues(toReplaceDf: DataFrame, column: String, mappedValues: Map[String, String]): DataFrame = {
val replacedColumn = column + "_replaced"
toReplaceDf.withColumn(replacedColumn, replaceValueIfMapped(mappedValues)(col(column)))
.drop(column)
.withColumnRenamed(replacedColumn, column)
}
Third, instead of having an Iterator on column names for replacements, we'll use a Map, where each column name is associated with a replacements map:
val colsToReplace = Map("fname" -> fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
Finally, we can call foldLeft on this map in order to execute all the replacements:
val replacedDf = colsToReplace.foldLeft(sourceDf){ case(alreadyReplaced, toReplace) =>
replaceColumnValues(alreadyReplaced, toReplace._1, toReplace._2)
}
replacedDf now contains the expected result.
To make the lookup dynamic at this level, you'll probably need to change the way you map your values to make then dynamically searchable. I would make maps of maps, with keys being the names of the columns, as expected to be passed in:
val fnameMap=Map("manuj"->"Manoj")
val lnameMap=Map("Beb"->"Bob")
val designationMap=Map("Miniger"->"Manager")
val companyMap=Map("Info"->"Info Ltd","gogle"->"Google","Info Delhi"->"Info Ltd")
val allMaps = Map("fname"->fnameMap,
"lname" -> lnameMap,
"designation" -> designationMap,
"company" -> companyMap)
This may make sense as the maps are relatively small, but you may need to consider using broadcast variables.
You can then dynamically look up based on field names.
* [ if you've seen that my scala code is bad, it's because it is. So here's a java version for you to translate] *
List<String> allColumns = Arrays.asList(dataFrame.columns());
df
.map(row ->
//this rewrites the row (that's a warning)
RowFactory.create(
allColumns.stream()
.map(dfColumn -> {
if(!colList.contains(dfColumn)) {
//column not requested for mapping, use old value
return row.get(allColumns.indexOf(dfColumn));
} else {
Object colValue =
row.get(allColumns.indexOf(dfColumn))
// in case of [2], you'd have to call:
//row.get(colListToDFIndex.get(dfColumn))
//Modified value
return allMaps.get(dfColumn)
//Assuming strings, you may need to cast
.getOrDefault(colValue, colValue);
}
})
.collect(Collectors.toList())
.toArray()
)
)
)

Data frame creation from MAP of N elements with schemaDetails of N elements

How do I convert the input5 data format into DataFrame, using the schema details mentioned in schemanames?
The conversion should be dynamic without using Row(r(0),r(1)) - the number of columns can increase or decrease in input and schema, hence the code should be dynamic.
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(Entry("a","b",0,Map("col1 " -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")))
val schemanames= "col1,ref"
Target dataframe should be only from Map of input 5 (like col1 and ref). There can be many other columns (like col2, col3...). If there are more columns in Map same columns would be mentioned in schema name.
Schema name variable should be used to create structure , input5.row(Map) should be data source ...as number of columns in schema name can be in 100's , same applies to data in Input5.row.
This would work for any number of columns, as long as they're all Strings, and each Entry contains a map with values for all of these columns:
// split to column names:
val columns = schemanames.split(",")
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, StringType)))
// convert input5 to Seq[Row], while selecting the values from "row" Map in same order of columns
val rows = input5.map(_.row)
.map(valueMap => columns.map(valueMap.apply).toSeq)
.map(Row.fromSeq)
// finally - create dataframe
val dataframe = spark.createDataFrame(sc.parallelize(rows), schema)
You can go through entries in schemanames (which are presumably selected keys in the Map based on your description) along with a UDF for Map manipulation to assemble the dataframe as shown below:
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(
Entry("a", "b", 0, Map("col1" -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")),
Entry("c", "b", 1, Map("col1" -> "0000444", "col2" -> "0000444", "ref" -> "2017-08-14 14:14:14.0")),
Entry("a", "d", 0, Map("col2" -> "0000666", "ref" -> "2017-08-16 16:16:16.0")),
Entry("e", "f", 0, Map("col1" -> "0000777", "ref" -> "2017-08-17 17:17:17.0", "others" -> "x"))
)
val schemanames= "col1, ref"
// Create dataframe from input5
val df = input5.toDF
// A UDF to get column value from Map
def getColVal(c: String) = udf(
(m: Map[String, String]) =>
m.get(c).getOrElse("n/a")
)
// Add columns based on entries in schemanames
val df2 = schemanames.split(",").map(_.trim).
foldLeft( df )(
(acc, c) => acc.withColumn( c, getColVal(c)(df("row"))
))
val df3 = df2.select(cols.map(c => col(c)): _*)
df3.show(truncate=false)
+-------+--------------------------+
|col1 |ref |
+-------+--------------------------+
|0000555|2017-08-12 12:12:12.266528|
|0000444|2017-08-14 14:14:14.0 |
|n/a |2017-08-16 16:16:16.0 |
|0000777|2017-08-17 17:17:17.0 |
+-------+--------------------------+

Read CSV File to OrderedMap

I'm reading the CSV File and adding data to Map in Scala.
val br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(fileName)), "UTF-8"))
val inputFormat = CSVFormat.newFormat(delimiter.charAt(0)).withHeader().withQuote('"')
import scala.collection.JavaConverters._
import org.apache.commons.csv.{CSVFormat, CSVParser}
val csvRecords = new CSVParser(br, inputFormat).getRecords.asScala
val buffer = for (csvRecord <- csvRecords; if csvRecords != null && csvRecords.nonEmpty)
yield csvRecord.toMap.asScala
buffer.toList
But as the Map is not ordered I'm not able to read the columns in order. Is there any way to read the csvRecords in order?
The CSV file contains comma separated values along with the header. It should generate the output in List[mutable.LinkedHashMap[String, String]] format something like [["fname", "A", "lname", "B"], ["fname", "C", "lname", "D"]].
The above code is working but it is not preserving the order. For Ex: if CSV file contains columns in order fname, lname, the output map is having lname first and fname last.
If I understand you question correctly, here's one way to create a list of LinkedHashMaps with elements in order:
// Assuming your CSV File has the following content:
fname,lname,grade
John,Doe,A
Ann,Cole,B
David,Jones,C
Mike,Duke,D
Jenn,Rivers,E
import collection.mutable.LinkedHashMap
// Get indexed header from CSV
val indexedHeader = io.Source.fromFile("/path/to/csvfile").
getLines.take(1).next.
split(",").
zipWithIndex
indexedHeader: Array[(String, Int)] = Array((fname,0), (lname,1), (grade,2))
// Aggregate LinkedHashMap using foldLeft
val ListOfLHM = for ( csvRecord <- csvRecords ) yield
indexedHeader.foldLeft(LinkedHashMap[String, String]())(
(acc, x) => acc += (x._1 -> csvRecord.get(x._2))
)
ListOfLHM: scala.collection.mutable.Buffer[scala.collection.mutable.LinkedHashMap[String,String]] = ArrayBuffer(
Map(fname -> John, lname -> Doe, grade -> A),
Map(fname -> Ann, lname -> Cole, grade -> B),
Map(fname -> David, lname -> Jones, grade -> C),
Map(fname -> Mike, lname -> Duke, grade -> D),
Map(fname -> Jenn, lname -> Rivers, grade -> E)
)