This seems like it should be relatively straightforward but I haven't been able to find an example of how to do this efficiently after scouring many resources.
I have a Spark DataFrame where each row is a single string with alternating keys and values separated by the same separator (space). It is formatted like so:
| value |
| ----------------------------------------|
| key1 value1 key2 value2 key3 value3 ... |
My intent is to map this into a DataFrame that looks like this:
| key1 | key2 | key3 | ... |
| ------ | ------ | ------ | --- |
| value1 | value2 | value3 | ... |
The names of the keys are not known ahead of time, nor is the number of pairs. However, I could make a solution work that started with a static list of keys we care about if that makes it workable.
I had hoped str_to_map might work but it does not when the key/value separator is the same as the pair separator. I could do df.select("value").as[String].flatMap(_.split(" ")) and then presumably somehow massage that array into a new DataFrame but I'm having trouble getting it right. Any ideas? Thank you.
Doing something like this worked out alright, but did require collecting the keys we care about ahead of time.
val fields = Seq(...)
val fieldIndices = fields.zipWithIndex.toMap
val structFields = fields.map(f => StructField(f, StringType, nullable = false)
val schema = StructType(structFields)
val rowEncoder = RowEncoder.apply(schema)
val rowDS = inputDF.select($"value".cast(StringType))
.as[String]
.map(_.split(" "))
.map(tokens => {
val values = Array.fill(fields.length)("")
tokens.grouped(2).foreach {
case Array(k, v) if fieldIndices.contains(k) => values(fieldIndices(k)) = v
case _ => ()
}
Row.fromSeq(values.toSeq)
}) (rowEncoder)
Would still be interested in other efficient approaches.
Related
Lets say I have a List[String] and I want to merge it with a RDD Object so that each object in the RDD gets each value in the List added to it:
List[String] myBands = ["Band1","Band2"];
Table: BandMembers
|name | instrument |
| ----- | ---------- |
| slash | guitar |
| axl | vocals |
case class BandMembers ( name:String, instrument:String );
var myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument));
//join the myRDD to myBands
// how do I do this?
//var result = myRdd.join/merge/union(myBands);
Desired result:
|name | instrument | band |
| ----- | ---------- |------|
| slash | guitar | band1|
| slash | guitar | band2|
| axl | vocals | band1|
| axl | vocals | band2|
I'm not quite sure how to go about this in the best way for Spark/Scala. I know I can convert to DF and then use spark sql to do the joins, but there has to be a better way with the RDD and List, or so I think.
The style is a bit off here, but assuming you really need RDD's instead of Dataset
So with RDD:
case class BandMembers ( name:String, instrument:String )
val myRDD = spark.sparkContext.parallelize(BandMembersTable.map(a => new BandMembers(a.name, a.instrument)))
val myBands = spark.sparkContext.parallelize(Seq("Band1","Band2"))
val res = myRDD.cartesian(myBands).map { case (a,b) => Row(a.name, a.instrument, b) }
With Dataset:
case class BandMembers ( name:String, instrument:String )
val myRDD = BandMembersTable.map(a => new BandMembers(a.name, a.instrument)).toDS
val myBands = Seq("Band1","Band2").toDS
val res = myRDD.crossJoin(myBands)
Input data:
val BandMembersTable = Seq(BandMembers("a", "b"), BandMembers("c", "d"))
val myBands = Seq("Band1","Band2")
Output with Dataset:
+----+----------+-----+
|name|instrument|value|
+----+----------+-----+
|a |b |Band1|
|a |b |Band2|
|c |d |Band1|
|c |d |Band2|
+----+----------+-----+
Println with RDDs (these are Rows)
[a,b,Band1]
[c,d,Band2]
[c,d,Band1]
[a,b,Band2]
Consider using RDD zip for this.. From official docs
RDD<scala.Tuple2<T,U>> zip(RDD other, scala.reflect.ClassTag evidence$11)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD,
How to handle if my delimiter is present in data when loading a file using spark RDD.
My data looks like below:
NAME|AGE|DEP
Suresh|32|BSC
"Sathish|Kannan"|30|BE
How to convert this column into 3 columns like below.
NAME AGE DEP
suresh 32 Bsc
Sathish|Kannan 30 BE
Please refer the how i tried to load the data.
scala> val rdd = sc.textFile("file:///test/Sample_dep_20.txt",2)
rdd: org.apache.spark.rdd.RDD[String] = hdfs://Hive/Sample_dep_20.txt MapPartitionsRDD[1] at textFile at <console>:27
rdd.collect.foreach(println)
101|"Sathish|Kannan"|BSC
102|Suresh|DEP
scala> val rdd2=rdd.map(x=>x.split("\""))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:29
scala> val rdd3=rdd2.map(x=>
| {
| var strarr = scala.collection.mutable.ArrayBuffer[String]()
| for(v<-x)
| {
| if(v.startsWith("\"") && v.endsWith("\""))
| strarr +=v.replace("\"","")
| else if(v.contains(","))
| strarr ++=v.split(",")
| else
| strarr +=v
| }
| strarr
| }
| )
rdd3: org.apache.spark.rdd.RDD[scala.collection.mutable.ArrayBuffer[String]] = MapPartitionsRDD[3] at map at <console>:31
scala> rdd3.collect.foreach(println)
ArrayBuffer(101|, Sathish|Kannan, |BSC)
ArrayBuffer(102|Suresh|DEP)
Maybe you need to explicitly define " as a quote character (it is by default for csv reader but maybe not in your case?). So adding .option("quote","\"") to the options when reading your .csv file should work.
scala> val inputds = Seq("Suresh|32|BSC","\"Satish|Kannan\"|30|BE").toDS()
inputds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val outputdf = spark.read.option("header",false).option("delimiter","|").option("quote","\"").csv(inputds)
outputdf: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
scala> outputdf.show(false)
+-------------+---+---+
|_c0 |_c1|_c2|
+-------------+---+---+
|Suresh |32 |BSC|
|Satish|Kannan|30 |BE |
+-------------+---+---+
Defining makes DataFrameReader ignore the delimiters found inside quoted strings, see Spark API doc here.
EDIT
If you want to play hard and still use plain RDDs, then try modifying your split() function like this:
val rdd2=rdd.map(x=>x.split("\\|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))
It uses positive look-ahead to ignore | delimiters found inside quotes, and saves you from doing string manipulations in your second .map.
I'm trying to do something quite simple where I have 2 arrays that have been converted into a Data Frame, and I want to show all possible combinations. So for example my output at the moment looks something like this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| Second | P |
+-----------|-----------+
However what I'm actually looking for is this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| First | P |
| Second | T |
| Second | P |
+-----------|-----------+
So far I've got some fairly straight forward code to map my arrays into columns but being quite new to using both Scala and Spark I'm not sure how I'd grab all those combinations. Here is what I have so far:
val firstColumnValues = Array("First", "Second")
val secondColumnValues = Array("T", "P")
val xs = Array(firstColumnValues, secondColumnValues).transpose
val mapped = sparkContext.parallelize(xs).map(ys => Row(ys(0), ys(1)))
val df = mapped.toDF("A", "B")
df.show
...
case class Row(first: String, second: String)
Thanks in advance for any help
In Spark 2.3
val firstColumnValues = sc.parallelize(Array("First", "Second")).toDF("A")
val secondColumnValues = sc.parallelize(Array("T", "P")).toDF("B")
val fullouter = firstColumnValues.crossJoin(secondColumnValues).show
I am using spark streaming and am creating this dataframe from the kafka message:
|customer|initialLoadComplete|initialLoadRunning| messageContent| tableName|
+--------+-------------------+------------------+--------------------+-----------------+
| A| false| true|TEFault_IdReason...|Timed_Event_Fault|
| A| false| true|TEFault_IdReason...|Timed_Event_Fault|
+--------+-------------------+------------------+--------------------+-----------------+
Now I want to extract out messageContent, messageContent is basically like a CSV that includes the raw data and the first line is the columns.
I can extract out the headers in the following way from the messageContent field.
val Array1 = ssc.sparkContext.parallelize(rowD.getString(2).split("\u0002")(0))
So Array1 looks like this:
Array1: col1^Acol2^Acol3
Array2 is basically the raw data, each column value seperated by ^A and record seperated by ^B.
^A is a column seperator. ^B is record seperator
So this is what array2 could look like:
Array2 = value1^Avalue2^Avalue3^Bvalue4^Avalue5^Avalue6^Bvalue7^Avalue8^Avalue9
Basically I want to create a dataframe out of this so it looks like this:
col1 | col2 | col3
-------------------------
value1 | value2 | value3
value4 | value5 | value6
value7 | value8 | value9
^B is the record delimiter.
When we were reading from a hdfs file, we created a dataframe via this command:
val df = csc.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\u0001").load(hdfsFile)
But this time I am creating a dataframe from two arrays from memory. Array1 is the headers for the values in array2 and array2 is record seperated by ^B.
What could be the equivalent of creating a dataframe in this approach as I did for creating a dataframe from a file.
I am inferring the following from your question.
Array1 is a rdd of only one entry col1^Acol2^Acol3
Array2 is a rdd with each entry looking something like this. value1^Avalue2^Avalue3^Bvalue4^Avalue5^Avalue6^Bvalue7^Avalue8^Avalue9
with these assumptions in place the following should work.
val array1 = sc.parallelize(Seq("col1\u0002col2\u0002col3"))
val array2 = sc.parallelize(Seq("value1\u0001value2\u0001value3\u0002value4\u0001value5\u0001value6\u0002value7\u0001value8\u0001value9"))
val data = array2.flatMap(x => x.split("\u0002")).map(x => x.split('\u0001')).collect()
val result = array2
.flatMap(x => x.split("\u0002"))
.map(x => x.split('\u0001'))
.map({ case Array(x,y,z) => (x,y,z)})
.toDF(array1.flatMap(x => x.split('\u0002')).collect(): _*)
result.show()
+------+------+------+
| col1| col2| col3|
+------+------+------+
|value1|value2|value3|
|value4|value5|value6|
|value7|value8|value9|
+------+------+------+
In scala I have a List[String] which I want to add as a new Column to an existing DataFrame.
Original DF:
Name | Date
======|===========
Rohan | 2007-12-21
... | ...
... | ...
Suppose want to add a new Column of Department
Expected DF:
Name | Date | Department
=====|============|============
Rohan| 2007-12-21 | Comp
... | ... | ...
... | ... | ...
How can I do this in Scala?
You can do it with one way like just create the dataframe of name and listvalues and join both the dataframe with name column
This solved my issue
val newrows = dataset.rdd.zipWithIndex.map(_.swap)
.join(spark.sparkContext.parallelize(results).zipWithIndex.map(_.swap))
.values
.map { case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x) }
Still need some exact explanation of it.