Spark dataFrame for-if loop taking a Long time

Spark dataFrame for-if loop taking a Long time - scala

I have a Spark DF (df):
I have to convert below into something like this:
Basically it should detect a new sentence whenever it finds a full stop (".") and another row.
I have written a code for above:
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate()
val count = df.count.toInt
var emptyDF = Seq.empty[(Int, Int, String)].toDF("start_time", "end_time", "Sentences")
var b = 0
for (a <- 1 to count){
if(d9.select("words").head(a)(a-1).toSeq.head == "." || a == (count-1))
{
val myList1 = d9.select("words").head(a).toArray.map(_.getString(0))
val myList = d9.select("words").head(a).toArray.map(_.getString(0)).splitAt(b)._2
val text = myList.mkString(" ")
val end_time = d9.select("end_time").head(a)(a-1).toSeq.head.toString.toInt
val start_time = d9.select("start_time").head(a)(b).toSeq.head.toString.toInt
val df1 = spark.sparkContext.parallelize(Seq(start_time)).toDF("start_time")
val df2 = spark.sparkContext.parallelize(Seq(end_time)).toDF("end_time")
val df3 = spark.sparkContext.parallelize(Seq(text)).toDF("Sentences")
val df4 = df1.crossJoin(df2).crossJoin(df3)
emptyDF = emptyDF.union(df4).toDF
b = a
}
}
Though its giving the correct output but its taking ages to complete iteration and I have 117 other df's which I need to run.
Any other way to Tune this code or any other way to achieve above operation? Any help will be deeply appreciated.

scala> import org.apache.spark.sql.expressions.Window
scala> df.show(false)
+----------+--------+--------+
|start_time|end_time|words |
+----------+--------+--------+
|132 |135 |Hi |
|135 |135 |, |
|143 |152 |I |
|151 |152 |am |
|159 |169 |working |
|194 |197 |on |
|204 |211 |hadoop |
|211 |211 |. |
|218 |222 |This |
|226 |229 |is |
|234 |239 |Spark |
|245 |249 |DF |
|253 |258 |coding |
|258 |258 |. |
|276 |276 |I |
+----------+--------+--------+
scala> val w = Window.orderBy("start_time", "end_time")
scala> df.withColumn("temp", sum(when(lag(col("words"), 1).over(w) === ".", lit(1)).otherwise(lit(0))).over(w))
.groupBy("temp").agg(min("start_time").alias("start_time"), max("end_time").alias("end_time"),concat_ws(" ",collect_list(trim(col("words")))).alias("sentenses"))
.drop("temp")
.show(false)
+----------+--------+-----------------------------+
|start_time|end_time|sentenses |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+

Here is my try. You can use a window to separate the sentence by counting the number of . for the following rows.
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("start_time").rowsBetween(Window.currentRow, Window.unboundedFollowing)
val df = Seq((132, 135, "Hi"),
(135, 135, ","),
(143, 152, "I"),
(151, 152, "am"),
(159, 169, "working"),
(194, 197, "on"),
(204, 211, "hadoop"),
(211, 211, "."),
(218, 212, "This"),
(226, 229, "is"),
(234, 239, "Spark"),
(245, 249, "DF"),
(253, 258, "coding"),
(258, 258, "."),
(276, 276, "I")).toDF("start_time", "end_time", "words")
df.withColumn("count", count(when(col("words") === ".", true)).over(w))
.groupBy("count")
.agg(min("start_time").as("start_time"), max("end_time").as("end_time"), concat_ws(" ", collect_list("words")).as("Sentences"))
.drop("count").show(false)
Then, this will give you the result as follows but it has some spaces between words and , or . as follows:
+----------+--------+-----------------------------+
|start_time|end_time|Sentences |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+

Here is my approach using udf without window function.
val df=Seq((123,245,"Hi"),(123,245,"."),(123,245,"Hi"),(123,246,"I"),(123,245,".")).toDF("start","end","words")
var count=0
var flag=false
val counterUdf=udf((dot:String) => {
if(flag) {
count+=1
flag=false
}
if (dot == ".")
flag=true
count
})
val df1=df.withColumn("counter",counterUdf(col("words")))
val df2=df1.groupBy("counter").agg(min("start").alias("start"), max("end").alias("end"), concat_ws(" ", collect_list("words")).alias("sentence")).drop("counter")
df2.show()
+-----+---+--------+
|start|end|sentence|
+-----+---+--------+
| 123|246| Hi I .|
| 123|245| Hi .|
+-----+---+--------+

Related

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe

Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified

If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

How to replace emoticon to empty string in scala dataframe?

Hello stackoverflowers,
Would you please help to take a look on how to replace the emoticon in scala dataframe?
import spark.implicits._
val df = Seq(
(8, "bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️"),
(64, "bb")
).toDF("number", "word")
df.show(false)
+------+-----------------------+
|number|word |
+------+-----------------------+
|8 |bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️|
|64 |bb |
+------+-----------------------+
df.select($"word", regexp_replace($"word", "[^\u0000-\uFFFF]", "").alias("word_revised")).show(false)
+-----------------------+---------------+
|word |word_revised |
+-----------------------+---------------+
|bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️|bat★ ⛱ ✨‍♂️⛷❤️|
|bb |bb |
+-----------------------+---------------+
The expected result is
+-----------------------+---------------+
|word |word_revised |
+-----------------------+---------------+
|bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️|bat|
|bb |bb |
+-----------------------+---------------+
Thank you so much for your helping, #fonkap. I am so sorry that chain in to the thread so late as I had get another sprint story to onboard during the past month. I would like to say the approach you posted almost works well for the emoticon. But there are some abnormal icon in my source data from our upstream. Do you have any suggestion on how to replace with them
scala> val df = Seq(
| (8, "♥♥♥♥♥☆ Condo֎۩ᴥ★Ąrt Ħouse Ŀocation")
| ).toDF("airPlaneId", "airPlaneName")
df: org.apache.spark.sql.DataFrame = [airPlaneId: int, airPlaneName: string]
scala> df.select($"airPlaneId", $"airPlaneName", regexp_replace($"airPlaneName", "[^\u0000-\u20CF]", "").alias("airPlaneName_revised")).show(false)
+----------+-----------------------------------+----------------------------+
|airPlaneId|airPlaneName |airPlaneName_revised |
+----------+-----------------------------------+----------------------------+
|8 |♥♥♥♥♥☆ Condo֎۞۩ᴥ★Ąrt Ħouse Ŀocation| Condo֎۞۩ᴥĄrt Ħouse Ŀocation|
+----------+-----------------------------------+----------------------------+
Looks like some symbol still remains as unexpected marked as underscore
Thank you for your sharing, #mck. And the purposed new approach is workable.
Anyway, there is a unwanted replacement occurs.
scala> df.selectExpr(
| "airPlaneId",
| "airPlaneName",
| "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '?') airPlaneName_revised"
| ).show(false)
+----------+------------+--------------------+
|airPlaneId|airPlaneName|airPlaneName_revised|
+----------+------------+--------------------+
|8 |la Cité |la Cit? |
|9 |Aéroport |A?roport |
|10 |München |M?nchen |
|11 |la Tête |la T?te |
|12 |Sarrià |Sarri? |
+----------+------------+--------------------+
Just wondering that do we have any enhanced approach to exclude the kind of valid ascii, only process emoji or symbol, please?

regexp_replace is doing it right. It is just that some of the "characters" you wrote are indeed in the \u0000-\uFFFF interval.
Proof:
import java.io.File
import java.nio.charset.StandardCharsets
import java.nio.file.{Files, Paths}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Emoticon {
def main(args: Array[String]) {
val str = "bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️"
val bw = Files.newBufferedWriter(new File("emoji.txt").toPath, StandardCharsets.UTF_8)
bw.write(str)
bw.newLine()
val cps = str.codePoints().toArray
cps.foreach(cp => {
bw.write(String.format(" 0x%06x", cp.asInstanceOf[Object]))
bw.write(" - ")
bw.write(new java.lang.StringBuilder().appendCodePoint(cp).toString)
bw.newLine()
})
bw.close()
}
}
Open emoji.txt with your browser and you'll see:
(It is worth noting that some characters are combinations)
The "filtered" string looks like:
So, everything looks right!
Finally, answering your question, you may want to use a narrower character interval, for example: [^\u0000-\u20CF] , and you will get the expected result.
object Emoticon2 {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[2]").appName("Simple Application").getOrCreate()
import spark.implicits._
val df = Seq(
(8, "bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️"),
(64, "bb")
).toDF("number", "word")
df.show(false)
df.select($"word", regexp_replace($"word", "[^\u0000-\u20CF]", "").alias("word_revised")).show(false)
}
}
will output:
+-----------------------+------------+
|word |word_revised|
+-----------------------+------------+
|bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️|bat ‍ |
|bb |bb |
+-----------------------+------------+
Take a look at: https://jrgraphix.net/research/unicode_blocks.php

You can remove all non-ASCII characters as below:
val df = Seq(
(8, "bat★😂 😆 ⛱ ✨🚣‍♂️⛷🏂❤️"),
(64, "bb")
).toDF("number", "word")
val df2 = df.selectExpr(
"number",
"replace(decode(encode(word, 'ascii'), 'ascii'), '?', '') word_revised"
)
df2.show(false)
+------+------------+
|number|word_revised|
+------+------------+
|8 |bat |
|64 |bb |
+------+------------+
val df = Seq((8, "♥♥♥♥♥☆ Condo֎۩ᴥ★")).toDF("airPlaneId", "airPlaneName")
val df2 = df.selectExpr(
"airPlaneId",
"replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '') airPlaneName_revised"
)
df2.show(false)
+----------+--------------------+
|airPlaneId|airPlaneName_revised|
+----------+--------------------+
|8 | Condo |
+----------+--------------------+

scala spark dataframe: explode a string column to multiple strings

Any pointers on below?
input df: here col1 is of type string
+----------------------------------+
| col1|
+----------------------------------+
|[{a:1,g:2},{b:3,h:4},{c:5,i:6}] |
|[{d:7,j:8},{e:9,k:10},{f:11,l:12}]|
+----------------------------------+
expected output: (again col1 is of type string)
+-------------+
| col1 |
+-------------+
| {a:1,g:2} |
| {b:3,h:4} |
| {c:5,i:6} |
| {d:7,j:8} |
| {e:9,k:10} |
| {f:11,l:12}|
+-----+
Thanks!

You can use the Spark SQL explode function with an UDF :
import spark.implicits._
val df = spark.createDataset(Seq("[{a},{b},{c}]","[{d},{e},{f}]")).toDF("col1")
df.show()
+-------------+
| col1|
+-------------+
|[{a},{b},{c}]|
|[{d},{e},{f}]|
+-------------+
import org.apache.spark.sql.functions._
val stringToSeq = udf{s: String => s.drop(1).dropRight(1).split(",")}
df.withColumn("col1", explode(stringToSeq($"col1"))).show()
+----+
|col1|
+----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+----+
Edit: for you new input data, the custom UDF can evolve as above :
val stringToSeq = udf{s: String =>
val extractor = "[^{]*:[^}]*".r
extractor.findAllIn(s).map(m => s"{$m}").toSeq
}
new output :
+-----------+
| col1|
+-----------+
| {a:1,g:2}|
| {b:3,h:4}|
| {c:5,i:6}|
| {d:7,j:8}|
| {e:9,k:10}|
|{f:11,l:12}|
+-----------+

Spark provides a quite rich trim function which can be used to remove the leading and the trailing chars, [] in your case. As #LeoC already mentioned the required functionality can be implemented through the build-in functions which will perform much better:
import org.apache.spark.sql.functions.{trim, explode, split}
val df = Seq(
("[{a},{b},{c}]"),
("[{d},{e},{f}]")
).toDF("col1")
df.select(
explode(
split(
trim($"col1", "[]"), ","))).show
// +---+
// |col|
// +---+
// |{a}|
// |{b}|
// |{c}|
// |{d}|
// |{e}|
// |{f}|
// +---+
EDIT:
For the new dataset the logic remains the same with the difference that you need to split with a different character other than ,. You can achieve this using regexp_replace to replace }, with }| in order to be able later to split with | instead of ,:
import org.apache.spark.sql.functions.{trim, explode, split, regexp_replace}
val df = Seq(
("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]"),
("[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")
).toDF("col1")
df.select(
explode(
split(
regexp_replace(trim($"col1", "[]"), "},", "}|"), // gives: {a:1,g:2}|{b:3,h:4}|{c:5,i:6}
"\\|")
)
).show(false)
// +-----------+
// |col |
// +-----------+
// |{a:1,g:2} |
// |{b:3,h:4} |
// |{c:5,i:6} |
// |{d:7,j:8} |
// |{e:9,k:10} |
// |{f:11,l:12}|
// +-----------+
Note: with split(..., "\\|") we escape | which is a special regex character.

You can do:
val newDF = df.as[String].flatMap(line=>line.replaceAll("\\[", "").replaceAll("\\]", "").split(","))
newDF.show()
Output:
+-----+
|value|
+-----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+-----+
Just as a note, this process will name the output column as value, that you can easily rename it (if needed), using select, withColumn, etc.

Finally what worked:
import spark.implicits._
val df = spark.createDataset(Seq("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]","[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")).toDF("col1")
df.show()
val toStr = udf((value : String) => value.split("},\\{").map(_.toString))
val addParanthesis = udf((value : String) => ("{" + value + "}"))
val removeParanthesis = udf((value : String) => (value.slice(2,value.length()-2)))
import org.apache.spark.sql.functions._
df
.withColumn("col0", removeParanthesis(col("col1")))
.withColumn("col2", toStr(col("col0")))
.withColumn("col3", explode(col("col2")))
.withColumn("col4", addParanthesis(col("col3")))
.show()
output:
+--------------------+--------------------+--------------------+---------+-----------+
| col1| col0| col2| col3| col4|
+--------------------+--------------------+--------------------+---------+-----------+
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| a:1,g:2| {a:1,g:2}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| b:3,h:4| {b:3,h:4}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...| c:5,i:6| {c:5,i:6}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| d:7,j:8| {d:7,j:8}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| e:9,k:10| {e:9,k:10}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...|f:11,l:12|{f:11,l:12}|
+--------------------+--------------------+--------------------+---------+-----------+

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?

Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)

You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.

Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

Convert Map(key-value) into spark scala Data-frame

convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .

if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark dataFrame for-if loop taking a Long time - scala

Related

How to create a dataframe from Array[Strings]?

How to replace emoticon to empty string in scala dataframe?

scala spark dataframe: explode a string column to multiple strings

In spark iterate through each column and find the max length

Convert Map(key-value) into spark scala Data-frame

Categories

Resources