Convert RDD[(String,List[String])] to Dataframe [closed] - scala

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
My RDD is in below format. i.e RDD[(String,List[String])]
(abc,List(a,b))
(bcb,List(a,b))
I want to convert it to Dataframe Like below
col1 col2 col3
abc a b
bcb a b
what is the best approach do it in scala ?

you first need to extract the elements of your List into a tuple, than you can use toDF on your RDD (spark implicit conversions need to be imported for this)
val rdd: RDD[(String, List[String])] = sc.parallelize(Seq(
("abc",List("a","b")),
("bcb",List("a","b"))
))
val df = rdd
.map{case (str,list) => (str,list(0),list(1))}
.toDF("col1","col2","col3")
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| a| b|
| bcb| a| b|
+----+----+----+

Related

how to implement uniqueConcatenate, uniqueCount in spark scala [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am trying to transform the data, the older code is in Tibco and using uniqueConcatenate, uniqueCount functions.
I am not sure how we can achieve same output in spark scala.
uniqueConcatenate Example:
uniqueCount Example:
I tried to use collect_set, but as i need to do Over partition by another columns, which seems like not be working for me.
Please help me here !
For uniqueConcatenate you can use collect_set() function which aggregates a column into a set.
For example:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
case class Record(col1: Option[Int] = None, col2: Option[Int] = None, col3: Option[Int] = None)
val df: DataFrame = Seq(Record(Some(1), Some(1), Some(1)), Record(Some(1), None, Some(3)), Record(Some(1), Some(3), Some(3))).toDF("col1", "col2", "col3")
df.show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 1|
| 1|null| 3|
| 1| 3| 3|
+----+----+----+
*/
df.agg(
concat_ws(". ", collect_set("col1")).as("col1"),
concat_ws(". ", collect_set("col2")).as("col2"),
concat_ws(". ", collect_set("col3")).as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1|1. 3|1. 3|
+----+----+----+
*/
For uniqueCount, you can use countDistinct in a similar way:
import org.apache.spark.sql.functions.countDistinct
df.agg(
countDistinct("col1").as("col1"),
countDistinct("col2").as("col2"),
countDistinct("col3").as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
+----+----+----+
*/

Pattern match string from column in spark dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have column in spark dataframe, where i need to search the data with only string containing "xyz" and to stored in new column.
Input (need the only field from column having xyz )
col A colB
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918
Required Output
col A colB colC
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656 xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918 xyz:4462915,xyz:4462917,xyz:4462918
I have 100k rows and cannot used groupby on colA using collect_list, can you please to get the required output.
If you are using Spark 2.4+ then you can split the colB with comma , and use built in functions as expressions
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("A", "bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656"),
("B", "xyz:4462915,xyz:4462917,xyz:4462918")
).toDF("colA", "colB")
val newDF = df.withColumn("split", split($"colB", ","))
.selectExpr("*", "filter(split, x -> x LIKE 'xyz%' ) filteredB")
.withColumn("colC", concat_ws(",", $"filteredB"))
.drop("split", "filteredB")
newDF.show(false)
Output:
+----+-----------------------------------------------------+-----------------------------------+
|colA|colB |colC |
+----+-----------------------------------------------------+-----------------------------------+
|A |bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656|xyz:3089656 |
|B |xyz:4462915,xyz:4462917,xyz:4462918 |xyz:4462915,xyz:4462917,xyz:4462918|
+----+-----------------------------------------------------+-----------------------------------+

How can I nullify spark dataframe column [duplicate]

This question already has answers here:
Create new Dataframe with empty/null field values
(2 answers)
Closed 3 years ago.
I am working in Scala programming language. I want to nullify the entire column of data frame.
If that is not possible, then I at least want to put an empty string
What is the efficient way to do any of the above two?
Note: I don't want to add new column but I want to do manipulation on an existing column
Thanks
You can directly use .withColumn with same column name and spark replaces the column.
import org.apache.spark.sql.functions._
val df=Seq(("1","a"),("2","b")).toDF("id","name")
df.show()
//+---+----+
//|id |name|
//+---+----+
//|1 |a |
//+---+----+
val df1=df.withColumn("id",lit(null)) //to keep null value for id column
df1.show()
//+----+----+
//|id |name|
//+----+----+
//|null|a |
//+----+----+
val df2=df.withColumn("id",lit("")) //to keep empty string "" value for id column
df2.show()
//+---+----+
//|id |name|
//+---+----+
//| |a |
//+---+----+

How can I make a Dataframe in Spark from a String instead of a file? [duplicate]

This question already has answers here:
Can I read a CSV represented as a string into Apache Spark using spark-csv
(3 answers)
Closed 3 years ago.
At the moment, I am making a dataframe from a tab separated file with a header, like this.
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("inferSchema","true").load(pathToFile)
I want to do exactly the same thing but with a String instead of a file. How can I do that?
To the best of my knowledge, there is no built in way to build a dataframe from a string. Yet, for prototyping purposes, you can create a dataframe from a Seq of Tuples.
You could use that to your advantage to create a dataframe from a string.
scala> val s ="x,y,z\n1,2,3\n4,5,6\n7,8,9"
s: String =
x,y,z
1,2,3
4,5,6
7,8,9
scala> val data = s.split('\n')
// Then we extract the first element to use it as a header.
scala> val header = data.head.split(',')
scala> val df = data.tail.toSeq
// converting the seq of strings to a DF with only one column
.toDF("X")
// spliting the string
.select(split('X, ",") as "X")
// extracting each column from the array and renaming them
.select( header.indices.map( i => 'X.getItem(i).as(header(i))) : _*)
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
ps: if you are not in the spark REPL make sure to write this import spark.implicits._ so as to use toDF().

Spark: How to convert a Dataset[(String , array[int])] to Dataset[(String, int, int, int, int, ...)] using Scala [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
The Dataset I have has the following schema: [String, array[int]]. Now, I want to convert it to a Dataset (or Dataframe) with the following schema: [String, int, int, int, ...]. Note that array[int] is dynamic, it can therefore have different length for different rows.
The problem stems from the fact that the tuple (String, Array[Int]) is a specific type and it is the same type no matter how many elements are in the Array.
On the other hand, the tuple (String, Int) is a different type from (String, Int, Int), which is different still from (String, Int, Int, Int), and so on. Being a strongly typed language, Scala doesn't easily allow for a method that takes one type as input and produces one of many possible, and unrelated, types as output.
Perhaps if you describe why you think you want to do this we can offer a better solution for your situation.
As #jwvh suggest, you probably cannot do this in a type-safe Dataset way. If you relax type-safety you can probably do this using DataFrames (assuming your arrays are not crazy long - I believe currently columns are restricted to Int.MaxValue number of columns).
Here is the solution using (primarily) DataFrames on Spark 2.0.2:
We start with a toy example:
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
// ds: org.apache.spark.sql.Dataset[(String, Array[Int])] = [_1: string, _2: array<int>]
ds.show()
+-----+-------------------+
| _1| _2|
+-----+-------------------+
|Hello| [1, 2, 3]|
|There|[1, 2, 10, 11, 100]|
| A| [5, 6, 7, 8]|
+-----+-------------------+
Next we compute the max length of the arrays we have (we hope is not crazy long here):
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
Next, we want a function to select an element of the array at a particular index. We express the array selection function as a UDF:
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
Now we create all the columns we want to generate:
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
Then hopefully we are done:
ds.select(columns:_*).show()
+-----+----+----+----+----+----+
| _1|a[0]|a[1]|a[2]|a[3]|a[4]|
+-----+----+----+----+----+----+
|Hello| 1| 2| 3| 0| 0|
|There| 1| 2| 10| 11| 100|
| A| 5| 6| 7| 8| 0|
+-----+----+----+----+----+----+
Here is the complete code for copy paste
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
ds.select(columns:_*).show()