How to convert an RDD array string to a dataframe - scala

Please help me convert the RDD array of IP address below, into a dataframe.
(Full Disclosure: I have little experience working with RDD)
RDD CREATION:
val SCND_RDD = FIRST_RDD.map(kv => kv._2).flatMap(r => r.get("ip")).map(o => o.asInstanceOf[scala.collection.mutable.Map[String, String]]).flatMap(ip => ip.get("address"))
SCND_RDD.take(3)
RESULTS:
SCND_RDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[33] at flatMap at <console>:38
res87: Array[String] = Array(5.42.212.99, 51.34.21.60, 63.99.831.7)`
My rdd<->dataframe conversion attempt:
case class X(callId: String)
val userDF = SCND_RDD.map{case Array(s0)=>X(s0)}.toDF()
This is the error I get
defined class X
<console>:40: error: scrutinee is incompatible with pattern type;
found : Array[T]
required: String
val userDF = NIPR_RDD22.map{case Array(s0)=>X(s0)}.toDF()

I leave a comment that is a duplicated question that might help you.
But here I also leave my trial.
val rdd = sc.parallelize(Array("test", "test2", "test3"))
rdd.take(3)
//res53: Array[String] = Array(test, test2, test3)
val df = rdd.toDF()
df.show
+-----+
|value|
+-----+
| test|
|test2|
|test3|
+-----+

Related

Scala : how to pass variable in a UDF and use in withColumn

I have a variable of type Map[String, Set[String]
val metadata = Map(a -> Set(b ,c))
val colToUse = "existingcol" // Option[String]
I am trying to add a new column in my dataFrame using metadata and colToUse which is an existing column in my dataframe
value of metadata is Set of Strings and
key is just a string which is a value of a column in df.
eg :
val metadata = Map['mike', ['physics','chemistry']]
val colToUse = 'student_name' // student_name is a column name in df
'mike' will be a value of "student_name" column.
i am trying to add a new column in existing DF where i can add subjects of each student based on student_name and metadata
myDF.withColumn("subjects", metadata.getorelse(col(colToUse), set.empty)
The above will not work in scala as i need pass columns only in withColumn.
Tried using UDF
def logic: (Map[String, Set[String]], String) => Set[String] =
(metadata: Map[String, Set[String]], colToUse: String) => {
metadata.getOrElse(colToUse, Set("a"))
}
def myUDF = udf(logic)
def getVal: Column = { myUDF(metadata, col(colToUse.get) }
and using it in withcolumn :
myDF.withColumn("newCol", getVal(metadata, colToUse)
Getting error : Unsupported literal type class scala.Tuple2
Looking for a best simplistic way to approach this ?
Issue 2: In getVal , for passing metadata a column is expected but i am passing a map
Is something like this is what you need:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row("mike"))),
StructType(List(StructField("student_name", StringType)))
)
df.show()
First test dataframe:
+------------+
|student_name|
+------------+
| mike|
+------------+
And now, create the udf that uses the map:
val metadata = Map("mike" -> Set("physics", "chemistry"))
val colToUse = "student_name"
def createUdf =
udf((key: String) => metadata.getOrElse(key, Set.empty))
and uset it in withColumn function:
df.withColumn("subjects", createUdf(col(colToUse))).show()
it gives:
+------------+--------------------+
|student_name| subjects|
+------------+--------------------+
| mike|[physics, chemistry]|
+------------+--------------------+
am I missing something?

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

How to pass dataset column value to a function while using spark filter with scala?

I have an action array which consists of user id and action type
+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
I want to filter actions that belongs to the users who have at least one search action.
So I created a bloom filter with user ids who has SEARCH action.
Then I tried to filter all actions depending on bloom filter's user status
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
But the code gives an exception
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how can I pass column value to the BloomFilter.mightContainString method?
Create filter:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
Use udf for filtering:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
If your current Bloom filter implementation is serializable you should be able to use it the same way, but if data is large enough to justify Bloom filter, you should avoid collecting.
You can do something like this,
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.
import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)
val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))
df.filter(filterUDF($"user_id"))
You can remove the broadcasting if the bloomFilter instance is serializable.
Hope this helps, Cheers.

scala - Spark : How to union all dataframe in loop

Is there a way to get the dataframe that union dataframe in loop?
This is a sample code:
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits){
var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}
I would want to obtain some like this:
aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon
Thanks again
You could created a sequence of DataFrames and then use reduce:
val results = fruits.
map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
reduce(_.union(_))
results.show()
Steffen Schmitz's answer is the most concise one I believe.
Below is a more detailed answer if you are looking for more customization (of field types, etc):
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
//initialize DF
val schema = StructType(
StructField("aCol", StringType, true) ::
StructField("bCol", StringType, true) ::
StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
//list to iterate through
var fruits = List(
"apple"
,"orange"
,"melon"
)
for (x <- fruits) {
//union returns a new dataset
initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}
//initialDF.show()
references:
How to create an empty DataFrame with a specified schema?
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/Dataset.html
https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
If you have different/multiple dataframes you can use below code, which is efficient.
val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)
In a for loop:
val fruits = List("apple", "orange", "melon")
( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")
you can first create a sequence and then use toDF to create Dataframe.
scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()
scala> for ( x <- fruits){
| dseq = dseq :+ ("aaa","bbb",x)
| }
scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))
scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]
scala> df.show
+----+----+------+
|aCol|bCol| name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+
Well... I think your question is a bit mis-guided.
As per my limited understanding of whatever you are trying to do, you should be doing following,
val fruits = List(
"apple",
"orange",
"melon"
)
val df = fruits
.map(x => ("aaa", "bbb", x))
.toDF("aCol", "bCol", "name")
And this should be sufficient.

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala.
I am starting with the following dataframe (single column made out of a dense Vector of Doubles):
scala> val scaledDataOnly_pruned = scaledDataOnly.select("features")
scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector]
scala> scaledDataOnly_pruned.show(5)
+--------------------+
| features|
+--------------------+
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
|[-0.0948337274182...|
+--------------------+
A straight conversion to RDD yields an instance of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] :
scala> val scaledDataOnly_rdd = scaledDataOnly_pruned.rdd
scaledDataOnly_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[32] at rdd at <console>:66
Does anyone know how to convert this DF to an instance of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] instead? My various attempts have been unsuccessful so far.
Thank you in advance for any pointers!
Just found out:
val scaledDataOnly_rdd = scaledDataOnly_pruned.map{x:Row => x.getAs[Vector](0)}
EDIT: use more sophisticated way to interpret fields in Row.
This is worked for me
val featureVectors = features.map(row => {
Vectors.dense(row.toSeq.toArray.map({
case s: String => s.toDouble
case l: Long => l.toDouble
case _ => 0.0
}))
})
features is a DataFrame of spark SQL.
import org.apache.spark.mllib.linalg.Vectors
scaledDataOnly
.rdd
.map{
row => Vectors.dense(row.getAs[Seq[Double]]("features").toArray)
}