Replace seperator in Array[long] in the Spark dataframe - scala

I'm reading a JSON file into a spark data frame in Scala. I have a JSON field like
"areaGlobalIdList":[2389,3,2,1,2147,2142,2518]
Spark is automatically inferring the datatype of this field as Array[long]. I tried concat_ws, but it seems only works with array[string]. When I tried converting this to array[string], the output is showing as
scala> val cmrdd = sc.textFile("/user/nkthn/cm.json")
scala> val cmdf = sqlContext.read.json(cmrdd)
scala> val dfResults = cmdf.select($"areaGlobalIdList".cast(StringType)).withColumn("AREAGLOBALIDLIST", regexp_replace($"areaGlobalIdList" , ",", "." ))
scala> dfResults.show(20,false)
+------------------------------------------------------------------+
|AREAGLOBALIDLIST |
+------------------------------------------------------------------+
|org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#6364b584|
+------------------------------------------------------------------+
I'm expecting the output to be
[2389.3.2.1.2147.2142.2518]
Any assistance is greatly helpful.

Given the schema of the areaGlobalIdList column as
|-- areaGlobalIdList: array (nullable = true)
| |-- element: long (containsNull = false)
You can achieve this with simple udf function as
import org.apache.spark.sql.functions._
val concatWithDot = udf((array: collection.mutable.WrappedArray[Long]) => array.mkString("."))
df.withColumn("areaGlobalIdList", concatWithDot($"areaGlobalIdList")).show(false)

Related

Scala explode followed by UDF on a dataframe fails

I have a scala dataframe with the following schema:
root
|-- time: string (nullable = true)
|-- itemId: string (nullable = true)
|-- itemFeatures: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to explode the itemFeatures column and then send my dataframe to a UDF. But as soon as I include the explode, calling the UDF results in this error:
org.apache.spark.SparkException: Task not serializable
I can't figure out why???
Environment: Scala 2.11.12, Spark 2.4.4
Full example:
val dataList = List(
("time1", "id1", "map1"),
("time2", "id2", "map2"))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode("itemFeatures"))
val doNextThingUDF: UserDefinedFunction = udf(doNextThing _)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time"))
where my UDF looks like this:
val doNextThing(time: String): String = {
time+"blah"
}
If I remove the explode, everything works fine, or if I don't call the UDF after the explode, everything works fine. I could imagine Spark is somehow unable to send each row to a UDF if it is dynamically executing the explode and doesn't know how many rows that are going to exist, but even when I add ex dfExploded.cache() and dfExploded.count() I still get the error. Is this a known issue? What am I missing?
I think the issue come from how you define your donextThing function. Also
there is couple of typos in your "full example".
Especially the itemFeatures column is a string in your example, I understand it should be a Map.
But here is a working example:
val dataList = List(
("time1", "id1", Map("map1" -> 1)),
("time2", "id2", Map("map2" -> 2)))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode($"itemFeatures"))
val doNextThing = (time: String) => {time+"blah"}
val doNextThingUDF = udf(doNextThing)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time")))

Spark-Scala Convert String of Numbers to Double

I am trying to make a dense Vector out of a string. But first, i need to convert to a double. How do i get it in double format?
[-- feature: string (nullable = false)]
https://i.stack.imgur.com/u1kWz.png
I have tried:
val new_col = df.withColumn("feature", df("feature").cast(DoubleType))
But, it results in a column of Null.
One approach would be to use a UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val df = Seq(
"-1,-1,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0",
"7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,",
"12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0",
"-1,-1,-1,-1,-1,-1,-1,5.0,9.0,9.0"
).toDF("feature")
def stringToVector = udf ( (s: String) =>
new DenseVector(s.split(",").map(_.toDouble))
)
df.withColumn("feature", stringToVector($"feature")).
show(false)
// +---------------------------------------------------+
// |feature |
// +---------------------------------------------------+
// |[-1.0,-1.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0]|
// |[7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0] |
// |[12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0]|
// |[-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,9.0,9.0] |
// +---------------------------------------------------+
first, i need to convert to a double. How do i get it in double format?
You can simply use split inbuilt function and cast to Array[Double] as below
import org.apache.spark.sql.functions._
val new_col = df.withColumn("feature", split(df("feature"), ",").cast("array<double>"))
which should give you
root
.....
.....
|-- feature: array (nullable = true)
| |-- element: double (containsNull = true)
.....
.....
I hope the answer is helpful

create a spark dataframe from a nested json file in scala [duplicate]

This question already has an answer here:
How to access sub-entities in JSON file?
(1 answer)
Closed 5 years ago.
I have a json file that looks like this
{
"group" : {},
"lang" : [
[ 1, "scala", "functional" ],
[ 2, "java","object" ],
[ 3, "py","interpreted" ]
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this
case class ProgLang (id: Int, lang: String, type: String )
I have read this post Reading JSON with Apache Spark - `corrupt_record` and understand that each record needs to be on a newline but in my case I cannot change the file structure
The json format is wrong. The the json api of sqlContext is reading it as corrupt record. Correct form is
{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}
and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframe you want
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = sqlContext.read.json("/home/test.json")
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
.show(false)
You should have
+---+-----+-----------+
|id |lang |type |
+---+-----+-----------+
|1 |scala|functional |
|2 |java |object |
|3 |py |interpreted|
+---+-----+-----------+
Updated
If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFiles to read the json file and parse it as below
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val readJSON = sc.wholeTextFiles("/home/test.json")
.map(x => x._2)
.map(data => data.replaceAll("\n", ""))
val df = sqlContext.read.json(readJSON)
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0).cast(IntegerType))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
df2.show(false)
df2.printSchema
It should give you dataframe as above and schema as
root
|-- id: integer (nullable = true)
|-- lang: string (nullable = true)
|-- type: string (nullable = true)
As of Spark 2.2 you can use multiLine option to deal with the case of multi-line JSONs.
scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
|-- lang: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Before Spark 2.2 see How to access sub-entities in JSON file? or Read multiline JSON in Apache Spark.

How to apply a function to a column of a Spark DataFrame?

Let's assume that we have a Spark DataFrame
df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame
with the following schema
df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
| |-- element: string (containsNull = true)
Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?
You don't have to write a custom function because there is one:
import org.apache.spark.sql.functions.size
df.select(size($"tk"))
If you really want you can write an udf:
import org.apache.spark.sql.functions.udf
val size_ = udf((xs: Seq[String]) => xs.size)
or even create custom a expression but there is really no point in that.
One way is to access them using the sql like below.
df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")
df2.show()
To get size of array column,
val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()
If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.
I would also try for some thing that traverses.

Convert String to Double in Scala / Spark?

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
I think problem here:
.asInstanceOf[Array[String]]
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)