how to convert json to array - scala

I have my Input as below.
val inputJson ="""[{"color": "red","value": "#f00"},{"color": "blue","value": "#00f"}]"""
I need to convert JSON val to ARRAY
My output should as below.
val colorval=Array("red","blue")
val value=Array("#f00","#00f")
Please Kindly Help

Following solution should help you if you have large data sets.
//input data I guess you have large data
val inputJson ="""[{"color": "red","value": "#f00"},{"color": "blue","value": "#00f"}]"""
//read the json data to dataframe
val df = sqlContext.read.json(sc.parallelize(inputJson::Nil))
//apply the collecting inbuilt functions
import org.apache.spark.sql.functions.collect_list
df.select(collect_list("color").as("colorVal"), collect_list("value").as("value"))
and you should have
+-----------+------------+
|colorVal |value |
+-----------+------------+
|[red, blue]|[#f00, #00f]|
+-----------+------------+
root
|-- colorVal: array (nullable = true)
| |-- element: string (containsNull = true)
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)

Create a DataFrame from the JSON and explode it. Now use collect_list() or collect_set() depending upon whether you need duplicates or not.

Related

Change Spark Dataframe Array[String] to Array[Double]

I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))

Custom row (List of CustomTypes) to spark dataframe

I am currently having some problems with creating a Spark Row object and converting it to a spark dataframe. What i am trying to achieve is,
I have a two lists of custom types that look more or less like the classes below,
case class MyObject(name:String,age:Int)
case class MyObject2(xyz:String,abc:Double)
val listOne = List(MyObject("aaa",22),MyObject("sss",223)),
val listTwo = List(MyObject2("bbb",23),MyObject2("abc",2332))
Using these two lists I want to create a Dataframe which has one row and two fields (fieldOne and fieldTwo),
fieldOne --> is a List of StructType (similar to MyObject)
fieldTwo --> is a list of StructType (similar to MyObject2)
In order to achieve this i created my custom structtypes for MyObject, MyObject2 and my ResultType.
val myObjSchema = StructType(List(
StructField("name",StringType),
StructField("age",IntegerType)
))
val myObjSchema2 = StructType(List(
StructField("xyz",StringType),
StructField("abc",DoubleType)
))
val myRecType = StructType(
List(
StructField("myField",ArrayType(myObjSchema)),
StructField("myField2",ArrayType(myObjSchema2))
)
)
I populated my data within the spark Row object and created a dataframe
val data = Row(
List(MyObject("aaa",22),MyObject("sss",223)),
List(MyObject2("bbb",23),MyObject2("abc",2332))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
when i call printSchema on the dataframe, the output is exactly what i would expect
root
|-- myField: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
|-- myField2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- xyz: string (nullable = true)
| | |-- abc: double (nullable = true)
However when i do a show, i get a runtime exception
Caused by: java.lang.RuntimeException: spark_utilities.example.MyObject is not a valid external type for schema of struct<name:string,age:int>
It looks like something is wrong with the Row object, can you please explain what is going wrong here?
Thanks a lot for your help!
ps: I know i can create a custom case class like case class PH(ls:List[MyObject],ls2:List[MyObject2]) populate it and convert it to a dataset. But due to some limitations i cannot use this approach and would like to solve it in the way mentioned above.
You can not simply insert your case class objects inside a row, you need to convert those objects to rows
val data = Row(
List(Row("aaa",22.toInt),Row("sss",223.toInt)),
List(Row("bbb",23d),Row("abc",2332d))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)

Casting an array of Doubles to String in spark sql

I'm trying read data from a JSON which has an array having lat, long values something like [48.597315,-43.206085] and I want to parse them in spark sql as a single string. is there a way I can do that?
my JSON input will look something like below.
{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}
I'm trying to push this to a rdbms store and when I'm trying to cast position.coordinates to string it's giving me
Can't get JDBC type for array<string>
as the destination datatype is nvarchar. any kind help is appreciated.!
You can read your json file into a DataFrame, then 1) use concat_ws to stringify your lat/lon array into a single column, and 2) use struct to re-assemble the position struct-type column as follows:
// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}
import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = true)
// | |-- coordinates: array (nullable = true)
// | | |-- element: double (containsNull = true)
// | |-- type: string (nullable = true)
df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
select($"id", struct($"coordinates", $"position.type").as("position")).
show(false)
// +-----+----------------------------+
// |id |position |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+
// printSchema:
// root
// |-- id: string (nullable = true)
// |-- position: struct (nullable = false)
// | |-- coordinates: string (nullable = false)
// | |-- type: string (nullable = true)
[UPDATE]
Using Spark SQL:
df.createOrReplaceTempView("position_table")
spark.sql("""
select id, concat_ws(',', position.coordinates) as position_coordinates
from position_table
""").
show(false)
//+-----+--------------------+
//|id |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+
You have to transform the given column into a string before loading it into the target datasource. For example, the following code creates a new column position.coordinates with value as joined string of given arrays of double, by using Array's toString and removing the brackets afterward.
df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))
Alternatively, you can use UDF to do create a custom transformation function on Row objects. That way you can maintain the nested structure of the column. The following source (answer number 2) can give you some idea how to take up UDF for your case: Spark UDF with nested structure as input parameter.

How to get first value of WrappedArray in Spark?

I grouped by few columns and am getting WrappedArray out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
Getting a dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
then printing the schema gives us: final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
Sample current output:
I am trying to convert to this:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
Desired example output (from the picture above):
-1.0,0.0
-1.0,0.0
In the case where collect_list will always only return one value, use first instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy step.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
Try col(x).getItem:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
Try split
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))

Spark - How to parse a JSON-escaped String field as a JSON Object in DataFrames?

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated
First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}
Next, to parse the JSON correctly from the above JSON, you have to use following code:
val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd
val df = spark.read.json(rdd)
Above code will give you following output:
df.show(false)
+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload |
+----------------+-------------------------------------+
|[null,abc] |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+
With following schema:
df.printSchema
root
|-- clientAttributes: struct (nullable = true)
| |-- backfillId: string (nullable = true)
| |-- clientPrimaryKey: string (nullable = true)
|-- escapedJsonPayload: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemId: string (nullable = true)
| | | |-- itemName: string (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
I hope this helps !