not able to get nested json value as column - scala

I'm trying to create schema for json, and see it as columns in dataframe
Input json
{"place":{"place_name":"NYC","lon":0,"lat":0,"place_id":1009}, "region":{"region_issues":[{"key":"health","issue_name":"Cancer"},{"key":"sports","issue_name":"swimming"}}}
code
val schemaRsvp = new StructType()
.add("place", StructType(Array(
StructField("place_name", DataTypes.StringType),
StructField("lon", DataTypes.IntegerType),
StructField("lat", DataTypes.IntegerType),
StructField("place_id", DataTypes.IntegerType))))
val ip = spark.read.schema(schemaRsvp).json("D:\\Data\\rsvp\\inputrsvp.json")
ip.show()
Its showing all the fields in single column place, want values column wise
place_name,lon,lat,place_id
NYC,0,0,1009
Any suggestion, how to fix this ?

You can convert struct type into columns using ".*"
ip.select("place.*").show()
+----------+---+---+--------+
|place_name|lon|lat|place_id|
+----------+---+---+--------+
| NYC| 0| 0| 1009|
+----------+---+---+--------+
UPDATE:
with the new column array you can explode your date and then do the same ".*" to convert struct type into columns:
ip.select(col("place"), explode(col("region.region_issues")).as("region_issues"))
.select("place.*", "region_issues.*").show(false)
+---+---+--------+----------+----------+------+
|lat|lon|place_id|place_name|issue_name|key |
+---+---+--------+----------+----------+------+
|0 |0 |1009 |NYC |Cancer |health|
|0 |0 |1009 |NYC |swimming |sports|
+---+---+--------+----------+----------+------+

Related

Pass list of column values to spark dataframe as new column

I am trying to add a new column to spark dataframe as below:
val abc = [a,b,c,d] --- List of columns
I am trying to pass above list of column values as new column to dataframe and trying to do sha2 on that new column and trying to do a varchar(64).
source = source.withColumn("newcolumn", sha2(col(abc), 256).cast('varchar(64)'))
It complied and the runtime error I am getting as:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'abc' given input
columns:
The expected output should be a dataframe with newcolum as column name and the column value as varchar64 with sha2 of concatenate of Array of string with ||.
Please suggest.
We can use map and concat_ws || to create new column and apply sha2() on the concat data.
val abc = Seq("a","b","c","d")
val df=Seq(((1),(2),(3),(4))).toDF("a","b","c","d")
df.withColumn("newColumn",sha2(concat_ws("||", abc.map(c=> col(c)):_*),256)).show(false)
//+---+---+---+---+----------------------------------------------------------------+
//|a |b |c |d |newColumn |
//+---+---+---+---+----------------------------------------------------------------+
//|1 |2 |3 |4 |20a5b7415fb63243c5dbacc9b30375de49636051bda91859e392d3c6785557c9|
//+---+---+---+---+----------------------------------------------------------------+

Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types.
While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column
Example of Dataframe
+---+----------+-----+
|id |name |class|
+---+----------+-----+
|1 |abc |21 |
|2 |bca |32 |
|3 |abab | 4 |
|4 |baba |5a |
|5 |cccca | |
+---+----------+-----+
Json Schema of the file:
{"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}
In this row 4 is corrupt records as the class column is of type Integer
So only this records has to be there in corrupt records, not the 5th row
Just check if value is NOT NULL before casting and NULL after casting
import org.apache.spark.sql.functions.when
df
.withColumn("class_integer", $"class".cast("integer"))
.withColumn(
"class_corrupted",
when($"class".isNotNull and $"class_integer".isNull, $"class"))
Repeat for each column / cast you need.

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

remove duplicate column from dataframe using scala

I need to remove one column from the DataFrame having another column with the same name. I need to remove only one column and need the other one for further usage.
For example, given this input DF:
sno | age | psk | psk
---------------------
1 | 12 | a4 | a4
I would like to obtain this output DF:
sno | age | psk
----------------
1 | 12 | a4
RDD is the way (but you need to know the column index of the duplicate columns for removing duplicate columns back to dataframe)
If you have dataframe with duplicate columns as
+---+---+---+---+
|sno|age|psk|psk|
+---+---+---+---+
|1 |12 |a4 |a4 |
+---+---+---+---+
You know that the last two column index are duplicates.
Next step is for you to have column names with duplicates removed and form schema
val columns = df.columns.toSet.toArray
val schema = StructType(columns.map(name => StructField(name, StringType, true)))
Vital part is to convert the dataframe to rdd and remove the required column index (here it is the 4th)
val rdd = df.rdd.map(row=> Row.fromSeq(Seq(row(0).toString, row(1).toString, row(2))))
Final step is to convert the rdd to dataframe using schema
sqlContext.createDataFrame(rdd, schema).show(false)
which should give you
+---+---+---+
|sno|age|psk|
+---+---+---+
|1 |12 |a4 |
+---+---+---+
I hope the answer is helpful

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+