How can I convert a nested json to a specific dataframe - scala

I have a json which looks like this:
"A":{"B":1,"C":[{"D":2,"E":3},{"D":6,"E":7}]}
I would like to create a dataframe based on this with two rows.
B
D
E
1
2
3
1
6
7
In an imperative language, I would use a for loop. However, I saw that this is not recommended with Scala and therefore I am wondering how I can explode this inner JSON and use the entries in new rows.
Thank you very much

You can use from_json to parse the string json values :
val df = Seq(
""""A":{"B":1,"C":[{"D":2,"E":3},{"D":6,"E":7}]}"""
).toDF("col")
val df1 = df.withColumn(
"col",
from_json(
regexp_extract(col("col"), "\"A\":(.*)", 1), // extract the part after "A":
lit("struct<B:int,C:array<struct<D:int,E:int>>>")
)
).select(col("col.B"), expr("inline(col.C)"))
df1.show
//+---+---+---+
//| B| D| E|
//+---+---+---+
//| 1| 2| 3|
//| 1| 6| 7|
//+---+---+---+
You can also pass the schema as StructType to from_json function which you define like this :
val schema = StructType(Array(
StructField("B", IntegerType, true),
StructField("C", ArrayType(StructType(Array(
StructField("D", IntegerType, true),
StructField("E", IntegerType, true)
))))
)
)

Related

How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.
Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
])
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
print(df.show())
This gives me the below result where the values are appended to the df in each loop.
+---+---+
| a1| a2|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+

Fill a DataFrame with values giving in a Map

My objective is to create a function that taking a Map and a data frame as parameter:
fillNa(columnsToFill, originalDF)
can fill the data frame with the values giving in a Map.
I'm working with a Data Frame similar to the one you can see below :
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| null|
|269984665|VACAPERVIAJES| 12| null|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| null| silver|
|265200651|RUBENTARARIRA| 11| null|
+---------+-------------+----------------+-------------------+
The desired output, therefore, is the following:
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| normal|
|269984665|VACAPERVIAJES| 12| normal|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| 0| silver|
|265200651|RUBENTARARIRA| 11| normal|
+---------+-------------+----------------+-------------------+
The code that generate the DataFrame is the following:
val someData = Seq(
Row("260341211", "HEBICOTE62617", 15, null),
Row("269984665", "VACAPERVIAJES", 12, null),
Row("223499446", "GAFAOCOSSR005", 10, "gold"),
Row("265004480", "NEFCOTEOC8179", null, "silver"),
Row("265200651", "RUBENTARARIRA", 11, null)
)
val someSchema = List(
StructField("seller_id", StringType, true),
StructField("nickname", StringType, true),
StructField("successful_items", IntegerType, true),
StructField("power_seller_status", StringType, true)
)
val originalDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
However, when I tried to create a function that take an string and fill the values I can't do it for both fields. The best I could do is:
1- Replace only one column
2- Duplicate the rows
The map using as parameter is the following:
val columnsToFill = Map("power_seller_status" -> "normal",
"successful_items" -> "0")
The functions I've created:
Version 1
def fillNa_version1(replacements: Map[String, String], dataFrame: DataFrame): DataFrame = {
dataFrame.na.fill(replacements.values.head, Seq(replacements.keys.head))
}
Version 2
def fillNa_version2(replacements: Map[String, String], dataFrame: DataFrame)= {
replacements.map{keyVal => dataFrame.na.fill(keyVal._2, Seq(keyVal._1))}.reduce(_.union(_))
}
originalDF.na.fill(columnsToFill).show()
yields:
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| normal|
|269984665|VACAPERVIAJES| 12| normal|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| 0| silver|
|265200651|RUBENTARARIRA| 11| normal|
+---------+-------------+----------------+-------------------+
which appears to be what you want, no?
If all you want to do is replace your nulls with some sort of default value, there are much easier ways to do that. You can use withColumn to derive a new column.
originalDF.select(
$"seller_id",
$"nickname",
$"successful_items",
$"power_seller_status").
withColumn("derived_successful_items", when($"successful_items".isNull,"0").otherwise($"successful_items")).
withColumn("derived_power_seller",when ($"power_seller_status".isNull,"normal").otherwise($"power_seller_status")).show
You could also use coalesce (returns the first non-null argument):
withColumn("coalesced_successful_items",coalesce($"successful_items",lit("0")))

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?
Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)
This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()
Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]
for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+