Spark generate a dataframe from two json columns - scala

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema

I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

Related

How to join rdd rows based on common value in another column RDD pyspark

Textfile Dataset I have :
Benz,25
BMW,27
BMW,25
Land Rover,22
Audi,25
Benz,25
The result I want is :
[((Benz,BMW),2),((Benz,Audi),1),((BMW,Audi),1)]
it basically pairs the cars with common values and gets the occurrence together.
My code so far is :
cars= sc.textFile('cars.txt')
carpair= cars.flatMap(lambda x: float(x.split(',')))
carpair.map(lambda x: (x[0], x[1])).groupByKey().collect()
As I'm a beginner I'm not able to figure it out.
Getting the occurrence is secondary i cant even map the values together.
In pyspark, it'd look like this:
schema = StructType([
StructField('car_model', StringType(), True),
StructField('value', IntegerType(), True)
])
df = (
spark
.read
.schema(schema)
.option('header', False)
.csv('./cars.txt')
)
(
df.alias('left')
.join(df.alias('right'), ['value'], 'left')
.select('value', f.col('left.car_model').alias('left_car_model'), f.col('right.car_model').alias('right_car_model'))
.where(f.col('left_car_model') != f.col('right_car_model'))
.groupBy('left_car_model', 'right_car_model')
.agg(f.count('*').alias('count'))
.show()
)

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

Structured streaming data stored in a dataframe

I've got a spark dataframe of following form:
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema_sdf_consistent = StructType([
StructField("A", DoubleType(), True),
StructField("B", DoubleType(), True),
StructField("C", DoubleType(), True),
])
sdf_consistent_init = ([( 0.0, 0.0, 0.0 )])
sdf_consistent = spark.createDataFrame(data=sdf_consistent_init, schema=schema_sdf_consistent)
sdf_consistent = sdf_consistent.withColumn("ts", unix_timestamp(current_timestamp()))
sdf_cons = sdf_consistent.select("ts", "A", "B", "C")
sdf_cons.show()
I am receiving structured streaming data in following form:
My aim: I would like to append the current streaming data to my dataframe in following form:
Hence in such way, that the timestamp “ts” (e.g. “1653577048“) and the key (e.g. “A”) with its value (e.g. “33.2”) of the streaming data is appended to the corresponding columns of the dataframe. The missing values for the column “B” and “C” are filled with the values of the previous row of the dataframe.
The trick is the "foreachbatch" function that can be applied to a data stream that converts the streaming datatype into a normal spark data frame so that functions like pivoting and joins can be applied.
The join can be made via a auxillary row added to both dataframes, like here:
# Add row_index for join and perform join
sdf1 = sdf1.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
sdf2= sdf2.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
sdf_join = sdf1.join(sdf2, on=["row_index"]).drop("row_index")

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

How to create empty struct in pyspark?

I'm trying to create empty struct column in pyspark. For array this works
import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array([]))
but this gives me an error.
df = df.withColumn('newCol', F.struct())
I saw similar question but for scala not pyspark so it doesn't really help me.
If you know the schema of the struct column, you can use the function from_json as follows
struct_schema = StructType([
StructField('name', StringType(), False),
StructField('surname', StringType(), False),
])
df = df.withColumn(
'newCol', F.from_json(psf.lit(""), struct_schema)
)
Actually the array is not really empty, because it has an empty element.
You should instead consider something like this:
df = df.withColumn('newCol', F.lit(None).cast(T.StructType())
PS: it's a late conversion of my comment into an answer, as it has been proposed - I hope it will help even if it's late after the OP's question