Structured streaming data stored in a dataframe - pyspark

I've got a spark dataframe of following form:
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema_sdf_consistent = StructType([
StructField("A", DoubleType(), True),
StructField("B", DoubleType(), True),
StructField("C", DoubleType(), True),
])
sdf_consistent_init = ([( 0.0, 0.0, 0.0 )])
sdf_consistent = spark.createDataFrame(data=sdf_consistent_init, schema=schema_sdf_consistent)
sdf_consistent = sdf_consistent.withColumn("ts", unix_timestamp(current_timestamp()))
sdf_cons = sdf_consistent.select("ts", "A", "B", "C")
sdf_cons.show()
I am receiving structured streaming data in following form:
My aim: I would like to append the current streaming data to my dataframe in following form:
Hence in such way, that the timestamp “ts” (e.g. “1653577048“) and the key (e.g. “A”) with its value (e.g. “33.2”) of the streaming data is appended to the corresponding columns of the dataframe. The missing values for the column “B” and “C” are filled with the values of the previous row of the dataframe.

The trick is the "foreachbatch" function that can be applied to a data stream that converts the streaming datatype into a normal spark data frame so that functions like pivoting and joins can be applied.
The join can be made via a auxillary row added to both dataframes, like here:
# Add row_index for join and perform join
sdf1 = sdf1.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
sdf2= sdf2.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
sdf_join = sdf1.join(sdf2, on=["row_index"]).drop("row_index")

Related

How to join rdd rows based on common value in another column RDD pyspark

Textfile Dataset I have :
Benz,25
BMW,27
BMW,25
Land Rover,22
Audi,25
Benz,25
The result I want is :
[((Benz,BMW),2),((Benz,Audi),1),((BMW,Audi),1)]
it basically pairs the cars with common values and gets the occurrence together.
My code so far is :
cars= sc.textFile('cars.txt')
carpair= cars.flatMap(lambda x: float(x.split(',')))
carpair.map(lambda x: (x[0], x[1])).groupByKey().collect()
As I'm a beginner I'm not able to figure it out.
Getting the occurrence is secondary i cant even map the values together.
In pyspark, it'd look like this:
schema = StructType([
StructField('car_model', StringType(), True),
StructField('value', IntegerType(), True)
])
df = (
spark
.read
.schema(schema)
.option('header', False)
.csv('./cars.txt')
)
(
df.alias('left')
.join(df.alias('right'), ['value'], 'left')
.select('value', f.col('left.car_model').alias('left_car_model'), f.col('right.car_model').alias('right_car_model'))
.where(f.col('left_car_model') != f.col('right_car_model'))
.groupBy('left_car_model', 'right_car_model')
.agg(f.count('*').alias('count'))
.show()
)

Does spark's collect() action, after an orderBy, provide the order preserved list?

I have a dataframe in spark scala, when I perform a collect operation on the dataframe after an orderBy operation, will the order be preserved in the collected scala list?
val schema = StructType(Array(
StructField("language", StringType, true),
StructField("users", IntegerType, true)))
val data = Seq(Row("Java", 20000),
Row("Python", 100000),
Row("Scala", 3000))
val df = spark.createDataFrame(data, schema)
//Performing an orderBy Operation. This will sort the data based on
//the number of users in descending orders
val dfSorted = df.orderBy(col("users").desc)
//Now I collect the data, this is where I am not sure if the data will be sorted
//or not, obviously because this ordering may happen on various executors
//in the cluster.
val collectedDataList = dfSorted.collect()
I know that the order is preserved in the list in Scala but I am not sure if the collect operation will provide the ordered data.

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema
I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

How to log malformed rows from Scala Spark DataFrameReader csv

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored
Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)
If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...
I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()