How to compare 2 JSON schemas using pyspark? - pyspark

I have 2 JSON schemas as below -
df1.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
How can I compare these 2 schemas and highlight the differences using pyspark as I am using pyspark-sql to load data from the JSON file into a DF.

While it is not clear what do you mean by "compare", the following code will give you the fields (FieldType) which are on DF2 and not on DF1.
set(df2.schema.fields) - set(df1.schema.fields)
Set will take your list and truncate the duplicates.

I find the following one line code useful and neat. I also provides you the two-directional differences at a column level
set(df1.schema).symmetric_difference(set(df2.schema))

Related

scala: read csv of documents, create cosine similarity

I'm reading in dozens of documents. They seem to be read into both RDD and DFs as a string of columns:
This is the schema:
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)...
|-- _c58: string (nullable = true)
|-- _c59: string (nullable = true)
This is the head of the df:
_c1| _c2|..........
V1 V2
This text ... This is an...
I'm trying to create a cosine similarity matrix using this:
val contentRDD = spark.sparkContext.textFile("...documents_vector.csv").toDF()
val Row(coeff0: Matrix) = Correlation.corr(contentRDD, "features").head
println(s"Pearson correlation matrix:\n $coeff0")
This is another way I was doing it:
val df_4 = spark.read.csv("/document_vector.csv")
Where features would be the name of the column created by converting the single row of 59 columns into a single column of 59 rows, named features.
Is there a way to map each new element in the csv to a new row to complete the cosine similarity matrix? Is there another way I should be doing this?
Thank you to any who consider this.

Deequ - How to put validation on a subset of dataset?

I have a usecase where I want to put certain validations on subset of data that satisfies a specific condition.
For example, I have a dataframe which has 4 columns. colA, colB, colC, colD
df.printSchema
root
|-- colA: string (nullable = true)
|-- colB: integer (nullable = false)
|-- colC: string (nullable = true)
|-- colD: string (nullable = true)
I want to put a validation that, wherever "colA == "x" and colB > 20" , combination of "colC and colD" should be unique. ( basically, hasUniqueness(Seq("colC", "colD"), Check.IsOne)

Divide dataframe into batches Spark

I need to run a set of transformations on batches of hours of the dataframe. And the number of hours should be parameterized so it could be changed - for example, run transformations on 3 hours of dataframe, then next 2 hours. In such a way, there should be a step with a parameterized number of hours for each transformation.
The signature of the transformation looks like this:
def transform(wordsFeed: DataFrame)(filesFeed: DataFrame): Unit
So I want to do this division into batches and then call a transform on this datafeed. But I can't use groupBy as it would change dataframe into grouped dataset while I need to preserve all the columns in the schema. How can I do that?
val groupedDf = df.srcHours.groupBy($"event_ts")
transform(keywords)(groupedDf)
Data schema looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- hashed_user_id: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
The main reason to introduce this batching is that there's too much data to process at once.
Note: I still want to use batch data processing and not streaming in this case

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to efficiently join 2 DateFrames based on Timestamp difference?

I have two DataFrames with two different time-series data. For simplicity let's call them Events and Status.
events:
root
|-- timestamp: timestamp (nullable = true)
|-- event_type: string (nullable = true)
|-- event_id: string (nullable = true)
statuses:
root
|-- timestamp: timestamp (nullable = true)
|-- status: string (nullable = true)
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: string (nullable = true)
I'd like to join them so every Event will have a column of list_statuses that contains all Row objects of statuses in the previous X hours of its own timestamp.
I can do it by a cartesian product of events and statuses and then filter for the time criteria but that it (extremely) inefficient.
Is there any better way to do it? Anything off-the-shelf?
(I thought to group both dataframes on a time-window, then self-join the second to contain both the current and the previous time-windows and then join between them and filter, but if there is anything ready and bug-free, I'd happily use...)
Thanks!
Almost 2 months later but I thought it might help others if I post something I got to:
http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html
It basically a more efficient implementation of range-join between two DataSets based on a Timestamp or a Numeric field (Scala, with Spark 2.0).