Spark DF schema verification with Decimal column - pyspark

I want to verify the schema of a Spark dataframe against schema information it get from some other source (a dashboard tool). The information I get about the table is field name and field type (nullability is not important at this point).
However, for DecimalType columns I do not get the information about precision and scale (the two parameters of DecimalType). So I have to ignore these values in the comparison.
I currently re-write the schema so that the Decimal columns become Float column. But is there a more elegant way to do that?
Basically I want to write a function is_schema_valid() that works as such:
from pyspark.sql import types as T
df_schema = T.StructType([
T.StructField('column_1', T.StringType(), True),
T.StructField('column_2', T.DecimalType(20,5), True), # values in DecimalType are random
])
schema_info = [('column_1', 'String'), ('column_2', 'Decimal')]
is_schema_valid(schema_info, df_schema)
# Output: True

The best would probably to compare similar objects. You can transform a schema in JSON object (or python dict).
import json
_df_schema_dict = json.loads(df_schema.json())
df_schema_dict = {
field["name"]: field["type"]
for field in _df_schema_dict["fields"]
}
df_schema_dict
> {'column_1': 'string', 'column_2': 'decimal(20,5)'}
You can work with this object to compare with schema_info. Here is a very basic test you can do (I change a bit the content of schema_info):
import json
def is_schema_valid(schema_info, df_schema):
df_schema_dict = {
field["name"]: field["type"] for field in json.loads(df_schema.json())["fields"]
}
schema_info_dict = {elt[0]: elt[1] for elt in schema_info}
return schema_info_dict == df_schema_dict
df_schema = T.StructType(
[
T.StructField("column_1", T.StringType(), True),
T.StructField("column_2", T.DecimalType(20, 5), True),
]
)
schema_info = [("column_1", "string"), ("column_2", "decimal(20,5)")]
is_schema_valid(schema_info, df_schema)
# True
If you want to ignore decimal precision, you can always twist a little bit the dataframe schema. Replace field["type"] with field["type"] if "decimal" not in field["type"] else "decimal" for example.
import json
def is_schema_valid(schema_info, df_schema):
df_schema_dict = {
field["name"]: field["type"] if "decimal" not in field["type"] else "decimal"
for field in json.loads(df_schema.json())["fields"]
}
schema_info_dict = {elt[0]: elt[1] for elt in schema_info}
return schema_info_dict == df_schema_dict
df_schema = T.StructType(
[
T.StructField("column_1", T.StringType(), True),
T.StructField("column_2", T.DecimalType(20, 5), True),
]
)
schema_info = [("column_1", "string"), ("column_2", "decimal")]
is_schema_valid(schema_info, df_schema)
# True

Related

Update map value in nested struct in pyspark

I have a table with dates and comments.
dob | comment
---------------------------
1960-12-01 | this is useful
And I want a new column with this type:
value_type = T.StructType(
[
T.StructField("extra",T.MapType(T.StringType(), T.StringType(), True), True),
T.StructField("date", T.StringType(), True),
T.StructField("from_date", T.StringType(), True),
T.StructField("to_date", T.StringType(), True),
T.StructField("value", T.StringType(), True),
]
)
I need to:
put the df.date into the date field of the struct and
put the df.comment into the extra map of the struct
thanks to blackbishop I figured out how to do the first part here - and i tried to use .withField() to update the map but it throws an error:
I tried:
(df
.withColumn("new_col",
F.struct(*[F.lit(None).cast(f.dataType).alias(f.name)
for f in value_type.fields]))
.withColumn("new_col", (F.col("new_col")
.withField("date", F.col("dob"))
.withField("extra.value", F.col("comment")))))
But I get the following error:
AnalysisException: cannot resolve 'update_fields(update_fields(new_col, WithField(dob), WithField(dob)).extra, WithField(dob))' due to data type mismatch: struct argument should be struct type, got: map<string,string>;
I am confused as per why it would not work with the map inside the struct?
Thanks :)
I figured it out!
(df
.withColumn("new_col",
F.struct(*[F.lit(None).cast(f.dataType).alias(f.name)
for f in value_type.fields]))
.withColumn("new_col", (F.col("new_col")
.withField("date", F.col("dob"))
.withField("extra",
F.create_map(F.lit("my_key"), F.col("comment")))))
The problem was that I was not actually passing a map to a map type!

How do I efficiently map keys from one dataset based on values from other dataset

Assuming data frame 1 represents target country and list of source countries and data frame 2 represents the availability for all the countries, find all the pairs from data frame 1 where target country mapping is TRUE and source country mapping is FALSE:
Dataframe 1 (targetId, sourceId):
USA: China, Russia, India, Japan
China: USA, Russia, India
Russia: USA, Japan
Dataframe 2 (id, available):
USA: true
China: false
Russia: true
India: false
Japan: true
Result Dataset should look like:
(USA, China),
(USA, India)
My idea is to first explode the data set1, create new data frame (say, tempDF), add 2 new columns to it: targetAvailable, sourceAvailable and finally filter for targetAvailable = false and sourceAvailable = true to get the desired result data frame.
Below is the snippet of my code:
val sourceDF = sourceData.toDF("targetId", "sourceId")
val mappingDF = mappingData.toDF("id", "available")
val tempDF = sourceDF.select(col("targetId"),
explode(col("sourceId")).as("source_id_split"))
val resultDF = tempDF.select("targetId")
.withColumn("targetAvailable", isAvailable(tempDF.col("targetId")))
.withColumn("sourceAvailable", isAvailable(tempDF.col("source_id_split")))
/*resultDF.select("targetId", "sourceId").
filter(col("targetAvailable") === "true" and col("sourceAvailable")
=== "false").show()*/
// udf to find the availability value for the given id from the mapping table
val isAvailable = udf((searchId: String) => {
val rows = mappingDF.select("available")
.filter(col("id") === searchId).collect()
if (rows(0)(0).toString.equals("true")) "true" else "false" })
Calling isAvailable UDF while calculating the resultDF throws me some weird exception. Am I doing something wrong? is there a better / simpler way to do this?
In your UDF, you are making reference to another dataframe, which is not possible, hence the "weird" exception you obtain.
You want to filter one dataframe based on values contained in another. What you need to do is a join on the id columns. Two joins actually in your case, one for the targets, one for the sources.
The idea to use explode however is very good. Here is a way to achieve what you want:
// generating data, please provide this code next time ;-)
val sourceDF = Seq("USA" -> Seq("China", "Russia", "India", "Japan"),
"China" -> Seq("USA", "Russia", "India"),
"Russia" -> Seq("USA", "Japan"))
.toDF("targetId", "sourceId")
val mappingDF = Seq("USA" -> true, "China" -> false,
"Russia" -> true, "India" -> false,
"Japan" -> true)
.toDF("id", "available")
sourceDF
// we can filter available targets before exploding.
// let's do it to be more efficient.
.join(mappingDF.withColumnRenamed("id", "targetId"), Seq("targetId"))
.where('available)
// exploding the sources
.select('targetId, explode('sourceId) as "sourceId")
// then we keep only non available sources
.join(mappingDF.withColumnRenamed("id", "sourceId"), Seq("sourceId"))
.where(! 'available)
.select("targetId", "sourceId")
.show(false)
which yields
+--------+--------+
|targetId|sourceId|
+--------+--------+
|USA |China |
|USA |India |
+--------+--------+

Scala how to use reduceBykey when I have two keys

Data format of one row:
id: 123456
Topiclist: ABCDE:1_8;5_10#BCDEF:1_3;7_11
One id can have many rows:
id: 123456
Topiclist:ABCDE:1_1;7_2;#BCDEF:1_2;7_11#
Target: (123456, (ABCDE,9,2),(BCDEF,5,2))
Records in topic list are split by #, so ABCDE:1_8;5_10 is one record.
A record is in the format <topicid>:<topictype>_<topicvalue>
E.g for ABCDE:1_8 has
topicid = ABCDE
topictype = 1
topicvalue = 8
Target: sum the total value of TopicType1 , and count frequency of TopicType1
so should be (id, (topicid, value,frequency)), eg: (123456, (ABCDE,9,2),(BCDEF,5,2))
Assume that your data are "123456!ABCDE:1_8;5_10#BCDEF:1_3;7_11" and "123456!ABCDE:1_1;7_2#BCDEF:1_2;7_11", so we use "!" to get your userID "123456"
rdd.map{f=>
val userID = f.split("!")(0)
val items = f.split("!")(1).split("#")
var result = List[Array[String]]()
for (item <- items){
val topicID = item.split(":")(0)
for (topicTypeValue <- item.split(":")(1).split(";") ){
println(topicTypeValue);
if (topicTypeValue.split("_")(0)=="1"){result = result:+Array(topicID,topicTypeValue.split("_")(1),"1") }
}
}
(userID,result)
}
.flatMapValues(x=>x).filter(f=>f._2.length==3)
.map{f=>( (f._1,f._2(0)),(f._2(1).toInt,f._2(2).toInt) )}
.reduceByKey{case(x,y)=> (x._1+y._1,x._2+y._2) }
.map(f=>(f._1._1,(f._1._2,f._2._1,f._2._2))) // (userID, (TopicID,valueSum,frequences) )
The output is ("12345",("ABCDE",9,2)), ("12345",("BCDEF",5,2)) a little different from your output, you can group this result if you really need ("12345",("ABCDE",9,2), ("BCDEF",5,2) )

pg-promise update where in custom array

How can the following postgresql query be written using the npm pg-promise package?
update schedule
set student_id = 'a1ef71bc6d02124977d4'
where teacher_id = '6b33092f503a3ddcc34' and (start_day_of_week, start_time) in (VALUES ('M', (cast('17:00:00' as time))), ('T', (cast('19:00:00' as time))));
I didn't see anything in the formatter namespace that can help accomplish this. https://vitaly-t.github.io/pg-promise/formatting.html
I cannot inject the 'cast' piece into the '17:00:00' value without it being considered part of the time string itself.
The first piece of the query is easy. It's the part after VALUES that i can't figure out.
First piece:
var query = `update schedule
set student_id = $1
where teacher_id = $2 and (start_day_of_week, start_time) in (VALUES $3)`;
var inserts = [studentId, teacherId, values];
I'm using this messiness right now for $3 (not working yet), but it completely bypasses all escaping/security built into pg-promise:
const buildPreparedParams = function(arr, colNames){
let newArr = [];
let rowNumber = 0
arr.forEach((row) => {
const rowVal = (rowNumber > 0 ? ', ' : '') +
`('${row.startDayOfWeek}', (cast('${row.startTime}' as time)))`;
newArr.push(rowVal);
});
return newArr;
};
The structure I am trying to convert into this sql query is:
[{
"startTime":"17:00:00",
"startDayOfWeek":"U"
},
{
"startTime":"16:00:00",
"startDayOfWeek":"T"
}]
Use CSV Filter for the last part: IN (VALUES $3:csv).
And to make each item in the array format itself correctly, apply Custom Type Formatting:
const data = [{
startTime: '17:00:00',
startDayOfWeek: 'U'
},
{
startTime: '16:00:00',
startDayOfWeek: 'T'
}];
const values = data.map(d => ({
toPostgres: () => pgp.as.format('(${startDayOfWeek},(cast(${startTime} as time))', d),
rawType: true
}));
Now passing in values for $3:csv will format your values correctly:
('U',(cast('17:00:00' as time)),('T',(cast('16:00:00' as time))

Can I optimize this: Programmatically prepare two DataFrame's for a Union

This is under the understanding that withColumn can only take one column at a time, so if I'm wrong there I'm going to be embarrassed, but I'm worried about the memory performance of this because the DF's are likely to be very large in production. Essentially the idea is to do a union on the column arrays (Array[String]), distinct the result, and foldLeft over that set updating the accumulated DF's as I go. I'm looking for a programatic way to match the columns on the two DF's so I can perform a union afterwards.
val (newLowerCaseDF, newMasterDF): (DataFrame,DataFrame) = lowerCaseDFColumns.union(masterDFColumns).distinct
.foldLeft[(DataFrame,DataFrame)]((lowerCaseDF, masterDF))((acc: (DataFrame, DataFrame), value: String) =>
if(!lowerCaseDFColumns.contains(value)) {
(acc._1.withColumn(value,lit(None)), acc._2)
}
else if(!masterDFColumns.contains(value)) {
(acc._1, acc._2.withColumn(value, lit(None)))
}
else{
acc
}
)
Found out that it's possible to select hardcoded null columns, so my new solution is:
val masterExprs = lowerCaseDFColumns.union(lowerCaseMasterDFColumns).distinct.map(field =>
//if the field already exists in master schema, we add the name to our select statement
if (lowerCaseMasterDFColumns.contains(field)) {
col(field.toLowerCase)
}
//else, we hardcode a null column in for that name
else {
lit(null).alias(field.toLowerCase)
}
)
val inputExprs = lowerCaseDFColumns.union(lowerCaseMasterDFColumns).distinct.map(field =>
//if the field already exists in master schema, we add the name to our select statement
if (lowerCaseDFColumns.contains(field)) {
col(field.toLowerCase)
}
//else, we hardcode a null column in for that name
else {
lit(null).alias(field.toLowerCase)
}
)
And then you're able to do a union like so:
masterDF.select(masterExprs: _*).union(lowerCaseDF.select(inputExprs: _*))