Creating a schema data health expectation in Palantir Foundry Code Repositories

Creating a schema data health expectation in Palantir Foundry Code Repositories - pyspark

I have a dataset that is the output of a Python transform defined in Palantir Foundry Code Repository. It has certain columns, but given that over time the data may change I want to validate these columns(around 73) holds in the future.
How can I create a data health expectation or check to ensure that all 73 columns holds in future?

You can use expectations to make assertions about which columns exist in your output schema.
See the official docs for schema expectations.
There are 3 kinds of schema expectations:
# Assert some columns exist.
E.schema().contains({'col1': type1, 'col2': type2})
# Assert the schema contains only columns from the given set (but not necessarily all of them).
E.schema().is_subset_of({'col1': type1, 'col2': type2})
# Assert the schema contains exactly the given columns.
E.schema().equals({'col1': type1, 'col2': type2})
Additionally, for checking a single column, you can use E.col('col1').exists(). But for 73 columns you're better off going with E.schema().
So for a more fleshed-out example, you might have something like:
from transforms.api import transform_df, Check, Input, Output
import transforms.expectations as E
from pyspark.sql import types as T
COLUMNS_WHICH_MUST_EXIST = {
'string_column': T.StringType(),
'number_column': T.IntegerType(),
# ...and 71 more.
}
#transform_df(
Output("ri.foundry.main.dataset.abcdef", checks=[
Check(E.schema().contains(COLUMNS_WHICH_MUST_EXIST), "contains important columns"),
]),
input_data=Input("ri.foundry.main.dataset.12345678"),
)
def compute(input_data):
# ... your logic here
Also see the official docs for expectation checks for more details of the options available.

Related

Selecting identical named columns in jOOQ

Im currently using jOOQ to build my SQL (with code generation via the mvn plugin).
Executing the created query is not done by jOOQ though (Using vert.X SqlClient for that).
Lets say I want to select all columns of two tables which share some identical column names. E.g. UserAccount(id,name,...) and Product(id,name,...). When executing the following code
val userTable = USER_ACCOUNT.`as`("u")
val productTable = PRODUCT.`as`("p")
create().select().from(userTable).join(productTable).on(userTable.ID.eq(productTable.AUTHOR_ID))
the build method query.getSQL(ParamType.NAMED) returns me a query like
SELECT "u"."id", "u"."name", ..., "p"."id", "p"."name", ... FROM ...
The problem here is, the resultset will contain the column id and name twice without the prefix "u." or "p.", so I can't map/parse it correctly.
Is there a way how I can say to jOOQ to alias these columns like the following without any further manual efforts ?
SELECT "u"."id" AS "u.id", "u"."name" AS "u.name", ..., "p"."id" AS "p.id", "p"."name" AS "p.name" ...
Im using the holy Postgres Database :)
EDIT: Current approach would be sth like
val productFields = productTable.fields().map { it.`as`(name("p.${it.name}")) }
val userFields = userTable.fields().map { it.`as`(name("p.${it.name}")) }
create().select(productFields,userFields,...)...
This feels really hacky though

How to correctly dereference tables from records
You should always use the column references that you passed to the query to dereference values from records in your result. If you didn't pass column references explicitly, then the ones from your generated table via Table.fields() are used.
In your code, that would correspond to:
userTable.NAME
productTable.NAME
So, in a resulting record, do this:
val rec = ...
rec[userTable.NAME]
rec[productTable.NAME]
Using Record.into(Table)
Since you seem to be projecting all the columns (do you really need all of them?) to the generated POJO classes, you can still do this intermediary step if you want:
val rec = ...
val userAccount: UserAccount = rec.into(userTable).into(UserAccount::class.java)
val product: Product = rec.into(productTable).into(Product::class.java)
Because the generated table has all the necessary meta data, it can decide which columns belong to it, and which ones don't. The POJO doesn't have this meta information, which is why it can't disambiguate the duplicate column names.
Using nested records
You can always use nested records directly in SQL as well in order to produce one of these 2 types:
Record2<Record[N], Record[N]> (e.g. using DSL.row(table.fields()))
Record2<UserAccountRecord, ProductRecord> (e.g using DSL.row(table.fields()).mapping(...), or starting from jOOQ 3.17 directly using a Table<R> as a SelectField<R>)
The second jOOQ 3.17 solution would look like this:
// Using an implicit join here, for convenience
create().select(productTable.userAccount(), productTable)
.from(productTable)
.fetch();
The above is using implicit joins, for additional convenience
Auto aliasing all columns
There are a ton of flavours that users could like to have when "auto-aliasing" columns in SQL. Any solution offered by jOOQ would be no better than the one you've already found, so if you still want to auto-alias all columns, then just do what you did.
But usually, the desire to auto-alias is a derived feature request from a misunderstanding of what's the best approch to do something in jOOQ (see above options), so ideally, you don't follow down the auto-aliasing road.

Pyspark: dynamically generate condition for when() clause with variable number of columns

I have a function to which I am passing a dataframe and listofcolumns which should not contain a NULL value. If any of the columns from "listofcolumns" has a null value, I need to take an action.
Now, I have to use the when clause here but the columns passed to the when clause will vary based on the dataframe and listofcolumns passed. So I want to be able to generate the when clause dynamically using the columns passed. The when clause could be checking for NULL value in just one column or multiple columns in the dataframe. Thus I cannot hard code to use one condition or multiple conditions.
I have tried generating the whenClause string dynamically and passing as a variable but get an error that "TypeError: condition should be a Column".
Can someone please advise how I can achieve this?

This can be achieved with resolving your the selection logic on your columns ahead of time and then using functools.reduce and operator, such as:
import functools
import operator
import pyspark.sql.functions as f
# conditional selection of columns - your logic on selecting
# which columns to check for null goes here
my_cols = [col for col in df.columns if "condition" in col]
# now I want to create my condition on these columns
# since it can be any of them, I use operator.or_
# but your logic may vary here - apply to my_cols created above
cond_expr = functools.reduce(operator.or_, [f.col(c).isNull() for c in my_cols])
# now you apply your action
df.withColumn(
"output_column",
f.when(cond_expr, TRUE_ACTION).otherwise(FALSE_ACTION)
)
Where TRUE_ACTION is when your condition of any equal to null is satisfied. If you wish to check for all columns in your condition being null, replace operator.or_ with operator.and_ and build your logic from there. Hope this helps!

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.

It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

pyspark+psycopg2 is slow in writing the results into the database

I have a Spark job which process the data pretty fast, but when it tries to write the result into the postgresql database, it is quite slow. Here is most of the relevant code:
import psycopg2
def save_df_to_db(records):
# each item in record is a dictionary with 'url', 'tag', 'value' as keys
db_conn = psycopg2.connect(connect_string)
db_conn.autocommit = True
cur = db_conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
upsert_query = """INSERT INTO mytable (url, tag, value)
VALUES (%(url)s, %(tag)s, %(value)s) ON CONFLICT (url, tag) DO UPDATE SET value = %(value)s"""
try:
cursor.executemany(upsert_query, records)
except Exception as e:
print "Error in executing save_df_to_db: ", e.message
data = [...] # initial data
rdd = sc.parallelize(data)
rdd = ... # Some simple RDD transforms...
rdd.foreachPartition(save_df_to_db)
The table also has a constraint about url+tag being unique. I am looking for solutions to improve the speed of this code. Any suggestion or recommendation is welcome.

Thanks for the responses. Since the version of psycopg2 I am using is not supporting the batch execution, I had to rely on a slightly different approach using the copy command. I wrote down a little function which helped reducing the save time from 20 minutes to about 30 seconds. Here is the function. It takes a pandas dataframe as input and write it to a table (curso):
import StringIO
import pandas as pd
def write_dataframe_to_table(cursor, table, dataframe, batch_size=100, null='None'):
"""
Write a pandas dataframe into a postgres table.
It only works if the table columns have the same name as the dataframe columns.
:param cursor: the psycopg2 cursor object
:param table: the table name
:param dataframe: the dataframe
:param batch_size: batch size
:param null: textual representation of NULL in the file. The default is the string None.
"""
for i in range(0, len(dataframe), batch_size):
chunk_df = dataframe[i: batch_size + i]
content = "\n".join(chunk_df.apply(lambda x: "\t".join(map(str, x)), axis=1))
cursor.copy_from(StringIO.StringIO(content), table, columns=list(chunk_df.columns), null=null)

I believe the main bottleneck is a combination of cursor.executemany and connection.autocommit. As it is explained in the official documentation of executemany
In its current implementation this method is not faster than executing han executing execute() in a loop.
Since you combine it with connection.autocommit you effectively commit after each insert.
Psycopg provides fast execution helpers:
psycopg2.extras.execute_values
psycopg2.extras.execute_batch
which can be used to perform batched operations. It would also make more sense to handle commits manually.
It is also possible that you additionally throttle the database server with larger number of concurrent writes and index updates. Normally I would recommend writing to disk and performing batch import with COPY but it is not guaranteed to help here.
Since you use mutable records without timestamps, you cannot just drop the index and recreate it after the import as another way to boost performance.

Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrames?
Reason
I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.
Current solution
I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.
Is there a better solution to store such extra information on DataFrames?

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.

As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):
customSchema = StructType([
StructField("cat_id", IntegerType(), True,
{'description': "Unique id, primary key"}),
StructField("cat_title", StringType(), True,
{'description': "Name of the category, with underscores"}) ])
categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
.options(header='false')
.load(csvFilename, schema = customSchema) )
f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]
["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
"cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]
This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.
I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.
Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.
For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.
For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.
And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.
See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas

If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though).
implicit class WrappedDataFrame(val df: DataFrame) {
var metadata = scala.collection.mutable.Map[String, Long]()
def addToMetaData(key: String, value: Long) {
metadata += key -> value
}
...[other methods you consider useful, getters, setters, whatever]...
}
If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.:
df.addtoMetaData("size", 100)
This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around.

I would store a wrapper around your dataframe. For example:
case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))

A lot of people saw the word "metadata" and went straight to "column metadata". This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame.
I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata.
Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. The RDD name (our metadata field) is lost with each new RDD. That means you need a way to re-add the name to your new RDD. This can be solved by providing a method that takes a function as an argument. It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata:
def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
val meta = df.rdd.name
val result = fn(wrappedFrame)
result.rdd.setName(meta)
MetaDataFrame(result)
}
Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Creating a schema data health expectation in Palantir Foundry Code Repositories - pyspark

Related

Selecting identical named columns in jOOQ

Pyspark: dynamically generate condition for when() clause with variable number of columns

Spark : Dynamic generation of the query based on the fields in s3 file

pyspark+psycopg2 is slow in writing the results into the database

Is there a way to add extra metadata for Spark dataframes?

Categories

Resources