Parametrize window by values in row in Pyspark - pyspark

I would like to add a new column to my Pyspark dataframe using a Window function, where the rowsBetween are parametrized by values from columns from the dataframe.
I tried date_window.rowsBetween(-(F.lit(2) + offset), -offset), but Spark tells me ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. which I did not expect in this case.
Is there any way to parametrize rowsBetween using values from specific columns?

Related

How to perform datediff using derived column expression in Azure DataFactory

I have a query in Sql which gives the result of activitystarttime and activityendtime in sec.
Below is the query in sql
DATEDIFF_BIG(second,ActivityStartTime, ActivityEndTime) as [DiffInTime]
I have to write the same using derived column expression.
Create a new column in Derived Column's Settings and write the below expression in Expression Field for this column. The new column will have the difference in seconds.
toInteger(ActiveEndTime-ActiveStartTime)/1000
Note: Make sure that ActiveEndTime and ActiveStartTime should be in timestamp format.

Syncsort produces a non-readable output for decimal(9,8) or smallint data type columns in Db2

I am using Syncsort to select records from Db2. For the columns that are either decimal(9,8) or smallint the output looks weird with junk characters in it. If I cast the column to CHAR type in the select statement the output is proper. I do not want to cast the column type to char in the SQL statement rather I want a solution in syncsort if this is possible.
For example: the decimal column has a value 2.98965467 which is displayed in a non-readable format by Syncsort if I don't use casting in the SQL statement. Kindly help

Appending spark dataframe to hive table with different columnn order

I'm using pyspark with HiveWarehouseConnector in HDP3 cluster.
There was a change in the schema so I updated my target table using the "alter table" command and added the new columns to the last positions of it by default.
Now I'm trying to use the following code to save spark dataframe to it but the columns in the dataframe have alphabetical order and i'm getting the error message below
df = spark.read.json(df_sub_path)
hive.setDatabase('myDB')
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode('append').option('table','target_table').save()
and the error message taced to:
Caused by: java.lang.IllegalArgumentException: Hive column:
column_x cannot be found at same index: 77 in
dataframe. Found column_y. Aborting as this may lead to
loading of incorrect data.
Is there any dynamic way of appending the dataframe to correct location in the hive table? It is important as I expect more columns to be added to the target table.
You can read the target column without rows to get the columns. Then, using select, you can order the column correctly and append it:
target = hive.executeQuery('select * from target_Table where 1=0')
test = spark.createDataFrame(source.collect())
test = test.select(target.columns)

Need to include the offset value as expr in LAG functions

I am migrating Redshift SQL to snowflake SQL.
Need suggestion on how to include the offset value as expression in snowflake's LAG(). regarding offset, Redshift supports expression in LAG() where as snowflake does not.
Eg:
expected sql in SF:
LAG(exp, **exp**) over (partition by col1 order by col2)
Expression for second input parameter of the LAG function is currently not supported. You will receive an error as given below, if you use the pass an expression.
Error: SQL compilation error: argument 2 to function LAG needs to be constant, found 'EXPR' -- Where EXPR is an expression
An improvement request for supporting expressions in the second argument of LAG() function is in the pipeline.
Workaround
You can rewrite the LAG function by adding the ROW_NUMBER() to the table and doing a Self-Join.

Insert null values to postgresql timestamp data type using python

I am tying to insert null value to a postgres timestamp datatype variable using python psycopg2.
The problem is the other data types such as char or int takes None, whereas the timestamp variable does not recognize None.
I tried to insert Null , null as a string because I am using a dictionary to get append the values for insert statement.
Below is the code.
queryDictOrdered[column] = queryDictOrdered[column] if isNull(queryDictOrdered[column]) is False else NULL
and the function is
def isNull(key):
if str(key).lower() in ('null','n.a','none'):
return True
else:
False
I get the below error messages:
DataError: invalid input syntax for type timestamp: "NULL"
DataError: invalid input syntax for type timestamp: "None"
Empty timestamps in Pandas dataframes come through as NaT (not a time), which is NOT pg compatible with NULL. A quick work around is to send it as a varchar and then run these 2 queries:
update <<schema.table_name>> set <<column_name>> = Null where
<<column_name>> = 'NULL';
or (depending on what you hard coded empty values as)
update <<schema.table_name>> set <<column_name>> = Null where <<column_name>> = 'NaT';
Finally run:
alter table <<schema.table_name>>
alter COLUMN <<column_name>> TYPE timestamp USING <<column_name>>::timestamp without time zone;
Surely you are adding quotes around the placeholder. Read psycopg documentation about passing parameters to queries.
Dropping this here incase it's helpful for anyone.
Using psycopg2 and the cursor object's copy_from method, you can copy missing or NaT datetime values from a pandas DataFrame to a Postgres timestamp field.
The copy_from method has a null parameter that is a "textual representation of NULL in the file. The default is the two characters string \N". See this link for more information.
Using pandas' fillna method, you can replace any missing datetime values with \N via data["my_datetime_field"].fillna("\\N"). Notice the double backslash here, where the first backslash is necessary to escape the second backslash.
Using the select_columns method from the pyjanitor module (or .loc[] and some subsetting with the column names of your DataFrame), you can coerce multiple columns at once via something akin to this, where all of your datetime fields end with an _at suffix.
data_datetime_fields = \
(data
.select_columns("*_at")
.apply(lambda x: x.fillna("\\N")))