Window function in pyspark - strange behavior - pyspark

I'm using Pyspark window functions extensively in my code. But it seems to be not working properly.
But i'm getting the correct results only for the last record by order by column for the partition.
Documentation says , it is experimental, can we use it in production systems
http://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.Window
Sample code:
invWindow = Window.partitionBy(masterDrDF["ResId"], masterDrDF["vrsn_strt_dts"]).orderBy(masterDrDF["vrsn_strt_dts"]).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
max(when(invDetDF["InvoiceItemType"].like('ABD%'), 1).otherwise(0)).over(invWindow).alias("ABD_PKG_IN")

Related

Why does this result in a "org.postgresql.util.PSQLException: The column index is out of range" exception?

I'm trying to run a function in a PostgreSQL 11 server from Ignition (8.0.16) as a named-query and getting a column index error. Everywhere that has discussed this error with regards to Postgres shows it is an issue of the parameters provided and expected not matching in number.
It always shows as one more than the number provided as being out of range. Even when changed to use a different number of parameters.
I count 13 everywhere: Ignition parameters, test parameters, the function call in Ignition, the function definition, the table. Here is the function call from Ignition:
SELECT insert_run_data(
:speed_in,
:avg_speed_in,
:coater_num_in,
:coater_op_in,
:finisher_in,
:helper1_in,
:helper2_in,
:coater_down_in,
:current_downtime_reason_in,
:hanging_downtime_reason_in,
:tabcode_in,
:start_time_in,
:end_time_in
);
In the same named-query window, if I comment out the function call and try to write directly using the same parameters, it writes without issue:
insert into
nh_coater_tabcode_operator_data(
speed, avg_speed, coater_num, coater_op, finisher, helper1,
helper2, coater_down, current_downtime_reason,
hanging_downtime_reason, tabcode, start_time, end_time
)
values(:speed_in, :avg_speed_in, :coater_num_in, :coater_op_in,
:finisher_in, :helper1_in, :helper2_in, :coater_down_in,
:current_downtime_reason_in, :hanging_downtime_reason_in, :tabcode_in,
:start_time_in, :end_time_in
);
The function also runs fine from within PGAdmin.
Here are gists showing the SQL used to create the table and function, the stack trace from Ignition, and an image showing the named-query authoring window parameters matching:
create function gist
create table gist
stack trace of error gist
parameters in Ignition
The Ignition Designer seems to cache the function. So, if it is changed, you will need to save and reopen the project: open another then switch back or close and open a new designer window.

window functions( lag) implementation and the use of IsNotIn in pyspark

Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'OUT' and lag(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) <> 'IN'
then 'TYPE4'
else ''
end as TYPE_FLAG,
Pyspark code giving error using window function lag
Tgt_df = Tgt_df.withColumn(
'TYPE_FLAG',
F.when(
(F.col('eventaction')=='OUT')
&(F.lag('eventaction',1).over(w).isNotIn(['IN'])),
"TYPE4"
).otherwise(''))
But it's not working. What to do!?
It is giving you an error because there is no isNotIn method for columns object.
That would have been obvious if you just posted the error message...
Instead, use the ~ (not) operator.
&( ~ F.lag('eventaction',1).over(w).isin(['IN'])),
List of available methods are in the official documentation.

Use consistency level in Phantom-dsl and Cassandra

Currently using --
cqlsh> show version
[cqlsh 4.1.1 | Cassandra 2.0.17 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Using phantom-dsl 1.12.2 , Scala 2.10 ..
I can't figure out how to set consistency levels on queries.
There are predefined functions insert() , select() as part of CassandraTable .. How can I pass the consistency level to them ?
insert.value(....).consistencyLevel_=(ConsistencyLevel.QUORUM)
does not work and fails with an error ( probably because this appends a "USING CONSISTENCY QUORUM" at the end of the query). Here's the actual exception I get
com.datastax.driver.core.exceptions.SyntaxError: line 1:424 no viable alternative at input 'CONSISTENCY'
at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:120) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:186) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:754) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:576) ~[cassandra-driver-core-2.2.0-rc3.jar:na]
I see from the documentation and discussion on this pull request that I could do a setConsistencyLevel(ConsistencyLevel.QUORUM) on a SimpleStatement, but I would prefer not rewrite all the different insert statements.
UPDATE
Just to close the loop on this issue. I worked around this by creating a custom InsertQuery and then using that instead of the one provided by final def insert in CassandraTable
def qinsert()(implicit keySpace: KeySpace) = {
val table = this.asInstanceOf[T]
new InsertQuery[T, M, Unspecified](table, CQLQuery("INSERT into keyspace.tablename", consistencyLevel = ConsistencyLevel.QUORUM)
}
First of all there is no setValue method inside phantom and the API method you are using is missing an = at the end.
The correct structure is:
Table.insert
.value(_.name, "test")
.consistencyLevel_=(ConsistencyLevel.Quorum)
As you are on stackoverflow, an error stack trace and specific details of what doesn't work is generally preferable to "does not work".
I have finally figured out how to properly set the consistency level using phantom-dsl.
Using a statement you can do the following:
statement.setConsistencyLevel(ConsistencyLevel.QUORUM)
Also, take a look on the test project I've been working to help guys like you with examples using phantom-dsl:
https://github.com/iamthiago/cassandra-phantom

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]