window functions( lag) implementation and the use of IsNotIn in pyspark - tsql

Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'OUT' and lag(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) <> 'IN'
then 'TYPE4'
else ''
end as TYPE_FLAG,
Pyspark code giving error using window function lag
Tgt_df = Tgt_df.withColumn(
'TYPE_FLAG',
F.when(
(F.col('eventaction')=='OUT')
&(F.lag('eventaction',1).over(w).isNotIn(['IN'])),
"TYPE4"
).otherwise(''))
But it's not working. What to do!?

It is giving you an error because there is no isNotIn method for columns object.
That would have been obvious if you just posted the error message...
Instead, use the ~ (not) operator.
&( ~ F.lag('eventaction',1).over(w).isin(['IN'])),
List of available methods are in the official documentation.

Related

pyspark.sql.functions abs() fails with PySpark Column input

I'm trying to convert the following HiveQL query into PySpark:
SELECT *
FROM ex_db.ex_tbl
WHERE dt >= 20180901 AND
dt < 20181001 AND
(ABS(HOUR(FROM_UNIXTIME(local_timestamp))-13)>6 OR
(DATEDIFF(FROM_UNIXTIME(local_timestamp), '2018-12-31') % 7 IN (0,6))
I am not great at PySpark, but I have viewed the list of functions. I have gotten to the point where I am attempting the ABS() function, but struggling to do so in PySpark. Here is what I have tried:
import pyspark.sql.functions as F
df1.withColumn("abslat", F.abs("lat"))
An error occurred while calling z:org.apache.spark.sql.functions.abs
It doesn't work. I read that the input must be a PySpark Column. I checked and that condition is met.
type(df1.lat)
<class 'pyspark.sql.column.Column'>
Can someone please point me in the right direction?
Your passsing string to abs which is valid in case of scala with $ Operator which consider string as Column.
you need to use abs() method like this abs(Dataframe.Column_Name)
For your case try this one:
df1.withColumn("abslat", abs(df1.lat))

Pyspark Data Frame Filtering Syntax Error

I am working with a Pyspark Data Frame using Pyspark version 1.6. Before exporting this data frame to a .CSV file I need to filter the data based on certain conditions using LIKE and OR operators on one particular column. To talk you through what I have done so far, I have created the initial data frame from multiple .JSON files. This data frame has been subsetted so only the required columns are included. A sqlContext temporary table has then been created. I have tried two different approaches so far, to use sqlContext and to use a Pyspark method.
sqlContext Method:
df_filtered = sqlContext.sql("SELECT * from df WHERE text LIKE '#abc' OR 'abc' OR 'ghi' OR 'jkl' OR '#mno' OR '#1234' OR '56789'")
This is the error message I am presented with when running the sqlContext method:
pyspark.sql.utils.AnalysisException: u"cannot resolve '(text LIKE #abc || abc)' due to data type mismatch: differing types in '(text LIKE #abc || abc)' (boolean and string).;"
pyspark Method:
df_filtered.where((df["text"].like ("#abc")) || ((brexit_april_2016["text"].like ("abc")) || ((brexit_april_2016["text"].like ("#ghi")) || ((brexit_april_2016["text"].like ("jkl")) || ((brexit_april_2016["text"].like ("#mno")) || ((brexit_april_2016["text"].like ("1234")) || ((brexit_april_2016["text"].like ("56789"))
When running the pyspark method I am presented with a syntax error.
I'm sure it's something really simple that I've messed up on here but I would appreciate some help.
Thanks!
df_filtered = df.filter(
(df.text.like("#abc")) | (df.text.like("abc")))
The "like" and or ("|") should be used like this in PySpark. You can add more conditions as per requirement.
I hope this helps.

How to insert similar value into multiple locations of a psycopg2 query statement using dict? [duplicate]

I have a Python script that runs a pgSQL file through SQLAlchemy's connection.execute function. Here's the block of code in Python:
results = pg_conn.execute(sql_cmd, beg_date = datetime.date(2015,4,1), end_date = datetime.date(2015,4,30))
And here's one of the areas where the variable gets inputted in my SQL:
WHERE
( dv.date >= %(beg_date)s AND
dv.date <= %(end_date)s)
When I run this, I get a cryptic python error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) argument formats can't be mixed
…followed by a huge dump of the offending SQL query. I've run this exact code with the same variable convention before. Why isn't it working this time?
I encountered a similar issue as Nikhil. I have a query with LIKE clauses which worked until I modified it to include a bind variable, at which point I received the following error:
DatabaseError: Execution failed on sql '...': argument formats can't be mixed
The solution is not to give up on the LIKE clause. That would be pretty crazy if psycopg2 simply didn't permit LIKE clauses. Rather, we can escape the literal % with %%. For example, the following query:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%';
would need to be modified to:
SELECT *
FROM people
WHERE start_date > %(beg_date)s
AND name LIKE 'John%%';
More details in the pscopg2 docs: http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
As it turned out, I had used a SQL LIKE operator in the new SQL query, and the % operand was messing with Python's escaping capability. For instance:
dv.device LIKE 'iPhone%' or
dv.device LIKE '%Phone'
Another answer offered a way to un-escape and re-escape, which I felt would add unnecessary complexity to otherwise simple code. Instead, I used pgSQL's ability to handle regex to modify the SQL query itself. This changed the above portion of the query to:
dv.device ~ E'iPhone.*' or
dv.device ~ E'.*Phone$'
So for others: you may need to change your LIKE operators to regex '~' to get it to work. Just remember that it'll be WAY slower for large queries. (More info here.)
For me it's turn out I have % in sql comment
/* Any future change in the testing size will not require
a change here... even if we do a 100% test
*/
This works fine:
/* Any future change in the testing size will not require
a change here... even if we do a 100pct test
*/

Not able to pass complete datetime(Y-m-d H:i:s) in DB::raw in eloquent

$shop=DB::table('shops')
->leftJoin('orderbookings',function($join)
{
$join->on('shops.id','=','orderbookings.shop_id');
$join->on('orderbookings.created_at','>=',DB::raw(date("Y-m-d",strtotime("now"))));
})
->select('shops.*')
->selectRaw('COUNT(orderbookings.id) as totalorder, SUM(orderbookings.grand_total) as gtotal')
->orderBy('shops.shop_name', 'asc')
->groupby('shops.id')
->paginate(10);
Above code working fine(But not giving total order and amount correct) and also gives result almost close to what I want,
But I am not able to give date format (Y-m-d H:i:s), it shows syntax error. I am using Laravel 5.2 version
Note: I want to give time as well with date to rectify result,
On giving [example: 2017-03-08 11:15:00] shows syntax error
working query in mysql
SELECT COUNT(orderbookings.id), SUM(orderbookings.grand_total), shops.shop_name FROMshopsLEFT JOIN orderbookings on orderbookings.shop_id = shops.id and orderbookings.created_at BETWEEN "2015-10-22 17:02:02" AND "2017-03-07 17:02:02" GROUP BY shops.id
But not able to to convert in eloquent
You should be able to do the following:
$join->on('orderbookings.created_at', '>=', date("Y-m-d"));
You don't need to use DB::raw, and leaving the second parameter null for date, assumes "now".
If this doesn't solve your issue, please post the exact error you're seeing.

dataframe: how to groupBy/count then filter on count in Scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex
When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
I think a solution is to put count in back ticks
.filter("`count` >= 2")
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF#fmsmsx124.amr.corp.intel.com%3E