pyspark.sql.functions abs() fails with PySpark Column input - pyspark

I'm trying to convert the following HiveQL query into PySpark:
SELECT *
FROM ex_db.ex_tbl
WHERE dt >= 20180901 AND
dt < 20181001 AND
(ABS(HOUR(FROM_UNIXTIME(local_timestamp))-13)>6 OR
(DATEDIFF(FROM_UNIXTIME(local_timestamp), '2018-12-31') % 7 IN (0,6))
I am not great at PySpark, but I have viewed the list of functions. I have gotten to the point where I am attempting the ABS() function, but struggling to do so in PySpark. Here is what I have tried:
import pyspark.sql.functions as F
df1.withColumn("abslat", F.abs("lat"))
An error occurred while calling z:org.apache.spark.sql.functions.abs
It doesn't work. I read that the input must be a PySpark Column. I checked and that condition is met.
type(df1.lat)
<class 'pyspark.sql.column.Column'>
Can someone please point me in the right direction?

Your passsing string to abs which is valid in case of scala with $ Operator which consider string as Column.
you need to use abs() method like this abs(Dataframe.Column_Name)
For your case try this one:
df1.withColumn("abslat", abs(df1.lat))

Related

What is PySpark SQL equivalent function for pyspark.pandas.DataFrame.to_string?

Pandas API function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_string.html
Another answer, though it doesn't work for me pyspark : Convert DataFrame to RDD[string]
Following above post advice, I tried going with
data.rdd.map(lambda row: [str(c) for c in row])
Then I get this error
TypeError: 'PipelinedRDD' object is not iterable
I would like for it to output rows of strings as if it's similar to to_string() above. Is this possible?
Would pyspark.sql.DataFrame.show satisfy your expectations about the console output? You can sort the df via pyspark.sql.DataFrame.sort before printing if required.

selecting a range of colums in SKlearn column transformer

I am encoding catagorical data, many columns need to be seletced, I have typed them in individually and it works ok but there is obviouly a more elegant way.
dataset =pd.read_csv('train.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[2,5,6,7,8,9,10,11,12,13,14,15,16,21,22,23,24,25,27,28,29,30,31,32,33,34,35,39,40,41,42,53,54,55,56,57,58,60,63,64,65,72,73,74,78,79])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
I have tried using (23:34) I have tried using slice but that does not work as it is not that data type.
Which method should I use for selecting a range of columns?
Also what datatype is it at this point were I am selecting the columns?
I made a search I an not able to see a solution for this exact question.
Finally, is this an effecient way to encode catagorical data or should I be looking at an alternative method?
Thanks!
you can use the following workaround:
ct = ColumnTransformer(
transformers=[
("ordinal_enc", OrdinalEncoder(), data.loc[:, "col1":"col100"].columns)
])

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

window functions( lag) implementation and the use of IsNotIn in pyspark

Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'OUT' and lag(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) <> 'IN'
then 'TYPE4'
else ''
end as TYPE_FLAG,
Pyspark code giving error using window function lag
Tgt_df = Tgt_df.withColumn(
'TYPE_FLAG',
F.when(
(F.col('eventaction')=='OUT')
&(F.lag('eventaction',1).over(w).isNotIn(['IN'])),
"TYPE4"
).otherwise(''))
But it's not working. What to do!?
It is giving you an error because there is no isNotIn method for columns object.
That would have been obvious if you just posted the error message...
Instead, use the ~ (not) operator.
&( ~ F.lag('eventaction',1).over(w).isin(['IN'])),
List of available methods are in the official documentation.

dataframe: how to groupBy/count then filter on count in Scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex
When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
I think a solution is to put count in back ticks
.filter("`count` >= 2")
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF#fmsmsx124.amr.corp.intel.com%3E