Pyspark join doesn't take 5 positional arguments? - pyspark

I'm implementing LEFT JOIN on 5 columns in Pyspark. But it's throwing an error as shown below
TypeError: join() takes from 2 to 4 positional arguments but 5 were given
Code implemented :
Tgt_df_time_in_zone_detail = Tgt_df_view_time_in_zone_detail_dtaas.join(Tgt_df_individual_in_shift_tiz
,Tgt_df_view_time_in_zone_detail_dtaas.id_individual == Tgt_df_individual_in_shift_tiz.id_individual,
(Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start <= Tgt_df_individual_in_shift_tiz.swipeout)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end >= Tgt_df_individual_in_shift_tiz.swipein)
&(Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end <= Tgt_df_individual_in_shift_tiz.swipeout)
, "left_outer")
Why Pyspark doesn't take join on 5 columns? What's the better way to do it then!?

Guess, you missed & in between your 1st and 2nd condition. Try this, if it works.
Tgt_df_time_in_zone_detail = Tgt_df_view_time_in_zone_detail_dtaas.join(Tgt_df_individual_in_shift_tiz,
(Tgt_df_view_time_in_zone_detail_dtaas.id_individual == Tgt_df_individual_in_shift_tiz.id_individual)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start <= Tgt_df_individual_in_shift_tiz.swipeout)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end <= Tgt_df_individual_in_shift_tiz.swipeout)
, "left_outer")

Related

ASP EF Core between query 'greater than or equal' throwing compile error?

I have a query in ASP Razor EF Core which runs ok:
Model.Bookings.Where(x => x.DateOfVisit > DateTime.Now.AddYears(-1) && x.DateOfVisit < DateTime.Now).Count() > 10
I have realised that I need the query to be 'greater than or equal to' and 'less than or equal to' but when I change the code to this:
Model.Bookings.Where(x => x.DateOfVisit >= DateTime.Now.AddYears(-1) && x.DateOfVisit =< DateTime.Now).Count() > 10
I get 2 errors - one for invald expression term '<' and the other for Operator '&&' cannot be applied to operands of type 'bool' and 'DateTime
I've done these types of queries in SQL many times but I can't see why it won't let me use this logic/format. I've been Googling for over an hour but I can't see why it wont work - please will somone put me out of my misery?
You got the second '=' wrong, it's after the '<'
Model.Bookings.Where(x => x.DateOfVisit >= DateTime.Now.AddYears(-1) && x.DateOfVisit <= DateTime.Now).Count() > 10
try this :
DateTime d1 = DateTime.Now.AddYears(-1);
DateTime d2 = DateTime.Now;
Model.Bookings.Where(x => x.DateOfVisit >= d1 && x.DateOfVisit <= d2 ).Count() > 10

How to get values from a dataframe column using SparkSQL?

Right now I am working with Spark/Scala and I am trying to join multiple dataframes to get the expected output.
The data input are CSV files with call record information. These are the input main fields.
a_number:String = is the origin call number.
area_code_a:String = is the a_number area code.
prefix_a:String = is the a_number prefix.
b_number:String = is the destination call number.
area_code_b:String = is the b_number area code.
prefix_b:String = is the b_number prefix.
cause_value:String = is the call final status.
val dfint = ((cdrs_nac.join(grupos_nac).where(col("causevalue") === col("id")))
.join(centrales_nac, col("dpc") === col("pointcode_decimal"), "left")
.join(series_nac_a).where(col("area_code_a") === col("codigo_area") &&
col("prefix_a") === col("prefijo") &&
col("series_a") >= col("serie_inicial") &&
col("series_a") <= col("serie_final"))
.join(series_nac_b, (
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "8") ||
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "10") ||
((col("codigo_area_b") === col("codigo_area_cent")) && col("len_b_number") == "7")) &&
col("prefix_b") === col("prefijo_b") &&
col("series_b") >= col("serie_inicial_b") &&
col("series_b") <= col("serie_final_b"), "left")
This generates a multiple output files with the call data records processed, including the column "len_b_number" which means the length of the b_number field.
I was doing some tests I already find that for some reason the expression "col("len_b_number")" is returning the column name "len_b_number" instead the length values which are 7, 8 or 10. This means that the col("len_b_number") == 7 OR col("len_b_number") == 8 OR col("len_b_number") == 10 conditions will never work because the code will always compare with the column name.
At this moment the output is blank because the col("len_b_number") doesnt match with 7, 8 or 10. I will like to know if ypou can help to understand how to extract the value from this column.
Thanks
Try using === instead of ==.
I could not get your error.
&& col("len_b_number") == "8"
should be:
&& col("len_b_number") === "8"

Pyspark compound filter, multiple conditions

Brand new to Pyspark and I'm refactoring some R code that is starting to lose it's ability to scale properly. I return a dataframe that has a number of columns with numeric values and I'm trying to filter this result set into a new, smaller result set using multiple compound conditions.
from pyspark.sql import functions as f
matches = df.filter(f.when('df.business') >=0.9 & (f.when('df.city') == 1.0) & (f.when('street') >= 0.7)) |
(f.when('df.phone') == 1) & (f.when('df.firstname') == 1) & (f.when('df.street') == 1) & (f.when('df.city' == 1)) |
(f.when('df.business') >=0.9) & (f.when('df.street') >=0.9) & (f.when('df.city')) == 1))) |
(f.when('df.phone') == 1) & (f.when('df.street') == 1) & (f.when('df.city')) == 1))) |
(f.when('df.lastname') >=0.9) & (f.when('df.phone') == 1) & (f.when('df.business')) >=0.9 & (f.when('df.city') == 1))) |
(f.when('df.phone') == 1 & (f.when('df.street') == 1 & (f.when('df.city') == 1) & (f.when('df.busname') >= 0.6)))
Essentially I'm just trying to return a new dataframe, "matchs" where the columns in the previous dataframe, "sdf" fall into the afore pasted criterion. I've read a couple of other filtering posts such as
multiple conditions for filter in spark data frames
PySpark: multiple conditions in when clause
however I still can't seem to get it right. I suppose I could filter it on one condition at a time and then call a unionall but I felt as if this would be the cleaner way.
Well, since #DataDog has clarified it, so the code below replicates the filters put by OP.
Note: Each and every clause/sub-clause should be inside the parenthesis. If I have missed out, then it's an inadvertent mistake, as I did not have the data to test it. But the idea remains the same.
matches = df.filter(
((df.business >= 0.9) & (df.city ==1) & (df.street >= 0.7))
|
((df.phone == 1) & (df.firstname == 1) & (df.street ==1) & (df.city ==1))
|
((df.business >= 0.9) & (df.street >= 0.9) & (df.city ==1))
|
((df.phone == 1) & (df.street == 1) & (df.city ==1))
|
((df.lastname >= 0.9) & (df.phone == 1) & (df.business >=0.9) & (df.city ==1))
|
((df.phone == 1) & (df.street == 1) & (df.city ==1) & (df.busname >=0.6))
)

How to filter out numbers with Structured Streaming?

I am working with Spark Structured Streaming and trying to filter out negative numbers from fields streamed from a lab. My code looks like this:
val records = labs.filter(
$" data.trays.tray1” <= 5 ||
$" data.trays.tray2" <= 10 ||
$" data.trays.tray3" <= 20)
.select("data.labs", "data.labs.tray1", “data.labs.tray2”, “data.labs.tray3”)
.writeStream.outputMode("append").format("console").start()
My output with the code above is:
Lab | Tray 1 | Tray 2 | Tray 3
----------------------------------
FGF 0 -8 13
RFF -3 9 -14
WER 2 -8 -16
However, I am missing the the logic to filter out the negative numbers. I thought I had if figured out, but I can't seem to filter them out
filter will keep all values such that the predicate returns true. Try
val records = labs.filter(
$"data.trays.tray1" > 5 &&
$"data.trays.tray2" > 10 &&
$"data.trays.tray3" > 20)
.select("data.labs", "data.labs.tray1", "data.labs.tray2", "data.labs.tray3")
.writeStream.outputMode("append").format("console").start()

geotools filter CQLException: Encountered "t"

I am querying a simple feature type schema:
r:Long:index=join,*g:Point:srid=4326,di:Integer:index=join,al:Float,s:Float,b:Float,an:Float,he:Float,ve:Float,t:Float,m:Boolean,i:Boolean,ts:Long;geomesa.table.sharing='true',geomesa.indices='attr:4:3,records:2:3,z2:3:3',geomesa.table.sharing.prefix='\\u0001'
with the query expression: r = 31 AND di = 5 AND BBOX(g, -38.857822, -76.111145, -74.64091, -38.61907) AND al <= 39.407307 AND s <= 1.6442835 AND b <= 83.14717 AND an <= 87.0774 AND he <= 40.89476 AND ve <= 88.761566 AND t <= 44.786507 AND m = true AND i = true.
but it throws an exception saying Encountered "t" at line 1, column 195.
Here is my exception log detail:
org.geotools.filter.text.cql2.CQLException: Encountered "t" at line 1, column 195.
Was expecting one of:
<NOT> ...
<IDENTIFIER> ...
"include" ...
"exclude" ...
"(" ...
"[" ...
Parsing : r = 31 AND di = 5 AND BBOX(g, -38.857822, -76.111145, -74.64091, -38.61907) AND al <= 39.407307 AND s <= 1.6442835 AND b <= 83.14717 AND an <= 87.0774 AND he <= 40.89476 AND ve <= 88.761566 AND t <= 44.786507 AND m = true AND i = true.
at org.geotools.filter.text.cql2.CQLCompiler.compileFilter(CQLCompiler.java:106)
at org.geotools.filter.text.commons.CompilerUtil.parseFilter(CompilerUtil.java:196)
at org.geotools.filter.text.cql2.CQL.toFilter(CQL.java:134)
at org.geotools.filter.text.cql2.CQL.toFilter(CQL.java:113)
at com.hps.GeomesaClient.query(GeomesaClient.java:134)
at com.hps.Reader.run(Reader.java:69)
at java.lang.Thread.run(Thread.java:745)
I am not able to determine, why it's throwing an exception on querying with the attribute named t. Whereas if I remove attribute t from the query, it works as expected. Is t a reserved key? or I am missing something.
Ok, this is a limitation in the ECQL query parser. The letter 't' by itself (ignoring case) is the UTC token.
https://github.com/geotools/geotools/blob/master/modules/library/cql/src/main/jjtree/ECQLGrammar.jjt#L180-L187
The options are to work with the GeoTools team to fix this corner case or pick a different attribute name. Nice find!