Brand new to Pyspark and I'm refactoring some R code that is starting to lose it's ability to scale properly. I return a dataframe that has a number of columns with numeric values and I'm trying to filter this result set into a new, smaller result set using multiple compound conditions.
from pyspark.sql import functions as f
matches = df.filter(f.when('df.business') >=0.9 & (f.when('df.city') == 1.0) & (f.when('street') >= 0.7)) |
(f.when('df.phone') == 1) & (f.when('df.firstname') == 1) & (f.when('df.street') == 1) & (f.when('df.city' == 1)) |
(f.when('df.business') >=0.9) & (f.when('df.street') >=0.9) & (f.when('df.city')) == 1))) |
(f.when('df.phone') == 1) & (f.when('df.street') == 1) & (f.when('df.city')) == 1))) |
(f.when('df.lastname') >=0.9) & (f.when('df.phone') == 1) & (f.when('df.business')) >=0.9 & (f.when('df.city') == 1))) |
(f.when('df.phone') == 1 & (f.when('df.street') == 1 & (f.when('df.city') == 1) & (f.when('df.busname') >= 0.6)))
Essentially I'm just trying to return a new dataframe, "matchs" where the columns in the previous dataframe, "sdf" fall into the afore pasted criterion. I've read a couple of other filtering posts such as
multiple conditions for filter in spark data frames
PySpark: multiple conditions in when clause
however I still can't seem to get it right. I suppose I could filter it on one condition at a time and then call a unionall but I felt as if this would be the cleaner way.
Well, since #DataDog has clarified it, so the code below replicates the filters put by OP.
Note: Each and every clause/sub-clause should be inside the parenthesis. If I have missed out, then it's an inadvertent mistake, as I did not have the data to test it. But the idea remains the same.
matches = df.filter(
((df.business >= 0.9) & (df.city ==1) & (df.street >= 0.7))
|
((df.phone == 1) & (df.firstname == 1) & (df.street ==1) & (df.city ==1))
|
((df.business >= 0.9) & (df.street >= 0.9) & (df.city ==1))
|
((df.phone == 1) & (df.street == 1) & (df.city ==1))
|
((df.lastname >= 0.9) & (df.phone == 1) & (df.business >=0.9) & (df.city ==1))
|
((df.phone == 1) & (df.street == 1) & (df.city ==1) & (df.busname >=0.6))
)
Related
I am following a tutorial (from https://automatetheboringstuff.com/2e/chapter4/) for a text-based version of Conway's game of life, and i have put it exactly like the tutorial says and it still produces an IndexError:
the error message is as follows:
print(currentCells[x][y], end='')
IndexError: string index out of range
my goal is to place a blank space while the cells are 'alive' (meeting certain requirements) and a # when they are 'dead' (meeting other requirements)
im confused why the tutorial even when i copy directly from it, gets it wrong. the tutorial is for python 3.8
the entire block of code is as follows:
while True:
print('\n\n\n\n\n')
currentCells = copy.deepcopy(nextCells)
for y in range(HEIGHT):
for x in range(WIDTH):
print(currentCells[x][y], end='')
print()
for x in range(WIDTH):
for y in range(HEIGHT):
leftCoord = (x - 1) % WIDTH
rightCoord = (x + 1) % WIDTH
aboveCoord = (y - 1) % HEIGHT
belowCoord = (y + 1) % HEIGHT
numNeighbors = 0
if currentCells[leftCoord][aboveCoord] == '#':
numNeighbors += 1
if currentCells[x][aboveCoord] == '#':
numNeighbors += 1
if currentCells[rightCoord][aboveCoord] == '#':
numNeighbors += 1
if currentCells[leftCoord][y] == '#':
numNeighbors += 1
if currentCells[rightCoord][y] == '#':
numNeighbors += 1
if currentCells[leftCoord][belowCoord] == '#':
numNeighbors += 1
if currentCells[x][aboveCoord] == '#':
numNeighbors += 1
if currentCells[rightCoord][belowCoord] == '#':
numNeighbors += 1
if currentCells[x][y] == '#' and (numNeighbors == 2 or numNeighbors == 3):
nextCells[x][y] = '#'
elif currentCells[x][y] == ' ' and numNeighbors == 3:
nextCells[x][y] = '#'
else:
nextCells[x][y] = ' '
time.sleep(1)
i'm new to coding so i tried commenting out the lines but of course that just renders the other parts that use those functions unusable. also the other questions on this topic seem to be about much more advanced versions of this game. like i said this is one of my first programs.
I'm too new to programming in Python! It's great language to learn:-)
Anyway I think I found the problem with the code. There is small mistake...
If you look at line 25, it says 'nextCells is a list of column list'.
So instead of appending empty string is should be:
line 25: nextCells.append(column)
I'm implementing LEFT JOIN on 5 columns in Pyspark. But it's throwing an error as shown below
TypeError: join() takes from 2 to 4 positional arguments but 5 were given
Code implemented :
Tgt_df_time_in_zone_detail = Tgt_df_view_time_in_zone_detail_dtaas.join(Tgt_df_individual_in_shift_tiz
,Tgt_df_view_time_in_zone_detail_dtaas.id_individual == Tgt_df_individual_in_shift_tiz.id_individual,
(Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start <= Tgt_df_individual_in_shift_tiz.swipeout)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end >= Tgt_df_individual_in_shift_tiz.swipein)
&(Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end <= Tgt_df_individual_in_shift_tiz.swipeout)
, "left_outer")
Why Pyspark doesn't take join on 5 columns? What's the better way to do it then!?
Guess, you missed & in between your 1st and 2nd condition. Try this, if it works.
Tgt_df_time_in_zone_detail = Tgt_df_view_time_in_zone_detail_dtaas.join(Tgt_df_individual_in_shift_tiz,
(Tgt_df_view_time_in_zone_detail_dtaas.id_individual == Tgt_df_individual_in_shift_tiz.id_individual)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_start <= Tgt_df_individual_in_shift_tiz.swipeout)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end >= Tgt_df_individual_in_shift_tiz.swipein)
& (Tgt_df_view_time_in_zone_detail_dtaas.timestamp_end <= Tgt_df_individual_in_shift_tiz.swipeout)
, "left_outer")
Right now I am working with Spark/Scala and I am trying to join multiple dataframes to get the expected output.
The data input are CSV files with call record information. These are the input main fields.
a_number:String = is the origin call number.
area_code_a:String = is the a_number area code.
prefix_a:String = is the a_number prefix.
b_number:String = is the destination call number.
area_code_b:String = is the b_number area code.
prefix_b:String = is the b_number prefix.
cause_value:String = is the call final status.
val dfint = ((cdrs_nac.join(grupos_nac).where(col("causevalue") === col("id")))
.join(centrales_nac, col("dpc") === col("pointcode_decimal"), "left")
.join(series_nac_a).where(col("area_code_a") === col("codigo_area") &&
col("prefix_a") === col("prefijo") &&
col("series_a") >= col("serie_inicial") &&
col("series_a") <= col("serie_final"))
.join(series_nac_b, (
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "8") ||
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "10") ||
((col("codigo_area_b") === col("codigo_area_cent")) && col("len_b_number") == "7")) &&
col("prefix_b") === col("prefijo_b") &&
col("series_b") >= col("serie_inicial_b") &&
col("series_b") <= col("serie_final_b"), "left")
This generates a multiple output files with the call data records processed, including the column "len_b_number" which means the length of the b_number field.
I was doing some tests I already find that for some reason the expression "col("len_b_number")" is returning the column name "len_b_number" instead the length values which are 7, 8 or 10. This means that the col("len_b_number") == 7 OR col("len_b_number") == 8 OR col("len_b_number") == 10 conditions will never work because the code will always compare with the column name.
At this moment the output is blank because the col("len_b_number") doesnt match with 7, 8 or 10. I will like to know if ypou can help to understand how to extract the value from this column.
Thanks
Try using === instead of ==.
I could not get your error.
&& col("len_b_number") == "8"
should be:
&& col("len_b_number") === "8"
My last elseif statement does not execute even if the conditions are met:
Currency_Exchanage != 'Select...' and all other variables (ETF_Exchanage, Index_Exchanage and Stock_Exchanage) = 'Select...'
Here is the section of code that I am concerned about:
if (strcmp(ETF_Exchanage,'Select...') == 1) && (strcmp(Stock_Exchanage,'Select...') == 1) && (strcmp(Index_Exchanage,'Select...') == 1)...
(strcmp(Currency_Exchanage,'Select...') == 1)
if db == 1 && uni == 1
tickers = gnr_bloomberg; % Analsise Bloomberg natural resources
nrm=1;
elseif db == 1 && uni == 2
tickers = all_bloomberg; % Analsise Bloomberg all
nrm=1;
elseif db == 2 && uni == 1
tickers = gnr_yahoo; % Analsise Yahoo natural resources
nrm=1;
elseif db == 2 && uni == 2
tickers = all_yahoo; % Analsise Yahoo all
nrm=1;
end
else
%Yahoo inputs
if (strcmp(ETF_Exchanage,'Select...') == 0) && (strcmp(Stock_Exchanage,'Select...') == 1) && (strcmp(Index_Exchanage,'Select...') == 1)...
(strcmp(Currency_Exchanage,'Select...') == 1); %Choose exhanges from ETF
tickers = ETF_Yahoo(:,1);
Exchanges = ETF_Yahoo(:,2);
Exchange = ETF_Exchanage;
db=2; %Yahoo Selection
elseif (strcmp(Index_Exchanage,'Select...') == 0) && (strcmp(Stock_Exchanage,'Select...') == 1) && (strcmp(ETF_Exchanage,'Select...') == 1)...
(strcmp(Currency_Exchanage,'Select...') == 1); %Choose exhanges from Index
tickers = Index_Yahoo(:,1);
Exchanges = Index_Yahoo(:,2);
Exchange = Index_Exchanage;
db=2;
elseif (strcmp(Stock_Exchanage,'Select...') == 0) && (strcmp(ETF_Exchanage,'Select...') == 1) && (strcmp(Index_Exchanage,'Select...') == 1)...
(strcmp(Currency_Exchanage,'Select...') == 1); %Choose exhanges from Stock
tickers = Stock_Yahoo(:,1);
Exchanges = Stock_Yahoo(:,2);
Exchange = Stock_Exchanage;
db=2;
elseif (strcmp(Currency_Exchanage,'Select...') == 0) && (strcmp(Stock_Exchanage,'Select...') == 1) && (strcmp(Index_Exchanage,'Select...') == 1)...
(strcmp(ETF_Exchanage,'Select...') == 1); %Choose exhanges from Currency
tickers = Currency_Yahoo(:,1);
Exchanges = Currency_Yahoo(:,2);
Exchange = Currency_Exchanage;
db=2;
else
msg = 'Error occurred.\Only one Yahoo input menue must be used!';
error(msg)
end
end
Any Help would be much appropriated, I can't see where I'm going wrong here. I am using Matlab 2013a.
Put a breakpoint at the elseif statement in question and then check in the command window what your condition evaluates to.
If it does not evaluate like expected, check what the individual terms evaluate to.
It is important to actually test what the conditions evaluate to in matlab, rather than only visually comparing the string values.
Usually by that point you should get a rough idea what is wrong.
However in your case we can't do these steps for you because something is off. Your code condensed to the more reasonable minimal example
if 1 && 1 && 1...
1;
disp('I was here')
end
does not even execute in R2014a since the interpreter complains about '...' being an unexpected matlab expression.
what is the proper way to do If statement with two variable in Xcode for Iphone
currently I have
if (minute >0) && (second == 0) {
minute = minute - 1;
second = 59;
}
The same was you would do it in any C/C++/Objective-C compiler, and most Algol derived languages, and extra set of parenthesis in order to turn to seperate boolean statements and an operator into a single compound statement:
if ((minute > 0) && (second == 0)) {
minute = minute - 1;
second = 59;
}
You'll need another set of parenthesis:
if ((minute >0) && (second == 0)) {
minute = minute - 1;
second = 59;
}
Or you could also write:
if (minute > 0 && second == 0)
Which is what you'll start doing eventually anyway, and I think (subjective) is easier to read. Operator precedence insures this works...