How to get values from a dataframe column using SparkSQL? - scala

Right now I am working with Spark/Scala and I am trying to join multiple dataframes to get the expected output.
The data input are CSV files with call record information. These are the input main fields.
a_number:String = is the origin call number.
area_code_a:String = is the a_number area code.
prefix_a:String = is the a_number prefix.
b_number:String = is the destination call number.
area_code_b:String = is the b_number area code.
prefix_b:String = is the b_number prefix.
cause_value:String = is the call final status.
val dfint = ((cdrs_nac.join(grupos_nac).where(col("causevalue") === col("id")))
.join(centrales_nac, col("dpc") === col("pointcode_decimal"), "left")
.join(series_nac_a).where(col("area_code_a") === col("codigo_area") &&
col("prefix_a") === col("prefijo") &&
col("series_a") >= col("serie_inicial") &&
col("series_a") <= col("serie_final"))
.join(series_nac_b, (
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "8") ||
((col("codigo_area_b") === col("area_code_b")) && col("len_b_number") == "10") ||
((col("codigo_area_b") === col("codigo_area_cent")) && col("len_b_number") == "7")) &&
col("prefix_b") === col("prefijo_b") &&
col("series_b") >= col("serie_inicial_b") &&
col("series_b") <= col("serie_final_b"), "left")
This generates a multiple output files with the call data records processed, including the column "len_b_number" which means the length of the b_number field.
I was doing some tests I already find that for some reason the expression "col("len_b_number")" is returning the column name "len_b_number" instead the length values which are 7, 8 or 10. This means that the col("len_b_number") == 7 OR col("len_b_number") == 8 OR col("len_b_number") == 10 conditions will never work because the code will always compare with the column name.
At this moment the output is blank because the col("len_b_number") doesnt match with 7, 8 or 10. I will like to know if ypou can help to understand how to extract the value from this column.
Thanks

Try using === instead of ==.
I could not get your error.
&& col("len_b_number") == "8"
should be:
&& col("len_b_number") === "8"

Related

Drools rule with not condition with multiple condition causing error

When i use below condition with 'not' I am getting an error.
not(Obj1(value == 0) && Obj2(value <= 3))
However if i replace above condition as below I am not getting any casting exception
Obj1(value != 0) or Obj2(value > 3)
The rule looks like this:
rule "test_6"
salience 10
when
not(Obj1(value == 0) && Obj2(value <= 3))
then
.....
end
And this is the error I'm getting:
throwing error Error Message: org.drools.core.rule.GroupElement cannot be cast to org.drools.core.rule.Pattern
The && and || operators can only be used inside a single pattern. For example: Obj1( value > 3 && value < 10 || value == 0). According to the documentation, to separate Patterns, you have to use the and and or operators.
So, in your case, your rule should be:
rule "test_6"
salience 10
when
not(Obj1(value == 0) and Obj2(value <= 3))
then
.....
end
Note that it was not failing when you were using or because that was the right operator to use instead of ||.
Hope it helps,

groupby function on a calculated column

I am joining multiple dataframes
and I am calculating the output by multiplying two columns from two diff dataframes and dividing it with a column belonging to another dataframe.
I get grouping sequence expression is empty error and no_order is not an aggregate function
whats is wrong with the code
df = df1.join(df2,df2["Code"] == df1["Code"],how = 'left')\
.join(df3, df3["ID"] == df1["ID"],how = 'left')\
.join(df4, df4["ID"] == df1["ID"],how = 'left')\
.join(df5, df5["Scenario"] == df1["Status"],how='left')\
.withColumn("Country",when(df1.Ind == 1,"WI"))\
.withColumn("Country",when(df1.Ind == 0,"AA"))\
.withColumn("Year",when(df1.Year == "2020","2021"))\
.agg((sum(df5["amt"] * df1["cost"]))/df2["no_order"]).alias('output')
.groupby('Country','Year','output')
the error shows you that df2["no_order"] should be withing some aggregation function, for example the sum which you are using for df5["amt"] * df1["cost"].
Also move .groupby() above .agg().
If I got correctly what you are trying to achieve, the code should look like:
df = df1\
.join(df2, on = 'Code', how = 'left')\
.join(df3, on = 'ID', how = 'left')\
.join(df4, on = 'ID', how = 'left')\
.join(df5, df5.Scenario == df1.Status, how='left')\
.withColumn('Country', when(df1.Ind == 1,"WI").when(df1.Ind == 0,"AA"))\
.withColumn('Year', when(df1.Year == "2020","2021"))\
.groupby('Country','Year')\
.agg(sum(df5["amt"] * df1["cost"] / df2["no_order"]).alias('output'))

How to "Print when" with condition

I have a question, I have a subreport that prints many lines because it is in the detail band, for me this is fine, but I want to filter the rows based on parameters when I print the report, I used the option "print when" to set the Boolean condition but not it works, this works with only one parameter.
The condition is:
(!"N".equals($P{Chk_Amministratori})
|| ($F{field1} != $P{CheckDinamico1}
|| $F{field1} != $P{CheckDinamico2}
|| $F{field1} != $P{CheckDinamico3}
|| $F{field1} != $P{CheckDinamico4}
|| $F{field1} != $P{CheckDinamico5}
|| $F{field1} != $P{CheckDinamico6}
|| $F{field1} != $P{CheckDinamico7}
|| $F{field1} != $P{CheckDinamico8}
|| $F{field1} != $P{CheckDinamico10}) ? Boolean.TRUE : Boolean.FALSE
If I put a parameter it works, for example
$F{field1} != $P{CheckDinamico10} ? Boolean.TRUE : Boolean.FALSE
Can anyone help me?
The logic in your expressions is flawed. You have this
$F{field1} != $P{CheckDinamico1}
|| $F{field1} != $P{CheckDinamico2}
|| ...
Let's say CheckDinamico1 is 5 and CheckDinamico2 is 7. So your expressions is field1 != 5 OR field1 != 7 OR ..
This expression is true whatever the value of field1. If field1 is 3, it will be different from both 5 and 7, so the expression is true. If fiedl1 is 5, it will be different from 7, and because of OR the expression will also be true. And if field1 is 7, it will be different from 5 so again the expression is true.
Maybe you wanted AND instead of OR in the expression? Also, as Alex K noted, != might not always work so it's safer to use equals, and you can use primitive boolean expressions, you don't need Boolean. Therefore try something like this:
(!"N".equals($P{Chk_Amministratori})
|| (!$F{field1}.equals($P{CheckDinamico1})
&& !$F{field1}.equals($P{CheckDinamico2})
&& !$F{field1}.equals($P{CheckDinamico3})
&& !$F{field1}.equals($P{CheckDinamico4})
&& !$F{field1}.equals($P{CheckDinamico5})
&& !$F{field1}.equals($P{CheckDinamico6})
&& !$F{field1}.equals($P{CheckDinamico7})
&& !$F{field1}.equals($P{CheckDinamico8})
&& !$F{field1}.equals($P{CheckDinamico10}))

How do I add a condition to an existing conditional expression?

I had a programmer write a Perl script for my site.
One of the functions is to update price/stock when a certain condition is met.
# update when price/stock conditions met
if ( ($force_price_updates == 1) ||
($data->{'price'} <= $product_price && $data->{'quantity'} > 0) ||
($product_quantity == 0 && $data->{'quantity'} > 0) ) {
What the above is not doing is not updating the price if the new price is higher. It updates the stock value, but if the new stock comes at a higher price, I lose out. Stock gets updated and but the price is not.
The script goes through a number of feeds and if the same product is found in any of the feeds, the script should amend price/stock change according to the rule above.
I can't find the programmer and my Perl knowledge is limited. I understand what the code is doing, but don't know what it should do if the price is higher and stock is greater than zero.
You can add the extra condition you're looking for to that statement.
The condition you're looking to match is:
$data->{'price'} > $product_price && $product_quantity > 0
So the final version would look like this:
if (($force_price_updates == 1) || ($data->{'price'} <= $product_price && $data->{'quantity'} > 0) || ($product_quantity == 0 && $data->{'quantity'} > 0) || ($data->{'price'} > $product_price && $product_quantity > 0)) {

Multiple statements in where clause

I have this strange problem. I have a simple search requirements where a user can search a given entitiy (Say customer) based on several search criterias. User may choose to use a criteria or not. The search conditions need to 'AND' all the criteria. So I write code like this (which works)
IQueryable _customer;
_customer = from c in DS.properties
where
(txtCustomerName.Text.Length == 0 || c.name == txtCustomerName.Text)
&& (txtpropcust1.Text.Length == 0 || c.customfield1 == txtpropcust1.Text)
&& (txtpropcust2.Text.Length == 0 || c.customfield2 == txtpropcust2.Text)
&& (txtpropcust3.Text.Length == 0 || c.customfield3 == txtpropcust3.Text)
&& (txtpropcust4.Text.Length == 0 || c.customfield4 == txtpropcust4.Text)
&& (txtpropcust13.Text.Length == 0 || c.customfield13 == txtpropcust13.Text)
select c;
GridView1.DataContext = _customer;
The problem is that if I have 14 where clauses, the EF throws an error- 13 works - 14 does not.
I am using EF+WCF data service in a WPF application. Is there a setting somewhere which limits the number of where clauses?
Thanks
To simplify the resulting query, you could use:
var customers = DS.properties;
if (txtCustomerName.Text.Length > 0)
customers = customers.Where(x => x.name == txtCustomerName.Text);
if (txtpropcust1.Text.Length > 0)
customers = customers.Where(x => x.customfield1 == txtpropcust1.Text);
// etc
_customer = customers;
GridView1.DataContext = _customer;
Note that this will only add SQL where clauses when there's a need for it.