pyspark: how to select two queries in dataframe - pyspark

I built a dataframe in pyspark and now I want to select the value of two columns. How can I do this?
I tried this:
df.where((df['E'] ==0 ).where(df['C']=='non'))
Thanks

You use & (and) logic operator for this:
df.where((df['E'] == 0) & (df['C'] == 'non'))

You can either where or filter -
df.where((df.E == 0) & (df.C == 'non'))
##OR
df.filter((df.E == 0) & (df.C == 'non'))

You can use sql syntax directly:
df.where("E=0 and C='non'")

Related

Chaining joins in Pyspark

I am trying to join multiple dataframes in PySpark by one chained operation. The join key column name is the same in all of them. The code snippet:
columns_summed = [i for i in df_summed.columns if i != "buildingBlock_id"]
columns_concat = [i for i in df_concat.columns if i != "buildingBlock_id"]
columns_indicator = [i for i in df_indicator_fields.columns if i != "buildingBlock_id"]
columns_takeone = [i for i in df_takeone.columns if i != "buildingBlock_id"]
columns_minmax = [i for i in df_minmax.columns if i != "buildingBlock_id"]
df_all_joined = (df_summed.alias("df1").join(df_concat,df_summed.buildingBlock_id == df_concat.buildingBlock_id, "left")
.join(df_indicator_fields,df_summed.buildingBlock_id == df_indicator_fields.buildingBlock_id, "left")
.join(df_takeone,df_summed.buildingBlock_id == df_takeone.buildingBlock_id, "left")
.join(df_minmax,df_summed.buildingBlock_id == df_minmax.buildingBlock_id, "left")
.select("df1.buildingBlock_id", *columns_summed
, *columns_concat
, *columns_indicator
, *columns_takeone
, *columns_minmax
)
)
Now, when I am trying to display the joined dataframe using:
display(df_all_joined)
I'm getting the following error:
AnalysisException: Reference 'df1.buildingBlock_id' is ambiguous, could be: df1.buildingBlock_id, df1.buildingBlock_id.
Why am I getting this error even though I specified where the key column should come from?
You should specify the join columns as an array of strings:
.join(df_concat,['buildingBlock_id'], "left")
If the columns that you are joining on have the same name, this makes sure to drop one of them. In the case of the left join, it drops the column from df_concat.
If you don't do that, you end up with both columns in the joined data frame, thus creating this "Ambiguous" excception.

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)

Count before order, skip and take

I'm using Entity Framework together with Unit-of-work and repository pattern.
For a function with ordering, pagination, etc. I use the following code:
stammdatenEntityModels =
_unitOfWork.StammdatenRepository.Get()
.Where(
s =>
s.Geloescht == false &&
((s.Auftraggeber != null && s.Auftraggeber.Bezeichnung.ToLower().Contains(keyword)) ||
(s.SerienNummer.Contains(keyword)) ||
(s.Bezeichnung.ToLower().Contains(keyword)) ||
(s.StammdatenKunde != null && s.StammdatenKunde.Name.ToLower().Contains(keyword)) ||
(s.BeginnVos.HasValue && s.BeginnVos == dateTime) ||
(s.VosDauer != null && s.VosDauer.Bezeichnung.ToLower().Contains(keyword)) ||
(s.Geraetewert.HasValue && s.Geraetewert.Value.ToString().Contains(keyword))
))
.OrderBy(orderBy)
.Skip(inputModel.EntriesToDisplay*(inputModel.Page - 1))
.Take(inputModel.EntriesToDisplay)
.ToList();
Now I need to know the numbers of records, but before the skip and take (for pagination) is performed.
Therefore I have the same code again:
totalCount = _unitOfWork.StammdatenRepository.Get()
.Count(
s =>
s.Geloescht == false &&
((s.Auftraggeber != null && s.Auftraggeber.Bezeichnung.ToLower().Contains(keyword)) ||
(s.SerienNummer.Contains(keyword)) ||
(s.Bezeichnung.ToLower().Contains(keyword)) ||
(s.StammdatenKunde != null && s.StammdatenKunde.Name.ToLower().Contains(keyword)) ||
(s.BeginnVos.HasValue && s.BeginnVos == dateTime) ||
(s.VosDauer != null && s.VosDauer.Bezeichnung.ToLower().Contains(keyword)) ||
(s.Geraetewert.HasValue && s.Geraetewert.Value.ToString().Contains(keyword))
));
Unfortunately this leads to a lot of redundance and my query is performed twice. Is there any better solution?
I agree with the redundancy. What you can do is to cache the count result for the specific parameters (keyword, dateTime) a certain amount of time and use it again at subsequent calls to deliver paginated (skip, take) results from the StammDatenRepository.
This way you only have the overall count-call one time for specific parameters.
Found this SO question where a respected member states:
Thinking about it from a SQL point of view, I can't think of a way in
a single normal query to retrieve both the total count and a subset of
the data, so I don't think you will be able to do it in LINQ either.
So, I really think you have to cache some counts results to increase performance. You know best how to do it for your specific situation and if it's worth it at all...

Slick compare table column with null

Can someone please tell me how do we compare a table column against NULL using slick's lifted embedding.
What i want to achieve in mysql:
select * from Users where email = '<some_email_addr>' and ( removed = NULL OR removed <= 1 )
It gives me an errors at x.removed === null when i try:
val q = for {
x <- Users
if x.email === email && ( x.removed === null || x.removed <= 1 )
} yield x.removed
Thanks
Try:
x.removed.isNull
I think that is what you are looking for
As daaatz said and per http://slick.typesafe.com/doc/2.1.0/upgrade.html#isnull-and-isnotnull you should use isEmpty, but it only works on Option columns. A workaround is x.removed.?.isEmpty.
Try
x.removed.isEmpty()
There are no nulls in scala ;-)