Pyspark join on multiple aliased table columns - pyspark

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")

Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")

Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

Related

How to get strings separated by commas from a list to a query in PySpark?

I want to generate a query by using a list in PySpark
list = ["hi#gmail.com", "goodbye#gmail.com"]
query = "SELECT * FROM table WHERE email IN (" + list + ")"
This is my desired output:
query
SELECT * FROM table WHERE email IN ("hi#gmail.com", "goodbye#gmail.com")
Instead I'm getting: TypeError: cannot concatenate 'str' and 'list' objects
Can anyone help me achieve this? Thanks
If someone's having the same issue, I found that you can use the following code:
"'"+"','".join(map(str, emails))+"'"
and you will have the following output:
SELECT * FROM table WHERE email IN ('hi#gmail.com', 'goodbye#gmail.com')
Try this:
Dataframe based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
email_filter_list = ["hi#gmail.com", "goodbye#gmail.com"]
df.where(col('email_id').isin(email_filter_list)).show()
Spark SQL based approach -
df = spark.createDataFrame([(1,"hi#gmail.com") ,(2,"goodbye#gmail.com",),(3,"abc#gmail.com",),(4,"xyz#gmail.com")], ['id','email_id'])
df.createOrReplaceTempView('t1')
sql_filter = ','.join(["'" +i + "'" for i in email_filter_list])
spark.sql("SELECT * FROM t1 WHERE email_id IN ({})".format(sql_filter)).show()

SELECT with substring in ON clause?

I have the following select statement in ABAP:
SELECT munic~mandt VREFER BIS AB ZZELECDATE ZZCERTDATE CONSYEAR ZDIMO ZZONE_M ZZONE_T USAGE_M USAGE_T M2MC M2MT M2RET EXEMPTMCMT EXEMPRET CHARGEMCMT
INTO corresponding fields of table GT_INSTMUNIC_F
FROM ZCI00_INSTMUNIC AS MUNIC
INNER JOIN EVER AS EV on
MUNIC~POD = EV~VREFER(9).
"where EV~BSTATUS = '14' or EV~BSTATUS = '32'.
My problem with the above statement is that does not recognize the substring/offset operation on the 'ON' clause. If i remove the '(9) then
it recognizes the field, otherwise it gives error:
Field ev~refer is unknown. It is neither in one of the specified tables
nor defined by a "DATA" statement. I have also tried doing something similar in the 'Where' clause, receiving a similar error:
LOOP AT gt_instmunic.
clear wa_gt_instmunic_f.
wa_gt_instmunic_f-mandt = gt_instmunic-mandt.
wa_gt_instmunic_f-bis = gt_instmunic-bis.
wa_gt_instmunic_f-ab = gt_instmunic-ab.
wa_gt_instmunic_f-zzelecdate = gt_instmunic-zzelecdate.
wa_gt_instmunic_f-ZZCERTDATE = gt_instmunic-ZZCERTDATE.
wa_gt_instmunic_f-CONSYEAR = gt_instmunic-CONSYEAR.
wa_gt_instmunic_f-ZDIMO = gt_instmunic-ZDIMO.
wa_gt_instmunic_f-ZZONE_M = gt_instmunic-ZZONE_M.
wa_gt_instmunic_f-ZZONE_T = gt_instmunic-ZZONE_T.
wa_gt_instmunic_f-USAGE_M = gt_instmunic-USAGE_M.
wa_gt_instmunic_f-USAGE_T = gt_instmunic-USAGE_T.
temp_pod = gt_instmunic-pod.
SELECT vrefer
FROM ever
INTO wa_gt_instmunic_f-vrefer
WHERE ( vrefer(9) LIKE temp_pod ). " PROBLEM WITH SUBSTRING
"AND ( BSTATUS = '14' OR BSTATUS = '32' ).
ENDSELECT.
WRITE: / sy-dbcnt.
WRITE: / 'wa is: ', wa_gt_instmunic_f.
WRITE: / 'wa-ever is: ', wa_gt_instmunic_f-vrefer.
APPEND wa_gt_instmunic_f TO gt_instmunic_f.
WRITE: / wa_gt_instmunic_f-vrefer.
ENDLOOP.
itab_size = lines( gt_instmunic_f ).
WRITE: / 'Internal table populated with', itab_size, ' lines'.
The basic task i want to implement is to modify a specific field on one table,
pulling values from another. They have a common field ( pod = vrefer(9) ). Thanks in advance for your time.
If you are on a late enough NetWeaver version, it works on 7.51, you can use the OpenSQL function LEFT or SUBSTRING. Your query would look something like:
SELECT munic~mandt VREFER BIS AB ZZELECDATE ZZCERTDATE CONSYEAR ZDIMO ZZONE_M ZZONE_T USAGE_M USAGE_T M2MC M2MT M2RET EXEMPTMCMT EXEMPRET CHARGEMCMT
FROM ZCI00_INSTMUNIC AS MUNIC
INNER JOIN ever AS ev
ON MUNIC~POD EQ LEFT( EV~VREFER, 9 )
INTO corresponding fields of table GT_INSTMUNIC_F.
Note that the INTO clause needs to move to the end of the command as well.
field(9) is a subset operation that is processed by the ABAP environment and can not be translated into a database-level SQL statement (at least not at the moment, but I'd be surprised if it ever will be). Your best bet is either to select the datasets separately and merge them manually (if both are approximately equally large) or pre-select one and use a FAE/IN clause.
They have a common field ( pod = vrefer(9) )
This is a wrong assumption, because they both are not fields, but a field an other thing.
If you really need to do that task through SQL, I'll suggest you to check native SQL sentences like SUBSTRING and check if you can manage to use them within an EXEC_SQL or (better) the CL_SQL* classes.

Column is not updating in postgresql

I tried to update my table like below:
$query = "select *
FROM sites s, companies c, tests t
WHERE t.test_siteid = s.site_id
AND c.company_id = s.site_companyid
AND t.test_typeid = '20' AND s.site_id = '1337'";
$queryrow = $db->query($query);
$results = $queryrow->as_array();
foreach($results as $key=>$val){
$update = "update tests set test_typeid = ? , test_testtype = ? where test_siteid = ?";
$queryrow = $db->query($update,array('10','Meter Calibration Semi Annual',$val->site_id));
}
The above code is working good. But in update query , The column test_typeid is not updated with '10'. Column test_typeid is updating with empty value. Other columns are updating good. I dont know why this column test_typeid is not updating? And the column test_typeid type is integer only. I am using postgreSql
And My table definition is:
What i did wrong with the code. Kindly advice me on this.
Thanks in advance.
First, learn to use proper JOIN syntax. Never use commas in the FROM clause; always use proper explicit JOIN syntax.
You can write the query in one statement:
update tests t
set test_typeid = '10',
test_testtype = 'Meter Calibration Semi Annual'
from sites s join
companies c
on c.company_id = s.site_companyid
where t.test_siteid = s.site_id and
t.test_typeid = 20 and s.site_id = 1337;
I assume the ids are numbers, so there is no need to use single quotes for the comparisons.

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

Please help transform Tsql "implicit joins" into explicit ones

Sorry, I am pretty much an SQL noob. This has to work in MSFT SQL, Oracle as well as Sybase. In the following snippet I need to change an inner join between IJ and KL on IJ.PO_id = KL.PO_id into a left join also on IJ.PO_id = KL.PO_id. So, I believe I have to re-factor this. Well, implicit joins are not the most readable, at least in my co-worker's eyes. I guess I will agree until I develop my own taste. Sorry, I mangled the table and field names just in case.
/* #IJ_id is an input stored proc patrameter. */
from AB,
CD,
EF,
GH,
IJ,
KL
where
EF.EF_id = IJ.EF_id and
IJ.EF_id = AB.EF_id and
EF.ZY_id = IJ.ZY_id and
IJ.ZY_id = AB.ZY_id and
IJ.IJ_id = AB.IJ_id and
IJ.IJ_id = #IJ_id and
EF.XW_id = GH.GH_id and
AB.VU_code = CD.VU_code and
IJ.TS > 0 and
IJ.RQ = 0 and
EF.RQ = 0 and
AB.RQ = 0 and
IJ.PO_id = KL.PO_id;
Now, my difficulty is that there is a lot going on in the where clause. Things that do not look like a.b = c.d will remain in the where clause, but not all stuff that does look like a.b = c.d look easy to convert into an explicit join. The difficult part is that ideally the conditions would be between neighbors - AB+CD, CD+EF, EF+GH, GH+IJ, IJ+KL but they are not that organized right now. I could re-order some, but ultimately I do not want to forget my goal: I want the new query to be no slower, and I want the new query to be no less readable. It seems that I might be better off hacking just the part that I need to change, and leave it mostly the same. I am not sure if I can do that.
If you understood my intent, please suggest a better query. if you did not, then please tell me how I can improve the question. Thanks.
I think it should be something like this:
FROM AB
JOIN CD ON AB.VU_code = CD.VU_code
JOIN IJ ON IJ.EF_id = AB.EF_id AND IJ.ZY_id = AB.ZY_id AND IJ.IJ_id = AB.IJ_id
JOIN EF ON EF.EF_id = IJ.EF_id AND EF.ZY_id = IJ.ZY_id
JOIN GH ON EF.XW_id = GH.GH_id
JOIN KL ON IJ.PO_id = KL.PO_id
WHERE
IJ.IJ_id = #IJ_id AND
IJ.TS > 0 AND
IJ.RQ = 0 AND
EF.RQ = 0 AND
AB.RQ = 0
I have tried to arrange the tables such that the following rules hold:
Every join condition mentions the new table that it joining on one side.
No table is mentioned in a join condition if that table has not been joined yet.
Conditions where one of the operands is a constant are left as a WHERE condition.
The last rule is a difficult one - it is not possible to tell from your mangled names whether a condition ought to be part of a join or part of the where clause. Both will give the same result for an INNER JOIN. Whether the condition should be part of the join or part of the where clause depends on the semantics of the relationship between the tables.
You need to consider each condition on a case-by-case basis:
Does it define the relationship between the two tables? Put it in the JOIN.
Is it a filter on the results? Put it in the WHERE clause.
Some guidelines:
A condition that includes a parameter from the user is unlikely to be something that should be moved to a join.
Inequalities are not usually found in join conditions.
It couldn't possibly get any less readable than the example you gave...
from AB a
join CD c on a.VU_Code = c.VU_Code
join EF e on a.EF_id = e.EF_id and e.RQ = 0
join GH g on e.XW_id = g.GH_id
join IJ i on a.IJ_id = i.IJ_id and e.EF_id = i.EF_id
and a.EF_id = i.EF_id and e.ZY_id = i.ZY_id
and a.ZY_id = i.ZY_id and i.TS > 0 and i.RQ = 0
LEFT join KL k on i.PO_id = k.PO_id
where
i.IJ_id = #IJ_id and
a.RQ = 0
Use:
FROM AB t1
JOIN CD t2 ON t2.VU_code = t1.VU_code
JOIN GH t4 ON t4.gh_id = t3.xw_id
JOIN IJ t5 ON t5.ZY_id = t1.ZY_id
AND t5.IJ_id = t1.IJ_id
AND t5.EF_id = t1.EF_id
AND t5.IJ_id = #IJ_id
AND t5.TS > 0
AND t5.RQ = 0
JOIN EF t3 ON t3.ef_id = t5.ef_id
AND t3.zy_id = t5.zy_id
AND t3.RQ = 0
JOIN KL t6 ON t6.po_id = t5.po_id -- Add LEFT before JOIN for LEFT JOIN
WHERE ab.qu = 0
They're aliased in the sequence of the original ANSI-89 syntax, but the order is adjusted due to alias reference - can't reference a table alias before it's been defined.
This is ANSI-92 JOIN syntax - there's no performance benefit, but it does mean that OUTER join syntax is consistent. Just have to add LEFT before the "JOIN KL ..." to turn that into a LEFT JOIN.