concat columns by joining multiple DataFrames - scala

I have multiple dataframes I need to concat the addresses and zip based condition.Actually I had sql query which i need to convert to dataframe join
I had written UDF which is working fine for concating multiple columns to obtain a single column,
val getConcatenated = udf( (first: String, second: String,third: String,fourth: String,five: String,six: String) => { first + "," + second + "," +third + "," +fourth + "," +five + "," +six } )
MySQl Query
select
CONCAT(al.Address1,',',al.Address2,',',al.Zip) AS AtAddress,
CONCAT(rl.Address1,',',rl.Address2,',',rl.Zip) AS RtAddress,
CONCAT(d.Address1,',',d.Address2,','d.Zip) AS DAddress,
CONCAT(s.Address1,',',s.Address2,',',s.Zip) AS SAGddress,
CONCAT(vl.Address1,',',vl.Address2,',vl.Zip) AS VAddress,
CONCAT(sg.Address1,',',sg.Address2,',sg.Zip) AS SAGGddress
FROM
si s inner join
at a on s.cid = a.cid and s.cid =a.cid
inner join De d on s.cid = d.cid AND d.aid = a.aid
inner join SGrpM sgm on s.cid = sgm.cid and s.sid =sgm.sid and sgm.status=1
inner join SeG sg on sgm.cid =sg.cid and sgm.gid =sg.gid
inner join bd bu on s.cid = bu.cid and s.sid =bu.sid
inner join locas al on a.ALId = al.lid
inner join locas rl on a.RLId = rl.lid
inner join locas vl on a.VLId = vl.lid
I am facing issue when joining the dataframes which gives me null value.
val DS = DS_SI.join(at,Seq("cid","sid"),"inner").join(DS_DE,Seq("cid","aid"),"inner") .join(DS_SGrpM,Seq("cid","sid"),"inner").join(DS_SG,Seq("cid","gid"),"inner") .join(at,Seq("cid","sid"),"inner")
.join(DS_BD,Seq("cid","sid"),"inner").join(DS_LOCAS("ALId") <=> DS_LOCATION("lid") && at("RLId") <=> DS_LOCAS("lid")&& at("VLId") <=> DS_LOCAS("lid"),"inner")
Iam trying to join my dataFrames like above which is not giving be proper results and then I want to concat by adding the column
.withColumn("AtAddress",getConcatenated())
.withColumn("RtAddress",getConcatenated())....
Any one tell me how effectively we can achieve this and am I joining the dataframes correctly or any better approach for this .....

You can use concat_ws(separator, columns_to_concat).
Example:
import org.apache.spark.sql.functions._
df.withColumn("title", concat_ws(", ", DS_DE("Address2"), DS_DE("Address2"), DS_DE("Zip")))

Related

Postgresql, how to SELECT json object without duplicated rows

I'm trying to find out how to get JSON object results of selected rows, and not to show duplicated rows.
My current query:
SELECT DISTINCT ON (vp.id) jsonb_agg(jsonb_build_object('affiliate',a.*)) as affiliates, jsonb_agg(jsonb_build_object('vendor',vp.*)) as vendors FROM
affiliates a
INNER JOIN related_affiliates ra ON a.id = ra.affiliate_id
INNER JOIN related_vendors rv ON ra.product_id = rv.product_id
INNER JOIN vendor_partners vp ON rv.vendor_partner_id = vp.id
WHERE ra.product_id = 79 AND a.is_active = true
GROUP BY vp.id
The results that I receive from this is:
[
affiliates: {
affiliate: affiliate1
affiliate: affiliate2
},
vendors: {
vendor: vendor1,
vendor: vendor1,
}
As you can see in the second record, vendor is still vendor1 because there are no more results, so I'd like to also know if there's a way to remove duplicates.
Thanks.
First point : the result you display here above doesn't conform the json type : the keys are not double-quoted, the string values are not double-quoted, having dupplicated keys in the same json object ('{"affiliate": "affiliate1", "affiliate": "affiliate2"}' :: json) is not be accepted with the jsonb type (but it is with the json type).
Second point : you can try to add the DISTINCT key word directly in the jsonb_agg function :
jsonb_agg(DISTINCT jsonb_build_object('vendor',vp.*))
and remove the DISTINCT ON (vp.id) clause.
You can also add an ORDER BY clause directly in any aggregate function. For more information, see the manual.
You could aggregate first, then join on the results of the aggregates:
SELECT a.affiliates, v.vendors
FROM (
select af.id, jsonb_agg(jsonb_build_object('affiliate',af.*)) as affiliates
from affiliates af
group by af.id
) a
JOIN related_affiliates ra ON a.id = ra.affiliate_id
JOIN related_vendors rv ON ra.product_id = rv.product_id
JOIN (
select vp.id, jsonb_agg(jsonb_build_object('vendor',vp.*)) as vendors
from vendor_partners vp
group by vp.id
) v ON rv.vendor_partner_id = v.id
WHERE ra.product_id = 79
AND a.is_active = true

Syntax error raw sql on rails when query inner results

how do you query raw SQL on rails?
So I have this raw sql that I need to run on rails but giving me a syntax error. I also escape the extra parenthesis but still got a syntax error near the first inner join.
here's my code:
Spree::CorporateAccount.joins(" (((((
( inner join spree_memberships on spree_corporate_accounts.id = spree_memberships.corporate_account_id)
inner join spree_users on spree_memberships.user_id = spree_users.id)
left join spree_variant_price_sets on spree_corporate_accounts.variant_price_set_id = spree_variant_price_sets.id)
left join spree_addresses on spree_corporate_accounts.bill_address_id = spree_addresses.id)
left join spree_states on spree_addresses.state_id = spree_states.id)
left join spree_countries on spree_addresses.country_id = spree_countries.id)
left join spree_partner_accounts on spree_corporate_accounts.id = spree_partner_accounts.partnerable_id
").where(" spree_memberships.deleted_at IS null
AND (spree_partner_accounts.partnerable_type = 'Spree::CorporateAccount' OR spree_partner_accounts.partnerable_type IS NULL)
AND admin = true
")
but on sql this is perfectly fine.
SELECT
spree_corporate_accounts.id,
spree_corporate_accounts.company_name,
spree_memberships.ADMIN,
spree_users.email,
spree_users.doctor AS name,
spree_partner_accounts.account_key,
CASE spree_corporate_accounts.billing_type when 1 THEN 'Postbill' WHEN 2 THEN 'Creditcard' ELSE 'Individual'END,
spree_variant_price_sets.name AS priceset,
spree_addresses.address1,
spree_addresses.address2,
spree_addresses.city,
spree_addresses.zipcode,
spree_states.name AS state,
spree_countries.name AS country,
spree_addresses.phone,
spree_users.created_at
from (((((( spree_corporate_accounts inner join spree_memberships on spree_corporate_accounts.id = spree_memberships.corporate_account_id)
inner join spree_users on spree_memberships.user_id = spree_users.id)
left join spree_variant_price_sets on spree_corporate_accounts.variant_price_set_id = spree_variant_price_sets.id)
left join spree_addresses on spree_corporate_accounts.bill_address_id = spree_addresses.id)
left join spree_states on spree_addresses.state_id = spree_states.id)
left join spree_countries on spree_addresses.country_id = spree_countries.id)
left join spree_partner_accounts on spree_corporate_accounts.id = spree_partner_accounts.partnerable_id
where spree_memberships.deleted_at IS null
and (spree_partner_accounts.partnerable_type = 'Spree::CorporateAccount' OR spree_partner_accounts.partnerable_type IS NULL )
AND admin = true
if I do this. It will yield a different result. so what im thinking is the parenthesis evaluate the result first and then go to the next.
Spree::CorporateAccount.joins("inner join spree_memberships on spree_corporate_accounts.id = spree_memberships.corporate_account_id")
.joins("inner join spree_users on spree_memberships.user_id = spree_users.id")
.joins("left join spree_variant_price_sets on spree_corporate_accounts.variant_price_set_id = spree_variant_price_sets.id")
.joins("left join spree_addresses on spree_corporate_accounts.bill_address_id = spree_addresses.id")
.joins("left join spree_states on spree_addresses.state_id = spree_states.id")
.joins("left join spree_countries on spree_addresses.country_id = spree_countries.id")
.joins("left join spree_partner_accounts on spree_corporate_accounts.id = spree_partner_accounts.partnerable_id ")
.where("spree_memberships.deleted_at IS null
AND spree_partner_accounts.partnerable_type = 'Spree::CorporateAccount' OR spree_partner_accounts.partnerable_type IS NULL
AND admin = true
")

Implementing Concat + RANK OVER SQL Clause in C# LINQ

I need to implement the following T-SQL clause ....
SELECT
CONCAT( RANK() OVER (ORDER BY [Order].codOrder, [PackedOrder].codPackedProduct ), '/2') as Item,
[Order].codOrder as [OF],
[PackedOrder].codLine as [Ligne],
[PackedOrder].codPackedProduct as [Material], ----------------------
[Product].lblPProduct as [Product],
[PackedProduct].lblPackedProduct as [MaterialDescription],
[PackedOrder].codPackedBatch as [Lot],
[Product].codCustomerColor as [ReferenceClient],
[PackedOrder].nbrPackedQuantity as [Quantity],
[PackedOrder].nbrLabelToPrint as [DejaImprime]
FROM [Order] INNER JOIN PackedOrder
ON [Order].codOrder = PackedOrder.codOrder INNER JOIN Product
ON [Order].codProduct = Product.codProduct INNER JOIN PackedProduct
ON PackedOrder.codPackedProduct = PackedProduct.codPackedProduct
Where [Order].codOrder = 708243075
So Far, I'm able to do:
var result =
from order1 in Orders
join packedorder1 in PackedOrders on order1.codOrder equals packedorder1.codOrder
join product1 in Products on order1.codProduct equals product1.codProduct
join packedproduct1 in PackedProducts on packedorder1.codPackedProduct equals packedproduct1.codPackedProduct
where order1.codOrder == _order.codOrder
select new FinishedProductPrintingM
{
OF = order1.codOrder,
Ligne = packedorder1.codLine,
Material = packedorder1.codPackedProduct,
Produit = product1.codProductType,
MaterialDescription = packedproduct1.lblPackedProduct,
Lot = packedorder1.codPackedBatch,
RéférenceClient = product1.codCustomerColor,
Quantité = packedorder1.nbrPackedQuantity,
Déjàimprimé = packedorder1.nbrLabelPrinted
};
Please let me know if its possible or not. I need to display the Items in such a way.Please feel free to add your valuable comments.
I am not aware how to use concat and Rank over function in LINQ.
Can anyone help me to convert my SQL query into LINQ?

Column name cannot be resolved in SparkSQL join

I'm not sure why this is happening. In PySpark, I read in two dataframes and print out their column names and they are as expected, but then when do a SQL join I get an error that cannot resolve column name given the inputs. I have simplified the merge just to get it to work, but I will need to add in more join conditions which is why I'm using SQL (will be adding in: "and b.mnvr_bgn < a.idx_trip_id and b.mnvr_end > a.idx_trip_data"). It appears that the column 'device_id' is being renamed to '_col7' in the df mnvr_temp_idx_prev_temp
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
print mnvr_temp_idx_prev.columns
['device_id', 'mnvr_bgn', 'mnvr_end']
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
print raw_data_filtered.columns
['device_id', 'trip_id', 'idx_trip_end']
raw_data_filtered.registerTempTable('raw_data_filtered_temp')
mnvr_temp_idx_prev.registerTempTable('mnvr_temp_idx_prev_temp')
test = sqlContext.sql('SELECT a.device_id, a.idx_trip_end, b.mnvr_bgn, b.mnvr_end \
FROM raw_data_filtered_temp as a \
INNER JOIN mnvr_temp_idx_prev_temp as b \
ON a.device_id = b.device_id')
Traceback (most recent call last): AnalysisException: u"cannot resolve 'b.device_id' given input columns: [_col7, trip_id, device_id, mnvr_end, mnvr_bgn, idx_trip_end]; line 1 pos 237"
Any help is appreciated!
I would recommend renaming the name of the field 'device_id' in at least one of the data frame. I modified your query just a bit and tested it(in scala). Below query works
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device_id")
[device_id: string, mnvr_bgn: string, mnvr_end: string, device_id: string, trip_id: string, idx_trip_end: string]
Now if you are doing a 'select * ' in above statement, it will work. But if you try to select 'device_id', you will get an error "Reference 'device_id' is ambiguous" . As you can see in the above 'test' data frame definition, it has two fields with the same name(device_id). So to avoid this, I recommend changing field name in one of the dataframes.
mnvr_temp_idx_prev = mnvr_3.select('device_id', 'mnvr_bgn', 'mnvr_end')
.withColumnRenamned("device_id","device")
raw_data_filtered = raw_data.select('device_id', 'trip_id', 'idx').groupby('device_id', 'trip_id').agg(F.max('idx').alias('idx_trip_end'))
Now use dataframes or sqlContext
//using dataframes with multiple conditions
val test = mnvr_temp_idx_prev.join(raw_data_filtered,$"device" === $"device_id"
&& $"mnvr_bgn" < $"idx_trip_id","inner")
//in SQL Context
test = sqlContext.sql("select * FROM raw_data_filtered_temp a INNER JOIN mnvr_temp_idx_prev_temp b ON a.device_id = b.device and a. idx_trip_id < b.mnvr_bgn")
Above queries will work for your problem. And if your data set is too large, I would recommend to not use '>' or '<' operators in Join condition as it causes cross join which is a costly operation if data set is large. Instead use them in WHERE condition.

Tuning Sql Query With Multiple Inner Joins On Same Table

Because of the structure of the database tables, I have to perform the following query by performing a series of inner joins on the same table. Is there any way to optimize such a query?
The query plan is here: http://explain.depesz.com/s/vAvx
SELECT
cch.id || '|' || ff.id || '|' || fc.id || '|' || tf.id || '|' || tc.id AS id,
cch.id as compare_cache_header_id, cch.client_id, --[ADDED]
base_env_adapter_id, changed_fund_config_id, type,
true AS new_money, 'BOTH'::text AS online_transfer, 'BOTH'::text AS automatic_transfer, true AS short_term_fee, true AS rebalancing, '*'::text AS target_units, 'ALL_INTERFACES'::text AS trade_allowed,
from_fund_code,
ff.en_short_display_name as from_fund_en_short_display_name, ff.en_med_display_name as from_fund_en_med_display_name, ff.en_long_display_name as from_fund_en_long_display_name,
ff.fr_short_display_name as from_fund_fr_short_display_name, ff.fr_med_display_name as from_fund_fr_med_display_name, ff.fr_long_display_name as from_fund_fr_long_display_name,
to_fund_code, tf.en_short_display_name as to_fund_en_short_display_name, tf.en_med_display_name as to_fund_en_med_display_name, tf.en_long_display_name as to_fund_en_long_display_name,
tf.fr_short_display_name as to_fund_fr_short_display_name, tf.fr_med_display_name as to_fund_fr_med_display_name, tf.fr_long_display_name as to_fund_fr_long_display_name,
cct.from_class_code, fc.english_name as from_class_english_name, fc.french_name as from_class_french_name, fc.en_display_name as from_class_en_display_name, fc.fr_display_name as from_class_fr_display_name,
cct.to_class_code, tc.english_name as to_class_english_name, tc.french_name as to_class_french_name, tc.en_display_name as to_class_en_display_name, tc.fr_display_name as to_class_fr_display_name
FROM compare_cache_header cch
INNER JOIN compare_cache_transfer cct
ON cch.id = cct.compare_cache_header_id
INNER JOIN fund ff
ON cch.changed_fund_config_id = ff.fund_config_id and cct.from_fund_code = ff.fund_code
INNER JOIN fund tf
ON cch.changed_fund_config_id = tf.fund_config_id and cct.to_fund_code = tf.fund_code
INNER JOIN class fc
ON cch.changed_fund_config_id = fc.fund_config_id and cct.from_class_code = fc.class_code
INNER JOIN class tc
ON cch.changed_fund_config_id = tc.fund_config_id and cct.to_class_code = tc.class_code
WHERE cct.type = 'ADD';
Make sure that there are indexes defined on all of the fields in all of the ON clauses and all of the WHERE clauses. It should be fairly quick otherwise.