Retrieve the valid Party ids from the Latest record DataFrame based on lookup from Invalid Party DataFrame - scala

Input:
1.Latest Party Records (Primary Key : prty_id, Latest Record identified using : lst_upd_dt)
prty_id country role lst_upd_dt
P1 IN Partner 2022/03/01
P2 JP VSI 2022/01/01
P3 CS Vendor 2021/05/18
P4 US Customer 2022/03/12
P5 CA Partner 2022/10/01
P6 IN Customer 2019/03/01
P7 CN Vendor 2022/02/01
P8 BZ Vendor 2020/09/15
Invalid party id:
Invalid Party Id Records
prty_id
P1
P7
P4
Required output is:
Valid Party Ids from Latest Record.
prty_id country role lst_upd_dt
P2 JP VSI 2022/01/01
P3 CS Vendor 2021/05/18
P5 CA Partner 2022/10/01
P6 IN Customer 2019/03/01
P8 BZ Vendor 2020/09/15
I am done the code using filter condition like below:
val valid_id=new_part_data.filter($"prty_id"=!="P1")
.filter($"prty_id"=!="P4").filter($"prty_id"=!="P7").show()
But requirement is:
Invalid parties should not be filtered based on hard coding, they should be either from parameter file. how to use this to get the output?

You can do that using a left anti join:
val valid_id = new_part_data
.join(
right = invalid_id,
usingColumns = Seq("prty_id"),
joinType = "left_anti"
)
valid_id.show()
// +-------+-------+--------+----------+
// |prty_id|country| role|lst_upd_dt|
// +-------+-------+--------+----------+
// | P2| JP| VSI|2022/01/01|
// | P3| CS| Vendor|2021/05/18|
// | P5| CA| Partner|2022/10/01|
// | P6| IN|Customer|2019/03/01|
// | P8| BZ| Vendor|2020/09/15|
// +-------+-------+--------+----------+
The left anti join will keep rows from the left dataframe (new_part_data), for which the right dataframe (invalid_id, containing the invalid partie ids) do not have a corresponding value in the column prty_id.

Related

T-SQL Question for Getting One Customer Type When There Can be More Than One Value

We have an organization that can have more than one customer type basically. However, what a user wants to see is either the partner or direct type (customer type is either Direct, Partner1, Partner2, or Partner3 but can be direct plus a partner value but only can be one of the partner values). So if a customer is both (ex: Direct and Partner1) they just want the type that is a partner (ex: Partner1). So I tried splitting out partners only into one temp table from a few tables joining together different org data. I have the same query without any limit pulling into a different temp table. Then I calculate count and put that into a temp table. Then I tried gathering data from all the temp tables. That is where I run into trouble and lose some of the customers where the type is direct (I have a image link below for a directcustomer and a customer who is both). I have been out of SQL for a bit so this one is throwing me...I figure the issue is the fact that I have a case statement referencing a table that a direct customer will not exist in (#WLPO). However I am not sure how to achieve pulling in these customers while also only selecting which partner type it is for a customer that has a partner and is also direct. FYI using MSSMS for querying.
If OBJECT_ID('tempdb..#WLPO') IS NOT NULL
DROP TABLE #WLPO
IF OBJECT_ID('tempdb..#org') IS NOT NULL
DROP TABLE #org
IF OBJECT_ID('tempdb..#OrgCount') IS NOT NULL
DROP TABLE #OrgCount
IF OBJECT_ID('tempdb..#cc') IS NOT NULL
DROP TABLE #cc
Select
o.OrganizationID,
o.OrganizationName,
os.WhiteLabelPartnerID,
s.StateName
INTO #WLPO
from [Org].[Organizations] o
join [Org].[OrganizationStates] os on o.OrganizationID=os.OrganizationID --and os.WhiteLabelPartnerID = 1
join [Lookup].[States] s on os.StateID = s.StateID
join [Org].[PaymentOnFile] pof on pof.OrganizationID=o.OrganizationID
where os.WhiteLabelPartnerID in (2,3,4)
and os.StateID in (1, 2, 3)
and o.OrganizationID = 7613
select * from #WLPO
Select
o.OrganizationID,
o.OrganizationName,
os.WhiteLabelPartnerID,
s.StateName
INTO #org
from [Org].[Organizations] o
join [Org].[OrganizationStates] os on o.OrganizationID=os.OrganizationID --and os.WhiteLabelPartnerID = 1
join [Lookup].[States] s on os.StateID = s.StateID
join [Org].[PaymentOnFile] pof on pof.OrganizationID=o.OrganizationID
where 1=1--os.WhiteLabelPartnerID = 1
and os.StateID in (1, 2, 3)
and o.OrganizationID = 7613
select * from #org
Select
OrganizationID,
count(OrganizationID) AS CountOrgTypes
INTO #OrgCount
from #org
where OrganizationID = 7613
group by OrganizationID
select * from #OrgCount
Select distinct
ct.OrganizationID,
ok.OrganizationName,
ct.CountOrgTypes,
case when ct.CountOrgTypes = 2 then wlp.WhiteLabelPartnerID
when ct.CountOrgTypes = 1 then ok.WhiteLabelPartnerID
END AS CustomerTypeCode,
case when ct.CountOrgTypes = 2 then wlp.StateName
when ct.CountOrgTypes = 1 then ok.StateName END As OrgState
INTO #cc
from #org ok
left join #WLPO wlp on wlp.OrganizationID=ok.OrganizationID
join #OrgCount ct on wlp.OrganizationID=ct.OrganizationID
select * from #cc
Select
OrganizationID,
OrganizationName,
CountOrgTypes,
case when CustomerTypeCode = 1 then 'Direct'
when CustomerTypeCode = 2 then 'Partner1'
when CustomerTypeCode = 3 then 'Partner2'
when CustomerTypeCode = 4 then 'Partner3' ELSE Null END AS CustomerType,
OrgState
from #cc
order by OrganizationName asc
DirectCustomer
CustomerwithBoth

Counting how many times each distinct value occurs in a column in PySparkSQL Join

I have used PySpark SQL to join together two tables, one containing crime location data with longitude and latitude and the other containing postcodes with their corresponding longitude and latitude.
What I am trying to work out is how to tally up how many crimes have occurred within each postcode. I am new to PySpark and my SQL is rusty so I am unsure where I am going wrong.
I have tried to use COUNT(DISTINCT) but that is simply giving me the total number of distinct postcodes.
mySchema = StructType([StructField("Longitude", StringType(),True), StructField("Latitude", StringType(),True)])
bgl_df = spark.createDataFrame(burglary_rdd, mySchema)
bgl_df.registerTempTable("bgl")
rdd2 = spark.sparkContext.textFile("posttrans.csv")
mySchema2 = StructType([StructField("Postcode", StringType(),True), StructField("Lon", StringType(),True), StructField("Lat", StringType(),True)])
pcode_df = spark.createDataFrame(pcode_rdd, mySchema2)
pcode_df.registerTempTable("pcode")
count = spark.sql("SELECT COUNT(DISTINCT pcode.Postcode)
FROM pcode RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon
AND bgl.Latitude = pcode.Lat)")
+------------------------+
|count(DISTINCT Postcode)|
+------------------------+
| 523371|
+------------------------+
Instead I want something like:
+--------+---+
|Postcode|Num|
+--------+---+
|LN11 9DA| 2 |
|BN10 8JX| 5 |
| EN9 3YF| 9 |
|EN10 6SS| 1 |
+--------+---+
You can do a groupby count to get a distinct count of values for a column:
group_df = df.groupby("Postcode").count()
You will get the ouput you want.
For an SQL query:
query = """
SELECT pcode.Postcode, COUNT(pcode.Postcode) AS Num
FROM pcode
RIGHT JOIN bgl
ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat)
GROUP BY pcode.Postcode
"""
count = spark.sql(query)
Also, I have copied in from your FROM and JOIN clause to make the query more relevant for copy-pasta.

How to translate SQL into Lambda for use in MS Entity Framework with Repository pattern usage

For example take these two tables:
Company
CompanyID | Name | Address | ...
Employee
EmployeeID | Name | Function | CompanyID | ...
where a Company has several Employees.
When we want to retrieve the Company and Employee data for a certain Employee, this simple SQL statement will do the job:
SELECT e.name as employeename, c.name as companyname
FROM Company c
INNER JOIN Employee e
ON c.CompanyID = e.CompanyID
where e.EmployeeID=3
Now, the question is how to translate this SQL statement into a 'lambda' construct. We have modelled the tables as objects in the MS Entity Framework where we also defined the relationship between the tables (.edmx file).
Also important to mention is that we use the 'Repository' pattern.
The closest I can get is something like this:
List<Company> tmp = _companyRepository.GetAll().Where
(
c.Employee.Any
(
e => e.FKEngineerID == engineerId && e.DbId == jobId
)
).ToList();
Any help is very much appreciated!
This should do, assuming your repositories are returning IQueryables of the types
var list = (from c in _companyRepository.GetAll()
join e in _employeeRepository.GetAll() on c.CompanyId equals e.CompanyId
where e.FKEngineerID == engineerId && e.DbId == jobId
select new
{
EmployeeName = e.name,
CompanyName = c.name
}).ToList();
Since you are constraining the query to a single employee (e.Employee=3) why don't you start with employees.
Also, your sql query return a custom set of columns, one column from the employee table and the other column from company table. To reflect that, you need a custom projection at the ef side.
var result = _employeeRepository.GetAll()
.Where( e => e.DbId == 3 ) // this corresponds to your e.EmployeeID = 3
.Select( e => new {
employeename = e.EmployeeName,
companyname = e.Company.CompanyName
} ) // this does the custom projection
.FirstOrDefault(); // since you want a single entry
if ( result != null ) {
// result is a value of anonymous type with
// two properties, employeename and companyname
}

find the next record that contains [some stuff]

I'm working on a report that contains inpatient ("IP") surgical visits and the service dates of the follow-up x-ray visits, which is based on patient type and revenue code:
MRN AdmitDate Pattype RevCode ServiceDate
123 1/1/2015 IP 100 *null*
123 *null* PT 200 2/1/2015
123 *null* SVO 320 2/10/2015
123 *null* PT 200 2/15/2015
I'm trying to roll up rows 1 and 3 on a single line to appear as follows:
MRN AdmitDate Pattype FollowUp
123 1/1/2015 IP 2/10/2015
but am getting either an empty return or just the next record in the dataset using #followup =
If {encounter.pattype} = "IP" then
if next ({encounter.pattype}) in [several different patient types]
if {charge_detail.revenuecode} in ["0320" to "0324"] then
{charge_detail.servicedate}

How to refer to Excel columns without headers in TSQL

I dived into documentation but since I didn't find any information (Well, actually I found similarities with some other SO questions but not what I want), I'm asking you guys to help me :
I'm executing a tsql query over an ADODB connection, to retrieve data from an Excel File (*.xlsx) into another one.
This file is composed as follow :
Header1 Header2 Header3
--------- --------- ---------
A1 B1 C1
A2 B2 C2
A3 B3 C3
.... .... ....
I want to retrieve the headers too so here's a part of the whole program, containing the connection string
Dim con As ADODB.Connection
Dim rs As ADODB.Recordset
Set con = New ADODB.Connection
Set rs = New ADODB.Recordset
With con
.Provider = "Microsoft.ACE.OLEDB.12.0"
.ConnectionString = "Data Source=path\file.xlsx; _
Extended Properties=""Excel 12.0 Xml;HDR=NO;IMEX=1"""
Set rs = .Execute("Select * From [Sheet1$]")
.Close
End With
Here I retrieve all the columns, but what if I want column B, then column A, then column C, e.g. something like that :
Set rs = .Execute("Select colb, cola, colc From [Sheet1$]")
The problem there is that I don't know the terms which should replace colb, cola, colc since I can't use the headers of the columns
Regards
PS : I don't know much about these technologies, so I may be wrong with the terminology.
Not my answer, but from a colleague who sits next to me ;)
Set rs = .Execute("SELECT F2, F1, F3 FROM [Sheet1$]")
(not tested)
Have you tried to read the columns as they are in the file:
Set rs = .Execute("SELECT * FROM [Sheet1$]")
and later map them to the required variables? I mean - what do you do with rs later in your code?