Spark Scala Dataframe join and modification

Spark Scala Dataframe join and modification - scala

I have a table which has employee details and another table project which has the project details and employee id assigned.
Employee
EmployeeName|Id|Address|Assigned
Joan|101|xxxx|y
Project
ProjectCode|Number of days|Employee
XX1223|24|101
I have a csv file which will load the employee details in the employee table. While loading the employee details,
I need to identify if his employee id is assigned to the project table:
if the employee id is available in the project table, insert y to Assigned in the Employee table.
if not, insert n to Assigned in the Employee table.
I have a dataframe for Employee as,
var employeeDF = Employee_TABLE
And,
var employeeAssignedDF = Employee_Join_Project
At the moment, I insert to Employee first then do the join and then update Employee again. But I can do the
employeeDF.except(employeeAssignedDF)
which will have a minimum number of rows.
Is it possible to change few of the data frame column alone?
I want to insert to the table only once, so when I join and do the except I should have all the records which can be inserted to DB. Is that feasible?
Thanks

You could try this, But not sure whether this could solve your problem or not -
val newDf = df.withColumn("Column", when(CONDITION, 'Y').otherwise('N'))
You could also use any method at the place of "when(CONDITION, 'Y')"

Related

eclipselink jpa joined table inheritance with mappedsuperclass

I have 2 tables. Employee and EmployeeDetails. Employee table has the basic details like Employee Id, Department and some audit fields like Created By, Created Timestamp. EmployeeDetails table has all the personal details about the employee and same audit fields (Created By, Created Timestamp) like Employee table. Now the audit fields and Version column are part of a MappedSuperclass ModelBaseFields.
I am using JOINED Inheritance in Employee which is my base class. It extends ModelBaseFields which is a MappedSuperclass. EmployeeDetails extends Employee.
Now the problem is, whenever I try to persist the data, Employee table INSERT query is formed properly however, EmployeeDetails INSERT query is missing audit fields (Created By, Created Timestamp) and version column.
I have tried using SINGLE TABLE inheritance with Secondary table. I am getting same issue in that scenario as well.
How do I add common columns in child table?

Table of tablenames

im using a platform called CKAN which saves datasets. When a dataset is added it creates a table with a (seemingly) random name. There are certain datasets that I want to use the data from. Therefore I want to map the relation between the table in another table and the data that is inside.
I would like to use this mapped variable (table name) in a select query as FROM statement.
SELECT * FROM (SELECT tablename FROM mappingtable WHERE id=1)
How do I do this?
Edit: As what kind of data type do I store the table name?

Merging postgres data

I have data in two postgresql databases that needs to be merged into 1. Just to be clear, both databases have "good" data in them from a certain date that needs to be combined. This isn't merely appending the data from one into another. In other words, let's say that table foo has an serial id field. Both databases have a foo with ID=5555 and both values are valid (but different). So, the target database's foo keeps 5555 and the new record should get added with a new ID of nextval(foo_id_seq).
So, it's a big mess.
My thoughts are to create a tmp schema in the target db and to copy the needed data from source db. Then I need to essentially "upsert" the data. New records get inserted with new ideas (and foreign keys updated) and records that exist in both dbs get updated.
I don't believe there is a tool that will help me with this.
My questions.
How best to handle generating the new id? I know I could do it via selects and just leaving out the id column, but that's a lot of typing and would be slow. My thinking is to create a temporary trigger for these tables that will override the id supplied when doing an insert.
Finally notes:
Both databases are offline. And I'm the only one that can get to them.
Both database have the exact same schema
Target database is 9.2

try using something like:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id IN (select A.id FROM A);
INSERT INTO A(id, f1, f2)
SELECT tmp_A.id, tmp_A.f1, tmp_A.f2
FROM tmp_A
WHERE tmp_A.id NOT IN (select A.id FROM A);
The idea - use one INSERT .. SELECT .. to insert the data with conflicts in id fields and other INSERT .. SELECT .. to insert the data without the conflict.
Or simply generate new id for every inserted record:
INSERT INTO A(id, f1, f2)
SELECT nextval('A_seq'), tmp_A.f1, tmp_A.f2
FROM tmp_A;

T-SQL, Get distinct values from column in source, check target, insert if not exist

I've seen several somewhat similar questions, but nothing exactly like mine. A T-SQL god should be able to answer this is a flash.
Source table (feeder) of employees and department codes from HRMS system feed. I have an employees table and a departments table in my SQL Server database. I need to have a stored proc that will first get a distinct list of department codes in the feeder table, then check those codes to see if they exist in the departments table. If they don't then insert into the departments table from the feeder table. Then, do the insert into the employee table.
Right now I have found that one of the business analysts has been getting separate list of departments in Excel and adding them manually. Seems crazy when the data is already coming into the feeder table from HRMS. I can do the inserts, but I don't know how to loop through feeder table to see if the department code in each row exists in the departments table already. Thanks for any assistance.
J.

You can use the merge keyword in SQL 2008 or greater. I just tested this snippet:
merge department as d
using feeder as f on f.code = d.code
when not matched then
insert (code)
values (f.code);

Merge will work. Since we're just doing inserts though, this will too:
INSERT
department
(
departmentCode
)
SELECT departmentcode
FROM
(
SELECT
departmentcode
FROM
feeder
EXCEPT
SELECT
departmentcode
FROM
department
) c;
With indexes on departmentcode in both tables, this pattern usually performs well.

Insert into table with Identity and foreign key columns

I was trying to insert values from one table to another from two different databases.
My issue is I have two tables with a relation and the first table is having an identity column also.
eg table first(id, Name) - table second(id, address)
So now both the table exist with values in a db and i am trying to copy values from this db to another db.
So when I insert values from first db to second db the the first table will insert values for the Id column by itself so now I have to link that id to the second table.
How can I do that?
UPDATE using MSSQL server 2000

You can use #scope_identity immediately after your insert in SQL server 2000 which will give you the last id within the current scope but I'm not sure how that would work with bulk inserting of data
http://msdn.microsoft.com/en-us/library/ms190315.aspx

If this were SQL Server 2005 or later I would suggest using the output clause in your insert statement to retrieve the ids just inserted, but that was not available in SQL Server 2000.
If your data contains some column or series of columns which is unique other than the identity column, then you can query your first table based on that series of columns to get the ids and use that to populate your second table.

If the target tables were empty you could use SET IDENTITY_INSERT ON - this would allow to insert original values to identity columns, and you will not have to update referenced IDs. Of course if there is any existing ids that can overlap inserted ids - that is not the solution.
If names in first tables are unique, you could boild mapping between new and old ids and perform update something like this:
UPDATE S
SET S.id = F.id
FROM second S
INNER JOIN first_original FO ON FO.id = S.id
INNER JOIN first F ON F.name = FO.name
If names are not unique, then original ids should be saved in "first" in order to provide mapping between old and new ids. It can be temporary new column that can be deleted after ids in "second" will be updated.
Or as Rich Andrews said you could use #scope_identity, but in this case you will have to perform insert one by one - declare a cursor on source table, insert each record, get its new id and insert it into "second" table.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Scala Dataframe join and modification - scala

You could try this, But not sure whether this could solve your problem or not - val newDf = df.withColumn("Column", when(CONDITION, 'Y').otherwise('N')) You could also use any method at the place of "when(CONDITION, 'Y')"

Related

eclipselink jpa joined table inheritance with mappedsuperclass

Table of tablenames

Merging postgres data

T-SQL, Get distinct values from column in source, check target, insert if not exist

Insert into table with Identity and foreign key columns

Categories

Resources