PySpark Dropping Columns Issues - pyspark

The goal of my code is to try to drop a column each time it shows up. I know there is a way to drop columns without using a for loop. The reason that method does not work is that the columns are dynamic. The problem is that the .drop command is not dropping the column indicated. So here is some pseudocode.
for column_name in column_name_list:
# create data_frame1 with the column name
# join data_frame with other data_frame2
# Here I drop column_name in data_frame1
data_frame = data_frame.drop(column_name)
The problem is after the drop, the column_name is re-appearing during the second iteration. My guess is that I am dropping the column on a copy and it's not "saving" the data_frame with the dropped column. Thank you for all the help.

Maybe you have 2 columns with same name after join. If you join with same name You can easly join in this way easly:
dataframe1.join(dataframe2, col_name) # no need to dataframe1.col_name == dataframe2.col_name
if You already do and code above not working(i use this code and worked) you can use :
data_frame.select(*set(dataframe.columns) - set(column_names_list))

Related

Add columns to Postgres, source file changed

I get data files from one of our vendors. Each line is a continuous string with most places filled out. They have plenty of sections that are just space characters to be used as filler locations for future columns. I have a parser that formats it into a CSV so I can upload it into postgres. Today the vendor informs us that they are adding a column by splitting one of their filler fields into 2 columns. X and Filler
For example index 0:5 is the name, 5:20 is filler and 20:X is other stuff. They are splitting 5:20 into 5:10 and 10:20 where 10:20 will still be a placeholder column.
NAME1 AUHDASFAAF!##!12312312541 -> NAME1, ,AUHDASFAAF,.....
Is now
NAME1AAAAA AUHDASFAAF!##!12312312541 -> NAME1,AAAAA, ,AUHDASFAAF,......
Modifying my parser to account for this change is the easy part. How do I edit my postgres table to accept this new column from the CSV file? Ideally I dont want to remake and reupload all of the data into the table.
Columns are in the order they are defined. When you add a new column it goes at the end. There's no direct way to add a column in the middle. While insert values (...) is convenient, you should not rely on the order of columns in the table.
There are various work arounds like dropping and recreating the table or dropping and adding columns. These are all pretty inconvenient and you'll have to do it again when there's another change.
You should never make assumptions about the order of columns in the table either in an insert or select *. You can either spell out all the columns, or you can create a view which specifies the order of the columns.
You don't have to write the columns out by hand. Get them from information_schema.columns and edit their order as necessary for your queries or to set up your view.
select column_name
from information_schema.columns
where table_name = ?

Drop function not working after left outer join in pyspark

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.
Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.
df = left.join(right, ["priority"])

are INTO, FROM an JOIN the only ways to get a table?

I'm currently writing a script which will allow me to input a file (generally .sql) and it'll generate a list of every table that's used in that file. the process is simple as it opened the input file, checks for a substring and if that substring exists outputs the line to the screen.
the substring that being checked is tsql keywords that is indicative of a selected table such as INTO, FROM and JOIN. not being a T-SQL wizard those 3 keywords are the only ones i know of that are used to select a table in a query.
So my question is, in T-SQL are INTO, FROM an JOIN the only ways to get a table? or are these others?
There're many ways to get a table, here're some of them:
DELETE
FROM
INTO
JOIN
MERGE
OBJECT_ID (N'dbo.mytable', N'U') where U is the object type for table.
TABLE, e.g. ALTER TABLE, TRUNCATE TABLE, DROP TABLE
UPDATE
However, by using your script, you'll not only get real tables, but maybe VIEW and temporary table. Here're 2 examples:
-- Example 1
SELECT *
FROM dbo.myview
-- Example 2
WITH tmptable AS
(
SELECT *
FROM mytable
)
SELECT *
FROM tmptable

How to unpack tuple in PostgreSQL SELECT?

As of some quirks in our DB model I am faced with a table that optionally links to itself. I want to write a query that selects each row in a way that either the original row is returned or - if present - the linked row.
SELECT
COALESCE(r2.*, r1.*)
FROM mytable r1
LEFT JOIN mytable r2
ON r1.sub_id = r2.id
While this works, all data is returned in one column 'COALESCE' as tuples instead of the actual table columns.
How can I unpack those tuples to get the actual table rows or 'fix' the query to avoid it altogether?

differing column names in self-outer-joins

When writing a self-join in tSQL I can avoid duplicate column names thus:
SELECT FirstEvent.Title AS FirstTitle, SecondEvent.Title AS FirstTitle
FROM ContiguatedEvents AS FirstEvent
LEFT OUTER JOIN ContiguatedEvents AS SecondEvent
ON FirstEvent.logID = SecondEvent.logID
Suppose I want to select all the columns from the self-join, for example into a view. How do I then differentiate the column names without writing each one out in the join statement. I.e. is there anything I can write like this (ish)
SELECT FirstEvent.* AS ???, SecondEvent.* AS ???
FROM ContiguatedEvents AS FirstEvent
LEFT OUTER JOIN ContiguatedEvents AS SecondEvent
ON FirstEvent.logID = SecondEvent.logID
There's no way to automatically introduce aliases for multiple columns, you just have to do it by hand.
One handy hint for quickly getting all of the column names into your query (in management studio) is to drag the Columns folder from the Object Explorer into a query window. It gives you all of the column names.