I'd like to append columns from one pyspark dataframe to another.
In pandas, the command would look like
df1 = pd.DataFrame({'x':['a','b','c']})
df2 = pd.DataFrame({'y':[1,2,3]})
pd.concat((df1, df2), axis = 1)
Is there a way to accomplish this in pyspark? All I can find is either concatenating the contents of a column or doing a join.
Related
I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.
I am utilizing the following to find missing values in my spark df:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
from my sample spark df below:
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
("James","CA",np.NaN), ("Julia","",None),
("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()
How can I convert result of the prior missing count lines into a pandas dataframe? My real df has 26 columns and showing it in a spark df is messy and misaligned.
This might not be as clean as the actual pandas df with a table, but hopefully this would work for you:
From your first code, remove the .show() call:
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
You can assign a variable for that line or go straight with toPandas() call
sdf = df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
new_df = sdf.toPandas().T
print(new_df)
The .T call is to transpose the dataframe. If you have several columns, without transposing it will truncate the columns and you will not be able to see all columns.
Again, this does not have the actual table, but at least this is more readable than a spark df.
UPDATE:
You can get that table look if after the last variable, you convert it to pandas df if you prefer that look. There could be another way or a more efficient way to do this, but so far this one works.
pd.DataFrame(new_df)
I have a DataFrame with two columns, id1, id2 and what I'd like to get is to count the number of distinct values of these two columns. Essentially this is count(set(id1+id2)).
How can I do that with PySpark?
Thanks!
Please note that this isn't a duplicate as I'd like for PySpark to calculate the count(). Of course it's possible to get the two lists id1_distinct and id2_distinct and put them in a set() but it doesn't seem to me the proper solution when dealing with big data and it's not really in the PySpark spirit
You can combine the two columns into one using union, and get the countDistinct:
import pyspark.sql.functions as F
cnt = df.select('id1').union(df.select('id2')).select(F.countDistinct('id1')).head()[0]
I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.
My input is a text file that has the following:
1234567,micheal,male,usa
891011,sara,femal,germany
I want to be able to group every row by the first six characters in the first column
Create a new column that contains only the first six characters of the first column, and then group by that:
from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)
Assume the following two Dataframes in pyspark with equal number of rows:
df1:
|_ Column1a
|_ Column1b
df2:
|_ Column2a
|_ Column2b
I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?
Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.