How can I process a dataframe in pyspark and get output as json:
Dataframe:
empid empname in out
1 A 1 1
1 A 1 1
output required in json:
{
id:empid,
name:empname,
in:[1,1],
out:[1,1]
}
Most straightforward way to convert pyspark datafram to json is first to convert dataframe into pandas dataframe and then store pandas dataframe as json
pandas_df = df.toPandas()
pandas_df.to_json("jsonfilename.json")
Related
How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)
I have below pyspark dataframe.
column_a
name,age,pct_physics,country,class
name,age,pct_chem,class
pct_math,class
I have to extract only the part of string which begins with only pct and discard rest of them.
Expected output:
column_a
pct_physics
pct_chem
pct_math
How to achieve this in pyspark
Use regexp_extract function.
Example:
df.withColumn("output",regexp_extract(col("column_a"),"(pct_.*?),",1)).show(10,False)
#+----------------------------------+-----------+
#|column_a |output |
#+----------------------------------+-----------+
#|name,age,pct_physics,country,class|pct_physics|
#|name,age,pct_chem,class |pct_chem |
#+----------------------------------+-----------+
I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')
I am using pyspark and have a situation where I need to compare metadata of 2 parquet files.
Example:-
Parquet 1 Schema is :
1, ID, string
2, address string
3, Date, date
Parquet 2 Schema is :
1, ID, string
2, Date, date
3, address string
This should show me a difference, as col 2 moved to col 3 in parquet 2.
Thanks,
VK
In Spark there isn't a native command to do the comparison of the headers. A solution to your problem could be the following:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.read.parquet('path/to/file1.parquet', header='true')
df2= spark.read.parquet('path/to/file2.parquet', header='true')
df1_headers = df1.columns
df2_headers = df2.columns
# Now in Python you could compare the lists with the headers
# You don't need Spark to compare simple headers :-)
df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)