Split json sturc value into multiple columns in Pyspark

Split json sturc value into multiple columns in Pyspark - pyspark

I am importing a json file into pyspark dataframe. I have import the json with following code
df = sqlContext.read.json("json_file.json").select("item", "attributes")
I want to split attributes from one column to multiple columns.
Here is sample json format:
{"item":"item-1","attributes":{"att-a":"att-a-15","att-b":"att-b-10","att-c":"att-c-7"}}
{"item":"item-2","attributes":{"att-a":"att-a-15","att-b":"att-b-10","att-c":"att-c-7"}}

If you want to see your output to be like this
+------+--------+--------+-------+
| item| att-a| att-b| att-c|
+------+--------+--------+-------+
|item-1|att-a-15|att-b-10|att-c-7|
|item-2|att-a-15|att-b-10|att-c-7|
+------+--------+--------+-------+
Use
from pyspark.sql import functions as f
df.select('item','attributes.*').show()
So that all attributes you can see it in the multiple columns.

Related

Retaining Column Ordering Pyspark

I have synapse analytics notebook. I am reading a csv file to a pyspark dataframe. Now when i write this dataframe to json file, then the the column order changes to Alphabetical order. Can some help me how i can retain the column order with out hardcoding the column names in the notebook.
For example when i do df.show() I am getting BCol, CCol,ACol
Now when i write to json file it is writing as {ACol ='';BCol='';CCol=''}. I am not able to retain the values.
I am using the following code to write to json file
df.coalesce(1).write.format("json").mode("overwrite").save(dest_location)

how to assign column names available in csv file as header to orc file

I have column names in one .csv file and want to assign these as column headers to Data Frame in scala. Since it is generic script, I don't want to hard code in the script rather pass the values from csv file.

You can do it:
val columns = spark.read.option("header","true").csv("path_to_csv").schema.fieldNames
val df: DataFrame = ???
df.toDF(columns:_*).write.format("orc").save("your_orc_dir")
in pyspark:
columns = spark.read.option("header","true").csv("path_to_csv").columns
df.toDF(columns).write.format("orc").save("your_orc_dir")
but store data schema separately from data is bad idea

Covert a Pyspark Dataframe into a List with actual values

I am trying to convert a Pyspark dataframe column to a list of values NOT objects.
Now my ultimate goal is use it as a filter for filtering another dataframe.
I have tries the following:
X = df.select("columnname").collect()
But when I use it to filter I am unable to.
Y = dtaframe.filter(~dtaframe.columnname.isin(X)))
Also, tried to convert into numpy Array and aggregate collect_list()
df.groupby('columnname').agg(collect_list(df["columnname"])
Please advise.

Collect function returns an array of row object by collecting the data from executors. If you need an array of values in native datatypes, it has to be handled explicitly to fetch the column from the row object.
This code creates a DF with column number of LongType.
df = spark.range(0,10,2).toDF("number")
Convert this into a python list.
num_list = [row.number for row in df.collect()]
Now this list can used in any dataframe to filter the values using isin function.
df1 = spark.range(10).toDF("number")
df1.filter(~col("number").isin(num_list)).show()

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type

Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Pyspark over zeppilin: unable to export to csv format?

I'm trying to export the dataframe into .csv file to S3 bucket.
Unfortunately it is saving in parquet files.
Can someone please let me know, how to get export pyspark dataframe into .csv file.
I tried below code:
predictions.select("probability").write.format('csv').csv('s3a://bucketname/output/x1.csv')
it is throwing this error: CSV data source does not support struct,values:array> data type.
Appreciate anybody help.
Note: my spark setup is based in zepplin.
Thanks,
Naseer

Probability is an array column (contains multiple values) and needs to be converted to string before you can save it to csv. One way to do it is using udf (user defined function):
from pyspark.sql.functions import udf
from pyspark.sql.functions import column as col
from pyspark.sql.types import StringType
def string_from_array(input_list):
return ('[' + ','.join([str(item) for item in input_list]) + ']')
ats_udf = udf(string_from_array, StringType())
predictions = predictions.withColumn('probability_string', ats_udf (col("probability")))
Then you can save your dataset:
predictions.select("probability_string").write.csv('s3a://bucketname/output/x1.csv')

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Split json sturc value into multiple columns in Pyspark - pyspark

Related

Retaining Column Ordering Pyspark

how to assign column names available in csv file as header to orc file

Covert a Pyspark Dataframe into a List with actual values

how to replace missing values from another column in PySpark?

Pyspark over zeppilin: unable to export to csv format?

Categories

Resources