Convert each row of pyspark DataFrame column to a Json string - pyspark

How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output

Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

Related

how to make a new column by pairing elements of the other column?

I have a big data dataframe and I want to make pairs from elements of the other column.
col
['summer','book','hot']
['g','o','p']
output:
the pair of the above rows:
new_col
['summer','book'],['summer','hot'],['hot','book']
['g','o'],['g','p'],['p','o']
Note that tuple will work instead of list. like ('summer','book').
I know in pandas I can do this:
df['col'].apply(lambda x: list(itertools.combinations(x, 2)))
but not sure in pyspark.
You can use a UDF to do the same as you would do in python. Then cast the output to an array of array of strings.
import itertools
from pyspark.sql import functions as F
combinations_udf = F.udf(
lambda x: list(itertools.combinations(x, 2)), "array<array<string>>"
)
df = spark.createDataFrame([(['hot','summer', 'book'],),
(['g', 'o', 'p'], ),
], ['col1'])
df1 = df.withColumn("new_col", combinations_udf(F.col("col1")))
display(df1)

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

Pyspark PCA Implementation

I am stuck in a problem where I wanna do PCA on a Pyspark Dataframe column. The name of the column is ‘features’ where each row is a SparseVector.
This is the flow:
Df - name of the pyspark df
Features - name of column
Snippet of the rdd
[Row(features=SparseVector(2,{1:50.0})),
Row(features=SparseVector(2,{0:654.0, 1:20.0}))],
from pyspark.mllib.linalg.distributed import RowMatrix
i = RowMatrix(df.select(‘features’).rdd)
ipc = i.computePrincipalComponents(2)
Error Message
You are getting an RDD[Row] object where your Row is Row(features=SparseVector(2,{1:50.0})).
You need an RDD[SparseVector], so you should change your line:
i = RowMatrix(df.select(‘features’).rdd)
to
i = RowMatrix(df.select(‘features’).rdd.map(lambda x: x[0]))
which will return RDD[SparseVector]

i have json string in my dataframe i already tried to exract json sting columns using pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.