i have json string in my dataframe i already tried to exract json sting columns using pyspark - pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf

I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

Related

Convert each row of pyspark DataFrame column to a Json string

How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

Convert header (column names) to new dataframe

I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);

How to convert a row from Dataframe to String

I have a dataframe which contains only one row with the column name: source_column in the below format:
forecast_id:bigInt|period:numeric|name:char(50)|location:char(50)
I want to retrieve this value into a String and then split it on the regex |
First I tried converting the row from the DataFrame into the String by following way so that I can check if the row is converted to String:
val sourceColDataTypes = sourceCols.select("source_columns").rdd.map(x => x.toString()).collect()
When I try to print: println(sourceColDataTypes) to check the content, I see [Ljava.lang.String;#19bbb216
I couldn't understand the mistake here. Could anyone let me know how can I properly fetch a row from a dataframe and convert it to String.
You can also try this:
df.show()
//Input data
//+-----------+----------+--------+--------+
//|forecast_id|period |name |location|
//+-----------+----------+--------+--------+
//|1000 |period1000|name1000|loc1000 |
//+-----------+----------+--------+--------+
df.map(_.mkString(",")).show(false)
//Output:
//+--------------------------------+
//|value |
//+--------------------------------+
//|1000,period1000,name1000,loc1000|
//+--------------------------------+
df.rdd.map(_.mkString(",")).collect.foreach(println)
//1000,period1000,name1000,loc1000

How to get a list[String] from a DataFrame

I have a text file in HDFS with a list of ids that I want to read as a list of String. When I do this
spark.read.text(filePath).collect.toList
I get a List[org.apache.spark.sql.Row] instead. How do I read this file into a list of string?
If you use spark.read.textFile(filepath) instead, you will get a DataSet[String] instead of a DataFrame (aka, DataSet[Row]). Then when you collect you will get an Array[String] instead of Array[Row].
You can also convert a DataFrame with a single string column into a DataSet[String] using df.as[String]. So df.as[String].collect will get an Array[String] from a DataFrame (assuming the DataFrame contains a single string column, otherwise this will fail)
Use map(_.getString(0)) to extract the value from the Row object:
spark.read.text(filePath).map(_.getString(0)).collect.toList