How to save struct column as string to CSV/TSV in PySpark? - pyspark

I've seen similar questions asked many times, but there's no clear answer to something that should be easy.
How can a struct column be saved to CSV (tsv actually) in PySpark? I want to serialize it and save as JSON.
I have a dataframe, which I read from parquet, that contains the following schema:
timestamp:long
timezoneOffset:string
dayInterval:integer
speed:double
heading:double
ignitionStatus:integer
segmentId:string
pointMM:struct
mmResult:array
element:struct
primitiveId:long
rnId:integer
internalId:integer
isFromTo:boolean
offset:double
probability:double
distanceToArc:double
headingDifference:double
isSuccessful:boolean
The pointMM column is a struct, that contains an array of structs, and another bool field (isSuccessful). I'm able to read this data from parquet and preview it:
If I want to save this data to CSV/TSV I get the following error:
df.write.csv(output_path, sep='\t')
AnalysisException: CSV data source does not support struct<mmResult:array<struct<primitiveId:bigint,rnId:int,internalId:int,isFromTo:boolean,offset:double,probability:double,distanceToArc:double,headingDifference:double>>,isSuccessful:boolean> data type.
Is there a way, even better an easy to convert the pointMM column to JSON string and save it to TSV?
Is there a way to do it with explicitly stating the schema of pointMM? Or a way to do it, even better, without knowing the schema?
I don't understand why this is difficult, because as you can see in the attached screenshot, that column is shown in JSON format.
EDIT 1: I understand that display() function somehow serializes the columns struct. Is there a way to use the same serialization without reinventing the wheel?
EDIT 2: .printSchema() shows the schema of the DataFrame. Can this be somehow used to help serialize pointMM column to JSON?

Use df.dtypes to get the type of each column. If type is a struct use to_json on this column to convert the data into a json string. Otherwise select the column as it is:
from pyspark.sql import functions as F
cols = [ F.to_json(c[0]).alias(c[0]) if c[1].startswith("struct") else F.col(c[0]) for c in df.dtypes]
df.select(cols).show(truncate=False)
Output:
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+
|dayInterval|heading|ignitionStatus|pointMM |segmentId|speed|timestamp|timezoneOffset|
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+
|1 |271.1 |4 |{"isSuccessful":true,"mmResult":[{"distanceToArc":12.211,"headingDifference":12.1,"internalId":5,"isFromTo":true,"offset":12.1,"primitiveId":12,"probability":0.12,"rnId":4},{"distanceToArc":12.211,"headingDifference":12.1,"internalId":5,"isFromTo":true,"offset":12.1,"primitiveId":13,"probability":0.12,"rnId":4}]}|abc |12.4 |12345678 |+1 |
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+

Related

Retaining Column Ordering Pyspark

I have synapse analytics notebook. I am reading a csv file to a pyspark dataframe. Now when i write this dataframe to json file, then the the column order changes to Alphabetical order. Can some help me how i can retain the column order with out hardcoding the column names in the notebook.
For example when i do df.show() I am getting BCol, CCol,ACol
Now when i write to json file it is writing as {ACol ='';BCol='';CCol=''}. I am not able to retain the values.
I am using the following code to write to json file
df.coalesce(1).write.format("json").mode("overwrite").save(dest_location)

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Scala - Writing dataframe to a file as binary

I have a hive table of type parquet, with column Content storing various documents as base64 encoded.
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row.
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF is storing the binary format of a document as a row, which I need to write to a file. Tried different options but couldn't get back the dataframe content to a file.
Appreciate any help regarding this.
I would suggest save as parquet:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)

Datatype conversion of Parquet using spark sql - Dynamically without specify a column name explicityly

I am looking for a way to handle the data type conversion dynamically. SparkDataframes , i am loading the data into a Dataframe using a hive SQL and storing into dataframe and then writing to a parquet file. Hive is unable to read some of the data types and i wanted to convert the decimal datatypes to Double . Instead of specifying a each column name separately Is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all 8 of them to Double datatype Without specify a column name. can we do that directly?
There is no direct way to do this convert data type here are some ways,
Either you have to cast those columns in hive query .
or
Create /user case class of data types you required and populate data and use it to generate parquet.
or
you can read data type from hive query meta and use dynamic code to get case one or case two to get. achieved
There are two options:
1. Use the schema from the dataframe and dynamically generate query statement
2. Use the create table...select * option with spark sql
This is already answered and this post has details, with code.

Edit csv file in Scala

I would like to edit csv (more than 500MB) file.
If I have data like
ID, NUMBER
A, 1
B, 3
C, 4
D, 5
I want to add some extra column like
ID, NUMBER, DIFF
A, 1, 0
B, 3, 2
C, 4, 1
D, 5, 1
This data also be able in ScSla data type.
(in)Orgin Csv file -> (out)(new csv file, file data(RDD type?))
Q1. Which is best way to treat data?
make a new csv file from the original csv file, and then re-open the new csv file to scala data.
make new scala data first and make it as csv file.
Q2. Do I need to use 'dataframe' for this? Which library or API should I use?
A fairly trivial way to achieve that is to use kantan.csv:
import kantan.csv.ops._
import kantan.csv.generic.codecs._
import java.io.File
case class Output(id: String, number: Int, diff: Int)
case class Input(id: String, number: Int)
val data = new File("input.csv").asUnsafeCsvReader[Input](',', true)
.map(i => Output(i.id, i.number, 1))
new File("output.csv").writeCsv[Output](data.toIterator, ',', List("ID", "NUMBER", "DIFF"))
This code will work regardless of the data size, since at no point do we load the entire dataset (or, indeed, more than one row) in memory.
Note that in my example code, data comes from and goes to File instances, but it could come from anything that can be turned into a Reader instance - a URI, a String...
RDD vs DataFrame: both are good options. The recommendation is to use DataFrames which allows some extra optimizations behind the scenes, but for simple enough tasks the performance is probably similar. Another advantage of using DataFrames is the ability to use SQL - if you're comfortable with SQL you can just load the file, register it as temp table and query it to perform any transformation. A more relevant advantage of DataFrames is the ability to use databricks' spark-csv library to easily read and write CSV files.
Let's assume you will use DataFrames (DF) for now:
Flow: sounds like you should
Load original file to a DF, call it input
Transform it to the new DF, called withDiff
At this point, it would make sense to cache the result, let's call the cached DF result
Now you can save result to the new CSV file
Use result again for whatever else you need