Datatype conversion of Parquet using spark sql - Dynamically without specify a column name explicityly - pyspark

I am looking for a way to handle the data type conversion dynamically. SparkDataframes , i am loading the data into a Dataframe using a hive SQL and storing into dataframe and then writing to a parquet file. Hive is unable to read some of the data types and i wanted to convert the decimal datatypes to Double . Instead of specifying a each column name separately Is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all 8 of them to Double datatype Without specify a column name. can we do that directly?

There is no direct way to do this convert data type here are some ways,
Either you have to cast those columns in hive query .
or
Create /user case class of data types you required and populate data and use it to generate parquet.
or
you can read data type from hive query meta and use dynamic code to get case one or case two to get. achieved

There are two options:
1. Use the schema from the dataframe and dynamically generate query statement
2. Use the create table...select * option with spark sql
This is already answered and this post has details, with code.

Related

How to save struct column as string to CSV/TSV in PySpark?

I've seen similar questions asked many times, but there's no clear answer to something that should be easy.
How can a struct column be saved to CSV (tsv actually) in PySpark? I want to serialize it and save as JSON.
I have a dataframe, which I read from parquet, that contains the following schema:
timestamp:long
timezoneOffset:string
dayInterval:integer
speed:double
heading:double
ignitionStatus:integer
segmentId:string
pointMM:struct
mmResult:array
element:struct
primitiveId:long
rnId:integer
internalId:integer
isFromTo:boolean
offset:double
probability:double
distanceToArc:double
headingDifference:double
isSuccessful:boolean
The pointMM column is a struct, that contains an array of structs, and another bool field (isSuccessful). I'm able to read this data from parquet and preview it:
If I want to save this data to CSV/TSV I get the following error:
df.write.csv(output_path, sep='\t')
AnalysisException: CSV data source does not support struct<mmResult:array<struct<primitiveId:bigint,rnId:int,internalId:int,isFromTo:boolean,offset:double,probability:double,distanceToArc:double,headingDifference:double>>,isSuccessful:boolean> data type.
Is there a way, even better an easy to convert the pointMM column to JSON string and save it to TSV?
Is there a way to do it with explicitly stating the schema of pointMM? Or a way to do it, even better, without knowing the schema?
I don't understand why this is difficult, because as you can see in the attached screenshot, that column is shown in JSON format.
EDIT 1: I understand that display() function somehow serializes the columns struct. Is there a way to use the same serialization without reinventing the wheel?
EDIT 2: .printSchema() shows the schema of the DataFrame. Can this be somehow used to help serialize pointMM column to JSON?
Use df.dtypes to get the type of each column. If type is a struct use to_json on this column to convert the data into a json string. Otherwise select the column as it is:
from pyspark.sql import functions as F
cols = [ F.to_json(c[0]).alias(c[0]) if c[1].startswith("struct") else F.col(c[0]) for c in df.dtypes]
df.select(cols).show(truncate=False)
Output:
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+
|dayInterval|heading|ignitionStatus|pointMM |segmentId|speed|timestamp|timezoneOffset|
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+
|1 |271.1 |4 |{"isSuccessful":true,"mmResult":[{"distanceToArc":12.211,"headingDifference":12.1,"internalId":5,"isFromTo":true,"offset":12.1,"primitiveId":12,"probability":0.12,"rnId":4},{"distanceToArc":12.211,"headingDifference":12.1,"internalId":5,"isFromTo":true,"offset":12.1,"primitiveId":13,"probability":0.12,"rnId":4}]}|abc |12.4 |12345678 |+1 |
+-----------+-------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-----+---------+--------------+

Spark : Dynamic generation of the query based on the fields in s3 file

Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.

Spark DataFrame Read And Write

I have a use case where I have to load millions of json formatted data into Apache Hive Tables.
So my solution was simply , load them into dataframe and write them as Parquet files .
Then I shall create an external table on them .
I am using Apache Spark 2.1.0 with scala 2.11.8.
It so happens all the messages follow a sort of flexible schema .
For example , a column "amount" can have value - 1.0 or 1 .
Since I am transforming data from semi-structured format to structured format but my schema is slightly
variable , I have compensated by thinking inferSchema option for datasources like json will help me .
spark.read.option("inferSchema","true").json(RDD[String])
When I have used inferSchema as true while reading json data ,
case 1 : for smaller data , all the parquet files have amount as double .
case 2 : For larger data , some parquet files have amount as double and others have int64 .
I tried to debug and found certain concepts like schema evolution and schema merging which
went over my head leaving me with more doubts than answers.
My doubts/questions are
When I try to infer schema , does it not enforce the inferred schema onto full dataset ?
Since I cannot enforce any schema due to my contraints , I thought to cast the whole
column to double datatype as it can have both integers and decimal numbers .
Is there a simpler way ?
My guess being ,Since the data is partitioned , the inferSchema works per partition and then
it gives me a general schema but it does not do anything like enforcing schema or anything
of such sort . Please correct me if I am wrong .
Note : The reason I am using inferSchema option is because the incoming data is too much flexible/variable
to enforce a case class of my own though some of the columns are mandatory . If you have a simpler solution, please suggest .
Infer schema really just processes all the rows to find the types.
Once it does that, it then merges the results to find a schema common to the whole dataset.
For example, some of your fields may have values in some rows, but not on other rows. So the inferred schema for this field then becomes nullable.
To answer your question, it's fine to infer schema for your input.
However, since you intend to use the output in Hive you should ensure all the output files have the same schema.
An easy way to do this is to use casting (as you suggest). I typically like to do a select at the final stage of my jobs and just list all the columns and types. I feel this makes the job more human-readable.
e.g.
df
.coalesce(numOutputFiles)
.select(
$"col1" .cast(IntegerType).as("col1"),
$"col2" .cast( StringType).as("col2"),
$"someOtherCol".cast(IntegerType).as("col3")
)
.write.parquet(outPath)

How can I resolve table names to Parquet on the fly?

I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical

Are there disadvantages on using as partition column a non-primitive column (date) in Hive?

Is there any reason why I shouldn't use a column formatted as date as the partitioning column in a table in Apache Hive?
The official documentation says:
Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn't make sense. The dynamic partitioning column's type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS.
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions#DynamicPartitions-Designissues
I don't see why columns formatted as date would create any issue, since by design these could be converted into string.