How to unify schema when writing to parquet in apache spark? - scala

I have a schema eg: A which I use to read:
val DF = spark.read.schema(A.schema).json(inputPath)
Now I have a different schema lets say D which is a union of A + B + C.
When writing to parquet I want to make sure that data frame is written to parquet with schema D. I am trying to think how can I achieve this. Any ideas would be helpful in this regard how can I approach this problem.

Related

PySpark - distibute gb data to hive table

Hello PySpark community,
I have to load aprox 5GB flat files to Hive table. The destination table depends on row content and it is calculated in PySpark code. Let's assume there are different dest tables: A, B, C.
Whould would be the most optimal solution to provide this? Currently, I thought about the following aproach, but it doesn't work quickly because the source df dataframe is loaded 3 times (for A, B C separatly):
pseudocode:
df = spark.read("").cache()
for item in ("A", "B", "C"):
df.filter(df.dest_table==item).saveAsTable(item)
Which solution would fit the best in this case? Which concepts are worth to consider?

Is df.schema action or transformation?

I have a schema created manually for creating a dataframe say myschema
Now my dataframe say df is created.
Now, I did some operations on df and some of the columns were dropped.
say original myschema consists of 500 columns
Now after dropping some columns, my df consist of 450 columns.
Now somewhere in my code I need the schema back but only the schema after dataframe has applied some operations(ie. having 450 columns).
Now ,
Q1. How optimum is calling df.schema and using it, Is it action or transformation?
Q2. Should I create another myschema2 by filtering out those columns from myschema which will be dropped and use that?
Quick answers:
to Q1: schema is neither an action neither a transformation, in the sense that it doesn't modify the data frame and doesn't trigger any computation.
to Q2:
if I understand well, I guess you have something like this
val myschema = StructType(someSchema)
val df = spark.createDataFrame(someData, myschema)
// do some transformation (drop, add columns etc)
val df2 = df.drop("column1", "column2").withColumn("new", $"c1" + $"c2"))
and you want get the schema of df2. if it is so you can just use
val myschema2 = df2.schema
Long Answer:
Informally speaking, DataFrame is an abstraction over distributed datasets, and as you already pointed out, there are transformations and actions defined on them.
When you makes some transformation on data frames, what happens under the hood is that spark just builds a Directed Acyclic Graph describing that transformations. When
That DAG is analyzed and used to build an execution plan to get the work done
Actions on the other hand trigger the execution of the plan, which is transforming the actual data.
Schema of a transformed data frame is derived from the schema of the initial data frame basically walking along the DAG. The impact of such derivations is _neglectable, it doesn't depend on the size of the data, it depends on how much big is the DAG, but in all practical cases, you can ignore the time required to get schema. Schema is just metadata attached to a dataframe.
So to respond to Q2: No you should not have schema2 taking track of the modification you. Just by calling df.schema Spark will do that for you
hope this clears your doubts

Spark DataFrame Read And Write

I have a use case where I have to load millions of json formatted data into Apache Hive Tables.
So my solution was simply , load them into dataframe and write them as Parquet files .
Then I shall create an external table on them .
I am using Apache Spark 2.1.0 with scala 2.11.8.
It so happens all the messages follow a sort of flexible schema .
For example , a column "amount" can have value - 1.0 or 1 .
Since I am transforming data from semi-structured format to structured format but my schema is slightly
variable , I have compensated by thinking inferSchema option for datasources like json will help me .
spark.read.option("inferSchema","true").json(RDD[String])
When I have used inferSchema as true while reading json data ,
case 1 : for smaller data , all the parquet files have amount as double .
case 2 : For larger data , some parquet files have amount as double and others have int64 .
I tried to debug and found certain concepts like schema evolution and schema merging which
went over my head leaving me with more doubts than answers.
My doubts/questions are
When I try to infer schema , does it not enforce the inferred schema onto full dataset ?
Since I cannot enforce any schema due to my contraints , I thought to cast the whole
column to double datatype as it can have both integers and decimal numbers .
Is there a simpler way ?
My guess being ,Since the data is partitioned , the inferSchema works per partition and then
it gives me a general schema but it does not do anything like enforcing schema or anything
of such sort . Please correct me if I am wrong .
Note : The reason I am using inferSchema option is because the incoming data is too much flexible/variable
to enforce a case class of my own though some of the columns are mandatory . If you have a simpler solution, please suggest .
Infer schema really just processes all the rows to find the types.
Once it does that, it then merges the results to find a schema common to the whole dataset.
For example, some of your fields may have values in some rows, but not on other rows. So the inferred schema for this field then becomes nullable.
To answer your question, it's fine to infer schema for your input.
However, since you intend to use the output in Hive you should ensure all the output files have the same schema.
An easy way to do this is to use casting (as you suggest). I typically like to do a select at the final stage of my jobs and just list all the columns and types. I feel this makes the job more human-readable.
e.g.
df
.coalesce(numOutputFiles)
.select(
$"col1" .cast(IntegerType).as("col1"),
$"col2" .cast( StringType).as("col2"),
$"someOtherCol".cast(IntegerType).as("col3")
)
.write.parquet(outPath)

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

How to read a parquet file with lots of columns to a Dataset without a custom case class?

I want to use datasets instead of dataframes.
I'm reading a parquet file and want to infer the types directly:
val df: Dataset[Row] = spark.read.parquet(path)
I don't want Dataset[Row] but a Dataset.
I know I can do something like:
val df= spark.read.parquet(path).as[myCaseClass]
but, my data has many columns! so, if I can avoid writing a case class it would be great!
Why do you want to work with a Dataset? I think it's because you will have not only the schema for free (which you have with the result DataFrame anyway) but because you will have a type-safe schema.
You need to have an Encoder for your dataset and to have it you need a type that would represent your dataset and hence the schema.
Either you select your columns to a reasonable number and use as[MyCaseClass] or you should accept what DataFrame offers.