Should I persist a Spark dataframe if I keep adding columns in it? - scala

I could not find any discussion on below topic in any forum I searched in internet. It may be because I am new to Spark and Scala and I am not asking a valid question. If there are any existing threads discussing the same or similar topic, the links will be very helpful. :)
I am working on a process which uses Spark and Scala and creates a file by reading a lot of tables and deriving a lot of fields by applying logic to the data fetched from tables. So, the structure of my code is like this:
val driver_sql = "SELECT ...";
var df_res = spark.sql(driver_sql)
var df_res = df_res.withColumn("Col1", <logic>)
var df_res = df_res.withColumn("Col2", <logic>)
var df_res = df_res.withColumn("Col3", <logic>)
.
.
.
var df_res = df_res.withColumn("Col20", <logic>)
Basically, there is a driver query which creates the "driver" dataframe. After that, separate logic (functions) is executed based on a key or keys in the driver dataframe to add new columns/fields. The "logic" part is not always a one-line code, sometimes, it is a separate function which runs another query and does some kind of join on df_res and adds a new column. Record count also changes since I use “inner” join with other tables/dataframes in some cases.
So, here are my questions:
Should I persist df_res at any point in time?
Can I persist df_res again and again after columns are added? I mean, does it add value?
If I persist df_res (disk only) every time a new column is added, is the data in the disk replaced? Or does it create a new copy/version of df_res in the disk?
Is there is a better technique to persist/cache data in a scenario like this (to avoid doing a lot of stuff in memory)?

The first thing is persisting a dataframe helps when you are going to apply iterative operations on dataframe.
What you are doing here is applying transformation operation on your dataframes. There is no need to persist these dataframes here.
For eg:- Persisting would be helpful if you are doing something like this.
val df = spark.sql("select * from ...").persist
df.count
val df1 = df.select("..").withColumn("xyz",udf(..))
df1.count
val df2 = df.select("..").withColumn("abc",udf2(..))
df2.count
Now, if you persist df here then it would be beneficial in calculating df1 and df2.
One more thing to notice here is, the reason why I did df.count is because dataframe is persisted only when an action is applied on it. From Spark docs:
"The first time it is computed in an action, it will be kept in memory on the nodes". And this answers your second question as well.
Every time you persist a new copy will be created but you should unpersist the prev one first.

Related

Is df.schema action or transformation?

I have a schema created manually for creating a dataframe say myschema
Now my dataframe say df is created.
Now, I did some operations on df and some of the columns were dropped.
say original myschema consists of 500 columns
Now after dropping some columns, my df consist of 450 columns.
Now somewhere in my code I need the schema back but only the schema after dataframe has applied some operations(ie. having 450 columns).
Now ,
Q1. How optimum is calling df.schema and using it, Is it action or transformation?
Q2. Should I create another myschema2 by filtering out those columns from myschema which will be dropped and use that?
Quick answers:
to Q1: schema is neither an action neither a transformation, in the sense that it doesn't modify the data frame and doesn't trigger any computation.
to Q2:
if I understand well, I guess you have something like this
val myschema = StructType(someSchema)
val df = spark.createDataFrame(someData, myschema)
// do some transformation (drop, add columns etc)
val df2 = df.drop("column1", "column2").withColumn("new", $"c1" + $"c2"))
and you want get the schema of df2. if it is so you can just use
val myschema2 = df2.schema
Long Answer:
Informally speaking, DataFrame is an abstraction over distributed datasets, and as you already pointed out, there are transformations and actions defined on them.
When you makes some transformation on data frames, what happens under the hood is that spark just builds a Directed Acyclic Graph describing that transformations. When
That DAG is analyzed and used to build an execution plan to get the work done
Actions on the other hand trigger the execution of the plan, which is transforming the actual data.
Schema of a transformed data frame is derived from the schema of the initial data frame basically walking along the DAG. The impact of such derivations is _neglectable, it doesn't depend on the size of the data, it depends on how much big is the DAG, but in all practical cases, you can ignore the time required to get schema. Schema is just metadata attached to a dataframe.
So to respond to Q2: No you should not have schema2 taking track of the modification you. Just by calling df.schema Spark will do that for you
hope this clears your doubts

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.

Is it inefficient to manually iterate Spark SQL data frames and create column values?

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.
The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.
You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot