Do Spark DataFrames/Datasets share data when cached?

Do Spark DataFrames/Datasets share data when cached? - scala

Let's say I do something like this:
def readDataset: Dataset[Row] = ???
val ds1 = readDataset.cache();
val ds2 = ds1.withColumn("new", lit(1)).cache();
Will ds2 and ds1 share all data in columns except "new" added to ds2? If I cache both datasets will it store in memory whole datasets ds and ds2 or will shared data be stored only once?
If data is shared, then when this sharing is broken (so the same data is stored in two memory locations)?
I know that datasets and rdds are immutable, but I couldn't find clear anwsers if the share data or not.

In short: the cached data will not be shared.
Experimental proof to convince you, with code snippet and corresponding memory usage that can be found in Spark UI:
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3)
df2.count()
uses about 10MB of memory:
while
val df = spark.range(10000000).cache()
val df2 = df.withColumn("other", col("id")*3).cache()
df2.count()
uses about 30MB:
for df: 10MB
for df2: 10MB for the copied column and another 10MB the new one:

Related

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()
import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")
I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.
I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.

You should use Dataset.repartition operator to control the number of output files.
You can still have the bucketBy with combination with repartition, but bucketBy has different use - avoiding shuffles in joins when they use the join keys matching the bucketing keys.
ds.repartition(3)
.write
.mode("overwrite")
.option("path", "loacl_path")
.bucketBy(3, "id")
.saveAsTable("Tmp")

bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition method.
df.repartition(3)
.write.mode(SaveMode.Overwrite)
.option("path", "local_path")
.saveAsTable("Tmp")

Unexpected caching behaviour for groupBy/join operations in spark

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?

spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

spark: read parquet file and process it

I am new of Spark 1.6. I'd like read an parquet file and process it.
For simplify suppose to have a parquet with this structure:
id, amount, label
and I have 3 rule:
amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH
How can do it in spark and scala?
I try something like that:
case class SampleModels(
id: String,
amount: Double,
label: String,
)
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
MY LOGIC
WRITE OUTPUT IN PARQUET
)
Is it right approach? Is it efficient? "MYLOGIC" could be more complex.
Thanks

Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.

Yes, it is correct approach.
It will do one pass over your complete data to build the extra column you need.
If you want a sql way, this is the way to go,
val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")
Remember to use hiveContext instead of sparkContext though.

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?

It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])

I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Do Spark DataFrames/Datasets share data when cached? - scala

Related

Why spark bucket number not equal to the number of files in the partition?

Unexpected caching behaviour for groupBy/join operations in spark

spark scala reducekey dataframe operation

spark: read parquet file and process it

How can I save an RDD into HDFS and later read it back?

Categories

Resources