Save Spark dataframe to HDFS partitioned by date - scala

I need to write the data from Spark dataframe into HDFS in Avro format. The challenge is that the data should be saved by each day so the directories would look like this: tablename/2019-08-12, tablename/2019-08-13 and so on.
I have only a field of timestamp from which I need to extract date for creating directories names.
I have built an approach which has 2 problems:
1) There are difficulties with a date extraction from the timestamp
3) On large dataset (and it's going to be larger later) performance will be very bad as a lot of tasks are launched.
So how can I change/improve this approach?
Here is the code I used (dataDF is an input data):
val uniqueDates = dataDF.select("update_database_time").distinct.
collect.map(elem => elem.getTimestamp(0).getDate)
uniqueDates.map(date => {
val resultDF = dataDF.where(to_date(dataDF.col("update_database_time")) <=> date)
val pathToSave = s"${dataDir}/${tableNameValue}/${date}"
dataDF.write
.format("avro")
.option("avroSchema", SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")
resultDF
})
.reduce(_.union(_))

If you can live with directory structure like
tablename/date=2019-08-12
tablename/date=2019-08-13
instead, then DataFrameWriter.partitionBy does the trick. For example
val df =
Seq((Timestamp.valueOf("2019-06-01 12:00:00"), 1),
(Timestamp.valueOf("2019-06-01 12:00:01"), 2),
(Timestamp.valueOf("2019-06-02 12:00:00"), 3)).toDF("time", "foo")
df.withColumn("date", to_date($"time"))
.write
.partitionBy("date")
.format("avro")
.save("/tmp/foo")
yields the following structure
find /tmp/foo
/tmp/foo
/tmp/foo/._SUCCESS.crc
/tmp/foo/date=2019-06-01
/tmp/foo/date=2019-06-01/.part-00000-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro.crc
/tmp/foo/date=2019-06-01/part-00000-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro
/tmp/foo/date=2019-06-01/.part-00001-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro.crc
/tmp/foo/date=2019-06-01/part-00001-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro
/tmp/foo/_SUCCESS
/tmp/foo/date=2019-06-02
/tmp/foo/date=2019-06-02/part-00002-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro
/tmp/foo/date=2019-06-02/.part-00002-2a7a63f2-7038-4aec-8f76-87077f91a415.c000.avro.crc

Related

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate()
import spark.implicits._
case class Something(id: Int, batchId: Option[String], div: String)
val sth1 = Something(1, Some("1000"), "10")
val sth2 = Something(2, Some("1000"), "10")
val sth3 = Something(3, Some("1000"), "10")
val sth4 = Something(4, Some("1000"), "10")
val ds = Seq(sth1, sth2, sth3, sth4).toDS()
ds.write.mode("overwrite").option("path", "loacl_path").bucketBy(3, "id").saveAsTable("Tmp")
I go to the local_path where it stores the data but I only find two parquet files. I wonder why it doesn't create 3 parquet files which is the number of bucket.
I have also tried bucket number equals to 1 or 2, it does impact the number of parquet files stored in local path. When bucket numer is 1, then there is only 1 parquet file, similarly for the case when it equals to 2.
You should use Dataset.repartition operator to control the number of output files.
You can still have the bucketBy with combination with repartition, but bucketBy has different use - avoiding shuffles in joins when they use the join keys matching the bucketing keys.
ds.repartition(3)
.write
.mode("overwrite")
.option("path", "loacl_path")
.bucketBy(3, "id")
.saveAsTable("Tmp")
bucketBy is not probably what you're looking for (if you're expecting your data to be written inside 3 parquet files). when you use bucketBy, you define the column names, and a hash function is responsible to divide your data into number of buckets you specified, it doesn't necessarily mean that they should be saved in n files. This is used to boost your querying performance (something similar to indexing, not equal). Now I haven't tried this yet, but what you're looking for probably is repartition method.
df.repartition(3)
.write.mode(SaveMode.Overwrite)
.option("path", "local_path")
.saveAsTable("Tmp")

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

spark: read parquet file and process it

I am new of Spark 1.6. I'd like read an parquet file and process it.
For simplify suppose to have a parquet with this structure:
id, amount, label
and I have 3 rule:
amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH
How can do it in spark and scala?
I try something like that:
case class SampleModels(
id: String,
amount: Double,
label: String,
)
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
MY LOGIC
WRITE OUTPUT IN PARQUET
)
Is it right approach? Is it efficient? "MYLOGIC" could be more complex.
Thanks
Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.
Yes, it is correct approach.
It will do one pass over your complete data to build the extra column you need.
If you want a sql way, this is the way to go,
val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")
Remember to use hiveContext instead of sparkContext though.

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")

Spark DataFrame Parallelism

Below is my usecase i am using Apache Spark
1) I have around 2500 Parquet files on HDFS, file size varies from file to file.
2) I need to process each parquet files and build a new DataFrame and write a new DataFrame into orc file format.
3) My Spark driver program is like this.
I am iterating each file, processing single parquet file creating a new DataFrame and writing a new DataFrame as ORC, below is the code snippet.
val fs = FileSystem.get(new Configuration())
val parquetDFMap = fs.listStatus(new Path(inputFilePath)).map(folder => {
(folder.getPath.toString, sqlContext.read.parquet(folder.getPath.toString))})
parquetDFMap.foreach {
dfMap =>
val parquetFileName = dfMap._1
val parqFileDataFrame = dfMap._2
for (column <- parqFileDataFrame.columns)
{
val rows = parqFileDataFrame.select(column)
.mapPartitions(lines => lines.filter(filterRowsWithNullValues(_))
.map(row => buildRowRecords(row, masterStructArr.toArray, valuesArr)))
val newDataFrame: DataFrame = parqFileDataFrame.sqlContext.createDataFrame(rows, StructType(masterStructArr))
newDataFrame.write.mode(SaveMode.Append).format("orc").save(orcOutPutFilePath+tableName)
}
}
The problem with this design I am able to process only one parquet file in time, parallelism is applied only when I create a new data frame and when the new DataFrame is written into ORC format. So if any of the tasks like creating a new DataFrame or writing a new DataFrame in to ORC take long time to complete other lined up parquet processing is stuck until the current parquet operation gets completed.
Can you please help me with a better approach or design for this usecase.
Can you create a single data frame for all the parquet files instead of one dataframe for each file
val df = sqlContext.read.parquet(inputFilePath)
df.map(row => convertToORc(row))
I was able to parallelise the parquet file processing by paralleling the by doing parquetDFMap.foreach.par