merge multiple small files in to few larger files in Spark - scala

I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?

I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date

You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.

The dataframe repartition(1) method works in this case.

Related

Spark performance issues when iteratively adding multiple columns

I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry

spark-shell load existing hive table by partition?

In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

Spark Partition Dataset By Column Value

(I am new to Spark) I need to store a large number of rows of data, and then handle updates to those data. We have unique IDs (DB PKs) for those rows, and we would like to shard the data set by uniqueID % numShards, to make equal sized, addressable partitions. Since the PKs (unique IDs) are present both in the data and in the update files, it will be easy to determine which partition will be updated. We intend to shard the data and the updates by the same criteria, and periodically rewrite "shard S + all updates accumulated for shard S => new shard S". (We know how to combine shard S + updates = new shard S.)
If this is our design, we need to (1) shard a DataFrame by one of its columns (say: column K) into |range(K)| partitions where it is guaranteed that all rows in a partition have the same value in column K and (2) be able to find the Parquet file that corresponds to column_K=k, knowing k = row.uniqueID % numShards.
Is this a good design, or does Spark offer something out of the box that makes our task much easier?
Which Spark class/method should we use for partitioning our data? We are looking at RangePartitioner, but the constructor is asking for the number of partitions. We want to specify "use column_K for partitioning, and make one partition for each distinct value k in range(K)", because we have already created column_K = uniqueID % numShards. Which partitioner is appropriate for splitting on the value of one column of a DataFrame? Do we need to create a custom partitioner, or use partitionBy, or repartitionByRange, or...?
This is what we have so far:
import org.apache.spark.sql.functions._
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc_url, "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.parquet("parquet/table_name")
Now we need to specify that this DataFrame should be partitioned by SHARD_ID before it is written out as Parquet files.
This works:
val df = spark.read
.option("fetchsize", 1000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc(jdbc.getString("url"), "SCHEMA.TABLE_NAME", partitions, props)
.withColumn("SHARD_ID", col("TABLE_PK") % 1024)
.write
.partitionBy("SHARD_ID")
.parquet("parquet/table_name")

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Scripts for generating csv files for spark Canssandra data

I want to generate the 'csv' files as per below logic for the table in cassandra.
val df = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
I want to generate the 'csv' files as per below logic.
Since there are 3 distinct 'emailid's' I need to generate 3 distinct 'csv' files.
Three csv files for below 3 different queries.
select * from table where emailId='abc#gmail.com'
select * from table where emailId='def#gmail.com'
select * from table where emailId='xyz#gmail.com'
How can I do this. Can anyone please help me on this.
Version:
Spark 1.6.2
Scala 2.10
Create a distinct list of the emails then iterate over them. When iterating, filter for only the emails that match and save the dataframe to Cassandra.
import sql.implicits._
val emailData = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
val distinctEmails = emailData.select("emailId").distinct().as[String].collect
for (email <- distinctEmails){
val subsetEmailsDF = emailData.filter($"emailId" === email).coalesce(1)
//... Save the subset dataframe to cassandra
}
Note: coalesce(1) sends all the data to one node. This can create memory issues if the dataframe is too large.