I am trying to join dataframes and in fact filtering in advance for simple testing. Each dataframe after filter has only 4 rows. There are 190 columns per dataframe.
I run this code in local and it runs super fast (albeit only 22 columns). Also, the partition size when I check in local is just 1. The join is pretty simple on 2 key columns and I made sure that there is no Cartesian product.
When I run this in my Dev/Uat cluster, it is taking forever and failing in between. Also, I see that the paritions created are around 40,000 per join. I am printing it using resultDf.rdd.partitions.size.
I've divided the join like this and it didn't help.
var joinCols = Seq("subjectid","componenttype")
val df1 = mainDf1.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df1attr").withColumnRenamed("value","df1val")
val df2 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df2attr").withColumnRenamed("value","df2val")
val df3 = mainDf3.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df3attr").withColumnRenamed("value","df3val")
val df4 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df4attr").withColumnRenamed("value","df4val")
var resultDf = df1.as("dft").join(df2,joinCols,"inner").select("dft.*","df2attr","df2val")
//check partition size here and show the dataframe to make sure we are getting 4 rows only as expected. I am getting 4 rows but 40,000 partitions and takes lot of time here itself.
resultDf = resultDf.as("dfi").join(df3,joinCols,"inner").select("dfi.*","df3attr","df3val")
//Mostly by here my program comes out either with heap space error or with exception exitCode=56
resultDf = resultDf.as("dfa").join(df4,joinCols,"inner").select("dfa.*","df4attr","df4val")
The naming convention used is all dummy to put the code here. So, please don't mind with that.
Any inputs/help to put me in right direction?
Related
I am seeing performance issues when iteratively adding columns (around 100) to a Dataframe.
I know that it is more efficient to use select to add multiple columns however I have to add the columns in order because column2 may depend on column 1 etc. etc.
The columns are being added following a join which resulted in skew so I have explicitly repartitioned by a salt key to evenly distribute data on the cluster.
When I ran locally I was seeing OOM errors even for fairly small (100 row, 500 column) datasets.
I was able to get the job running locally by checkpointing after the addition of every x columns so I suspect spark lineage issues are causing my problems however I am still unable to run the job at scale on the cluster.
Any advice on where to look or on best practice in this scenario would be greatly received.
At a high level my job looks like this:
val df1 = ??? // Millions of rows, ~500 cols, from parquet
val df2 = ??? // 1000 rows, from parquet
val newExpressions = ??? // 100 rows, from Oracle
val joined = df1.join(broadcast(df2), <join expr>)
val newColumns = newExpressions.collectAsList.map(<get columnExpr and columnName>)
val salted = joined.withColumn("salt", rand()).repartition(x, col("salt"))
newColumns.foldLeft(joined) {
case (df, row) => df.withColumn(col(row.expression).as(row.name))
} // Checkpointing after ever x columns seems to help
Cheers
Terry
I've been working on merging multiple but selected columns into a new dataframe using crossJoins. Each of those columns holds more than a million records.
whenever I join more than 5 columns Pyspark crashes. so by a solution suggested in another post got the issue solve doing crossjoins within 5 columns at maximum creating multiple dataframes that I eventually will have to crossjoins together, is there a way of creating a dataframe let say with 25 columns let say same length I mentioned before without crashing spark?
result1 = table.select("column1")\
.crossJoin(table.select("column2"))\
.crossJoin(table.select("column3"))\
.crossJoin(table.select("column4"))\
.crossJoin(table.select("column5"))\
.crossJoin(table.select("column6"))
result2 = table.select("column7")\
.crossJoin(table.select("column8"))\
.crossJoin(table.select("column9"))
result3 = table.select("column10")\
.crossJoin(table.select("column11"))\
.crossJoin(table.select("column12"))\
.crossJoin(table.select("column13"))\
.crossJoin(table.select("column14"))\
.crossJoin(table.select("column15"))
instead of doing all the crossjoins together.
When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.
I am using Spark 1.5.1 with Scala on Zeppelin notebook.
I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.
What is the best way to delete all the rows that belong to the users to exclude?
Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?
I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.
Here is the code that I am applying:
import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))
In the code above:
the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.
I wonder if there is a more efficient solution than the one I am using.
Thanks
You can either use join:
val usersToKeep = sc.parallelize(
listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")
val finalDataFrame = usersToKeep
.join(initialDataFrame, $"userID" === $"userID_")
.drop("userID_")
or a broadcast variable and an UDF:
import org.apache.spark.sql.functions.udf
val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))
It should be also possible to broadcast a DataFrame:
import org.apache.spark.sql.functions.broadcast
initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")
I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.