Pyspark - using multiple crossJoins without crashing

Pyspark - using multiple crossJoins without crashing - pyspark

I've been working on merging multiple but selected columns into a new dataframe using crossJoins. Each of those columns holds more than a million records.
whenever I join more than 5 columns Pyspark crashes. so by a solution suggested in another post got the issue solve doing crossjoins within 5 columns at maximum creating multiple dataframes that I eventually will have to crossjoins together, is there a way of creating a dataframe let say with 25 columns let say same length I mentioned before without crashing spark?
result1 = table.select("column1")\
.crossJoin(table.select("column2"))\
.crossJoin(table.select("column3"))\
.crossJoin(table.select("column4"))\
.crossJoin(table.select("column5"))\
.crossJoin(table.select("column6"))
result2 = table.select("column7")\
.crossJoin(table.select("column8"))\
.crossJoin(table.select("column9"))
result3 = table.select("column10")\
.crossJoin(table.select("column11"))\
.crossJoin(table.select("column12"))\
.crossJoin(table.select("column13"))\
.crossJoin(table.select("column14"))\
.crossJoin(table.select("column15"))
instead of doing all the crossjoins together.

Related

Spark Dataframe join heap space issue and too many partitions

I am trying to join dataframes and in fact filtering in advance for simple testing. Each dataframe after filter has only 4 rows. There are 190 columns per dataframe.
I run this code in local and it runs super fast (albeit only 22 columns). Also, the partition size when I check in local is just 1. The join is pretty simple on 2 key columns and I made sure that there is no Cartesian product.
When I run this in my Dev/Uat cluster, it is taking forever and failing in between. Also, I see that the paritions created are around 40,000 per join. I am printing it using resultDf.rdd.partitions.size.
I've divided the join like this and it didn't help.
var joinCols = Seq("subjectid","componenttype")
val df1 = mainDf1.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df1attr").withColumnRenamed("value","df1val")
val df2 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df2attr").withColumnRenamed("value","df2val")
val df3 = mainDf3.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df3attr").withColumnRenamed("value","df3val")
val df4 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df4attr").withColumnRenamed("value","df4val")
var resultDf = df1.as("dft").join(df2,joinCols,"inner").select("dft.*","df2attr","df2val")
//check partition size here and show the dataframe to make sure we are getting 4 rows only as expected. I am getting 4 rows but 40,000 partitions and takes lot of time here itself.
resultDf = resultDf.as("dfi").join(df3,joinCols,"inner").select("dfi.*","df3attr","df3val")
//Mostly by here my program comes out either with heap space error or with exception exitCode=56
resultDf = resultDf.as("dfa").join(df4,joinCols,"inner").select("dfa.*","df4attr","df4val")
The naming convention used is all dummy to put the code here. So, please don't mind with that.
Any inputs/help to put me in right direction?

Spark taking long time to return the count of an RDD

I have dataframe that was executed on a query. Data size might be more than 1GB.
After using spark sql, consider I have a dataframe df. After this,
I filtered the dataset into 2 dataframes,
one with values less than 540 ie filteredFeaturesBefore, another with value between 540 and 640 ie filteredFeaturesAfter
After this, I combined the 2 dataframes mentioned above ie combinedFeatures by using inner join
val combinedFeatures = sqlContext.sql("""select * from filteredFeaturesBefore inner join filteredFeaturesAfter on filteredFeaturesBefore.ID = filteredFeaturesAfter.ID """)
Later I create a dataset of LabeledPoint to pass the data into the machine learning model.
val combinedRddFeatures = combinedFeatures.rdd
combinedRddFeatures.map(event => LabeledPoint(parseDouble(event(0) + ""), Vectors.dense(parseDouble(event(1) + ""),
parseDouble(event(2) + ""), parseDouble(event(3) + ""), parseDouble(event(4) + ""))))
If I perform, finalSamples.count() spark is executing and is not returning anything for a long time. I executed the program for 6 hours and still no results was returned from spark. I had to stop the execution because laptop was almost slow and was not responding properly.
I dont know whether this is because of my laptop processor speed or is spark hung.
I'm using macBook air 2017 which has a processor of 1.8Ghz.
Can you tell me why is this happening as I'm new to spark.
Also, is there any workaround for this. Instead of splitting the data into 2 dataframes, can I iterate both the dataframes and extract the labelelPoints data structure? If yes, can you suggest me the method to do this.

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.

Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir

is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Split Spark DataFrame into parts

I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?

I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/

If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark - using multiple crossJoins without crashing - pyspark

Related

Spark Dataframe join heap space issue and too many partitions

Spark taking long time to return the count of an RDD

Spark: transform dataframe

Spark DataFrame row count is inconsistent between runs

Split Spark DataFrame into parts

Categories

Resources