Spark DataFrame row count is inconsistent between runs - scala

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir

is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Related

examine the result of groupBy followed by orderBy in spark

I have dataframe, including two columns, e.g., "ID" and "Time". I would like to quickly check the result of grouped by ID and based on the time information.
I can do this
df.groupBy('ID').orderBy('Time')
But I cannot add show after that because show() is not a attribute of either orderBy or groupBy. How to quickly check this kind of operation result.
For a quick check (and if you do not care too much about performance) you can use GroupedData.applyInPandas() to print out each group separately as Pandas Dataframe:
def debug(pandas_df):
print(pandas_df.sort_values("Time"))
return pandas_df
df = df.groupBy("ID").applyInPandas(lambda pandas_df: debug(pandas_df), df.schema)
df.<call action>()
In a multi node environment, the debug messages will be printed out on the executors.

Read file content per row of Spark DataFrame

We have an AWS S3 bucket with millions of documents in a complex hierarchy, and a CSV file with (among other data) links to a subset of those files, I estimate this file will be about 1000 to 10.000 rows. I need to join the data from the CSV file with the contents of the documents for further processing in Spark. In case it matters, we're using Scala and Spark 2.4.4 on an Amazon EMR 6.0.0 cluster.
I can think of two ways to do this. First is to add a transformation on the CSV DataFrame that adds the content as a new column:
val df = spark.read.format("csv").load("<csv file>")
val attempt1 = df.withColumn("raw_content", spark.sparkContext.textFile($"document_url"))
or variations thereof (for example, wrapping it in a udf) don't seem to work, I think because sparkContext.textFile returns an RDD, so I'm not sure it's even supposed to work this way? Even if I get it working, is the best way to keep it performant in Spark?
An alternative I tried to think of is to use spark.sparkContext.wholeTextFiles upfront and then join the two dataframes together:
val df = spark.read.format("csv").load("<csv file>")
val contents = spark.sparkContext.wholeTextFiles("<s3 bucket>").toDF("document_url", "raw_content");
val attempt2 = df.join(contents, df("document_url") === contents("document_url"), "left")
but wholeTextFiles doesn't go into subdirectories and the needed paths are hard to predict, and I'm also unsure of the performance impact of trying to build an RDD of the entire bucket of millions of files if I only need a small fraction of it, since the S3 API probably doesn't make it very fast to list all the objects in the bucket.
Any ideas? Thanks!
I did figure out a solution in the end:
val df = spark.read.format("csv").load("<csv file>")
val allS3Links = df.map(row => row.getAs[String]("document_url")).collect()
val joined = allS3Links.mkString(",")
val contentsDF = spark.sparkContext.wholeTextFiles(joined).toDF("document_url", "raw_content");
The downside to this solution is that it pulls all the urls to the driver, but it's workable in my case (100,000 * ~100 char length strings is not that much) and maybe even unavoidable.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.