Spark Dataframe returns an inconsistent value on count() - pyspark

I am using pypark to perform some computations on data obtained from a PostgreSQL database. My pipeline is something similar to this:
limit = 1000
query = "(SELECT * FROM table LIMIT {}) as filter_query"
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://path/to/db") \
.option("dbtable", query.format(limit)) \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver")
df.createOrReplaceTempView("table")
df.count() # 1000
So far, so good. The problem starts when I perform some transformations on the data:
counted_data = spark.sql("SELECT column1, count(*) as count FROM table GROUP BY column1").orderBy("column1")
counted_data.count() # First value
counted_data_with_additional_column = counted_data.withColumn("column1", my_udf_function)
counted_data_with_additional_column.count() # Second value, inconsistent with the first count (should be the same)
The first transformation alters the number of rows, (the value should be <= 1000). However, the second one does not, it just adds a new column. How can it be that I am getting a different result for count()?

The explanation is actually quite simple, but a bit tricky. Spark might perform additional reads to the input source (in this case a database). Since some other process is inserting data in the database, these additional calls read slightly different data than the original read, causing this inconsistent behaviour. A simple call to df.cache() after the read disables further reads. I figured this out by analyzing the traffic between the database and my computer, and indeed, some further SQL commands where issued that matched my transformations. After adding the cache() call, no further traffic appeared.

Since you are using Limit 1000, you might be getting different 1000 records on each execution. And since you will be getting different records each time, the result of aggregation will be different. In order to get the consistent behaviour with Limit you can try following approaches.
Either try to cache your dataframe with cahce() or Persist method, which will ensure that spark will use same data till the time it will be available in memory.
But better approach could be to sort the data based on some unique column and then get the 1000 records, which will ensure that you will get the same 1000 records each time.
Hope it helps.

Related

Databricks - All records from Dataframe/Tempview get removed after merge

I am observing some weired issue. I am not sure whether it is lack of my knowledge in spark or what.
I have a dataframe as shown in below code. I create a tempview from it and I am observing that after merge operation, that tempview becomes empty. Not sure why.
val myDf = getEmployeeData()
myDf.createOrReplaceTempView("myView")
// Result1: Below lines display all the records
myDf.show()
spark.Table("myView").show()
// performing merge operation
val sql = s"""MERGE INTO employee AS a
USING myView AS b
ON a.Id = b.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *"""
spark.sql(sql)
// Result2: ISSUE is here. myDf & mvView both are empty
myDf.show()
spark.Table("myView").show()
Edit
getEmployeeData method performs join between two dataframes and returns the result.
df1.as(df1Alias).join(df2.as(df2Alias), expr(joinString), "inner").filter(finalFilterString).select(s"$df1Alias.*")
Dataframes in Spark are lazily evaluated, ie. not executed until an action like .show, .collect is executed or they're used in SQL DDL operation. This also means that if you refer to it once more, it will get reevaluated again.
Assuming there's no other background activity that can mess up, apparently your function getEmployeeData, depends on employee table. It gets executed both before and after the merge and might yield different result.
To prevent it you can checkpoint the dataframe:
myView.checkpoint()
or explicitly materialize it:
myView.write.saveAsTable("myViewMaterialized")
and later refer to the materialized version.
I agree with the points that #Kombajn zbozowy said related to Dataframes in spark are lazily evaluated and will be reevaluated again if you call action on it once more.
I would like to point out that, it is normal expected behavior and might not have to do anything with the merge operation.
For example, the dataframe that you get as an output of the join only contains inserts and you perform those inserts on the target table using dataframe write api, and then if you run the df.show() command again it would show you output as empty because when it is reevaluating the content of the dataframe by performing join it won't get any difference and will not output any records...
Same holds true for merge operation as it also updates the target table with insert/updates and when you rerun it won't show any output rows.

How to improve query performance in spark?

I have a query which joins 4 tables and i used query pushdown to read it into a dataframe.
val df = spark.read.format("jdbc").
option("url", "jdbc:mysql://ip/dbname").
option("driver", "com.mysql.jdbc.Driver").
option("user", "username").
option("password", "password")
.option("dbtable",s"($query) as temptable")
.load()
The number of records in individual tables are 430, 350, 64, 2354 respectively and it takes 12.784 sec to load and 2.119 sec for creating SparkSession
then I count the resultdata as,
val count=df.count()
println(s"count $count")
then the total execution time 25.806 sec and the result contains only 430 records.
When I try the same in sql workbench it only takes few sec to execute completely.
Also I tried cache after load() but it take the same time. So how can I execute it much faster than what I did.
You are using a tool meant to handle big data to solve toy examples and thus you are getting all of the overhead and none of the benefits
Try Options like
partitionColumn
numPartitions
lowerBound
upperBound
These options will can help improving the performance of Query, as these will create multiple partitions and read will happen in parallel

Incrementally Load Data from RDBMS and Write to Parquet

I am attempting to implement a pipeline for reading data from an RDBMS data source, partitioning the read on a datetime field, and storing that data partitioned data in parquet.
The pipeline is intended to be run weekly, with each run simply appending any new rows which have been added to the RDBMS source to the partitioned parquet data.
Currently, the way I'm handling this problem is by:
Storing the previous time of ingest.
Reading from the RDBMS and applying a filter on the datetime column for entries after the previous time of ingest.
Appending this data to the paritioned parquet file.
While this works, I am not sure if it is the most idiomatic way of handling what is likely a very common use case. As well, unless I want to allow row duplication, some additional massaging of the already written data is necessary.
An example of this pipeline would be:
// rdbms is an object which stores various connection information for an RDBMS.
// dateCol is the column name of the datetime column.
// path is the parquet file path.
val yearCol = "year"
val monthCol = "month"
val dayCol = "day"
val refreshDF = spark.read
.format("jdbc")
.option("url", rdbms.connectionString + "/" + rdbms.database)
.option("dbtable", rdbms.table)
.option("user", rdbms.userName)
.option("password", rdbms.password)
.option("driver", rdbms.driverClass)
.option("header", true)
.option("readOnly", true)
.load()
val ts = unix_timestamp(col(dateCol), dateFormat).cast("timestamp")
val unixDateCol = dateCol + "_unix"
val datedDF = refreshDF.withColumn(unixDateCol, ts)
val filteredDF = datedDF.filter(col(unixDateCol).lt(lastRun))
val ymdDF = filteredDF.withColumn(yearCol, year(col(unixDateCol)))
.withColumn(monthCol, month(col(unixDateCol)))
.withColumn(dayCol, day(col(unixDateCol)))
ymdDF.write.mode("append").partitionBy(yearCol, monthCol, dayCol).parquet(path)
Is there a better way to do this? I'd like to avoid reading the entire table and computing a difference for performance reasons.
(Edit: Added in the partitioning, although as this pass doesn't de-duplicate the latest read, it's not leveraged.)
Instead of reading all the data from DB each time, you can pass timestamp field filters via 'predicates' parameter to DB side for it to only return the data for the data range of your interest. Which is much faster in case of large tables and if the timestamp is indexed and or partitioned on the DB side. Here is the relevant method:
/**
* Construct a `DataFrame` representing the database table accessible via JDBC URL
* url named table using connection properties. The `predicates` parameter gives a list
* expressions suitable for inclusion in WHERE clauses; each one defines one partition
* of the `DataFrame`.
*
* Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
* your external database systems.
*
* #param url JDBC database url of the form `jdbc:subprotocol:subname`
* #param table Name of the table in the external database.
* #param predicates Condition in the where clause for each partition.
* #param connectionProperties JDBC database connection arguments, a list of arbitrary string
* tag/value. Normally at least a "user" and "password" property
* should be included. "fetchsize" can be used to control the
* number of rows per fetch.
* #since 1.4.0
*/
def jdbc(
url: String,
table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame
As for ensuring the data is not duplicated, you can query the count of records per day from the parquet file for say last couple weeks and find the oldest date, for which there are 0 records as the "previous time of ingest". This would eliminate the chance of this date being out of sync with the parquet data.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Recursively adding rows to a dataframe

I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.