I am trying to accomplish the following:
For iterator i from 0 to n
Create data frames using i as one of the filter criteria in the select statement of sparksql
Create Rdd from dataframe
Perform multiple operations on rdd
How do I make sure that for loop works? I am trying to run the Scala code on a cluster.
First I would suggest to run it locally in some test suite (as in scalatest). If you are not the type of unit/integration testing, you could simply do a DF.show() on your data frames as you iteration though them. this will print a sample from each data frame.
(0 until 5).foreach(i => {
val df = [some data frame you use i in filtering]
df.show()
val df_rdd = df.rdd
})
Related
I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.
I want to check if there is any formulae column inside a csv file. So I have constructed a regex and want to parse to entire dataframe.
I have a solution but that does it column by column, I feel it will hit the performance for large datasets.
val columns = df.columns
import spark.implicits._
val dfColumns = columns.map{name =>
val some = df.filter($"$name".rlike("""^=.+\)$"""))
some.count()>0
}
val exist = dfColumns.exists(x=> x)
You cannot apply same methods to the whole dataframe.
Instead you can optimize a little bit your code.
val df = spark.read.csv("your_path").cache // Cache the dataframe to avoid re reading
import spark.implicits._
df.columns.map{
name => df.filter($s"$name".rlike("""^=.+\)$""")).isEmpty // Use isEmpty to avoid counting everything when it is not needed.
}.exists(identity)
Be aware that filter is usually pushed at the top of the catalyst plan, so if you do something else than just reading, the cache might not result in better performances (but isEmpty will always do)
PS: isEmpty is from Spark 2.3, if you do not have the right version, you can use df.limit(1).count > 0 Which will limit before counting, and will increase your performances.
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
Usually I load csv files and then I run different kind of aggregations like for example "group by" with Spark. I was wondering if it is possible to start this sort of operations during the file loading (typically a few millions of rows) instead of sequentialize them and if it can be worthy (as time saving).
Example:
val csv = sc.textFile("file.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = data.take(1)
val rows = data.filter(line => header(0) != "id")
val trows = rows.map(row => (row(0), row))
trows.groupBy(//row(0) etc.)
For my understanding of how Spark works, the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv. If this is correct, can the loading and the grouping run at the "same" time instead of sequencing the two steps?
the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv.
It is not the case. At the local (single partition) level Spark operates on lazy sequences so operations belonging to a single task (this includes map side aggregation) can squashed together.
In other words when you have chain of methods operations are performed line-by-line not transformation-by-transformation. In other words the first line will be mapped, filtered, mapped once again and passed to aggregator before the next one is accessed.
To start a group by on load operation You could proceed with 2 options:
Write your own loader and make your own group by inside that + aggregationByKey. The cons of that is writting more code & more maintanance.
Use Parquet format files as input + DataFrames, due it's columnar it will read only desired columns used in your groupBy. so it should be faster. - DataFrameReader
df = spark.read.parquet('file_path')
df = df.groupBy('column_a', 'column_b', '...').count()
df.show()
Due Spark is Lazy it won't load your file until you call action methods like show/collect/write. So Spark will know which columns read and which ignore on the load process.