pyspark dataframe reference vs value - pyspark

I learn pyspark. I'm trying to build DataFrame from sql, for example
DF=spark.sql("with a as (select ....) select ...")
My sql is a little complex, so it's executed for 20 minutes.
I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
The only way I can think of is to use "create table", for example:
xx=spark.sql("create table yyy as with a as (select ....) select ...")
DF=sqlContext.sql("select * from yyy")
It works but it looks strange to me.
What are the best practices to create DataFrame in pyspark from complex SQL ? I would like to skip the step with "create table".

I'd like to have DataFrame like in pandas with value in RAM where DF.head(10), DF.count() take a few seconds.
Pandas load your data into memory from the moment you read that data, that's why it's lightning-fast. But remember the data size you can load is limited to your computer's memory
My sql is a little complex, so it's executed for 20 minutes. I feel like DF is a refer to my SQL, it means when I execute DF.head(10) it takes 20 minuts, next step DF.count() takes also 20 minuts etc.
Spark does not load the data when you read it. It only read data when there is an "action" like cache or count or head.
The only way I can think of is to use "create table"
Yes, creating a table is also an action, where your query is executed entirely. Next time when you read it, it doesn't have to re-compute it. Tha alternative of creating table is caching, you can do something like this DF.cache().count(), spark will load entire data into memory and all other actions later will be much faster.

Related

Avoid loading into table dataframe that is empty

I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.

Spark-Optimization Techniques

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.

Spark Job simply stalls when querying full cassandra table

I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra

write dataframe to csv file took too much time to write spark

I want to aggregate data based on intervals on timestamp columns.
I saw that it takes 53 seconds for computation, but 5 minutes to write result in the CSV file. It seems like df.csv() takes too much to write.
How can I optimize the code please ?
Here is my code snippet :
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\inputDataSet.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("_c0"))/300).cast("long") * 300).cast("timestamp")).drop("_c0")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map(mean(_)).head,result2.drop("new_time").columns.map(mean(_)).tail: _*).sort("new_time")
finalresult.coalesce(1).write.option("header", "true").csv("C:/result_with_time.csv")//<= it took to much to write
Here are some thoughts on optimization based on your code.
inferSchema: it will be faster to have a predefined schema rather than using inferSchema.
Instead of writing into your local, you can try writing it in hdfs and then scp the file into local.
df.coalesce(1).write will take more time than just df.write. But you will get multiple files which can be combined using different techniques. or else you can just let it be in one directory with with multiple parts of the file.

Apache Spark Multiple Aggregations

I am using Apache spark in Scala to run aggregations on multiple columns in a dataframe for example
select column1, sum(1) as count from df group by column1
select column2, sum(1) as count from df group by column2
The actual aggregation is more complicated than just the sum(1) but it's besides the point.
Query strings such as the examples above are compiled for each variable that I would like to aggregate, and I execute each string through a Spark sql context to create a corresponding dataframe that represents the aggregation in question
The nature of my problem is that I would have to do this for thousands of variables.
My understanding is that Spark will have to "read" the main dataframe each time it executes an aggregation.
Is there maybe an alternative way to do this more efficiently?
Thanks for reading my question, and thanks in advance for any help.
Go ahead and cache the data frame after you build the DataFrame with your source data. Also, to avoid writing all the queries in the code, go ahead and put them in a file and pass the file at run time. Have something in your code that can read your file and then you can run your queries. The best part about this approach is you can change your queries by updating the file and not the applications. Just make sure you find a way to give the output unique names.
In PySpark, it would look something like this.
dataframe = sqlContext.read.parquet("/path/to/file.parquet")
// do your manipulations/filters
dataframe.cache()
queries = //how ever you want to read/parse the query file
for query in queries:
output = dataframe.sql(query)
output.write.parquet("/path/to/output.parquet")