How can I sample and show parquet files using fugue? - fugue

Say I have this file s3://some/path/some_partiitoned_data.parquet.
I would like to sample a given count of rows and display them nicely, possibly in a jupyter notebook.
some_partiitoned_data.parquet could be very large, I would like to do this without loading the data into memory, even without downloading the parquet files to disk.

Spark doesn't let you sample a given number of rows, you can only sample a given fraction, but with Fugue 0.8.0 this is a solution to get n rows
import fugue.api as fa
df = fa.load("parquetfile", engine=spark)
fa.show(fa.sample(df, frac=0.0001), n=10)
You just need to make sure with the frac, there are still more than 10 rows.
You can use fa.head to get the dataframe instead of printing it.
See the API reference at https://fugue.readthedocs.io/en/latest/top_api.html

Related

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

SCALA: How to use collect function to get the latest modified entry from a dataframe?

I have a scala dataframe with two columns:
id: String
updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be
safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should
be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!
I'm assuming that you are talking about a spark dataframe (not scala).
If you just want the latest date (only that column) you can do:
df.select(max("updated"))
You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select().
This will return a dataframe with just one row with the max value in "updated" column.
To answer to your question:
So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp
When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.
My question now is, how is collect() actually supposed to work in such a situation?
The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.
In this case if you do:
df.select(max("Timestamp")).collect().head
You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.
Take some time to read more about spark dataframe/rdd and the difference between transformation and action.
It sounds weird. First of all you don´t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:
How to get the last row from DataFrame?

write dataframe to csv file took too much time to write spark

I want to aggregate data based on intervals on timestamp columns.
I saw that it takes 53 seconds for computation, but 5 minutes to write result in the CSV file. It seems like df.csv() takes too much to write.
How can I optimize the code please ?
Here is my code snippet :
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\inputDataSet.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("_c0"))/300).cast("long") * 300).cast("timestamp")).drop("_c0")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map(mean(_)).head,result2.drop("new_time").columns.map(mean(_)).tail: _*).sort("new_time")
finalresult.coalesce(1).write.option("header", "true").csv("C:/result_with_time.csv")//<= it took to much to write
Here are some thoughts on optimization based on your code.
inferSchema: it will be faster to have a predefined schema rather than using inferSchema.
Instead of writing into your local, you can try writing it in hdfs and then scp the file into local.
df.coalesce(1).write will take more time than just df.write. But you will get multiple files which can be combined using different techniques. or else you can just let it be in one directory with with multiple parts of the file.

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value<number), then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.
When I do the query I use ORDER BY RAND() in the query (using sqlContext). I created a var Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)
I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use. I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.
Any feedback would be great. Thanks
I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:
in Hive:
create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'
And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:
df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)