Write parquet with partitionby vs. just a loop [duplicate] - pyspark

This question already has answers here:
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 2 years ago.
Let's say have a script which writes a parquet file every week in 2 partitions: DAY and COUNTRY, in a FOLDER.
SOLUTION 1:
df.write.parquet(FOLDER, mode='overwrite',
partitionBy=['DAY', 'COUNTRY'])
The problem with this is that if later you want to rerun the script just for a specific country and date due to corrupted data in that partition, it will delete the whole folder's contents, and write in data just for the speciffic day/country.
APPEND also doesnt solve it, it would just append the correct data to the wrong one.
What would be ideal is that if the above command ONLY overwrote the DAY/COUNTRY combos which the df has.
SOLUTION 2:
Make a loop:
for country in countries:
for day in days:
df.write.parquet(FOLDER/day/country, mode='overwrite')
This works, because if I run the script, it only overwrites the files in the specific FOLDER/day/country, it just feels so wrong. Any better alternative?

If you are using spark 2.3 or above, you can create a partitioned table and
set the spark.sql.sources.partitionOverwriteMode setting to dynamic
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df.write.mode("overwrite").insertInto("yourtable")

Related

Write Spark Dataset to Excel File along with partitioning

I have a Dataset similar to the below structure:
col_A col_B date
1 5 2021-04-14
2 7 2021-04-14
3 5 2021-04-14
4 9 2021-04-14
I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.
Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
some code which sets the outputDataset
*/
outputDataset
.repartition(1)
.write()
.partitionBy("date")
.format("com.crealytics.spark.excel")
.option("header", "true")
.save("/saveLoc/sales");
Normal Working Case:
When I pass use .format("csv"), the above code creates a folder with the name date=2021-04-14 in the path /saveLoc/sales that is passed in .save() which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv. Also, the column date is removed from the file since it was partitioned on.
What I need to do:
However, when I use .format("com.crealytics.spark.excel"), it just creates a plain file called sales in the folder saveLoc and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales. Please note that it overrides the folder "sales" with a file sales.
Excel plugin used is descibed here: https://github.com/crealytics/spark-excel
How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?
Versions used:
spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11
Thanks.

process values of records [duplicate]

This question already has answers here:
Calculate average using Spark Scala
(4 answers)
Closed 2 years ago.
I am new to Spark and I couldn't find enough information to understand some things in Spark. I am trying to write a pseudocode in scala (like these examples http://spark.apache.org/examples.html)
A file with data is given. Each line has some data: number, course name, credits, and mark.
123 Programming_1 10 75
123 History 5 80
I am trying to compute the average of each student (number). Average is the sum of every course credits*Mark a student had took
divided by the sum of every course credits the student took. Ignoring any line that has mark==NULL. Suppose that I have a function parseData(line) which makes a line with strings to record with 4 member : number, coursename, credits, mark.
What I tried until now
data=spark.textFile(“hdfs://…”)
line=data.filter(mark=> mark != null)
line= line.map(line => parseData(line))
data = parallelize(List(line))
groupkey= data.groupByKey()
((a,b,c)=>(a, sum(mul(b,c))/ sum(b))
But I don't know how to read the specific values and use them to produce the average for each student. Is it possible to use array?
Once you filter and get the dataframe, you could use something like this:
df.withColumn("product",col("credits")*col("marks"))
.groupBy(col("student"))
.agg(sum("credits").as("sumCredits"),sum("product").as("sumProduct"))
.withColumn("average",col("sumProduct")/col("sumCredits"))
Hope this helps!!

How to increase display length in pg admin tool [duplicate]

This question already has answers here:
pgAdmin III Why query results are shortened?
(2 answers)
Closed 6 years ago.
I have a dumb problem. Basically I just upgraded from pgsql 8.4 to 9.1 and upgrade to pgAdmin 1.20.
I have some tables that have large text fields and in the previous query tool I could query a row and copy-paste the data out of it to modify. In this case, I had a table that stored queries that I could run.
Once I upgraded to the new pgAdmin version, when I use the tool and query a row to pull out the text from a field in that row, it truncates the result and ends with an ellipsis (...).
I tried figuring out how to increase the mem on this so it doesn't truncate after 100 characters or so but couldn't.
Anybody have any ideas??
In pgAdmin options, you can change the length of the field. Do the following,
Go to:
File > Options > Query Tool > Max. characters per column
By default it is 256, you can increase it accordingly.
Hope this helps
Marlon Abeykoon's answer is good, but if you want a one-off output and don't want to change settings, then simply output to a file (two buttons along from the usual green 'go' arrow). This saves the entire output in a csv file.

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.

Read contents from txt file using T-SQL [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server File Operations?
Is there by any chance I could use T-SQL to read the first line of a txt file?
Actually, I have a csv file and the first line is the name of all hundreds of columns. I have already coded the part where I could use the first line to generate a table with all that columns. So, really want to figure out how to do the reading part.
You could look at the BULK INSERT statement.