how to convert multiple csv file to parquet using scala? - scala

I am trying to convert multiple csv files to parquet format, I need to save all the files with the original name.
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/*.csv")
df.write.mode(SaveMode.Overwrite).parquet("/tmp/")
what should I change when I am writing the parquet format.

Related

How to read an excel file with polars without header?

I want to read an excel file without setting a header in polars. When i use df = pl.read_excel(file_path, sheet_name = 0) i receive a dataframe with selected column names. But I won't use the first row as header. I need to drop the first 3 rows and then set the top row as the header of the dataframe. How can I do this?
This will help to ignore the header of the excel file.
df = pl.read_excel(filepath, sheet_name=0, read_csv_options={"has_header": False})

PySpark DataFrame to CSV file with GML format

Is there a way to quickly save a DataFrame to a CSV file that I want to modify in such a way that it is in GML format?
My strategy, for now, is to save the file as a standard CSV file and then modify that file.
I appreciate any help you can provide.

convert into a pandas dataframe after finding missing values in a spark dataframe

I am utilizing the following to find missing values in my spark df:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
from my sample spark df below:
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
("James","CA",np.NaN), ("Julia","",None),
("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()
How can I convert result of the prior missing count lines into a pandas dataframe? My real df has 26 columns and showing it in a spark df is messy and misaligned.
This might not be as clean as the actual pandas df with a table, but hopefully this would work for you:
From your first code, remove the .show() call:
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
You can assign a variable for that line or go straight with toPandas() call
sdf = df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns))
new_df = sdf.toPandas().T
print(new_df)
The .T call is to transpose the dataframe. If you have several columns, without transposing it will truncate the columns and you will not be able to see all columns.
Again, this does not have the actual table, but at least this is more readable than a spark df.
UPDATE:
You can get that table look if after the last variable, you convert it to pandas df if you prefer that look. There could be another way or a more efficient way to do this, but so far this one works.
pd.DataFrame(new_df)

Write Spark Dataset to Excel File along with partitioning

I have a Dataset similar to the below structure:
col_A col_B date
1 5 2021-04-14
2 7 2021-04-14
3 5 2021-04-14
4 9 2021-04-14
I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.
Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
some code which sets the outputDataset
*/
outputDataset
.repartition(1)
.write()
.partitionBy("date")
.format("com.crealytics.spark.excel")
.option("header", "true")
.save("/saveLoc/sales");
Normal Working Case:
When I pass use .format("csv"), the above code creates a folder with the name date=2021-04-14 in the path /saveLoc/sales that is passed in .save() which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv. Also, the column date is removed from the file since it was partitioned on.
What I need to do:
However, when I use .format("com.crealytics.spark.excel"), it just creates a plain file called sales in the folder saveLoc and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales. Please note that it overrides the folder "sales" with a file sales.
Excel plugin used is descibed here: https://github.com/crealytics/spark-excel
How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?
Versions used:
spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11
Thanks.

Appending columns to pyspark dataframe

I'd like to append columns from one pyspark dataframe to another.
In pandas, the command would look like
df1 = pd.DataFrame({'x':['a','b','c']})
df2 = pd.DataFrame({'y':[1,2,3]})
pd.concat((df1, df2), axis = 1)
Is there a way to accomplish this in pyspark? All I can find is either concatenating the contents of a column or doing a join.