How to find widths of a Flat File using read_fwf() in Pandas? - pyspark

I have downloaded some data from the Mainframe (.DATA format) and I need to parse it to create a PySpark DataFrame and perform some operations on it. Before doing that, I created a sample file and read it using read_fwf() feature of Pandas.
I was able to read and create the DataFrame but I encountered some problems like
Padding of "0" in the first column of some of the Rows
Repeating Headers while reading the Data
These were some of the issues I can handle, however the key challenge I am facing is in identifying the widths of the columns. I currently have 65 columns but in order to create a PySpark DataFrame, I would require to know the widths of these columns. Can read_fwf() tell what is the widths it is using for each column ?
And is there a read_fwf() like function in PySpark ? Or we would have to write a MapRed code for it ?

Related

Dataframe display function in pyspark on databricks platform

I am new to databricks, i was studing topic dataframe in pyspark
df = spark.read.parquet(salesPath)
display(df)
Above is my code , i m not getting ,what actually the up arrows do?
and why this beautiful df.display not included in Apache pyspark documentation?
Arrows are used to sort the displayed portion of the dataframe. But please note that the display function shows at max 1000 records, and won't load the whole dataset.
The display function isn't included into PySpark documentation because it's specific to Databricks. Similar function also exist in Jupyter that you can use with PySpark, but it's not part of the PySpark. (you can use df.show() function to display as text table - it's a part of the PySpark's DataFrame API)

How to add trailer/footer in csv dataframe azure blob pyspark

i have as solution which goes like
df1 -->dataframe 1 with having 50 columns of data
df2 --->datarame 2 having footer/trailer 3 columns of data like Trailer,count of rows,date
so i added the remaining 47 columns as "","",""..... so on
so that i can union 2 dataframe:
df3=df1.union(df2)
now if i want to save
df3.coalesce(1).write.format("com.databricks.spark.csv")\
.option("header","true").mode("overwrite")\
.save(output_blob_path);
so now i am getting the footer as well
like this Trailer,400,20210805,"","","","","","","".. and so on
if any one can suggest how to remove ,"","","",.. these double quotes from the last row
where i want to save this file in blob container.
it would be very helpful
You can try to define structure of data frame to treat entire row as single column for both the files and then perform union. This way you no need to add extra columns on data frame 2 and then struck in to tricky situation to remove extra columns after union.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My goals are below
1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows.
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.
Note: I have the flexibility of writing separate code for each worksheet.
How can I achieve this?
I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.
You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).
There you have the following options:
1) use Hive directly to read the Excel files and to CTAS to a table in CSV format
You would need to deploy the HadoopOffice Excel Serde
https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde
then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)
create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");
Then do CTAS into a CSV format table:
create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;
2) use Spark
Depending on the Spark version you have different options:
for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here

Data Conversion Failed SQL

I am using the import and export wizard and imported a large csv file. I get the following error.
Error 0xc02020a1: Data Flow Task 1: Data conversion failed. The data
conversion for column "firms" returned status value 2 and status text "The
value could not be converted because of a potential loss of data.".
(SQL Server Import and Export Wizard)
Upon importing, I use the advanced tab and make all of the adjustments. As for the field in question, I set it is numeric (8,0). I have since went through this process multiple times and tried 7,8,9,10,and 11 to no avail. I import the csv into excel and look at the respective column, firms. It shows no entry with more than 5 characters. I thought about making it DT_String but will need to manipulate that column eventually by averaging it. Also, have searched for spaces or strange characters and found none.
Any other ideas?
1) Try changing the Numeric precision to numeric(30,20) both in source and destination table.
2) Change the data type to str/wstr and adjust the output column width while importing. It will run fine. It happened with me as well while loading large CSV file of approx 5 GB. After load, use Try_convert function to convert it back to numeric and check the values which went null while conversion, you will find the root cause then.