how to load specific row and column from an excel sheet through pyspark to HIVE table? - pyspark

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My goals are below
1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows.
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.
Note: I have the flexibility of writing separate code for each worksheet.
How can I achieve this?
I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.

You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).
There you have the following options:
1) use Hive directly to read the Excel files and to CTAS to a table in CSV format
You would need to deploy the HadoopOffice Excel Serde
https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde
then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)
create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");
Then do CTAS into a CSV format table:
create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;
2) use Spark
Depending on the Spark version you have different options:
for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here

Related

Is it possible to generate the space separated header row using data factory copy activity?

I am using azure sql as source dataset and delimited file as sink dataset in the copy activity.
I tried copy activity but First row as header gives comma separated headers.
Is there way to change the header output style ?
Please note spacing is unequal (h3...h4)
In this repro, I tried to give
1 space between 1st and 2nd column,
2 spaces between 2nd and 3rd column,
3 spaces between 3rd and 4th column.
Also, I tried to give same column name for column2 and column3. The approach is as follows.
Data is copied from Azure SQL database to datalake in comma delimitted format as a staging file.
This staging file is taken as a source in Dataflow activity.
In source dataset, first row as header is not checked.
Data preview of Source transformation:
Derived column transformation is added to change the column name of column2 and column3.
In this case, date_col of column1 is header data. Thus when column1 is 'date_col' replace column2 and column3 data with same column name.
column_2 = iif(Column_1=='date_col','ECIX',Column_2);
column_3 = iif(Column_1=='date_col','ECIX',Column_3);
Again derived column transformation is added to concat all the columns with spaces. Column name is given as concat . Value for this column is
concat(Column_1,' ',Column_2,' ',Column_3,' ',Column_4)
Select transformation is added and only concat column is selected here.
In sink, new delimited file is added as a sink dataset. And in sink dataset also , first row as header is not checked.
Output file screenshot
After pipeline is run, the target file looks like this.
Keeping the source as azure sql itself in the dataflow, I created a single derived column 'OUTDC' and added all the columns from the source like this:
(h1)+' '+(h2)+' '+(h3)
Then fed the OUTDC to a delimited sink and kept the Headers option as single string like this:
['h1 h2 h2']

ADF Add Header to CSV Sink

Anyone know how to add header to csv sink? I have a data flow that's source is a database table. Then I have used derived column and concatenated the columns to make one column and split the data in the column by commas (done in the source via a query). I have then selected the column that has been concatenated to be export to csv.
Data example:
Matt,Smith,10
Therefore I technically only have one column, however, I want to add a header for each section of the data.
Desired output:
FirstName,LastName,Age
Matt,Smith,10
You can add headers in CSV file.
Select Data Flow Activity.
Select Source and use Select activity.
Add column names as shown in below screenshot.
Finally add Sink and run Pipeline.

Handling delimited files in Azure Data factory

I have got a very large table with around 28 columns and 900k records.
I converted it to CSV file (Pipe separated) and then tried to use that file for feeding another table using ADF itself.
When I tried to use that file, it keeps triggering an error saying some column datatype mismatch.
So excavating more into the data I have found few rows having Pipe (|) symbol in their text itself. So at the time coverting it back, the text after the pipe been considered for the next column and thus the error.
So how to handle the conversion into CSV efficiently when there are texts with delimiters in their columns.
Option1: If there is a possibility, I would suggest changing the delimiter to other than pipe(|), as the column value also contains pipe in its text.
Option2: In the CSV dataset, select a Quote character to identify the columns.
Step1: Copying data from table1 to CSV.
Source:
Sink CSV dataset:
Output:
Step2: Loading same CSV data to table2 with a copy activity.
CSV output file of Step1.
Source CSV dataset:
Sink dataset:
Output:

How to find widths of a Flat File using read_fwf() in Pandas?

I have downloaded some data from the Mainframe (.DATA format) and I need to parse it to create a PySpark DataFrame and perform some operations on it. Before doing that, I created a sample file and read it using read_fwf() feature of Pandas.
I was able to read and create the DataFrame but I encountered some problems like
Padding of "0" in the first column of some of the Rows
Repeating Headers while reading the Data
These were some of the issues I can handle, however the key challenge I am facing is in identifying the widths of the columns. I currently have 65 columns but in order to create a PySpark DataFrame, I would require to know the widths of these columns. Can read_fwf() tell what is the widths it is using for each column ?
And is there a read_fwf() like function in PySpark ? Or we would have to write a MapRed code for it ?

Excel 2010 - Pivot using external csv file - how to make dates work?

I have a set of pivot tables that use external csv files as their data sources. The csv files originally contained dates in the format dd/mm/yy (e.g. 31/01/13). The pivot tables did not recognise these as dates. I converted the dates in the csv files to dd/mm/yyyy (e.g. 31/01/2013) but these were still not recognised as dates by the pivot tables.
I tried setting up a calculated field =DATEVALUE(date_from_csv) but when used in the pivot table (I'm using the Max option to select the most recent date) I get #VALUE! errors.
I have tried converting the csv file to xlsx and also importing the data into the workbook that contains the pivot table - but I can't change from the external connection to use the internal data. I don't want to rebuild the pivots as there are a lot of variables and formatting that would take ages to redo.
Any ideas??
The problem was caused by the date column being blank for some rows and I found that if I moved a row to the top (after the header line) that had all the fields filled in, then Excel got the formats correct and the pivot tables now work!