How to add trailer/footer in csv dataframe azure blob pyspark - pyspark

i have as solution which goes like
df1 -->dataframe 1 with having 50 columns of data
df2 --->datarame 2 having footer/trailer 3 columns of data like Trailer,count of rows,date
so i added the remaining 47 columns as "","",""..... so on
so that i can union 2 dataframe:
df3=df1.union(df2)
now if i want to save
df3.coalesce(1).write.format("com.databricks.spark.csv")\
.option("header","true").mode("overwrite")\
.save(output_blob_path);
so now i am getting the footer as well
like this Trailer,400,20210805,"","","","","","","".. and so on
if any one can suggest how to remove ,"","","",.. these double quotes from the last row
where i want to save this file in blob container.
it would be very helpful

You can try to define structure of data frame to treat entire row as single column for both the files and then perform union. This way you no need to add extra columns on data frame 2 and then struck in to tricky situation to remove extra columns after union.

Related

Is it possible to generate the space separated header row using data factory copy activity?

I am using azure sql as source dataset and delimited file as sink dataset in the copy activity.
I tried copy activity but First row as header gives comma separated headers.
Is there way to change the header output style ?
Please note spacing is unequal (h3...h4)
In this repro, I tried to give
1 space between 1st and 2nd column,
2 spaces between 2nd and 3rd column,
3 spaces between 3rd and 4th column.
Also, I tried to give same column name for column2 and column3. The approach is as follows.
Data is copied from Azure SQL database to datalake in comma delimitted format as a staging file.
This staging file is taken as a source in Dataflow activity.
In source dataset, first row as header is not checked.
Data preview of Source transformation:
Derived column transformation is added to change the column name of column2 and column3.
In this case, date_col of column1 is header data. Thus when column1 is 'date_col' replace column2 and column3 data with same column name.
column_2 = iif(Column_1=='date_col','ECIX',Column_2);
column_3 = iif(Column_1=='date_col','ECIX',Column_3);
Again derived column transformation is added to concat all the columns with spaces. Column name is given as concat . Value for this column is
concat(Column_1,' ',Column_2,' ',Column_3,' ',Column_4)
Select transformation is added and only concat column is selected here.
In sink, new delimited file is added as a sink dataset. And in sink dataset also , first row as header is not checked.
Output file screenshot
After pipeline is run, the target file looks like this.
Keeping the source as azure sql itself in the dataflow, I created a single derived column 'OUTDC' and added all the columns from the source like this:
(h1)+' '+(h2)+' '+(h3)
Then fed the OUTDC to a delimited sink and kept the Headers option as single string like this:
['h1 h2 h2']

ADF map source columns startswith to sink columns in SQL table

I have a ADF data flow with many csv files as a source and a SQL database as a sink. The data in the csv files are similar with 170 plus columns wide however not all of the files have the same columns. Additionally, some column names are different in each file, but each column name starts with the same corresponding 3 digits. Example: 203-student name, 644-student GPA.
Is it possible to map source columns using the first 3 characters?
Go back to the data flow designer and edit the data flow.
Click on the parameters tab
Create a new parameter and choose string array data type
For the default value as per your requirement, enter ['203-student name','203-student grade',’203-student-marks']
Add a Select transformation. The Select transformation will be used to map incoming columns to new column names for output.
We're going to change the first 3 column names to the new names defined in the parameter
To do this, add 3 rule-based mapping entries in the bottom pane
For the first column, the matching rule will be position==1 and the name will be $parameter11
Follow the same pattern for column 2 and 3
Click on the Inspect and Data Preview tabs of the Select transformation to view the new column name.
Reference - https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow-dynamic-columns#parameterized-column-mapping

Map a text file to key/value pair in order to group them in pyspark

I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.
My input is a text file that has the following:
1234567,micheal,male,usa
891011,sara,femal,germany
I want to be able to group every row by the first six characters in the first column
Create a new column that contains only the first six characters of the first column, and then group by that:
from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)

Splitting a column data as per delimiter

I have a Spark (1.4) dataframe where the data in a column is like "1-2-3-4-5-6-7-8-9-10-11-12". I want to split the data into multiple columns. Please note that the number of fields can vary from 1 to 12, its not fixed.
P.S. we are using Scala API.
Edit:
Editing over the original question. I have the delimited string as below:
"ABC-DEF-PQR-XYZ"
From this string I need to create delimited strings in separate columns as below. Please note that this string is in a column in DF.
Original column: ABC-DEF-PQR-XYZ
New col1 : ABC
New col2 : ABC-DEF
New col3 : ABC-DEF-PQR
New col4 : ABC-DEF-PQR-XYZ
Please note that there can be 12 such new columns which needs to get derived from original field. Also, the string in original column might vary i.e. some times 1 column, some time 2 but max can be 12.
Hope I have articulated the problem statement clearly.
Thanks!
You can use explode and pivot. Here is some sample data:
df=sc.parallelize([["1-2-3-4-5-6-7-8-9-10-11-12"], ["1-2-3-4"], ["1-2-3-4-5-6-7-8-9-10"]]).toDF(schema=["col"])
Now add a unique id to rows so that we can keep track of which row the data belongs to:
df=df.withColumn("id", f.monotonically_increasing_id())
Then split the columns by delimiter - and then explode to get a long-form dataset:
df=df.withColumn("col_split", f.explode(f.split("col", "\-")))
Finally pivot on id to get back to wide form:
df.groupby("id")
.pivot("col_split")
.agg(f.max("col_split"))
.drop("id").show()

How to split 2 or more delimited columns in a single row to multiple rows using Talend

I am trying to move data from a CSV file to DB table. There are 2 delimited columns in the CSV file (separated by ";"). I would like to create a row for each of the delimited values at matching indexes as shown below. Assumption is that both columns will contain same number of delimited items.
Example CSV Input:
Labels Values
A;B;C 1;2;3
D 4
F;G 5;6
Expected Output:
Labels Values
A 1
B 2
C 3
D 4
E 5
F 6
How can I achieve this? I have tried using tNormalize but this only works for a single column. Also I tried 2 successive tNormalize nodes but as expected it resulted in unwanted combinations.
Thanks
Read your CSV file with a tfileinputdelimited, and
define your schema for the file.
Assuming you are using MySQL , also drop a tMysqlOutput component on you desinger to save your parsed file to the DB.