I have a Ado.Net dataset that have three datatables let say Dataset name customer and tables are accounts, purchases and profile . i do like to export them using softartisans ExcelWriter to worksheets using templates
Example
DataSet myDataSet = new DataSet();
myDataSet.Tables.Add("Customer");
myDataSet.Tables.Add("Accounts");
myDataSet.Tables.Add("Purchases");
xlt.BindData(myDataSet,null , xlt.CreateDataBindingProperties());
I do like to export theses tables into seperate excel worksheets.
The BindData method of OfficeWriter's ExcelTemplate object has a number of overloads. The one that accepts a DataSet does not automatically import every DataTable in the DataSet, it only imports the first one. If you want to use every DataTable in the DataSet, you should use the overload that takes a DataTable (see the documentation). For example:
//The 2nd parameter is the dataSource name that must correspond with the first
//part of the datamarker in your template workbook (i.e. %%=Customer.FirstName)
xlt.BindData(DS.Tables["Customer"], "Customer", xlt.CreateDataBindingProperties());
You could also do something in a loop, like this:
foreach (DataTable table in myDataSet.Tables)
{
xlt.BindData(table, table.TableName, xlt.CreateDataBindingProperties());
}
Related
I have a Copy-Data task where I am adding an additional column called "Id" and its value is #guid(). Problem is that for every row it is importing, the Guid value is always the same and the destination/sink throws a primary key violation.
Additional column definition
The copy activity will copy the same guid() for all rows if you use additional column.
To get Unique guid() for Each row, you can follow the demonstration below.
First Give your source data to lookup activity and give its output to a ForEach activity.
This is my source data in csv format for sample, give this to lookup.
source.csv:
name
"Rakesh"
"Laddu"
"Virat"
"John"
Use another dummy dataset and give any one value to it. Use this in copy activity.
Dummy.csv:
name
"Rakesh"
ForEach activity:
Inside ForEach use Copy activity and give the dummy dataset. Create additional columns and give our source data(#item().name) and #guid().
Copy activity:
Now in sink give your database dataset. Here for sample, I have used Azure SQL database table.
Go to mapping of copy activity and click on import Schemas.Give any string value for it to import the schemas of source (Here dummy schema) and sink.
After the above, you will get like this, in this give the additional columns we created to the database columns.
Pipeline Execution:
After Executing the pipeline, you can get the desired output with Unique rows.
Output:
I have selected to export tables at the end of model execution to an Excel file, and I would like that data to accumulate on the same Excel sheet after every stop and start of the model. As of now, every stop and start just exports that 1 run's data and overwrites what was there previously. I may be approaching the method of exporting multiple runs wrong/inefficiently but I'm not sure.
Best method is to export the raw data, as you do (if it is not too large).
However, 2 improvements:
manage your output data yourself, i.e. do not rely on the standard export tables but only write data that you really need. Check this help article to learn how to write your own data
in your custom output data tables, add additional identification columns such as date_of_run. I often use iteration and replication columns to also identify from which of those the data stems.
custom csv approach
Alternative approach is to write create your own csv file programmatically, this is possible with Java code. Then, you can create a new one (with a custom filename) after any run:
First, define a “Text file” element as below:
Then, use this code below to create your own csv with a custom name and write to it:
File outputDirectory = new File("outputs");
outputDirectory.mkdir();
String outputFileNameWithExtension = outputDirectory.getPath()+File.separator+"output_file.csv";
file.setFile(outputFileNameWithExtension, Mode.WRITE_APPEND);
// create header
file.println( "col_1"+","+"col_2");
// Write data from dbase table
List<Tuple> rows = selectFrom(my_dbase_table).list();
for (Tuple row : rows) {
file.println( row.get( my_dbase_table.col_1) + "," +
row.get( my_dbase_table.col_2));
}
file.close();
Oversimplified Scenario:
A process which generates monthly data in a s3 file. The number of fields could be different in each monthly run. Based on this data in s3,we load the data to a table and we manually (as number of fields could change in each run with addition or deletion of few columns) run a SQL for few metrics.There are more calculations/transforms on this data,but to have starter Im presenting the simpler version of the usecase.
Approach:
Considering the schema-less nature, as the number of fields in the s3 file could differ in each run with addition/deletion of few fields,which requires manual changes every-time in the SQL, Im planning to explore Spark/Scala, so that we can directly read from s3 and dynamically generate SQL based on the fields.
Query:
How I can achieve this in scala/spark-SQL/dataframe? s3 file contains only the required fields from each run.Hence there is no issue reading the dynamic fields from s3 as it is taken care by dataframe.The issue is how can we generate SQL dataframe-API/spark-SQL code to handle.
I can read s3 file via dataframe and register the dataframe as createOrReplaceTempView to write SQL, but I dont think it helps manually changing the spark-SQL, during addition of a new field in s3 during next run. what is the best way to dynamically generate the sql/any better ways to handle the issue?
Usecase-1:
First-run
dataframe: customer,1st_month_count (here dataframe directly points to s3, which has only required attributes)
--sample code
SELECT customer,sum(month_1_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count").show()
Second-Run - One additional column was added
dataframe: customer,month_1_count,month_2_count) (here dataframe directly points to s3, which has only required attributes)
--Sample SQL
SELECT customer,sum(month_1_count),sum(month_2_count)
FROM dataframe
GROUP BY customer
--Dataframe API/SparkSQL
dataframe.groupBy("customer").sum("month_1_count","month_2_count").show()
Im new to Spark/Scala, would be helpful if you can provide the direction so that I can explore further.
It sounds like you want to perform the same operation over and over again on new columns as they appear in the dataframe schema? This works:
from pyspark.sql import functions
#search for column names you want to sum, I put in "month"
column_search = lambda col_names: 'month' in col_names
#get column names of temp dataframe w/ only the columns you want to sum
relevant_columns = original_df.select(*filter(column_search, original_df.columns)).columns
#create dictionary with relevant column names to be passed to the agg function
columns = {col_names: "sum" for col_names in relevant_columns}
#apply agg function with your groupBy, passing in columns dictionary
grouped_df = original_df.groupBy("customer").agg(columns)
#show result
grouped_df.show()
Some important concepts can help you to learn:
DataFrames have data attributes stored in a list: dataframe.columns
Functions can be applied to lists to create new lists as in "column_search"
Agg function accepts multiple expressions in a dictionary as explained here which is what I pass into "columns"
Spark is lazy so it doesn't change data state or perform operations until you perform an action like show(). This means writing out temporary dataframes to use one element of the dataframe like column as I do is not costly even though it may seem inefficient if you're used to SQL.
I have Employee table in database which is having gender column. so I want to Filter Employee data based on number of gender with three column to excel like:
I'm getting this output using below talend schemaStructure 1:
So I want to optimized above structure and trying in this way but I have been stuck with other scenario. Here I'm getting Employee data with gender wise but in three different file so Is there any way so that I can achieve same excel result from one SQL input file and after mapping can be get in a single output excel file?
Structure 2 :
NOTE: I don't want to use same input table many time. I want to get same output using single table and single output excel file. so please suggest me any component which one is useful for me.
Thanks in advance!!!
I'm creating a BIRT report and I need to split a comma delimited string from a dataset into multiple columns in a table.
The data looks like:
256,1400.031,-70.014,1,4.544,0.36,10,31,30.89999962,0
256,1400,-69.984,2,4.574,1.36,10,0,0,0
...
The data is stored this way in the database and I can't change it but I need to be able to display it as a table. I'm new to BIRT, any ideas?
I think the easiest way is to create a computed column in the dataset for each field.
For example if the merged field from database is named "mergedData" you can split it with this kind of expression:
First field (computed column) expression:
var tempArray=row["mergedData"].split(",");
tempArray[0];
Second field:
var tempArray=row["mergedData"].split(",");
tempArray[1];
etc..
Depending on some variables that you did not mention.
If the dataset is stagenent (not updated much or ever). Open the data set with Excel, converiting it from .csv to .xls and save.
Use the Excel as a datasource. Assuming you are using BIRT 4.1 or newer this should work fine.
I don't think there is any SQL code that easily converts .csv