Open source tools for reconcilliation - reconciliation

Are there any open source tools available for data reconciliations? Key usecases are getting data from different parties in custom formats and having to recon them and identify any missing/mismatched rows. Another good to have usecase is the ability to build this recon pipeline directly through UI, where users can load sample files, mark key fields for matching and define the output format for recon.

Related

Add metadata to large number of files in SharePoint Online

We're migrating a large number of files from a document management system into SharePoint online. We have important metadata associated with a good number of the files. The export process renames the files as nnnnn_yyyyy_oldfilename with nnnnn being a cabinet number and yyyyy being a folder number. It also creates a file that associates all existing metadata by these two pieces of information. Is it practical to script renaming the files back to their original names while storing the two pieces of information in new custom metadata fields (cabinet, ofolder) for each file? If we can save those pieces of information, we'll then be able to use a similar script to push the saved information into other custom metadata fields later on.

Is it possible to merge Azure Data Factory data flows

I have two separate Data flows in Azure Data Factory, and I want to combine them into a single Data flow.
There is a technique for copying elements from one Data flow to another, as described in this video: https://www.youtube.com/watch?v=3_1I4XdoBKQ
This does not work for Source or Sink stages, though. The Script elements do not contain the Dataset that the Source or Sink is connected to, and if you try to copy them, the designer window closes and the Data flow is corrupted. The details are in the JSON, but I have tried copying and pasting into the JSON and that doesn't work either - the source appears on the canvas, but is not usable.
Does anyone know if there is a technique for doing this, other than just manually recreating the objects on the canvas?
Thanks Leon for confirming that this isn't supported, here is my workaround process.
Open the Data Flow that will receive the merged code.
Open the Data Flow that contains the code to merge in.
Go through the to-be-merged flow and change the names of any transformations that clash with the names of transformations in the target flow.
Manually create, in the target flow, any Sources that did not already exist.
Copy the entire script out of the to-be-merged flow into a text editor.
Remove the Sources and Sinks.
Copy the remaining transformations into the clipboard, and paste them in to the target flow's script editor.
Manually create the Sinks, remembering to set all properties such as "Allow Update".
Be prepared that, if you make a mistake and paste in something that is not correct, then the flow editor window will close and the flow will be unusable. The only way to recover it is to refresh and discard all changes since you last published, so don't do this if you have other unpublished changes that you don't want to lose!
I have already established a practice in our team that no mappings are done in Sinks. All mappings are done in Derived Column transformations, and any column name ambiguity is resolved in a Select transformations, so the Sink is always just auto-map. That makes operations like this simpler.
It should be possible to keep the Source definitions in Step 6, remove the Source elements from the target script, and paste the new Sources in to replace them, but that's a little more complex and error-prone.

How do I use cloud dataprep to convert my excel file to target formet regularly?

I'd like to convert my excel to proper format using Google Cloud Dataprep. How do I save my convert flow and use it as a template? For example, if there were two excel files named A and B and I create a flow to merge these two, next time there are two other files named C.xlsx and D.xlsx, how can I use the flow I created to merge C and D?
You can copy and reuse recipes (using the right-click or ... context menus and selecting Make a copy > Without inputs), or you can swap the input dataset for the original recipe and select your new file without having to recreate the recipe.
If your goal is automation, this is a bit more difficult when your source is an Excel file (as these are only an allowed format when using the uploader).
If you're able to have the data output in a CSV and uploaded to Cloud Storage, it opens up additional opportunities to schedule and parameterize your process.

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

Transfer Scrumwise to GreenHopper

Scrumwise can export its whole data as XML. This contains lots of data including Projects, Backlog, Sprints, Tasks, Team Members, etc. It can also export Tasks as CSV.
GreenHopper can import Projects in various formats (but not XML).
I'd like to transfer as much as possible between Scrumwise and GreenHopper. I'm thinking of extracting the Projects node from the XML, converting to JSON, and importing that. Right now GreenHopper rejects the data right from the start.
Is there a reference to the data schema used in GreenHopper? I'd like to transfer more than just the Project, but all its associated data.
Select all the backlog tasks and export. Scrumwise uses the semicolon as the delimiter. Convert by search and replace and JIRA will at least try to import. Import into a new empty project, expect it to take several attempts to get it right. You'll need to add columns for time remaining in seconds. Getting the statuses and resolution was the hardest part. JIRA gives you status options that will fail during import! Check the jira workflow for valid states. Also, an empty resolution means "unresolved", an unrecognized resolution defaults to "resolved". For tasks, you will need to export as XML, parse the XML for all tasks and add a parent ID column.