Scala object to unpivot the data fields - scala

Read RAW_us_deaths.csv and RAW_us_confirmed.csv Develop scala objects
to Unpivot the date fields from raw files to populate metrics by date in rows for death
and confirmed cases with the given schema.
details
Merge Case count & Death count:Join Deaths & Confirmed covid cases data sets that
were arrived in the previous step and prepare one data set that holds both Case_count
& death_count for both US & global and write it to local path
details2
Remove the records with both case_count & death_count = 0 and write the final output
to a parquet file.

Related

Azure Data Factory - Insert Sql Row for Each File Found

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?
You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.
Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

How to read partitioned parquets with same structure but different column names?

I have parquet files that are partitioned by the date created (BusinessDate) and the data source (SourceSystem). Some source systems generate their data with different column names (small stuff like capitalization, ie orderdate vs OrderDate), but the same overall data structure (column order and data type is always the same between files).
My data looks like this in my filesystem:
dataroot
|-BusinessDate=20170809
|-SourceSystem=StoreA
|-data.parquet (has column "orderdate")
|-SourceSystem=StoreB
|-data.parquet (has column "OrderDate")
Is there a way to read the data in from either dataroot or dataroot/BusinessData=######/, and somehow normalize the data into a uniform schema?
My first attempt was to try:
val inputDF = spark.read.parquet(samplePqt)
standardNames = Seq(...) //list of uniform column names in order
val uniformDF = inputDF.toDF(standardNames: _*)
But this does not work (will rename columns which have same column names between source systems, but will populate with null for records from source system B with different column names).
I never did find a way to process all of the data in one pass, my solution iterates through the distinct source systems, creates filepaths pointing to each source system, and processes them individually. As they get individually processed, they get transformed into a standard schema and unioned with the other results.
val inputDF = spark.read.parquet(dataroot) //dataroot contains business date
val sourceList = inputDF.select(inputDF("source_system")).distinct.collect.map(_(0)).toList //list of source systems for businessdate
sourceList.foreach(println(_))
for (ss <- sourceList){//process data}

Save result for multiple request

I made a python batch that compare a filtered dataset of vertex with a specific vertex. My issue is that I need to execute more than 35000 times this batch. (with the same filtered dataset)
At first, I was querying this filtered dataset every time (30sec/request):
SELECT #rid FROM Expression WHERE OUT("identifie").asList().size() > 3
Then, I decided to saved all the rids from this dataset on a python variable and select it from rids in every request. (5sec/request) :
LET $filtered_expressions = (SELECT FROM [92:0, 92:1, ... , 98:2])
So, I'm wondering, can I save this filtered dataset in the database memory or something like that to improve my batch because it's this thing that cost the most time in every request ?

TorQ: How to use dataloader partitioning with separate tables that have different date ranges?

I'm trying to populate a prices and quotes database using AquaQ's TorQ. For this purpose I use the .loader.loadallfiles function. The difference being that prices is daily data and quotes is more intraday e.g. FX rates.
I do the loading as follows:
/- check the location of database directory
hdbdir:hsym `$getenv[`KDBHDB]
/hdbdir:#[value;`hdbdir;`:hdb]
rawdatadir:hsym `$getenv[`KDBRAWDATA]
target:hdbdir;
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "prices"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`open`close`low`high`volume;"DSFFFFF";enlist ",";`prices;target;`date;`year;{[p;t] `date`sym`open`close`low`high`volume xcols update volume:"i"$volume from t}); rawdatadir];
rawdatadir:hsym `$("" sv (getenv[`KDBRAWDATA]; "quotes"));
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`partitiontype`dataprocessfunc!(`date`sym`bid`ask;"ZSFF";enlist ",";`quotes;target;`date;`year;{[p;t] `date`sym`bid`ask`mid xcols update mid:(bid+ask)%2.0 from t}); rawdatadir];
and this works fine. However when loading the database I get errors attemping to select from either table. The reason is that for some partitions there aren't any prices or or there aren't any quotes data. e.g. attempting to:
quotes::`date`sym xkey select from quotes;
fails with an error saying the the partition for year e.g. hdb/2000/ doesn't exist which is true, there are only prices for year 2000 and no quotes
As I see there are two possible solutions but neither I know how to implement:
Tell .loader.loadallfiles to create empty schema for quotes and prices in partitions for which there isn't any data.
While loading the database, gracefully handle the case where there is no data for a given partition i.e. select from ... where ignore empty partitions
Try using .Q.chk[`:hdb]
Where `:hdb is the filepath of your HDB
This fills in missing tables, which will then allow you to preform queries.
Alternatively you can use .Q.bv, where the wiki states:
If your table exists in the latest partition (so there is a prototype
for the schema), then you could use .Q.bv[] to create empty tables
on the fly at run-time without having to create those empties on disk.