REST V2 ingest of JSON events to a S3 bucket (avoid duplicates) - rest

I would like to ask you for help.
I am trying to ingest events in JSON from a source using REST API (REST V2 connector) in a raw format.
The source allows me to pass parameters "take" and "days" in the headers. The parameter "take" allows us to specify how many records to take, parameter "days", specifies how old events to request.
The job I have created works fine for Data Ingestion to a database, where I map filed to columns in the database.
I tried million things, and the two recent problems I am facing when I try to ingest files into a bucket or database in raw format:
For mass ingestion: there are no incremental jobs options available (for REST V2 source), so I am getting duplicate records, and ingestion never stops.
Is there a way to stop mass ingestion and avoid duplicates when all records are ingested?
For Data integration to a DB: Each record/event I attempt to ingest has multiple fields. Since I DON'T want to separate the records (I WANT entire documents in JSON), I pack all files into an array. The problem is that when I request ten records (or N records), all records get ingested into a single row in a table.
Here is what I mean:
TABLE DB:
ROW1: "array packed" JSON1, JSON2 .... JSNO_N...JSON10 "/array packed"
ROW2: empty
This is what I need (each record in a separate row in raw format)
TABLE DB:
ROW1: JSON1
ROW2: JSON2
ROWN: JSON_N
ROW10: JSON_10
I was also trying to accomplish this using a lambda function. The problem with lambda is that I will have to make sure there is no duplicates (Informatica has this cool option "upsert" that allows me to avoid duplicates).
At the end of the day, I don't care if this will be accomplished using data integration, mass ingestion, or lambda and if ingest will be directly into DB or S3. For now, I am trying to find a working solution.
If somebody can come up with some ideas, I will appreciate the help.

Related

How to combine data from postgreSQL and dynamic json in grafana

I have a grafana dashboard where I want to use an orcestra cities map dashboard to show status of some stations. The status is available as json from a http server (using nagios for this part) but the status has no idea of the location of the stations. This I have in a postGIS database.
I know I can set up a script that reads the status json and inserts the data into a table in the postgis database. This can run each five minutes or something. This feels a bit kludgy, so I wonder if there are some other ways of doing this.
Could it be possible to use a foreign data wrapper to fetch the json into postgis? The only json fdw I have found is to read a set of files, I would need to read from a http server.
If not, is it possible to combine data from json and postgres in one data set in grafana? I can read in data from both sources and present them e.g. as time series in one panel, but here I need to be able to join the two so that I use some of the attributes from json to categorize the points from postgis (or the other way around if that should be easier)
In theory you can do that in the Grafana. You need to have 2 queries with results from both sources (how to write query, configure datasources for that is not in the scope of this question) + you need a key, which can be used for a join in both results (e.g. city_id).
Then you may use join transformation to "join" both query results into single dataset.

Azure Data Factory data flow - drops null columns

When using a data flow in azure data factory to move data, I've noticed that the data (at the sink) is missing columns that contains NULL values. When using the copy activity to copy the same data, the columns are present in the sink with their NULL values.
Record after a copy activity:
Record after a data flow:
Source is parquet, sink is azure cosmos db. My goal is to avoid defining any schemas, as I simply want to copy all of the data "as is". I've used the "allow schema drift" option on the source and sink.
I would just use the copy activity, but it doesn't appear to have the ability to define a maximum speed (RU consumption) like the data flow does, so the copy activity ends up consuming all of the cosmos db's RUs very quickly (as described here)
EDIT:
sink data preview shows all columns
sink inspect tab shows all columns
Dataflows always skip writing JSON tags with NULLs. There is no workaround currently other than copy activity.
This is really not a good design or behavior on Microsoft's part because you can't Standardize in Cosmos weather to "Keep" or "Remove" null fields in your JSON.
Querying Cosmos
Where field1 = NULL is completely different than where NOT IS_DEFINED (field1) and will yield an entirely different result set.
And if your users don't know if the ADF developer used a Dataflow with a Sink vs a Copy Activity in a Pipeline then the may get erroneous results in a query. The only way to ensure you get all the data is to always use:
Where field1 = NULL or where NOT IS_DEFINED (field1)
Users should not have to depend on knowing what kind of ADF functionality was chosen for a specific JSON document in a Cosmos NoSQL collection to do a query. Plus you can't standardize that you will "Keep" null across all Cosmos documents or you will "Remove" nulls across all Cosmos documents. Unless you force everyone to use Pipelines only or Dataflows only. Depending on the complexity using Pipeline only is not always possible. But using Dataflow only is also not always needed.

ELT pipeline for Mongo

I am trying to get my data into Amazon Redshift using Fivetran, but have some questions in general about the ELT/ETL process. My source database is Mongo but I want to perform deep analysis on the data using a 3rd party BI tool like Looker, but they integrate with SQL. I am new to the ELT/ETL process and was wondering would it look like this.
Extract data from Mongo (handled by Fivetran)
Load into Amazon Redshift (handled by Fivetran)
Perform Transformation - This is where my biggest knowledge gap is. I obviously have to convert objects and arrays into compatible SQL types. I can perform a transformation on all objects to extract those to columns and transform all arrays to a table. Is this the right idea? Should I design a MYSQL schema and write all the transformations according to that schema design?
as you state, Fivetran will load your data into Redshift putting individual fields in columns where it can and putting everything else into varchar columns as JSON. So at that point you basically have a Data Lake - all your data in an analytical platform but basically still in source format and available for you to do whatever you want with it.
Initially, if you don't know much about your data and just want to investigate it, you can probably leave it as it is. Redshift has SQL functions that allow you to query the elements of a JSON structure so there is no need to build additional tables and more ETL just to allow you to investigate your data - especially as these tables may get thrown away once you understand your data and decide what you want to do with it.
If you have proper reporting requirements then that is the point where you can start to design a schema that will support these requirements (I'm not sure why you suggested a MYSQL schema as MYSQL is a database vendor?). Traditionally an analytical schema would be designed as a Kimball Dimensional model (facts and dimensions) but the type of schema you decide to design will depend on:
The database platform you are using (in your case, Redshift) and the type of structures it works best with e.g. star schema or "flat" tables
The BI tool you are using and how it expects to have data presented to it
For example (and I'm not saying this is a real world example), if Redshift works ok with star schemas but better with flat tables and Looker has to have a star schema then it probably makes more sense to build star schemas in Redshift as this is a single modelling exercise - rather than model flat tables in Redshift and then have to model star schemas in Looker.
Hope this helps?
It depends on how you need the final stage of your data analysis presented, and what the purpose of your data analysis is. As stated by NickW, assuming you need to integrate your data into a BI tool the schema should be adapted according to the tool's data format requirements.
a mongodb ETL/ELT process might looks like this:
Select Connection: Select the set connection
Collection Name:Choose the collection by using the [database].[collection] format.
If you pulling data from your authentication database, only the [collection] name can be determined. Examples: ea sample.products east .
Extract Method:
All: pull the entire data in the table.
Incremental: pull data by incremental value.
Incremental Attributes: Set the name of the incremental attribute to run by. I.e: UpdateTime .
Incremental Type: Timestamp | Epoch. Choose the type of incremental attribute.
Choose Range:
In Timestamp, choose your date increment range to run by.
In Epoch, choose the value increment range to run by.
If no End Date/Value entered, the default is the last date/value in the table.
The increment will be managed automatically
Include End Value: Should the increment process take the end value or not
Interval Chunks: On what chunks the data will be pulled by. Split the data by minutes, hours, days, months or years.
Filter: Filter the data to pull. The filter format will be a MongoDB Extended JSON.
Limit: Limit the rows to pull.
Auto Mapping: You can choose the set of columns you want to bring, add a new column or leave it as it is.
Converting Entire Key Data As a STRING
In cases the data is not as expected by a target, like key names started with numbers, or flexible and inconsistent object data, You can convert attributes to a STRING format by setting their data types in the mapping section as STRING
Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.
Use cases:
Here are few filtering examples:
{"account":{"$oid":"1234567890abcde"}, "datasource": "google", "is_deleted": {"$ne": true}}
date(MODIFY_DATE_START_COLUMN) >=date("2020-08-01")

AWS Glue: How to handle nested JSON with varying schemas

Objective:
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
Background:
The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.
Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.
What we've tried/referenced so far:
Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.
Question:
How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?
I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.
Thanks!
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
's3',
{"paths": ['s3://...']},
"json")
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.
Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.
One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".
you should add a glue classifier preferably $[*]
When you crawl the json file in s3, it will read the first line of the file.
You can create a glue job in order to load the data catalog table of this json file into the redshift.
My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..
let me know if you have found a solution
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0;
Explode struct or array objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
Transform df1 back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
Join relationalized table with the intact table based on 'id'
column.
As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.
However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.