In Snowplow, is it a compulsory to use DynamoDB in stream enrich process? - snowplow

I trying to develop a working example of Snowplow click tracking. I have to setup enrichment process to enrich raw data on Kinesis stream. But, when I am running JAR file, I am getting this error:
ERROR com.amazonaws.services.kinesis.leases.impl.LeaseManager - Failed to get table status for SnowplowEnrich-${enrich.streams.in.raw}
Is DynamoDB a necessity for enrichment process?

It depends, in batch mode DynamoDB is not necessary for enrichment process, DynamoDB is used in the RDB Shredder.
Which release are (were) you trying to install. For a PoC you can use Snowplow Mini
Snowplow community is active in discourse.snowplowanalytics.com

Related

How to copy Druid data source data from prod server to QA server (like hive distcp action)

I wanted to check if there is a way to copy Druid datasource data (segments) from one server to another. Our ask is to load new data to prod Druid (using SQL queries), and copy the same data to qa Druid server. We are using hive druid storage handler to load the data, and HDFS as deep storage.
I read Druid documentations, but did not find any useful information.
There is currently no way to do this cleanly in druid.
If you really need this feature, please request this by creating a github ticket on : https://github.com/apache/druid/issues .
The workaround way is documented here : https://docs.imply.io/latest/migrate/#the-new-cluster-has-no-data-and-can-access-the-old-clusters-deep-storage
Full disclosure: I work for imply.

Using cosmos db for Spring batch job repository

Is it possible to use CosmosDB as a job repository for Spring Batch?
If that is not possible, can we go with an in-memory DB to handle our Spring batch jobs?
The job itself is triggered on message arrival in a remote queue. We use a variation of the process indicator in our current Spring batch job, to keep track of "chunks" which are being processed. Our attributes for saveStep are also disabled . The reader always uses a DB query to avoid picking up the same chunks and prevent duplicate processing.
We don't commit the message on the queue , till all records for that job are processed. So if the node dies and comes back up in the middle of processing , the same message would be redelivered , which takes of job restarts. Given all this, we have a choice of either coming up with a way to implement a cosmos job repository or simply use in-memory and plug in an "afterJob" listener to clean up the in-memory job data to ensure that java mem is not used in Prod. Any recommendations?
Wanted to provide information that Azure Cosmos DB just release v3 of the Spring Data connector for the SQL API:
The Spring on Azure team, in partnership with the Azure Cosmos DB team, are proud to have just made the Spring Data Azure Cosmos DB v3 generally available. This is the latest version of Azure Cosmos DB’s SQL API Spring Data connector.
Also, Spring.io has an example microservices solution (Spring Cloud Data Flow) based on batch that could be used as an example for your solution.
Additional Information:
Spring Data Azure Cosmos DB v3 for Core (SQL) API: Release notes and resources (link)
A well written 3rd party blog that is super helpful:
Introduction to Spring Data Azure Cosmos DB (link)

How to get data transfer completion status in nifi for SFTP transfer

I have created a flow in nifi to transfer data from one linux machine to another linux machine.
Flow is like this:
GetSFTP-->UpdateAttribute-->PutSFTP
Everything I am managing through nifi APIs, i.e. creating, updating attributes and starting of flow using nifi APIs.
How can I get the completion status of data transfer, so that I can stop the flow.
Thanks.
The concept of being "complete" is something that NiFi can't really know here. How would NiFi know that another file isn't going to be added to the directory where GetSFTP is watching?
From NiFi's perspective the dataflow is running until someone says to stop it. It is not really meant to be a job system where you submit a job that starts and completes, it is a running dataflow that handles an infinite stream of data.

way to check spring cloud stream source and sink data content

Is there any way I can check what data is there in spring cloud dataflow stream source(say some named destination ":mySource") and sink(say "log" as sink)?
e.g. dataflow:>stream create --name demo --definition ":mySource>log"
Here what is there in mySource and log - how to check?
Is it like I have to check spring cloud dataflow log somewhere to get any clue, if it at all has logs? If so, what is the location of logs for windows environment?
If you're interested in the payload content, you can deploy the stream with the DEBUG logs for the Spring Integration package, which will print the header + payload information among many other interesting lifecycle details. The logs will be either the payload consumed or produced depending on the application-type (i.e., source, processor, or sink).
In your case, you can view the payload consumed by the log-sink via:
dataflow:>stream create --name demo --definition ":mySource > log --logging.level.org.springframework.integration=DEBUG"
We have plans to add native provenance/lineage support with the help of Zipkin and Sleuth in the future releases.

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.