How to segregate large real time data in MongoDB - mongodb

Let me explain the problem
We get real time data which is as big as 0.2Million per day.
Some of these records are of special significance. The attributes
that shall mark them as significant are pushed in a reference collection. Let us say each row in Master Database has the following attributes
a. ID b. Type c. Event 1 d. Event 2 e. Event 3 f. Event 4
For the special markers, we identify them as
Marker1 -- Event 1 -- Value1
Marker2 -- Event 3 -- Value1
Marker3 -- Event 1 -- Value2
and so on. We can add 10000 such markers.
Further, the attribute Type can be Image, Video, Text, Others. Hence the idea is to segregate Data based on Type, which means that we create 4 collections out of Master Collection. This is because we have to run search on collections based on Type and also run some processing.The marker data should show in a different tab on the search screen.
We shall also be running a search on Master Collection through a wild search.
We are running Crons to do these processes as
I. Dumping Data in Master Collection - Cron 1
II. Assigning Markers - Cron 2
III. Segregating Data based on Type - Cron 3
Which runs as a module. Cron 1 - Cron 2 - Cron 3.
But assigning targets and segregation takes a very long time. We are using Python as scripting language.
In fact, the crons don't seem to work at all. The cron works from the command prompt. But scheduling these in crontab does not work. We are giving absolute path to the files. The crons are scheduled at 3 minutes apart.
Can someone help?

Yes, I also faced this problem but then I tried by moving small chunks of the data. Sharding is not the better way as per my experience regarding this kind of problem. Same thing for the replica set.

Related

Anylogic, split an agent in multiple different agent types

I've a problem with a simulation in anylogic.
I have an item (agent) that must be processed by a resource, the result of this service block is the starting object and two different documents which are processed in two separate offices and which at the end of the flow will have to be linked to the article in question.
I can't find a way to do this division into 3 different agents, or in general, to model this flow.
Thanks in advice
You can use 2 split blocks to generate 2 independent documents and connect them through a variable or link to agents... maybe each original agent has an id and the copies in the split block will have something like agent.id=original.id;
Then after, when the documents are processed you can check for which ones have the same id to merge them into an article...
but if you want to get more complicated, there's also the following option:
create 2 enter blocks (enter1 and enter2), one for each document. I will assume your documents correspond to 2 different agent types called Document1 and Document2
On each one of the agent types, you will add a link to agents in order to be able to connect the documents to each other. Read more on link to agents on the help documentation if you don't know what that is.
At the end of the service block, on the on exit action, you can do the following:
Document1 doc1=add_Document1();
Document2 doc2=add_Document2();
doc1.linkToDoc2.connectTo(doc2);
enter1.take(doc1);
enter2.take(doc2);
I don't know if your original agent has to be connected, but you would follow the same principle to do that.
Later, you can just check if the connected docs are completed in order to join them in an article again.

Streaming data Complex Event Processing over files and a rather long period

my challenge:
we receive files every day with about 200.000 records. We keep the files for approx 1 year, to support re-processing, etc..
For the sake of the discussion assume it is some sort of long lasting fulfilment process, with a provisioning-ID that correlates records.
we need to identify flexible patterns in these files, and trigger events
typical questions are:
if record A is followed by record B which is followed by record C, and all records occured within 60 days, then trigger an event
if record D or record E was found, but record F did NOT follow within 30 days, then trigger an event
if both records D and record E were found (irrespective of the order), followed by ... within 24 hours, then trigger an event
some pattern require lookups in a DB/NoSql or joins for additional information either to select the record, or to put into the event.
"Selecting a record" can be simple "field-A equals", but can also be "field-A in []" or "filed-A match " or "func identify(field-A, field-B)"
"days" might also be "hours" or "in previous month". Hence more flexible then just "days". Usually we have some date/timestamp in the record. The maximum is currently "within 6 months" (cancel within setup phase)
The created events (preferably JSON) needs to contain data from all records which were part of the selection process.
We need an approach that allows to flexibly change (add, modify, delete) the pattern, optionally re-processing the input files.
Any thoughts on how to tackle the problem elegantly? May be some python or java framework, or does any of the public cloud solutions (AWS, GCP, Azure) address the problem space especially well?
thanks a lot for your help
After some discussions and readings, we'll try first Apache Flink with the FlinkCEP library. From the docs and blog entries it seems to be able to do the job. It also seems AWS's choice, running on their EMR cluster. We didn't find any managed service on GCP nor Azure providing the functionalities. Of course we can always deploy and manage it ourselves. Unfortunately we didn't find a Python framework

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?
The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps

How Can use real-time workflow in CRM 2015?

I have a real-time workflow for creating unique numbers. This workflow get a numeric field from my custom entity, increase it by 1, and update it for next use.
I want to run this workflow on multiple records.
Running on-demand mode, it works fine,and I have true and unique numbers, but for "Record is Created" mode, it dose not work fine and get repeated numbers.
What I have to do?
This approach wont work, when the workflow runs on demand its running multi-threaded, e.g. two users create two records, two instances of the workflow start. As there is no locking mechanism you end up with duplicated numbers.
I'm guessing this isn't happening when running on demand because you are running as a single user.
You will need to implement a custom auto number approach, such as Auto Number for DynamicsCRM.
Disclaimer: I work for Gap Consulting who produce the tool linked above.