Map Reduce for analyzing time series - mongodb

I am new to map reduce concept and wonder if the following problem can be solved using it.
We have a log of data in the form like this:
TransID Date Operation DocumentID User
1 01/01/2010 Open aaa Anne
2 01/11/2010 Close aaa Anne
3 01/12/2010 Open bbb Mary
4 01/12/2010 Close bbb Mary
We want to be able to calculate different time metrics, such as:
How much time passes between Open and Close operations average globally? or
How much time passes between Open and Close average per each user?
Is there a simple way to achieve this with map-reduce? We are considering MongoDB or Hadoop.
The amount of data can be large - billions of records. Thanks!

The trick here is you need to "flatten" your data during the map phase and send that to the reducer for your calculation. So your key would be DocumentID (and maybe User depending on your use case) and then the value is the time and operation (put time first if it sorts better that way). In your reducer the rows above would only results in rows being able to loop through within a key. Here is an example of something very similar https://allthingshadoop.com/2011/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/

Related

Tableau - Calculated field of different columns based on different partition of the same table

Sorry for the stupid question.
Situation: I have a partitioned table (the partition is the week of the year) with some metrics (e.g. frequency of some keywords); I need to run an analysis of metrics belonging to different partitions (e.g. the trend between the frequency of a keyword in week 32 compared to week 3). The ultimate purpose is to create a dashboard where the user can choose the week of the year and is presented with the calculated analysis on the go.
So far I have used a live query that uses two parameters (week_1 and week_2) that joins data from the same table based on the two different parameters. You can imagine that the dashboard recomputes everything once one of the parameter is changed by the user. To avoid long waiting times, I have set the two parameters to a non-existent default value (0, zero), so that the dashboard can open very quickly. Then I prompt the user to stop the dashboard, insert the new parameters of choice, and then restart the dashboard to load the new computations.
My question is: is it possible to achieve the same by using an extract of the table? The table itself should not be excessively big (it should be 15 million records spanning 3 years) and as far as I know the extracts are performant with those numbers.
I am quite new to Tableau, so I would like to know from more expert people if there is a more optimal way to do such a thing without using live queries.
Please, feel free to ask more information if I was not clear! However, I cannot share my workbook, as it contains sensitive information.
Edit:
+-----------+ -----------+ ------------+
partition keyword frequency
+-----------+ -----------+ ------------+
202032 hello 5000
202032 ciao 567
...
202031 hello 2323
202031 ciao 34567
...
20203 hello 2
20203 ciao 1000
With the live query, I can join the table where partition = 202032 with the same table where partition - 20203 and make a new table with a column where I compute e.g. a trend between the two frequencies:
+----------+ -----------------------+ ---------------+
keyword partitions_compared trend
+----------+ -----------------------+ ---------------+
hello 202032 - 20203 +1billion %
ciao 202032 - 20203 +1K %
With the live query I join on the keywords.
Thanks a lot in advance and have a great day!
Cheers

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

KDB+/Q query too heavy to handle

I want to grab data from a KDB data base for a list of roughly 200 days within the last two years. The 200 days are in no particular pattern.
I only need the data from 09:29:00.000 to 09:31:00.000 everyday.
My first approach was to query all of the last two years data that have time stamp between 09:29:00.000 and 09:31:00.000, because I didn't see a way to just query the particular 200 days that I need.
However this proved to be too much for my server to handle.
Then I tried to summarize the 2 minute data for each date into an average and just print out the average, so now I will only have 200 rows of data as output. But somehow this still turns out to be too much. I'm not sure if this is because I'm not selecting the data correctly.
My other suspicion is that the query is garbing all the data first then averaging each date, which means averaging is not making it easier to handle.
Here's the code that I have:
select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty) by date from dms where date within(2015.01.01, 2016.06,10), time within(09:29:00.000, 09:31:00.000), sym = `ZF6
poms is the table that the data is in
ZFU6 is the symbol that im looking for
I tried adding the key word distinct after select.
I want to know if there's anyway to break up the query, or make the query lighter for the server to handle.
Thank you!
If you use 32-bit kdb+ and get infamous 'wsfull error then you may try processing one day at a time like this:
raze{select maxPriceB:max(price), minPriceB:min(price), avgPriceB:avg(price), avgSizeB:avg(qty)
from dms where date=x,sym=`ZF6,time within 09:29:00.000 09:31:00.000}each 2015.01.01+1+til 2016.06.10-2015.01.01

Best way to query 4 B+ records in Tableau

I am looking a best way to analyse 4B records (1TB data) stored in Vertica using Tableau. I tried using extract of 1M records which works perfectly. but dont know how to manage 4B records, because its taking too long to query on 4B records.
I have following dataset :
timestamp id url domain keyword nor_word cat_1 cat_2 cat_3
So here I need to create descending list of Top 10 ID's, Top 10 url, Top 10 domain, Top 10 keyword, Top 10 nor_word, Top 10 cat_1, Top 10 cat_2, Top 10 cat_3 depending count of each field value in separate worksheet and combine all worksheet in one dashboard.
There is no primary key. This dataset of 1 month so I want to make global filter start date and end date to reduce the query size. But don't know how to create global date filter and display on dashboard ?
You have two questions, one about Vertica and one about Tableau. You should split these up.
Regarding Vertica, you need to know that Vertica stores data in ascending sort order in physical storage. This means that an additional step will always be required anytime you want to get a descending sort order.
I would suggest creating a partition on the date, and subsequently running Database Designer (DBD) in incremental mode and using your queries as samples. By partitioning the data, Vertica can eliminate the partitions during optimization.
Running the DBD will generate some better optimized projections. You should consider the trade-off between how often you will need this data and whether it's worth creating these additional projections as it will impact your load performance.