Hbase schema design suggestion - nosql

Coming from RDBMS background, I need little help/suggestion to design Hbase schema for below usecase.
It is a report generating application using hadoop. Now, we need to track all previous report generation history for a particular user based on his email id. So, data need to be persisted are, email id, report name, start date, end date, status. I am planning to keep the email id as row key and other entities as columns,
emailId(row key) - (columns) appName:reportName, appName:startDate, appName:endDate, appName:status
But the problem is, the same user can run same report for different date ranges. So it will overwrite the appName:reportName and appName:status columns. Since I am new to NoSQL world, I am not sure how to tackle this problem.
Can someone please suggest me an ideal way of designing schema for this requirement?
Any help would be greatly appreciated.
Thanks

Based on your expected query pattern, here is what I'd suggest:
RowKey | Column Family (appName) |
userid#domain.com-YYYY-MM-DD HH:MM:SSS | reportName | status | startDate | endDate |
This design gives you a few advantages. First of all, you can quickly query (using a scan) all rows for a particular user over a particular date range. Secondly, you are avoiding write hotspots by preceding the timestamp in the rowkey with the user's ID.
You can write one row to this schema each time a user triggers the generation of a report, and you won't need to worry about overwriting the columns (unless a user generates two reports in the same 1/10th of a second).

Related

filemaker pro 16 creating records that share a date (work rota)

I am currently trying to create a rota within filemaker 16 and I can't figure out how to create records that share a date.
I want to be able to have people assigned to jobs and jobs assigned to dates but currently when I create jobs with the same date it creates a new record instead of assigning it to the one already existing.
I have 3 tables currently jobs, date and people. I have a 4th layout with a portal where I wanted to view records related to jobs that are set for a certain day.
Any help would be much appreciated.
Many thanks.
I am not 100% convinced you need a Dates table. Do you have anything specific to record about a date, other than its existence?
However, you certainly need a join table of Assignments, with fields for:
PersonID
JobID
Date
(this is assuming your rota is daily, otherwise you will need to indicate the shift or hours too).
I think your structure should change on this.
So instead have:
Parent:
ProjectId
Date
PersonId
JobId
Date
Then make the project Id your 'Primary Key' so your parent record.
Then you are just assigning the person, job & date as the child.
That way you can always add to the previous record without relying on date field.
You could then filter via dates etc.

Identifying next closest record by date in tableau

I have a table of users and another table of transactions.
The transactions all have a date against them. What I am trying to ascertain for each user is the average time between transactions.
User | Transaction Date
-----+-----------------
A | 2001-01-01
A | 2001-01-10
A | 2001-01-12
Consider the above transactions for user A. I am basically looking for the distance from one transaction to the next chronologically to determine the distances.
There are 9 days between transactions one and two; and there are 2 days between transactions three and four. The average of these is obviously 4.5, so I would want to identify the average time between user A's transactions to be 4.5 days.
Any idea of how to achieve this in Tableau?
I am trying to create a calculated field for each transaction to identify the date of the "next" transaction but I am struggling.
{ FIXED [user id] : MIN(IF [Transaction Date] > **this transaction date** THEN [Transaction Date]) }
I am not sure what to replace this transaction date with or whether this is the right approach at all.
Any advice would be greatly appreciated.
LODs dont have access to previous values directly, so you need to create a self join in your data connection. Follow below steps to achieve what you want.
Create a self join with your data with following criteria
Create an LOD calculation as below
{FIXED [User],[Transaction Date]:
MIN(DATEDIFF('day',[Transaction Date],[Transaction Date (Data1)]))
}
Build the View
PS: If you want to improve the performance, Custom SQL might be the way.
The only type of calculation that can take order sequence into account (e.g., when the value for a calculated field depends on the value of the immediately preceding row) is a table calc. You can't use an LOD calc for this kind of problem.
You'll need to understand how partitioning and addressing works with table calcs, along with specifying your sort order criteria. See the online help. You can then do something like, for example, define days_since_last_transaction as:
if first() > 0 then min([Transaction Date]) -
lookup(min([Transaction Date]), -1) end
If you have very large data or for other reasons want to do your calculations at the database instead of in Tableau by a table calc, then you use SQL windowing (aka analytical) queries instead via Tableau's custom SQL.
Please attach an example workbook and anything you tried along with the error you have.
This might not be useful if you cannot set User ID Field as a filter.
So, you can set
User ID
as a filter. Then following the steps mentioned in here will lead you to calculating difference between any two dates. Ideally if you select any one value in the filter, the calculated field from the link should give you the difference in the dates that you have in the transaction dates column.

Staff availability form in FileMaker

I'm have trouble coming up with a solution to a staff availability form, for a live event facility.
The goal is to have someone in the office select start and end dates, which generates a list of upcoming events. This list will be seen by employees on WebDirect, and they will be able to mark whether they are available or not via checkbox. The people in the office will then be able to see who is available for the all upcoming events while scheduling.
The idea behind choosing the start and end date is so the office can selectively "publish" which dates the employees see, as well as having a log of all responses tied to that form.
I also want to limit the employee to only be able to see their responses to the form.
So far I have tables as follows:
Employee Event Availability Form Response
-------- ----- ------------ ---- --------
ID ID ID ID ID
Name Date StartDate fk_AvailabilityID fk_EmployeeID
Title EndDate fk_ResponseID Checkbox
All of the relationships are primary key = foreign key except Event Date has a relationship to Availability:
Date ≥ StartDate AND
Date ≤ EndDate
Not really sure where to go past this or if this even is correct to begin with. I've experimented with a FormResponse table but not really sure what connections to make.
I'm fairly new to FileMaker and Databases in general, so laymens terms would be appreciated.
For starters, instead of convoluted relationships, you could use a field in the event table as a publish flag and then perform a search for that in web direct upon login.
When the staff sets a start/end date, they run a script that sets the publish flag for these records only.
To match the logged in user to an employee, you would need to store the account name in the employee table and match these upon login. Then set the privileges such that records can only be viewed when there is a match.
Hope this helps.

Why TIMESTAMP instead of DATETIME in this instance?

In many examples I've read of SQL timestamp useage, a typical case would be that a timestamp column is added to prevent a kind of race condition whereby a user is changing data that has lost it's integrity since another user 'got in there first'.
More specifically, prior to issuing an update on a row, business logic would cross check the timestamp they believe to be changing so that there isn't a mix up with row versioning.
Question
Why wouldn't DATETIME suffice for this task? In fact, by that logic - why wouldn't any unique data type be appropriate instead? NEWID() every time an update is issued, for example?
In mySQL, timestamp is a physically smaller datatype to store than datetime. In addition, timestamp is universal, ignoring all timezones. For international products, this is important.
ID's are not recommended as they often generate at the point of insert/update.
It appears I missed the fundamental feature of timestamp, it auto updates.
So calling UPDATE on a row will automatically increment it's TIMESTAMP column without me manually setting it.
I'll leave this answer here just in case anybody has comments about what else I may have missed.

Volume of an Incident Queue at a Point in Time

I have an incident queue, consisting of a record number-string, the open time - datetime, and a close time-datetime. The records go back a year or so. What I am trying to get is a line graph displaying the queue volume as it was at 8PM each day. So if a ticket was opened before 8PM on that day or anytime on a previous day, but not closed as of 8, it should be contained in the population.
I tried the below, but this won't work because it doesn't really take into account multiple days.
If DATEPART('hour',[CloseTimeActual])>18 AND DATEPART('minute',[CloseTimeActual])>=0 AND DATEPART('hour',[OpenTimeActual])<=18 THEN 1
ELSE 0
END
Has anyone dealt with this problem before? I am using Tableau 8.2, cannot use 9 yet due to company license so please only propose 8.2 solutions. Thanks in advance.
For tracking history of state changes, the easiest approach is to reshape your data so each row represents a change in an incident state. So there would be a row representing the creation of each incident, and a row representing each other state change, say assignment, resolution, cancellation etc. You probably want columns to represent an incident number, date of the state change and type of state change.
Then you can write a calculated field that returns +1, -1 or 0 to to express how the state change effects the number of currently open incidents. Then you use a running total to see the total number open at a given time.
You may need to show missing date values or add padding if state changes are rare. For other analytical questions, structuring your data with one record per incident may be more convenient. To avoid duplication, you might want to use database views or custom SQL with UNION ALL clauses to allow both views of the same underlying database tables.
It's always a good idea to be able to fill in the blank for "Each record in my dataset represents exactly one _________"
Tableau 9 has some reshaping capability in the data connection pane, or you can preprocess the data or create a view in the database to reshape it. Alternatively, you can specify a Union in Tableau with some calculated fields (or similarly custom SQL with a UNION ALL clause). Here is a brief illustration:
select open_date as Date,
"OPEN" as Action,
1 as Queue_Change,
<other columns if desired>
from incidents
UNION ALL
select close_date as Date,
"CLOSE" as Action,
-1 as Queue_Change,
<other columns if desired>
from incidents
where close_date is not null
Now you can use a running sum for SUM(Queue_Change) to see the number of open incidents over time. If you have other columns like priority, department, type etc, you can filter and group as usual in Tableau. This data source can be in addition to your previous one. You don't have ta have a single view of the data for every worksheet in your workbook. Sometimes you want a few different connections to the same data at different levels of detail or for perspectives.