Google Analytics Reporting API different results for one period - google-analytics-api

I tried to perform two requests to Reporting API:
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth
startDate=2016-01-01, endDate=2016-08-26, ga:users, ga:yearMonth, ga:year
The metric results do not match. Why?
Example on
Result for request one:
ga:yearMonth ga:users
201601 1372
201602 1701
201603 1980
201604 1779
201605 1465
201606 1336
201607 1402
201608 1595
Result for request two:
ga:year ga:yearMonth ga:users
2016 201601 1372
2016 201602 1525
2016 201603 1761
2016 201604 1531
2016 201605 1239
2016 201606 1084
2016 201607 1157
2016 201608 1365

This answer maybe useful to someone having the same problem.Whenever there is mismatch in data between api and data on dashboard do following things.
Make sure you are using right parameters for both of them (similar metrics and dimensions).
If after step one still there is mismatch then its probably because sampling has been kicked in internally by the google , this is because even the smallest query requires heavy computation. To make sure sampling is being done there will be a field samplingSpaceSizes in the response .
To avoid sampling make sure you loop over dates and query for each day independently.
in your case its most probably the issue of sampling because of huge date range (and this is because you GA account has lots of data), so intead of querying for a bigger range , loop over date range.
Also remember it may tae upto 48 hours for the fresh data to be processed.To make sure if your data is processed one look for a field isDataGolden in response if its present that data is processed and so results will match. if that param is absent it means some of your data has not been processed yet.

Have you checked for sampling? The date range you're working with is on the large side, so you might consider testing with a smaller range to see if the totals become more consistent.
Another thing to consider is that Users metric can be pre-calculated or calculated on the fly. More information on the Users metric here


Table Calculation In Blended Data Sources

I have 2 data sources:
1) Eligible Devices to be changed to a newer model
2) Devices actually changed to a new model
These could be for example computer devices which are after a certain time required to be changed to a newer version i.e. end of life of a product.
As this is a blended data source, so i cannot apply an LOD.
The calculation i am trying to achieve is:
Jan 2017: There are 100 eligible devices but actually only 85 got refreshed. hence there are 15 device which are carried forward to the next month
Feb 2017: There are 200 eligible devices but actually only 160 got refreshed, plus 15 from Jan 15 so the total opening bal for Feb = 200 + 15 = 215 and then 160 to be deducted i.e. 55.
The same process will continue for all the other months and year.
The challenge:
Lets say by actual - eligible is named as Diff. This number should only take actual - eligible for the first month. from second month onward it should take actual - eligible and then the balance from previous month i.e. look up
How do i write a calc which only shows the calculation as described above for first month and then a look up + actual - eligible from previous month from next month onwards.
There would be month and year columns in filters, if i remove any year or month filter, the carry forward from that period will not be accounted for.
Sounds like Running_Sum([Diff]) ahould do the trick as long as you set the partitioning and addressing correctly.

What is the best way to perform row wise calculations in Spark ? Details below

Okay guys, I have a situation where I have the following schema for my data frame:
Customer_Id Role_Code Start_TimeStamp End_Timestamp
Ray123 1 2015 2017
Kate123 -- 2016 2017
I wish to decide the Role_Code of a given customer(say "Ray123") based on a few conditions. Let's say his Role_Code comes out to be 1. I then process the next row and the next customer(say "Kate123") has overlapping time with Ray123, then she can challenge Ray123 and might win against him to have Role_Code 1 (based on some other conditions). And so if she wins, for the overlapping time period, I need to set the Role_Code of Ray123 as 2 so the data looks like:
Customer_Id Role_Code Start_TimeStamp End_Timestamp
Ray123 1 2015 2016
Ray123 2 2016 2017
Kate123 1 2016 2017
There are similar things happening where I need to go back and forth and pick rows and compare the timestamps and some other fields, then take unions and do except etc to get a final data frame with the correct set of customers with correct set of role codes. The problem is, the solution works fine if i have 5-6 rows, but if i test against eg. 70 rows, the YARN container kills the job, it always runs out of memory. I don't know how else to solve this problem without multiple actions such as head(),first() etc coming in the way to process each row and then split the rows effectively.
It seems like some other framework would be better suited for this. I'm thankful for any suggestion!

Adding cases by group in SAS

I have an SAS dataset in a long format. The intervention program started from 2014 Spring semester, and it has been on until 2017 Spring semester. So there has been 7 semesters (2014 Spring and Fall, 2015 Spring and Fall, 2016 Spring and Fall, 2017 Spring).
Not everyone participated in all 7 semesters though. Some participated once and never came back, some participated more than twice but not necessarily two semesters in a low.
So each individual has a different number of cases. For someone who participated twice, for example, has 2 rows, some with 5 participations have 5 rows.
I want everyone has 7 rows in the dataset for some reason.
What could be the best way of programming to do this in SAS?
I would really appreciate any suggestions!
PROC EXPAND is probably the most direct way to do this, although it has the limitation that it won't extrapolate beyond the observed start/end of the range, and it expects regular intervals (or a lot more work to define intervals).
proc expand data=your_data out=expanded_data from=semiyear extrapolate method=none;
by student;
id semester_date;
That relies on semester_date being a date variable corresponding generally to the start of each half-year.
Perhaps more easily in this case, you could use the printmiss option in proc tabulate, which would generate a table pretty easily.
ods output table=out_table;
proc tabulate data=your_data;
class student semester;
tables student,semester/printmiss misstext=' ';
ods output close;
Then merge that back to the main dataset, this will have a row for every student*semester combination.

Joining time series events with daily 'shift' data?

What is the best practice for joining 'shift' data and other time series data in Tableau? I am working with multiple geo data (from LA to India, UK, NY, Malaysia, Australia, China etc), and a lot of employees work past midnight.
For example, an employee has shift at 9 PM to 6 AM on 2016-07-31. The 'report date' is 2016-07-31 but no time zone information is provided.
This employee does work and there are events (time stamps in UTC) between 2016-07-31 21:00 to 2016-08-01 06:00. When I look at the events though, 7/31 will only have the events between 21:00 and 23:59. If I filter for just July, my calculations will be skewed (the event data will be cut off at midnight even though the shift extended to 6 AM).
I need to make calculations based upon the total time an employee was actually engaged with work (productive) and the total time they were paid. The request is for this to be daily/weekly/monthly.
If anyone can help me out here or give me some talking points to explain this to my superiors, it would be appreciated. This seems like it must be a common scenario. Do I need to request for a new raw data format or is there something I can do on my end?
the shift data only looks like this:
id date regular_hours overtime_hours total_hours
abc 2016-06-17 8 0.52 8.52
abc 2016-06-18 7.64 0.83 8.47
abc 2016-06-19 7.87 0.23 8.1
the event data is more detailed (30 minute interval data on events handled and the time it took to complete those events in seconds):
id date interval events event_duration
abc 2016-06-17 01:30:00 4 688
abc 2016-06-17 02:00:00 6 924
abc 2016-06-17 02:30:00 10 1320
So, you sum up the event_duration for an entire day and you get a number of seconds which was actually spent doing work. You can then compare this to amount of time that the employee was paid to see how efficient the staffing is.
My concern is that the event data has the date and the time (UTC). The payroll data only has a date without any time zone information. This causes inaccuracies when blending data in Tableau because some shifts cross midnight. Is there a way around this or do I need to propose new data requirements?
(FYI - people have been calculating it just based on the date for years most likely without considering time zones before. My assumption is that they just did not realize that this could cause inaccurate results)

How to handle dates in neo4j

I'm an historian of medieval history and I'm trying to code networks between kings, dukes, popes etc. over a period of time of about 50 years (from 1220 to 1270) in medieval Germany. As I'm not a specialist for graph-databases I'm looking for a possibility to handle dates and date-ranges.
Are there any possibilities to handle over a date-range to an edge so that the edges, which represents a relationship, disappears after e.g. 3 years?
Are there any possibility to ask for relationships who have their date-tag in a date-range?
The common way to deal with dates in Neo4j is storing them either as a string representation or as millis since epoch (aka msec passed since Jan 01 1970).
The first approach makes the graph more easily readable the latter allows you to do math e.g. calculate deltas.
In your case I'd store two properties called validFrom and validTo on the relationships. You queries need to make sure you're looking for the correct time interval.
E.g. to find the king(s) in charge of France from Jan 01 1220 to Dec 31st 1221 you do:
MATCH (c:Country{name:'France'})-[r:HAS_KING]->(king)
WHERE r.validFrom >= -23667123600000 and r.validTo <=-23604051600000
RETURN king, r.validFrom, r.validTo
Since Neo4j 3.0 there's the APOC library which provides couple of functions for converting timestamps to/from human readable date strings.
You can also store the dates in their number representation in the following format: YYYYMMDD
In your case 12200101 would be Jan 1st 1220 and 12701231 would be Dec 31st 1270.
It's a useful and readable format and you can perform range searches like:
MATCH (h:HistoricEvent)
WHERE >= 12200101 AND < 12701231
It would also let you order by dates, if you need to.
As of Neo4J 3.4, the system handles duration and dates, see the official documentation. See more examples here.
An example related to the original question: Retrieve the historical events that happened in the last 30 days from now :
WITH duration({days: 30}) AS duration
MATCH (h:HistoricEvent)
WHERE date() - duration < date(
Another option for dates that keeps the number of nodes/properties you create fairly low is a linked list years (earliest year of interest - latest year), one of months (1-12), and one of dates in a month (1-31). Then every "event" in your graph can be connected to a year, month, and day. This way you don't have to create a new node for every new combination of a year month and day. You just have a single set of months, one of days, and one year. I scale the numbers to make manipulating them easier like so
Years are yyyy*10000
Months are mm*100
Date are dd
so if you run a query such as
match (event)-[:happened]->(t:time)
with event,sum(t.num) as date
order by date
You will get a list of all events in chronological order with dates like Janurary 17th, 1904 appearing as 19040117 (yyyymmdd format)
Further, since these are linked lists where, for example,
...-(t0:time {num:19040000})-[:precedes]->(t1:time {num:19050000})-...
ordering is built into the nodes too.
This is, so far, how I have liked to do my event dating