After performing dropDuplicates() am getting different counts when taking the count - scala

I did dropDuplicates in a dataframe with subsets of Region,store,and id.
The dataframe contains some other columns like latitude, longitude, address, Zip, Year, Month...
When I do count of the derived dataframe am getting a constant value,
But when i take the count of a selected year, say 2018, am getting different counts when running the df.count()
Could anyone please explain why this is happening?
Df.dropDuplicates("region","store","id")
Df.createOrReplaceTempView(Df)
spark.sql("select * from Df").count() is constant
whenever i run
But if i put a where clause inside with Year or Month am getting multiple counts.
Eg:
spark.sql("select * from Df where Year =2018").count()
This statement is giving multiple values on each execution.
Intermediate output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1 4 2018 0
Sgs 21 1425 47.783 6.7282 2 5 2019 1
Efg 26 1277 48.8293 8.2727 3 7 2019 2
Output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1277 4 2018 0
Sgs 21 1425 47.783 6.7282 1425 5 2019 1
Efg 26 1277 48.8293 8.2727 1277 7 2019 2
So here newid gets the value of objecrnr,
When newid is comming same then i need to assign the latest objectnr to newid, considering the year and month

The line
Df.dropDuplicates("region","store","id")
creates a new Dataframe and it is not modifying the existing one. Dataframes are immutable.
To solve your issue you need to save the output of the dropDuplicates statement into a new Dataframe as shown below:
val Df2 = Df.dropDuplicates("region","store","id")
Df2.createOrReplaceTempView(Df2)
spark.sql("select * from Df2").count()
In addition you may get different counts when applying the filter Year=2018 because the Year column ist not part of the three columns you used to drop the duplicates. Apparently you have date in your Dataframe that share the same values in the three column but differ in the Year. Dropping duplicates is not a deterministic process ass it depends on the ordering of your data which vary in every run on your code.

Related

Date difference between pairs of dates per ID

I have data with 2 columns, in the following format:
ID
Date
1
1/1/2020
1
27/7/2020
1
15/3/2021
2
18/1/2020
3
1/1/2020
3
3/8/2020
3
18/9/2021
2
23/8/2020
2
30/2/2021
Now I would like to create a calculation field in Tableau to find per ID the difference between the different dates. For any value e.g. days.
For example for ID 1 the difference of the two dates according to calendar is 208 days. Next the difference of the second to third date for the same ID is 231 days.
A table calc like the following should do if you get the partitioning, addressing and ordering right — such as setting “compute using” to Date.
If first() < 0 then min([Date]) - lookup(min([Date]), -1) end

How to union in a way that will group "date" into different columns instead of combining everything in the same column

I'm very new to pyspark. Here is the little situation, I created a dataframe for each file (9 in total, each file represent the counts for each month), then I need to union them all in one big df. The thing is I need it to come out like this with each month is its own separate column.
name_id | 2020_01 | 2020_02 | 2020_03
1 23 43534 3455
2 12 34534 34534
3 2352 32525 23
However, with my current code, it put all months under the same column. I've been searching on the internet for a long time, but could not find anything to solve this (maybe I need groupby, but not sure how to do that). Below is my code. Thanks!
df1=spark.read.format("parquet").load("dbfs:")
df2=spark.read.format("parquet").load("dbfs:")
df3=spark.read.format("parquet").load("dbfs:")
df4=spark.read.format("parquet").load("dbfs:")
df5=spark.read.format("parquet").load("dbfs:")
df6=spark.read.format("parquet").load("dbfs:")
df7=spark.read.format("parquet").load("dbfs:")
df8=spark.read.format("parquet").load("dbfs:")
df9=spark.read.format("parquet").load("dbfs:")
#union all 9 files
union_all=df1.unionAll(df2).unionAll(df3).unionAll(df4).unionAll(df5).unionAll(df6).unionAll(df7).unionAll(df8).unionAll(df9)
Here is the current output
name_id | count | date
1 23 2020_01
2 12 2020_01
1 43534 2020_02
2 34534 2020_02

Use one date filter on multiple columns in tableau

I do have data set with multiple date columns with different values of dates across all the months and years. I want to create a report wherein when I select a Year, I want to list the count of dates across each months on that year. Based on one Year field selection, how can I apply filter across different date fields to display the counts for that particular year
Lets say we have the data set like this
Date 1 Date 2
1/3/2017 NA
1/23/2017 1/23/2017
1/14/2017 1/16/2017
2/2/2017 2/3/2017
NA 2/21/2017
3/1/2017 NA
3/3/2017 3/21/2017
.
.
.
12/1/2017 12/12/2017
My result should look like this when I pick the year 2017
Date 1 Date 2
Jan 3 2
Feb 1 2
Mar 2 1
.
.
Dec 1 1
I was able to apply filter on one column but when I try to apply on other columns, I am not getting desired result
Assuming you want to interact with your dashboard using a parameter, you can create one string parameter in order to input the year you want to analyze.
After that you just need to create 2 calculated fields to count if that year is "contained" in your dates:
if contains(str([Date 1]),[Parameter]) then 1 else 0 end
Keep in mind that there's no gaurantee you'll get all the available months in the calendar unless you have data for all of them.
In order to consider even blank dates, I created a Date Global calculated field as follow:
ifnull([Date 1],[Date 2])
Once you've created this fields/parameter (show parameter control), you can simply add them in your worksheet ad I did in the image:

TABLEAU Calculating a Running DISTINCT COUNT on usernames for last 3 months

Issue:
Need to show RUNNING DISTINCT users per 3-month interval^^. (See goal table as reference). However, “COUNTD” does not help even after table calculation or “WINDOW_COUNT” or “WINDOW_SUM” function.
^^RUNNING DISTINCT user means DISTINCT users in a period of time (Jan - Mar, Feb – Apr, etc.). The COUNTD option only COUNT DISTINCT users in a window. This process should go over 3-month window to find the DISTINCT users.
Original Table
Date Username
1/1/2016 A
1/1/2016 B
1/2/2016 C
2/1/2016 A
2/1/2016 B
2/2/2016 B
3/1/2016 B
3/1/2016 C
3/2/2016 D
4/1/2016 A
4/1/2016 C
4/2/2016 D
4/3/2016 F
5/1/2016 D
5/2/2016 F
6/1/2016 D
6/2/2016 F
6/3/2016 G
6/4/2016 H
Goal Table
Tried Methods:
Step-by-step:
Tried to distribute the problem into steps, but due to columnar nature of tableau, I cannot successfully run COUNT or SUM (any aggregate command) on the LAST STEP of the solution.
STEP 0 Raw Data
This tables show the structure Data, as it is in the original table.
STEP 1 COUNT usernames by MONTH
The table show the count of users by month. You will notice because user B had 2 entries he is counted twice. In the next step we use DISTINCT COUNT to fix this issue.
STEP 2 DISTINCT COUNT by MONTH
Now we can see who all were present in a month, next step would be to see running DISTINCT COUNT by MONTH for 3 months
STEP 3 RUNNING DISTINCT COUNT for 3 months
Now we can see the SUM of DISTINCT COUNT of usernames for running 3 months. If you turn the MONTH INTERVAL to 1 from 3, you can see STEP 2 table.
LAST STEP Issue Step
GOAL: Need the GRAND TOTAL to be the SUM of MONTH column.
Request:
I want to calculate the SUM of '1' by MONTH. However, I am using WINDOW function and aggregating the data that gave me an Error.
WHAT I NEED
Jan Feb March April May Jun
3 3 4 5 5 6
WHAT I GOT
Jan Feb March April May Jun
1 1 1 1 1 1
My Output after tried methods: Attached twbx file. DISTINCT_count_running_v1
HELP taken:
https://community.tableau.com/thread/119179 ; Tried this method but stuck at last step
https://community.tableau.com/thread/122852 ; Used some parts of this solution
The way I approached the problem was identifying the minimum login date for each user and then using that date to count the distinct number of users. For example, I have data in this format. I created a calculated field called Min User Login Date as { FIXED [User]:MIN([Date])} and then did a CNTD(USER) on Min User Login Date to get the unique user count by date. If you want running total, then you can do quick table calculation on Running Total on CNTD(USER) field.
You need to put Month(date) and count(username) in the columns then you will get result what you expect.
See screen below

Parse variable in the Query SSIS

In SQL Task Editor I have the following Query
DELETE FROM
[TICKETS_DATA]
where BILLING_TICKETS_DATA_Time_ID <
(SELECT TIME_ID
FROM [TIME]
WHERE (TIME_Year = ?) AND (TIME_Month = ?)) - 1
I have TIME_ID with relevant Month and Year present in the row.
I have 2 Variables present as Time_Month (int32) and Time_Year (int32) for eg 08 and 2012 respectively.
I want to pick up the Current Time_ID and pass the above query in SQL Task Editor.
Currently in the Table I was storing 1 month of data and now want to store 3 months data.
Kindly, assist me in Parameter mapping and how to parse the variable in the SQL Command query.
As long as the Time_id in the table is a numeric value that is incremented by one for each record, and there is as stated one record per year/month combo and the numbers increase sequentially, by one each time, in date order (i.e. 2000 01 has time_id 1 and 2000 02 has time_id 2 and 2001 1 has time_id 13), then you can just change the -1 to -3 to delete records from your table that are older than three months. Bear in mind that since this was probably run last month, you will have two months in the table on the first run after this change and it will delete 0 records in this first run. Next run, you will have 3 months and it will delete 0 records again. On the third run (assuming it is only run once a month) you will have three months of data and it will delete records from 4 months prior to that date.