Why Nulls are automatically skipped in Tableau - tableau-api

I created a tableau view that gives no. of students in each school.
My input dataset is below. Intentionally I kept null student_name .
As you can see 3rd row and 4th row are having null names
student_name school
Stev Boston Academy
Mike Florida school
Boston Academy
Boston Academy
Sue Florida school
Jim Florida school
But here nulls are automatically skipped .
Even if I apply quick filter to include nulls then also nulls are skipped .
As you can see there are 2 null names for Boston Academy.. I am expecting count as 3 for Boston academy.
Below is the view
I would like know how tableau behaves if we have null for student_name.
Does it skip null? Does it skip null even if we apply filter to include null?

Count() by definition ignores nulls like the other aggregation functions. More precisely, CNT([Student Name]) returns the number of records with a non-null value for the field [Student Name]).
That is standard database behavior.
If you want to count the number of data rows per school, regardless of whether [Student Name] has a value, then you can use CNT(1) (the 1 could be any non-null constant value), or possibly slightly less efficiently SUM(1) or equivalently SUM([Number of Records])

Related

Tableau measure count items if between dates

What I am trying to achieve is to get a count of people employed in a particular period.
I have 3 variables:
Employee ID (integer)
Hire date (date)
Termination date (date or null)
Example
the formula I am looking for is something like
if termination_date is null
then
count employee_ID in
dates between Hire_date and max of either hire_date or termination_date
else
count employee ID in
dates between hire_date and termination_date
This aims to show the dynamic of staff level over the time.
I am new to Tableau, not sure how to even start with it. Any suggestions welcome.
This problem will be simpler if you reshape your data to have the following three columns
Employee ID
Date
Action. (where action takes on the values of ‘Hire’ or ‘Terminate’).
Each data row represents one change in status for an employees. If an employee had a termination date, they will have two records in this new format, otherwise just one record showing the hiring date.
You can reshape your data by hand, or leave the original and use Tableau Prep or the Tableau data source page to reshape using a self Union and a few simple calculated fields.
Define a calculated field called Staffing_Change as
if Action=‘Hire’ then 1 else -1 end
Now you can plot the change in staff level over time by putting exact date on columns and sum(Staffing_Change) on Rows. You can use a quick Table calc, Running Sum, to see the net staffing level. For line mark types, I’d use a step style by pressing on the path button on the Marks card. Otherwise, the chart can give the impression of fractional number of employees.

Atomic values / divisibility to reach 1NF

After reading about normalization I am unsure of how to interpreter the 1 NF requirements
According to wikipedia, something is in first normal form, if the "domain of each attribute contains only atomic indivisible values"
My question is: Who decides what is indivisible or not?
You may divide a date datatype into year, month, day, second, nanoseconds. You may aswell divide an adress into the exact latitude coordinates. When can you really be sure that you have reached 1NF?
Would this table be considered 1NF?
fullName
fullAdresss
Joe Zowesson
87th Victoria Street London EC96 1MB, 14584
Mason Hamburg
47th Jeremy Street London EC26 1MB, 13584
Dedrik Terry
27th Burger Street London EC16 1MB, 17584
My interpretation here is that the value Joe Zowesson is indivisible in regards to the column fullName. And that both zip code, street number and street name is atomic in relation to the column name fullAddress.
I am almost certain that I am in the wrong, but I can not yet understand why.
The question is in regards to an upcoming exam, where I will need to "proove" which normal form something currently is in. Something that I find very hard depending on how you interpreter the word atomic.
You have misunderstood the concept of 1NF basically. By atomic value, it is meant that when you have a column for Name, you should not store any other values alongside it. In other words, the column intended for the Name should not store ID, Address or anything else together with Name, so that when you query the column Name you get only Name, and not name with Id or Address. And Name can be in any form you want whether it be First name + Last name or First name + Last name + Middle name + Previous name.
The decision of whether you need separate columns for the related data should be made during design. Let's suppose you have table Student:
StudentId
FullName
Address
Average grade
1
John Done
New York, US
3.4
2
Robert Bored
New York, US
0
3
Student LName
Dallas, US
1
4
Another LName
Munich, Germany
2
In this case, it means that you do not write queries and don't need data based on First name, Last name separately, but you need all at once for example:
SELECT FullName
FROM Student
WHERE StudentId = 1;
John Done
And when you need First name, Last name separately, you decompose them into several columns, for example:
StudentId
FullName
LastName
Address
Average grade
1
John
Done
New York, US
3.4
2
Robert
Bored
New York, US
0
3
Student
LName
Dallas, US
1
4
Another
LName
Munich, Germany
2
And your queries might look like this:
SELECT LastName, AverageGrade
FROM Student
WHERE AverageGrade >= 1 AND FirstName != 'John';
The result will be:
| LastName | AverageGrade |
---------------------------
| LName | 1 |
| LName | 2 |
Or something like this maybe:
UPDATE Student
SET AverageGrade = 4
WHERE LastName = 'LName' AND FirstName != 'Student'
Basically, the decision depends on how you manipulate the data and in which form you need it.
To sum it up. Whether the relation is in 1NF or not depends on what values you're trying to store on this table, as I mentioned above, one column should store only one type of value, e.g ID, Address, Name, etc. And the decision of how your columns' values will look depends on the design and how you NEED TO STORE the data. If you do not need to query fistname, middlename, lastname, secondname separately, then what you can do is just save all of them in one column FullName and it will still be in 1NF. But if you need them separately, you can store them in separate columns, and again it will still be in 1NF, but it might violate other rules.
Here are some tutorials you might find useful: https://www.studytonight.com/dbms/first-normal-form.php
Let the application, and how it will be used, guide you as to what data should be split further into additional fields (or not).
For example;
If, in your application, you are constantly splitting first name from last name so that you can say "Hi Joe" on correspondence, you should split fullName into two fields. Conversely, If you had two fields firstName and lastName, and were always concatenating them so that you could correctly address an envelope, it would make more sense to have those two fields stored in a single column in your table.
In practice, it is not uncommon for a database to show some de-normalization with the above example given how common both scenarios are but the risk is that they get out of sync if someone updates first name (for example) but doesn't update fullName.
Consider things like how you will force your users to follow a certain pattern if you decide to go with a single column fullName. How would you prevent "Smith, Joe" if your application needed "Joe Smith"?
Dates are another good example and again, whether you split the parts into separate columns depends on how they will be used.
A datetime field which indicates when a row was inserted probably doesn't need to be split out, but if you had many queries which were only interested in the year (for example), it might make sense to split it out.
This only scratches the surface which is why this answer is more about how to think about the underlying problem. Yes normalizing your database is important for all kinds of reasons, but how far you go with it depends on how your data will be used at the end of the day.

Is this table in first normal form?

I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last

SQL Server 2008: Pivot column with no aggregate function workaround

Yes I know, this question has been asked MANY times but after reading all the posts I found that there wasn't an answer that fits my need. So, Heres my question. I would like to take a column of values and pivot them into rows of 6 columns.
I want to take this...... And turn it into this.......................
G Letter Date Code Ammount Name Account
081278 G 081278 12 00123535 John Doe 123456
12
00123535
John Doe
123456
I have 110000 values in this one column in one table called TempTable. I need all the values displayed because each row is an entity to itself. For instance, There is one unique entry for all of the Letter, Date, Code, Ammount, Name, and Account columns. I understand that the aggregate function is required but is there a workaround that will allow me to get this desired result?
Just use a MAX aggregate
If one row = one column (per group of 6 rows) then MAX of a single value = that row value.
However, the data you've posted in insufficient. I don't see anything to:
associate the 6 rows per group
distinguish whether a row is "Letter" or "Name"
There is no implicit row order or number to rely upon to generate the groups
Unfortunately, the max columns in a SQL 2008 select statement is 4,096 as per MSDN Max Capacity.
Instead of using a pivot, you might consider dynamic SQL to get what you want to do.
Declare #SQLColumns nvarchar(max),#SQL nvarchar(max)
select #SQLColumns=(select '''+ColName+'''',' from TableName for XML Path(''))
set #SQLColumns=left(#SQLColumns,len(#SQLColumns)-1)
set #SQL='Select '+#SQLColumns
exec sp_ExecuteSQL #SQL,N''