There are several transposing questions on Stackoverflow, but looking at few non of them is really similar to my problem. The main difference being: the have a predefined set of columns.
Let's say my table looks like this:
ID Name Value
---------------
1 Set Mitch
2 Get Jane
3 Push Dave
4 Pull Mike
5 Dummy John
...
I'd like to transpose it to become:
Set Get Push Pull Dummy ...
----------------------------------
Mitch Jane Dave Mike John ...
It looks like you're looking for a "dynamic pivot table". See the example here, or Google that term for more information:
http://www.kodyaz.com/articles/t-sql-pivot-tables-in-sql-server-tutorial-with-examples.aspx
Do you need to do this in the SQL? It seems pretty trivial if you can just do it after you get a SELECT * query into an array you can manipulate at will.
Related
I would like to update a column in a table that contains names with several characters added as suffix. Is it possible to remove these characters from this specific column in some way as there is not any other condition?
For example I have got names such as
Joe Doe Nm and I want to update it to Joe Doe
John Crow Nkk and update it to John Crow
George Stavrou_Ngf and same as above (George Stavrou)
and similar style regarding names.
Regards
I have a very big -super big- database of names.
The task is to find all the similar names (of the same person per se) despite some diffrences like :
first name, second name inversed --> John Doe & Doe John
two names or more (same ones) with light changes, maybe some
letters misplaced or something else --> Jonh Doe & John Deo
two names with some letters added --> Johhn Doe & Johnn Doee &
John Doe
names where another middle name inserted --> John Blair Campbell Doe & John Blair Doe
And so on..
I tried using the classical methods like soundex and leveshtein but the results were not very good, had results like : Amine depi and Amina dope are in the same group while they're diffrent
and It would take very long to perform the task on just a fraction on the data, as for my database, it would directly crash after a long time
I also thought of using another approach like cosine which uses numerical values and I though of finding a way of representing the names in a numerical way, or convert them (something like word2vec), I actually though of using directly word2vec with the whole database of namems as the text, but as expected it didn't work. Tried to codify the names in a low level way, like code ASCII for exemple, but the results weren't good neither.
So I thought of Clustering.
So I tried using DBSCAN. I found a way to use DBSCAN clustering with a custom distance metric and used leveshtein distance. (If you ask me why DBSCAN? It is because I don't know the numbers of similar groups of names which are in the database in the beginning)
I did have some results, but very poor performance overall. It would either give the same exact ones, John Doe and John Doe int he same cluster, or nothing at all, and would even skip some exact ones.
Do you have a suggestion for performing this task ? preferbly using clsutering or another smart way since the database is very big (more than 500 000 line and up to millions ) so I cannot iterate alot.
I am open to suggestions or propositions !
Especially if you worked on something like this previously or similar to this, Thank you in advance.
Try AgglomerativeClustering.
Sample code:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=0.3 # smaller threshold meaning more strict similarities, and more clusters
).fit(your_vectorized_name_list)
print(f'total clusters: {clustering.n_clusters_}')
I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).
I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob
I've got a load of materialized views, some of them take just a few seconds to create and refresh, whereas others can take me up to 40 minutes to compile, if SQLDeveloper doesn't crash before that.
I need to aggregate some strings in my query, and I have the following function
create or replace
function stragg
( input varchar2 )
return varchar2
deterministic
parallel_enable
aggregate using stragg_type
;
Then, in my MV I use a select statement such as
SELECT
hse.refno,
STRAGG (DISTINCT per.person_name) as PERSONS
FROM
HOUSES hse,
PERSONS per
This is great, because it gives me the following :
refno persons
1 Dave, John, Mary
2 Jack, Jill
Instead of :
refno persons
1 Dave
1 John
1 Mary
2 Jack
2 Jill
It seems that when I use this STRAGG function, the time it takes to create/refresh an MV increases dramatically. Is there an alternative method to achieve a comma separate list of values? I use this throughout my MVs so it is quite a required feature for me
Thanks
There are a number of techniques for string aggregation at the link below. They might provide better performance for you.
http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php