pig merge lists without key - merge

In Apache Pig 0.15, I have two simple lists (WITHOUT id/primary key, etc.) that I want to merge together to create one list of tuples with two columns. Example:
Names
-----
Peter
John
Anne
Ages
-----
45
23
44
I want to end up with:
Names Age
---------------
Peter 45
John 23
Anne 44
I know I can use RANK on both lists and then JOIN, but that looks way too costly as I have millions of entries in these lists. I kind of want to do a JOIN with "merge" without having a join parameter...
Any idea about how to do this efficiently in Apache Pig?

If you do not care about the mapping between Age and Name then you can try cross-join between two relations. Post Cross join group by names and retain anyone out of it. However IMO, this may be more costlier ( rather resource intensive) than the RANK approach you mentioned above.

Related

How to select a column containing dot in column name in kdb

I have a table which consists of column named "a.b"
q)t:([]a.b:3?10.0; c:3?10; d:3?`3)
How can we select column a.b and c from table t?
How can we rename column a.b to b?
Is it possible to achieve above two cases without functional select?
Failed attempts:
q)select a.b, c from t
'type
q)?[`t;();0b;enlist (`b`c!`a.b`c)]
'type
q)select b:a.b from t
'type
As others have mentioned, .Q.id t will sanitise table column names if they aren't suitable for qSQL statements or performance in general.
`a.b`c#t
will only work for multiple column selects and
`a.b#t
will return a type error. However, you can get around this by enlisting the single item into the take operator, like so:
q)enlist[`a.b]#t
a.b
---------
4.931835
5.785203
0.8388858
q)(enlist`a.b)#t
a.b
---------
4.931835
5.785203
0.8388858
If you only need the values from a single column another option would be to use indexing, in this case, it would be t[a.b] ` which would return all values from the a.b column.
You could also mix these selection styles like so, but ultimately lose the column name from a.b:
q)select c,t[`a.b] from t
c x
----------
8 4.707883
5 6.346716
4 9.672398
In the query operation the . itself is used for foreign key navigation and it is throwing a type error as it cannot find any table relating to the foreign key it believes you have passed it.
As much as I hate answering any online forum question by refuting the premise, I really must here, do not use periods in column names, it will cause trouble. .Q.id exists to santise column names for a reason.
The primary reason that errors are encountered is that the use of dot notation in qSQL is reserved for the resolution of linked columns. We can see how this is actually working by parsing the query itself
q)parse "select a.b from tab"
?
`tab
()
0b
(,`b)!,`a.b // Here the referencing of a linked column b via a is occuring
// Compared to a normal select
q)parse "select b from tab"
?
`tab
()
0b
(,`b)!,`b
Other issues could crop up depending on future processing, such as q attempting to treat the column names as namespaces or operating on each part of the name with the dot operator.
Using dot notation in your column names will hamstring any further development, and force all other kdb users to use roundabout methods. The development will be slow and encounter many bugs.
I would advise that if periods must be included in the column, you create an API for external users to use to translate queries into the sanitised forms.
You can easily sanitise the whole table with .Q.id
q)tab:enlist `a.b`c`d!(1 2 3)
q)tab:.Q.id tab
q)sel:{[tab;cl] ?[tab;();0b;((),.Q.id each cl)!((),.Q.id each cl)]}
q)sel[tab;`a.b]
ab
--
1
How about the following, using take # :
q) `a.b`c#t
a.b c
-----------
4.931835 1
5.785203 9
0.8388858 5
To rename:
q) `b xcol t
b c d
---------------
4.931835 1 mil
5.785203 9 igf
0.8388858 5 kao
You can use .Q.id to rename any unselectable columns:
q).Q.id t
ab c d
---------------
4.931835 1 mil
5.785203 9 igf
0.8388858 5 kao
Best to avoid dots in columns names and symbols in general, use underscore if you must.

KDB: How to serialize a table for a union join within kdb-tick architecture?

Im trying to modify the kdb-tick architecture to support a union join on incoming data and the local rdb table.
I have modified the upd function in the tick.q file to the following:
ups:{[t;x]ts"d"$a:.z.P;
if[not -16=type first first x;a:"n"$a;x:$[0>type first x;a,x;(enlist(count first x)#a),x]];
f:key flip value t;pub[t;$[0>type first x;enlist f!x;flip f!x]];if[l;l enlist (`ups;t;x);i+:1];};
With ups:uj subsequently set in the subscriber files.
My question relates to how one might serialize a table row before publishing it within the .u.ups[] function.
I.e. given a table:
second | amount price
-----------|----------------
02:46:01 | 54 9953.5
02:46:02 | 54 9953.5
02:46:03 | 54 9953.5
02:46:04 | 150 9953.5
02:46:05 | 150 9954.5
How should one serialize the first row 02:46:01 | 54 9953.5 such that it can be sent via the .u.ups function to subscribers whereby uj will be run between the row and the local table on the subscribers.
Thanks in advance for your advice.
Some of this might help:
You can't set ups:uj in the subscribers because the table name is being passed as a symbol so the subscriber will effectively try to do
uj[`tab1;tab2]
which won't work because uj doesn't accept table names (symbols) as input. You would have to instead set ups to
ups:{x set value[x] uj y}
A standard tickerplant is not designed to handle variable/changing schema - for good reason, it's generally not a good idea to have a schema that changes intraday. However your situation might warrant it so in that case you'd need to modify your .u.ups function to something like
\d .u
ups:{[t;x]ts"d"$a:.z.P;
x:`time xcols update time:"n"$a from x;
pub[t;$[98h=type x;x;1=count last x;enlist x;flip x]];if[l;l enlist (`ups;t;x);i+:1];};
\d .
and your feeder process would have to send kdb tables or kdb dictionaries to the .u.ups function. Since a feedhandler process is usually not a kdb process, it may or may not be possible to send tables/dictionaries to the tickerplant as normally the feedhandler would send lists (without column metadata). In your case you need to somehow supply the column metadata to the tickerplant on each update (or maybe you're doing that already?), as otherwise it won't know which columns are which.
In other words your feeder process could send either of the following:
(`.u.upd;`tab;([]col1:`a`b`c;col2:1 2 3))
(`.u.upd;`tab;`col1`col2!(`a;1))
(`.u.upd;`tab;`col1`col2!(`a`b;1 2))
I'm going to assume this is related to your previous few questions about disparate schemas. I'd like to suggest an alternative solution, which is only truly viable if you are using kdb version 3.6, which uses anymap. If you can narrow your schemas down to a minimal list of common columns, all other columns can be placed as dictionaries into a general column.
q)tab:([]sym:`$();col1:`float$();colGeneral:(::))
q)`tab upsert (`AAPL;3.454;(`colX`colY`colZ!(1;2.3;"abc")))
`tab
q)`tab upsert (`MSFT;3.0;(`colX`colY!(2;100.0)))
`tab
q)`tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
`tab
q)tab
sym col1 colGeneral
----------------------------------------
AAPL 3.454 `colX`colY`colZ!(1;2.3;"abc")
MSFT 3 `colX`colY!(2;100f)
AMZN 100 (,`colX)!,10
q)select colGeneral from tab
colGeneral
-----------------------------
`colX`colY`colZ!(1;2.3;"abc")
`colX`colY!(2;100f)
(,`colX)!,10
q)select sym, colGeneral #\: `colX from tab
sym x
-------
AAPL 1
MSFT 2
AMZN 10
q)select sym, colGeneral #\: `colY from tab
sym x
---------
AAPL 2.3
MSFT 100f
AMZN 0N
With 3.6 you can be saving this to disk in any splayed format (splayed, partitioned, segmented) and still easily query the data. The storage of such a table will likely be sub-optimal due to poor compression characteristics of the general column (assuming you wish to compress data), but it will be perfectly functional.
Integrating uj into standard ingestion procedure with each update will be computationally expensive. Using a general column and dictionary method will massively improve your ingestion speed. Below I've given a demonstration using the example given a previous answer to a related question of yours
q)table:()
q)row1:enlist `x`y`colX!(`AMZN;100.0;10)
q)table:table uj row
q)\ts:100000 table:table uj row1
13828 6292352
q)\ts:100000 `tab upsert (`AMZN;100.0;((enlist `colX)!(enlist 10)))
117 12746880

Is this table in first normal form?

I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).

Need help building complex multi-table queries

This question is something that a lot of people learning bioinformatics and new to DNA data analysis are struggling with:
Lets say I have 20 tables with the same column headings. Each table represents a patient sample and each row represents a locus (site) which has mutated in that sample. Each site is uniquely identified by two columns together - chromosome number and base number (eg. 1 and 43535, 1 and 33456, 1 and 3454353). There are several columns which give different characteristics of each mutation including a column called Gene which gives the gene at that site.. Multiple sites can be mutated in a gene - meaning the Gene column can have the same value multiple times in one table.
I want to query all these tables at the same time by lets say Gene. I input a value from the Gene column and I want as output the names of all the tables (samples) in which the gene name is present in the Gene column and also the entire line(s) (preferably) for each sample so that I can compare the characteristics of the mutation in that gene across multiple samples on one output page.
I also want to input a number say 4 and want as output a list of genes which have mutated in at least 4 of 20 patients (list of genes whose names appear in the Gene column in atleast 4 of 20 tables).
What is the "easiest way" to do this? What is the "best way" assuming I want to make more flexible queries, besides these two?
I am a MD, do not have any particular software expertise but I am willing to put in the necessary time to build this query system. A few lines of code won't put me off..
Eg data:
Func Gene ExonicFunc Chr Start End Ref Obs
exonic ACTRT2 nonsynonymous SNV 1 2939346 2939346 G A
exonic EIF4G3 nonsynonymous SNV 1 21226201 21226201 G A
exonic CSMD2 nonsynonymous SNV 1 34123714 34123714 C T
This is just a third of the columns. Multiple columns were removed to fit the page size here...
Thank you.
Create a view that union's all the tables together. You should probably add additional information about which table ti comes from:
create view allpatients as
select 'a' as whichtable, t.*
from tableA t
union all
select 'b' as whichtable, t.*
from tableB t
...
You might find that it is easier to "instantiate" the view by creating a table with all patients. Just have a stored procedure that recreates the table by combining the 20 tables.
Alternatively, you could find that you have large individual tables (millions of rows). In this case, you would want to treat each of the original tables as a partition.
If what you have is a bunch of Excel files, you can import them all into the same table, with a distinct column for patient id. There is no need to create 20 different tables for this -- in fact, it would be a bad idea.
Once you do, go to Access' query design, SQL view and use these queries:
To create a query that returns all fields for the input gene name:
select *
from gene_data
where gene = [GeneName]
To create a query that returns gene names that are mutated in more than 4 samples:
select gene
from
(select gene, sample_id
from gene_data
group by gene, sample_id) g
group by gene
having count(sample_id) > 4
After this, change to design view -- you'll see how to create similar queries using the GUI.

Incredibly slow Materialized View creation when using string aggregation, any performance suggestions?

I've got a load of materialized views, some of them take just a few seconds to create and refresh, whereas others can take me up to 40 minutes to compile, if SQLDeveloper doesn't crash before that.
I need to aggregate some strings in my query, and I have the following function
create or replace
function stragg
( input varchar2 )
return varchar2
deterministic
parallel_enable
aggregate using stragg_type
;
Then, in my MV I use a select statement such as
SELECT
hse.refno,
STRAGG (DISTINCT per.person_name) as PERSONS
FROM
HOUSES hse,
PERSONS per
This is great, because it gives me the following :
refno persons
1 Dave, John, Mary
2 Jack, Jill
Instead of :
refno persons
1 Dave
1 John
1 Mary
2 Jack
2 Jill
It seems that when I use this STRAGG function, the time it takes to create/refresh an MV increases dramatically. Is there an alternative method to achieve a comma separate list of values? I use this throughout my MVs so it is quite a required feature for me
Thanks
There are a number of techniques for string aggregation at the link below. They might provide better performance for you.
http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php