Group data column side into single dataset in pig - group-by

I have data like
(1,a,null)
(4,b,null)
(7,c,null)
(1,a,3)
(4,b,6)
(7,c,9)
I want distinct value and its count for each column like
(1,2)
(4,2)
(7,2)
(a,2)
(b,2)
(c,2)
(null,3)
(3,1)
(6,1)
(9,1)
Where as first 3 records are distinct value and its count in first, next 3 records are distinct value and its count in second record and last 4 records are distinct value and its count in last column.
For this I am trying to group the values by columns individually like
x = load 'path/to/file' using PigStorage(',') AS (col1:int,col2:chararray,col3:chararray);
B1 = GROUP x by col1;
B2 = GROUP x by col2;
B3 = GROUP x by col3;
C1 = FOREACH B1 GENERATE $0 as col1:chararray,COUNT(x);
C2 = FOREACH B2 GENERATE $0 as col2:chararray,COUNT(x);
C3 = FOREACH B3 GENERATE $0 as col3:chararray,COUNT(x);
This is giving the data separately:
C1 result is:
(1,{(1,a,null),(1,a,null)})
(4,{(4,b,null),(4,b,6)})
(7,{(7,c,null),(7,c,9)})
C2 result is:
(a,{(1,a,null),(1,a,3)})
(b,{(4,b,null),(4,b,6)})
(c,{(7,c,null),(7,c,9)})
C3 result is:
(null,{(1,a,null),(4,b,null),(7,c,null)})
(3,{(1,a,3)})
(6,{(4,b,6)})
(7,{(7,c,9)})
But i want all the datasets data to be placed in single dataset as below with out using uninon keyword
(1,{(1,a,null),(1,a,null)})
(4,{(4,b,null),(4,b,6)})
(7,{(7,c,null),(7,c,9)})
(a,{(1,a,null),(1,a,3)})
(b,{(4,b,null),(4,b,6)})
(c,{(7,c,null),(7,c,9)})
(null,{(1,a,null),(4,b,null),(7,c,null)})
(3,{(1,a,3)})
(6,{(4,b,6)})
(7,{(7,c,9)})
Thanks in advance

Read the data as lines and do a word count.
Script
raw_data= LOAD 'test3.txt' AS (line:chararray);
words = FOREACH raw_data GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Output

Related

Drupal 8 and PostgreSQL: How can I use the result of an expression inside a WHERE clause?

Hi there!
I construct a database query using Drupal's query interface. With that, I also successfully add an expression that generates a conditional value. The contents of that conditional field is a ISO formatted datestring.
I now would like to use that computed field inside my WHERE clause. In another project using MySQL I was able to just use the name carrying the expression result inside the where-clause, but that does not seem to work in PostgreSQL. I always get the "column does not exist"-error, wheter if I prefix the expression name with the database table alias or not.
The (raw and shortened) query looks like the following (still contains the value placeholders):
SELECT
base.nid AS nid,
base.type AS type,
date.field_date_value AS start,
date.field_date_end_value AS "end",
CASE WHEN COALESCE(LENGTH(date.field_date_end_value), 0) > 0 THEN date.field_date_end_value ELSE date.field_date_value END AS datefilter
FROM {node} base
LEFT JOIN {node__field_date} date ON base.type = date.bundle AND base.nid = date.entity_id AND base.vid = date.revision_id AND attr.langcode = date.langcode AND date.deleted = 0
WHERE (base.type = :db_condition_placeholder_0) AND (datefilter > :db_condition_placeholder_2)
ORDER BY field_date_value ASC NULLS FIRST
As already stated, it does not make any difference if I apply the filter using "date.datefilter" or just "datefilter". The field/expression turns out just fine on the result list if I do not attempt to use it in the WHERE part. So, how can I use the result of the expression to filter the query result?
Any hint is appreciated! Thanks in advance!
You can try to create a derived table that defines the new column "datefilter".
For example instead of coding:
select case when c1 > 0 then c1 else -c1 end as c, c2, c3
from t
where c = 0;
ERROR: column "c" does not exist
LINE 1: ...c1 > 0 then c1 else -c1 end as c, c2, c3 from t where c = 0;
Define a derived table that defines the new column and used it in a outer SELECT:
select c, c2, c3
from
(select case when c1 > 0 then c1 else -c1 end as c, c2, c3 from t) as t2
where c=0;

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Join tables with non-equal rows in Matlab

I'm trying to use the relatively new data type in Matlab, table. I have a number of variables that each contains a value for a set of parameters (Rows). The rows are not (necessarily) equal for each variable, however. I want to join the variables together so the results are all displayed in a single table. E.g., I want to join these together: (drawn side by side to save space)
Var_A Var_B
________ _______
a 0.36744 b 0.88517
b 0.98798 c 0.91329
c 0.037739 d 0.79618
Is it possible to join these two tables?
Here's an example of what I'm trying to do:
A = table(rand(3,1),'VariableNames',{'Var_A'},'RowNames',{'a','b','c'})
B = table(rand(3,1),'VariableNames',{'Var_B'},'RowNames',{'b','c','d'})
try
C = join(A,B)
catch e
disp(e.identifier)
disp(e.message)
end
This results in:
MATLAB:table:join:CantInferKey
Cannot find a common table variable to use as a key variable.
Okay, so maybe join isn't intended for this -- what about outerjoin? Its documentation sounds promising:
The outer join includes the rows that match between A and B, and also unmatched rows from either A or B, all with respect to the key variables. C contains all variables from both A and B, including the key variables.
Well, outerjoin apparently can't be used with tables with row names! This is the closest I've found that does what I want, but seems to be against the idea of the table data structure to some degree:
AA = table({'a';'b';'c'},rand(3,1));
AA.Properties.VariableNames = {'param','Var_A'}
BB = table({'b';'c';'d'},rand(3,1));
BB.Properties.VariableNames = {'param','Var_B'}
CC = outerjoin(AA,BB,'Keys',1,'MergeKeys',true)
This results in
param Var_A Var_B
_____ _______ _______
'a' 0.10676 NaN
'b' 0.65376 0.77905
'c' 0.49417 0.71504
'd' NaN 0.90372
I.e., the row is just stored as a separate variable. This means it can't be indexed using "logical" notation such as CC{'a',:}.
So this can be fixed with:
CCC = CC(:,2:end);
CCC.Properties.RowNames = CC{:,1}
Which finally results in:
CCC =
Var_A Var_B
_______ ________
a 0.4168 NaN
b 0.65686 0.29198
c 0.62797 0.43165
d NaN 0.015487
But is this really the best way to go about things? Matlab is weird.
There must be a better way to do this, but here is another option:
clear;
%// Create two tables to play with.
tableA = table([.5; .6; .7 ],'variablenames',{'varA'},'rowname',{'a','b','c'});
tableB = table([.55; .62; .68],'variablenames',{'varB'},'rowname',{'b','c','d'});
%// Lets add rows to tableA so that it has the same rows as tableB
%// First, get the set difference of tableB rows and tableA rows
%// Then, make a new table with those rows and NaN for data.
%// Finally, concatenate tableA with the new table
tableAnewRows=setdiff(tableB.Properties.RowNames,tableA.Properties.RowNames);
tableAadd=table( nan(length(tableAnewRows),1) ,'variablenames',{'varA'},'rownames',tableAnewRows);
tableA=[tableA;tableAadd];
%// Lets add rows to tableB so that it has the same rows as tableA
tableBnewRows=setdiff(tableA.Properties.RowNames,tableB.Properties.RowNames);
tableBadd=table( nan(length(tableBnewRows),1) ,'variablenames',{'varB'},'rownames',tableBnewRows);
tableB=[tableB;tableBadd];
%// Form tableC from tableA and tableB. Could also use join().
tableC=[tableA tableB];

In Apache Pig, select DISTINCT rows based on a single column

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie
The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002 are returned.
I found one way to do this, using the GROUP BY and the TOP operators:
my_table = LOAD 'my_table_file' AS (A, B);
my_table_grouped = GROUP my_table BY A;
my_table_distinct = FOREACH my_table_grouped {
-- For each group $0 refers to the group name, (A)
-- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
-- Here, we take only the first (top 1) row in the bag:
result = TOP(1, 0, $1);
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
This results in one distinct row per ID column:
(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)
I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.
(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)
I have found that you can do this with a nested grouping and using LIMIT So using Arel's example:
my_table = LOAD 'my_table_file' AS (A, B);
-- Nested foreach grouping generates bags with same A,
-- limit bags to 1
my_table_distinct = FOREACH (GROUP my_table BY A) {
result = LIMIT my_table 1;
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
You can use
Apache DataFu™ (incubating)
FirstTupleFrom Bag
register datafu-pig-incubating-1.3.1.jar
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
my_table_grouped = GROUP my_table BY A;
my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null));

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.