I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one.
Example:
Col_1 Col_2 Col_3 Col_4
abc 1 No 123
xyz 2 Yes 123
def 1 Yes 345
Expected:
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6
abc 1 No 123 1 1
xyz 2 Yes 123 0 0
def 1 Yes 345 0 0
Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end
Col_6 Condition : max(Col_5) over (Col_2)
I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.
Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...
Shall this will be possible in Druid?
Take a look at https://druid.apache.org/docs/latest/misc/math-expr.html
which contains many transform expressions you can use.
In particular, I tested your use case with the wikipedia demo data by creating the following expressions:
{
"type": "expression",
"name": "isNB",
"expression": "case_simple(\"namespace\", 'Main',1,0)"
},
{
"type": "expression",
"expression": "greatest( case_simple(\"IsNew\", True, 1, 0), case_simple(\"namespace\", 'Main',1,0)",
"name": "combined_calc"
}
One thing to note is that transform expressions cannot refer to other transform expressions, so calculations need to all be done from the raw input fields.
Col_5 Condition : if Col_1 = 'abc' then 1 else 0
You can use the following
df = df.withColumn('Col_5', f.when((f.col('Col_1') == 'abc'), 1).otherwise(0))
Col_6 Condition : max(Col_5) over (Col_2)
You can apply window operation
windowSpec = Window.partitionBy("Col_2").orderBy("Col_5").desc()
df_max = df.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
Now remove duplicates for each Col_2 and then join the df_max with your main df.
The above code snippet is in python, but spark API is the same so you can use it with minimal changes.
The first type,if Col_1 = 'abc' then 1 else 0, would not be hard. Eg, see this article with similar examples.
The second, aggregating over one of the columns, doesn't sound possible. We can aggregate over all the dimensions taken together (like a primary key), but not over one single dimension, afaik.
Related
I'm trying to aggregate a spark dataframe up to a unique ID, selecting the first non-null value from that column for that ID given a sort column. Basically replicating MySQL's group_concat function.
The SO post here Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function was very helpful in replicating the group_concat for a single column. I need to do this for a dynamic list of columns.
I would rather not have to copy this code for each column (dozen +, could be dynamic in the future), so am trying to implement in a loop (frowned on in spark I know!) given a list of column names. Loop runs successfully but, the previous iterations don't persist even when the intermediate df is cached/persisted (re: Cacheing and Loops in (Py)Spark).
Any help, pointers or a more elegant non-looping solution would be appreciated (not afraid to try a bit of scala if there is a functional programming approach more suitable)!
Given following df:
unique_id
row_id
first_name
last_name
middle_name
score
1000000
1000002
Simmons
Bonnie
Darnell
88
1000000
1000006
Dowell
Crawford
Anne
87
1000000
1000007
NULL
Eric
Victor
89
1000000
1000000
Zachary
Fields
Narik
86
1000000
1000003
NULL
NULL
Warren
92
1000000
1000008
Paulette
Ronald
Irvin
85
group_column = "unique_id"
concat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.cache()
df_final.display() #I'm using databricks
My result looks like:
unique_id
middle_name
1000000
Warren
My desired result is:
unique_id
first_name
last_name
middle_name
1000000
Simmons
Eric
Warren
Second set of tables if they don't pretty print above
I found a solution to my own question: Add a .collect() call on my dataframe as I join to it, not a persist() or cache(); this will produce the expected dataframe.
group_column = "unique_id"
enter code hereconcat_list = ['first_name','last_name','middle_name']
sort_column = "score"
sort_order = False
df_final=df.select(group_column).distinct()
for i in concat_list:\
df_helper=df
df_helper=df_helper.groupBy(group_column)\
.agg(sort_array(collect_list(struct(sort_column,i)),sort_order).alias('collect_list'))\
.withColumn("sorted_list",col("collect_list."+str(i)))\
.withColumn("first_item",slice(col("sorted_list"),1,1))\
.withColumn(i,concat_ws(",",col("first_item")))\
.drop("collect_list")\
.drop("sorted_list")\
.drop("first_item")
print(i)
df_final=df_final.join(df_helper,group_column,"inner")
df_final.collect()
df_final.display() #I'm using databricks
I have one requirement to run some queries against some tables in the postgresql database to populate a dataframe. Tables are as following.
table 1 has the below data.
QueryID, WhereClauseID, Enabled
1 1 true
2 2 true
3 3 true
...
table 2 has the below data.
WhereClauseID, WhereClauseString
1 a>b
2 a>c
3 a>b && a<c
...
table 3 has the below data.
a, b, c, value
30, 20, 30, 100
20, 10, 40, 200
...
I want to query in the following way. For table 1, I want to pick up the rows when Enabled is true. Based on the WhereClauseID in each row, I want to pick up the rows in table 2. Based on the WhereClause condition picked up from table 2, I want to run the query using Where Clause to query table 3 to get the Value. Finally, I want to get all records in table 3 meeting the WhereClauses enabled in table 1.
I know I can go through table 1 row by row, and use the parameterized string to build sql query to query table 3. But I think the efficiency is very low to query row by row, especially if table 1 is big. Are there some better way to organize the query to improve the efficiency? Thanks a lot!
Depending on you usecase, but for pyspark databases, you'd might be able to solve it using the .when statement in pyspark.
Here is a suggestion.
import pyspark.sql.functions as F
tbl1 = spark.table("table1")
tbl3 = spark.table("table3")
tbl3 = (
tbl3
.withColumn("WhereClauseID",
## You can do some fancy parsing of your tbl2
## here if you want this to be evaluated programatically from your table2.
(
F.when( F.col("a") > F.col("b"), 1)
.when( F.col("a") > F.col("b"), 2)
.otherwise(-1)
)
)
)
tbl1_with_tbl_3 = tbl1.join(tbl3, "WhereClauseID", "left")
I am trying to join two dataframes on a single field. In order to do this, I must first make sure the field is unique. So my order of events goes:
Read in the first dataframe
Select the field I want to join on (say, field1), and another field I want to bring in on the join (field2)
Do .distinct
Then, for the second table..
Read in the second dataframe
Do a leftouter join on field1 with the first table
Do .distinct
I have tried to run my script and it is taking way longer than it should.
To debug this, I put a println for the record count on the first table before and after the join, and here were the results:
Before the join, the record count was 904,326. After, 2,658,632.
So I think it is blowing up, but am not sure why. I think it has to do with trying to use only one "distinct" after selecting two fields..?
Please help!
Here is the code:
val ticketProduct = Source.fromArg(args, "f1").read
.select($"INSTRUMENT_SK", $"TICKET_CODES_SK")
.distinct
val instrumentD = Source.fromArg(args, "f2").read
// println("instrumentD count before join is " + instrumentD.count)
.join(ticketProduct, Seq("INSTRUMENT_SK"), "leftouter")
// .select($"SERIAL_NBR", $"TICKET_CODES_SK")
.distinct
println("instrumentD count after join is " + instrumentD.count)
The problem you have is that by calling distinct you only remove rows where the values of field1 and field2 are the same.
Since you join on field1, you might want the values of field1 to be unique.
You could try something like the following instead of calling distinct.
dataframe1.groupBy($"field1").agg(org.apache.spark.sql.functions.array($"field2"))
This will result in a dataframe where the column field1 is unique, and multiple values of field2 are aggregate into an array.
The same applies to the second dataframe.
To give an example: Say you have dataframes with the following content.
field1, field2
1, 1
1, 2
field1, field3
1, 1
1, 3
Then distinct does nothing to them, since the rows are distinct.
Now if you would do a join on field1 you would get the following.
1, 1, 1
1, 2, 1
1, 1, 3
1, 2, 3
The aggregation in contrast would give
field1, array_of_field2
1, [1,2]
field1, array_of field3
1, [1,3]
Then a join would then result in the following dataframe.
1, [1,2], [1,3]
I have the following flags declared:
0 - None
1 - Read
2 - Write
4 - View
I want to write a query that will group on this bitmask and get the count of each flag used.
person mask
a 0
b 3
c 7
d 6
The result should be:
flag count
none 1
read 2
write 3
view 2
Any tips would be appreciated.
For Craig
SELECT lea.mask as trackerStatusMask,
count(*) as count
FROM Live le
INNER JOIN (
... --some guff
) lea on le.xId = lea.xId
WHERE le.xId = p_xId
GROUP BY lea.mask;
SQL Fiddle
select
count(mask = 0 or null) as "None",
count(mask & 1 > 0 or null) as "Read",
count(mask & 2 > 0 or null) as "Write",
count(mask & 4 > 0 or null) as "View"
from t
Simplest - pivoted result
Here's how I'd approach it:
-- (after fixing the idiotic mistakes in the first version)
SELECT
count(nullif(mask <> 0, True)) AS "none",
count(nullif(mask & 2,0)) AS "write",
count(nullif(mask & 1,0)) AS "read",
count(nullif(mask & 4,0)) AS "view"
FROM my_table;
-- ... though #ClodAldo's version of it below is considerably clearer, per comments.
This doesn't do a GROUP BY as such; instead it scans the table and collects the data in a single pass, producing column-oriented results.
If you need it in row form you can pivot the result, either using the crosstab function from the tablefunc module or by hand.
If you really must GROUP BY, explode the bitmask
You cannot use GROUP BY for this in a simple way, because it expects rows to fall into exactly one group. Your rows appear in multiple groups. If you must use GROUP BY you will have to do so by generating an "exploded" bitmask where one input row gets copied to produce multiple output rows. This can be done with a LATERAL function invocation in 9.3, or with a SRF-in-SELECT in 9.2, or by simply doing a join on a VALUES clause:
SELECT
CASE
WHEN mask_bit = 1 THEN 'read'
WHEN mask_bit = 2 THEN 'write'
WHEN mask_bit = 4 THEN 'view'
WHEN mask_bit IS NULL THEN 'none'
END AS "flag",
count(person) AS "count"
FROM t
LEFT OUTER JOIN (
VALUES (4),(2),(1)
) mask_bits(mask_bit)
ON (mask & mask_bit = mask_bit)
GROUP BY mask_bit;
I don't think you'll have much luck making this as efficient as a single table scan, though.
I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.