How to implement groupBy on multiple keys(columns) in Apache beam?

How to implement groupBy on multiple keys(columns) in Apache beam? - apache-beam

I'm using Python and I want to implement groupBy over multiple columns in Apache beam. For example I have a below dataset with 3 columns :
GM TV 7500.2 abc
ONLINE 2000.1 def
CONSOLE 1000.2 ghi
CONSOLE 6500.6 ghi
GM TV 4500.5 abc
CONSOLE 9500.4 ghi
How can I group the data based on first an third column ?

You can use a tuple (column 1, column 3) as key in your GBK transform.

Related

Drop rows of Spark DataFrame that contain specific value in column using Scala

I am tryping to drop rows of a spark dataframe which contain a specific value in a specific row.
For example, if i have the following DataFrame, i´d like to drop all rows which have "two" in column "A". So i´d like to drop the rows with index 1 and 2.
I want to do this using Scala 2.11 and Spark 2.4.0.
A B C
0 one 0 0
1 two 2 4
2 two 4 8
3 one 6 12
4 three 7 14
I tried something like this:
df = df.filer(_.A != "two")
or
df = df.filter(df("A") != "two")
Anyway both did not work. Any suggestions how that can be done?

Try:
df.filter(not($"A".contains("two")))
Or if you look for exact match:
df.filter(not($"A".equalTo("two")))

I finally found the solution in a very old post:
Is there a way to filter a field not containing something in a spark dataframe using scala?
The trick which does it is the following:
df = df.where(!$"A".contains("two")

Accumulated value

I am coming across an issue wherein I am trying to lookup a Cost to a file with multiple rows for a project, but it's not working out, as lookup is repeating the cost for all the rows and thereby not providing the correct cost associated with a project. Here is how the file looks in which I am trying to lookup the value:
Date Project
1/08/2017 XYZ
2/08/2017 XYZ
3/08/2017 XYZ
4/08/2017 XYZ
5/08/2017 XYZ
6/08/2017 XYZ
1/09/2017 ABC
2/09/2017 ABC
3/09/2017 ABC
4/09/2017 ABC
5/09/2017 ABC
6/09/2017 ABC
12/10/2017 DEF
13/10/2017 DEF
11/11/2017 IJK
And here is the file form which I am trying to lookup the value from:
Project Budget
XYZ 200000
ABC 300000
DEF 1000000
IJK 50000
Any help is highly appreciated. Also how can I count a project is repeated in the field. I am looking for something like this :
Date Project Count_Projects
1/08/2017 XYZ 6
2/08/2017 XYZ 6
3/08/2017 XYZ 6
4/08/2017 XYZ 6
5/08/2017 XYZ 6
6/08/2017 XYZ 6
1/09/2017 ABC 6
2/09/2017 ABC 6
3/09/2017 ABC 6
4/09/2017 ABC 6
5/09/2017 ABC 6
6/09/2017 ABC 6
12/10/2017 DEF 2
13/10/2017 DEF 2
11/11/2017 IJK 1
I really need to figure this out.

For your second question, you can create the Count_Projects calculated column as follows:
Count_Projects =
CALCULATE(DISTINCTCOUNT(Dates[Date]),
FILTER(Dates, Dates[Project] = EARLIER(Dates[Project])))
Or you can use a variable:
Count_Projects =
VAR Project = Dates[Project]
RETURN CALCULATE(DISTINCTCOUNT(Dates[Date]),
ALL(Dates), Dates[Project] = Project)

Like #Alexis Olson, I'm not clear as to exactly what output you expect; but, assuming that you want to see the same Budget number listed for each respective Project entry (e.g., 200000 for each instance of XYZ, 300000 for each instance of ABC, etc.), here's an answer.
If you've got both tables loaded into PowerBI, As seen from the right side of the screen in the Data view (I named them Table and TableLookup):
If you click Home -> Manage Relationships, you'll see there is a relationship between the two tables:
If you then click Edit..., you'll see it's a Many to one relationship between the overall table (I called it Table) and the lookup table (I called it TableLookup):
Anyhow, the point is...there is a relationship between the two tables, and you're going to use it.
Click Cancel.
Click Close.
Click Modeling -> New Column; then, in the formula bar, type:
Budget = RELATED(TableLookup[Budget])
and enter. You'll get this:
Then you can do what Alexis said for counting:
Click Modeling -> New Column; then, in the formula bar, type:
Count_Projects =
CALCULATE(DISTINCTCOUNT('Table'[Date]),
FILTER('Table', 'Table'[Project] = EARLIER('Table'[Project])))
I replaced Alexis's "Date" with "Table" because my table is named Table.
You'll see this:

IBM Datastage - Join Stage not doing Exact match?

We are using IBM Datastage to match data coming via extracts from various mainframes.
One of the steps is to remove data in Oracle tables that are coming down in the extracts that night and it uses the Join stage to match an extract from an Oracle table against a dataset created from one of the extracts.
The issue is that the Join doesn't seem to do an exact match, it's like it'd using a LIKE using the keys.
Has anyone seen this before?
Example of Data:
Oracle Extract
POLICY_NUMBER SQL Type(VarChar) Extended(Unicode) Length(13)
POLICY_NUMBER CLIENT_NUMBER
A 123456 12345
A 123456W A 123456W01
A 234567 23456
A 234567J A 234567J01
Nightly Extract
POLICY_NUMBER SQL Type(Char) Extended(Unicode) Length(8)
POLICY_NUMBER PRODUCT
A 123456 LIFE
A 234567 PENSION
Dataset after join
POLICY_NUMBER CLIENT_NUMBER
A 123456 12345
A 123456 A 123456W01
A 234567 23456
A 234567 A 234567J01

Adjust the data types - it seems the comparison is done on the char(8). I have not seen it efore - and a test with two differnt char length columns did not match in my test.
There is no like comparison in join.
I think if you adjust the datatypes bioth to varchar(13) it should work.
Otherwise open a PMR.

Talend combine two different data flows with same schema into one

I am using talend 6.1.1 and i am i have two components tmysqlinput and tfixedflowinput.
Schema is same for both the components and i am trying to combine the data generated by these components.
for example: schema is like col1 and col2
output of tmysqlinput component is:
1,2
2,3
output of tixeflowinput component is:
3,4
4,5
Now output which i am expeting is like combination of both the ouputs.
It should be like:
1,2
2,3
3,4
4,5
Please help me to combine the outputs of those two components.

An alternative to using tUnite is tHashOutput.
For example:
tMySqlInput--main-->tHashOutput
|
onSubjobOK
|
tFixedFlowInput--main-->tHashOutput
|
onSubjobOK
|
tHashInput--main-->tFileOuputDelimtited
In the second tHashOutput, make sure to associate it with the first tHashOutput.
In the tHashInput, make sure to associate it with the first tHashOutput.
tUnite would generally be preferred, but depending upon the case tHashOutput can be appropriate.

If the schema is exactly the same, you can send the row output of both components into a tUnite component
https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tUnite

Multiple PDF export with single JRXML

I am using iReport for designing the JRXML and JasperReports Server for report scheduling.
I have developed the simple single JRXML in iReport with SQL query that execute query and get the record and generate single pdf.
Now i have different type of records like :
DEPT NAME Salary
---------------------------
HR MR XYZ 500000
MFG MR PQR 300000
HR MR ABC 400000
EDU MR DEF 350000
EDU MR SSS 400000
Now my requirement is to generate individual pdf for each department. Means one pdf contain only HR related data other must contain only EDU related data etc.
How can do it with iReport and JasperReports Server?

You can parametrize your query. If I understand correctly, and choosing department is a matter of simple WHERE clause you can add to the query:
WHERE dept like $P{DEPT_PARAM}
And pass that parameter from jasperserver as needed ("%" for when you need to print for all the departments)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to implement groupBy on multiple keys(columns) in Apache beam? - apache-beam

You can use a tuple (column 1, column 3) as key in your GBK transform.

Related

Drop rows of Spark DataFrame that contain specific value in column using Scala

Accumulated value

IBM Datastage - Join Stage not doing Exact match?

Talend combine two different data flows with same schema into one

Multiple PDF export with single JRXML

Categories

Resources