Iterate on a tMssqlInput in Talend

Iterate on a tMssqlInput in Talend - talend

I use the last version of Talend 5.3.1.
I have a tmssqlInput which query my database like :
SELECT IdInvoice, DateInvoice, IdStuff, Name FROM Invoice
INNER JOIN Stuff ON Invoice.IdInvoice = Stuff.IdInvoice
which result in something like this
IdInvoice | DateInvoice | IdStuff | Name
1 | 2013-01-01 | 10 | test
1 | 2013-01-01 | 11 | test2
2 | 2013-02-01 | 12 | test3
2 | 2013-02-01 | 13 | test4
I'd like to export one file per invoice, here the specifications :
one header line with IdInvoice;DateInvoice
then one line per stuff like IdStuff;Name
example file 1:
1;2013-01-01
10;test
11;test2
example file 2 :
2;2013-02-01
12;test3
13;test4
how can I resolve that case with talend ?
Probably in tFileOutputDelimited but how can I have one file with multiple informations and iterate over each IdInvoice

Please go through the following link, you will get clear idea how to split data into multiple files
http://www.talendfreelancer.com/2013/09/talend-tflowtoiterate.html

Related

PostgreSQL - Setting null values to missing rows in a join statement

SQL newbie here. I'm trying to write a query that generates a scoring table, setting null to a student's grades in a module for which they haven't yet taken their exams (on PostgreSQL).
So I start with tables that look something like this:
student_evaluation:
|student_id| module_id | course_id |grade |
|----------|-----------|-----------|-------|
| 1 | 1 | 1 |3 |
| 1 | 1 | 1 |7 |
| 1 | 2 | 1 |8 |
| 2 | 4 | 2 |9 |
course_module:
| module_id | course_id |
| ---------- | --------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
In our use case, a course is made up of several modules. Each module has a single exam, but a student who failed his exam may have a couple of retries. The same module may also be present in different courses, but an exam attempt only counts for one instance of the module (ie. student A passed module 1's exam on course 1. If course 2 also has module 1, student A has to retake the same exam for course 2 if he also has access to that course).
So the output should look like this:
student_id
module_id
course_id
grade
1
1
1
3
1
1
1
7
1
2
1
8
1
3
1
null
2
4
2
9
I feel like this should have been a simple task, but I think I have a very flawed understanding of how outer and cross joins work. I have tried stuff like:
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
RIGHT OUTER JOIN course_module ON course_module.course_id = se.course_id
AND course_module.module_id = se.module_id
or
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
CROSS JOIN course_module WHERE course_module.course_id = se.course_id
Neither worked. These all feel wrong, but I'm lost as to what would be the proper way to go about this.
Thank you in advance.

I think you need both join types: first use a cross join to build a list of all combinations of students and courses, then use an outer join to add the grades.
SELECT sc.student_id,
sc.module_id,
sc.course_id,
se.grade
FROM student_evaluation se
RIGHT JOIN (SELECT s.student_id,
c.module_id,
c.course_id
FROM (SELECT DISTINCT student_id
FROM student_evaluation) AS s
CROSS JOIN course_module AS c) AS sc
USING (course_id));

How to show only latest value in influx-db&Grafana?

I am trying to run a dashboard with influxdb and grafana and i would like to show only the latest values of each id e.g. if i have a table something like this
id
utime
2
9:10AM
3
9:20AM
2
9:30AM
2
9:35AM
4
9:40AM
4
9:50AM
Now I would like to create a query in which it will only show me the latest result of each id and the result will show something like this below..
| 2 | 9:35AM |
| 3 | 9:20AM |
| 4 | 9:50AM |
I am new to this and don't know how exactly will it work. I have tried this query below but it didn't worked.
SELECT "id", "utime" FROM "http_listener_v2" WHERE $timeFilter GROUP BY "id"
Can somebody please help,how can I do this?

How to preserve order of a DataFrame when writing it as CSV with partitioning by columns?

I sort the rows of a DataFrame and write it out to disk like so:
df.
orderBy("foo").
write.
partitionBy("bar", "moo").
option("compression", "gzip").
csv(outDir)
When I look into the generated .csv.gz files, their order is not preserved. Is this the way Spark does this? Is there a way to preserve order when writing a DF to disk with a partitioning?
Edit: To be more precise: Not the order of the CSVs is off, but the order inside them. Let's say I have it like the following after df.orderBy (for simplicity, I now only partition by one column):
foo | bar | baz
===============
1 | 1 | 1
1 | 2 | 2
1 | 1 | 3
2 | 3 | 4
2 | 1 | 5
3 | 2 | 6
3 | 3 | 7
4 | 2 | 9
4 | 1 | 10
I expect it to be like this, e.g. for files in folder bar=1:
part-00000-NNN.csv.gz:
1,1
1,3
2,5
part-00001-NNN.csv.gz:
3,8
4,10
But what it is like:
part-00000-NNN.csv.gz:
1,1
2,5
1,3
part-00001-NNN.csv.gz:
4,10
3,8

It's been a while but I witnessed this again. I finally came across a workaround.
Suppose, your schema is like:
time: bigint
channel: string
value: double
If you do:
df.sortBy("time").write.partitionBy("channel").csv("hdfs:///foo")
the timestamps in the individual part-* files get tossed around.
If you do:
df.sortBy("channel", "time").write.partitionBy("channel").csv("hdfs:///foo")
the order is correct.
I think it has to do with shuffling. So, as a workaround, I am now sorting by the columns I want my data to be partitioned by first, then by the column I want to have it sorted in the individual files.

String splitting and operations on only some results

I have strings that look like this:
schedulestart | event_labels
2018-04-04 | 9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW
When I look at it in the database. I have code that relies in this string in this format to display a schedule with events with those labels on those days.
Now I find myself needing to break down the string in postgres for reporting/analysis, and I can't really pull out the string and parse it in another language, so I have to stick to postgres.
I've figured out a way to unpack the string so my results look like this:
User ID | Schedule Start | Unpacked String
2 | 2018-04-04 | TTR
2 | 2018-04-04 | 9
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 11
2 | 2018-04-04 | SWW
2 | 2018-04-04 | 14
2 | 2018-04-04 | DNV
2 | 2018-04-04 | 26
select schedulestart, unnest(string_to_array(unnest(string_to_array(event_labels, '&')), '=')) from table;
Now what I need is a way to actually perform an interval calculation (so 2018-04-04+11 days::interval), and I can if I only get a numbers list, but I need to also bind that result to each string. So the goal is an output like this:
eventdate | event_label
2018-04-12 | TTR
2018-04-20 | DNV
Where eventdate is the schedule start + which day of the schedule the event is on. I'm not sure how to take the unpacked string I created and use it to perform date calculations, and tie it to the string.
I've considered doing only one unnest, so that it's 11=TTR and 14=DNV, but I'm not sure how to take that to my desired result either. Is there a way to read a string until you reach a certain character, and then use that in calculations, and then read every character past a certain character in a string into a new column?
I'm aware completely rewriting how this is handled would be ideal, but I did not initially write it, and I don't have the time or means to rewrite the ~20 locations this is used.

Here is your table (I added userid column):
CREATE TABLE test(userid INTEGER, schedulestart DATE, event_labels VARCHAR);
And input data:
INSERT INTO test(userid,schedulestart , event_labels) VALUES
(2,DATE '2018-04-04', '9=TTR&11=DNV&14=SWW&26=DNV&2=QQQ&43=FTW');
And finally the solution:
SELECT
userid,
(schedulestart + (SPLIT_PART(kv,'=',1)||' days')::INTERVAL)::DATE AS eventdate,
SPLIT_PART(kv,'=',2) AS event_label
FROM (
SELECT
userid,schedulestart,
REGEXP_SPLIT_TO_TABLE(event_labels, '&') AS kv
FROM test
WHERE userid = 2
) a

Cognos force 0 on group by

I've got a requirement to built a list report to show volume by 3 grouped by columns. The issue i'm having is if nothing happened on specific days for the specific grouped columns, i cant force it to show 0.
what i'm currently getting is something like:
ABC | AA | 01/11/2017 | 1
ABC | AA | 03/11/2017 | 2
ABC | AA | 05/11/2017 | 1
what i need is:
ABC | AA | 01/11/2017 | 1
ABC | AA | 02/11/2017 | 0
ABC | AA | 03/11/2017 | 2
ABC | AA | 04/11/2107 | 0
ABC | AA | 05/11/2017 | 1
ive tried going down the route of unioning a "dummy" query with no query filters, however there are days where nothing has happened, at all, for those first 2 columns so it doesn't always populate.
Hope that makes sense, any help would be greatly appreciated!

to anyone who wanted an answer i figured it out. Query 1 for just the dates, as there will always be some form of event happening daily so will always give a unique date range.
query 2 for the other 2 "grouped by" columns.
Create a data item in each with "1" as the result (but would work with anything as long as they are the same).
Query 1, left join to Query 2 on this new data item.
This then gives a full combination of all 3 columns needed. The resulting "Query 3" can then be left joined again to get the measures. Final query (depending on aggregation) may need to have the measure data item wrapped with a COALESCE/ISNULL to create a 0 on those days nothing happened.