I am new to talend and trying to parse xml document and generate etl sequence to maintain the child parent relationship. Situation here is I have a xml like this :
<RDF>
<footPrint>
<custid>123</custid>
<item>
<itemCd>apple</itemCd>
</item>
<item>
<itemCd>orange</itemCd>
</item>
</footPrint>
<footPrint>
<custid>456</custid>
<item>
<itemCd>grapes</itemCd>
</item>
<item>
<itemCd>kiwi</itemCd>
</item>
</footPrint>
</RDF>
And the output I am trying to achieve is :
id | Custid | item_seq | item
-------------------------------
1 | 123 | 1 | apple
1 | 123 | 2 | orange
2 | 456 | 1 | grapes
2 | 456 | 2 | kiwi
Any help will be appreciated.
use tFIleInputXML and set Xpath loop query to "/RDF/footPrint/item"
add two columns to schema i.e. cust_id and item. these column will automatically reflet in mapping content.
than for cust_id set Xpath query to - "../custid"
for item set Xpath query to "itemCd"
you will get your result.
hope this help...
Related
SQL newbie here. I'm trying to write a query that generates a scoring table, setting null to a student's grades in a module for which they haven't yet taken their exams (on PostgreSQL).
So I start with tables that look something like this:
student_evaluation:
|student_id| module_id | course_id |grade |
|----------|-----------|-----------|-------|
| 1 | 1 | 1 |3 |
| 1 | 1 | 1 |7 |
| 1 | 2 | 1 |8 |
| 2 | 4 | 2 |9 |
course_module:
| module_id | course_id |
| ---------- | --------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
In our use case, a course is made up of several modules. Each module has a single exam, but a student who failed his exam may have a couple of retries. The same module may also be present in different courses, but an exam attempt only counts for one instance of the module (ie. student A passed module 1's exam on course 1. If course 2 also has module 1, student A has to retake the same exam for course 2 if he also has access to that course).
So the output should look like this:
student_id
module_id
course_id
grade
1
1
1
3
1
1
1
7
1
2
1
8
1
3
1
null
2
4
2
9
I feel like this should have been a simple task, but I think I have a very flawed understanding of how outer and cross joins work. I have tried stuff like:
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
RIGHT OUTER JOIN course_module ON course_module.course_id = se.course_id
AND course_module.module_id = se.module_id
or
SELECT se.student_id, se.module_id, se.course_id, se.grade FROM student_evaluation se
CROSS JOIN course_module WHERE course_module.course_id = se.course_id
Neither worked. These all feel wrong, but I'm lost as to what would be the proper way to go about this.
Thank you in advance.
I think you need both join types: first use a cross join to build a list of all combinations of students and courses, then use an outer join to add the grades.
SELECT sc.student_id,
sc.module_id,
sc.course_id,
se.grade
FROM student_evaluation se
RIGHT JOIN (SELECT s.student_id,
c.module_id,
c.course_id
FROM (SELECT DISTINCT student_id
FROM student_evaluation) AS s
CROSS JOIN course_module AS c) AS sc
USING (course_id));
I am trying to filter a column in data frame that matches the regex pattern given in another column
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
I want to prune the data frame such that only to keep rows whose query matches the regex_pattern column value.
The final result should like this
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
I was thinking of
df.filter(column('query').rlike('regex_patt'))
But rlike only accepts regex strings.
Now the question is, how to filter the "query" column based on the regex value of "regex_patt" column?
You could try this. The expression allows you to put columns as the str and pattern.
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+
I've got a requirement to built a list report to show volume by 3 grouped by columns. The issue i'm having is if nothing happened on specific days for the specific grouped columns, i cant force it to show 0.
what i'm currently getting is something like:
ABC | AA | 01/11/2017 | 1
ABC | AA | 03/11/2017 | 2
ABC | AA | 05/11/2017 | 1
what i need is:
ABC | AA | 01/11/2017 | 1
ABC | AA | 02/11/2017 | 0
ABC | AA | 03/11/2017 | 2
ABC | AA | 04/11/2107 | 0
ABC | AA | 05/11/2017 | 1
ive tried going down the route of unioning a "dummy" query with no query filters, however there are days where nothing has happened, at all, for those first 2 columns so it doesn't always populate.
Hope that makes sense, any help would be greatly appreciated!
to anyone who wanted an answer i figured it out. Query 1 for just the dates, as there will always be some form of event happening daily so will always give a unique date range.
query 2 for the other 2 "grouped by" columns.
Create a data item in each with "1" as the result (but would work with anything as long as they are the same).
Query 1, left join to Query 2 on this new data item.
This then gives a full combination of all 3 columns needed. The resulting "Query 3" can then be left joined again to get the measures. Final query (depending on aggregation) may need to have the measure data item wrapped with a COALESCE/ISNULL to create a 0 on those days nothing happened.
I'm trying to figure out how to show distinct records in groups in crystal reports. The view I wrote returns something like this:
Field 1 | Field 2 | Field 3
----------------------------------
10 | 111 | Record Info 1
10 | 111 | Record Info 1
10 | 222 | Record Info 2
20 | 111 | Record Info 1
20 | 222 | Record Info 2
The report groups are based off field one, and I want distinct fields 2 and 3 for each group:
Field 1 | Field 2 | Field 3
----------------------------------
10 | 111 | Record Info 1
10 | 222 | Record Info 2
20 | 111 | Record Info 1
20 | 222 | Record Info 2
Field 2 and 3 are always the same, Field 1 acts as an FK reference to any entries in the view. Selecting distinct xxx in the view isn't really viable due to the huge amount of columns being brought in.
Can this be done in CR?
Cheers
Create a group for field1, field2
Hide Details area, field1 group area header and field1 group footer
Drop all the columns you want to show in the field2 group area header/footer.
Good luck!
You might also consider using Database | Select Distinct Records.
I use the last version of Talend 5.3.1.
I have a tmssqlInput which query my database like :
SELECT IdInvoice, DateInvoice, IdStuff, Name FROM Invoice
INNER JOIN Stuff ON Invoice.IdInvoice = Stuff.IdInvoice
which result in something like this
IdInvoice | DateInvoice | IdStuff | Name
1 | 2013-01-01 | 10 | test
1 | 2013-01-01 | 11 | test2
2 | 2013-02-01 | 12 | test3
2 | 2013-02-01 | 13 | test4
I'd like to export one file per invoice, here the specifications :
one header line with IdInvoice;DateInvoice
then one line per stuff like IdStuff;Name
example file 1:
1;2013-01-01
10;test
11;test2
example file 2 :
2;2013-02-01
12;test3
13;test4
how can I resolve that case with talend ?
Probably in tFileOutputDelimited but how can I have one file with multiple informations and iterate over each IdInvoice
Please go through the following link, you will get clear idea how to split data into multiple files
http://www.talendfreelancer.com/2013/09/talend-tflowtoiterate.html