Removing empty Columns in Pentaho Kettle before inserting on MongoDB

Removing empty Columns in Pentaho Kettle before inserting on MongoDB - mongodb

I am using Pentaho Kettle as a tool to process several CSV files before inserting them in MongoDB for the first time.
Since MongoDB is schemaless I don't seen the point in keeping the null column values of the CSV rows. I want to do receive something like this from the CSV
+------------+----------+---------+
| _id | VALUE_1 | VALUE_2 |
+------------+----------+---------+
| 1 | 1 | 1 |
| 2 | 2 | null |
| 3 | null | 2 |
+------------+----------+---------+
And insert it onto mongodb in a way that I get this in there:
{ "_id" : 1, "VALUE_1" : 1, "VALUE_2" : 1 }
{ "_id" : 2, "VALUE_1" : 2 }
{ "_id" : 3, "VALUE_2" : 2}
How would I do such a thing in Kettle? I just can't seem to find the right option there, there is a filter rows but it doesn't seem what I want.

I'm having the same problem. One work around I found from Matt Casters and Diethard Steiner is to un pivot the data and then remove Null Rows. Then you could pivot back and write up the JSON with a javascript step or JSON output perhaps. Similar to this:
http://diethardsteiner.blogspot.com/2010/11/pentaho-kettle-data-input-pivoted-data.html
This worked fine for small files, but I have large csv's with 30-100 columns and hundreds of thousands of rows, millions in some cases. So pivoting is very slow.. but maybe you can come up with another idea, I'd be glad to hear it! =)

Related

Postgresql Query - select json

I have a postgresql query that I want to save as .json, just from a especific part of the query result:
SELECT info FROM d.tests where tag like 'HMIZP'
The result of this query is:
{"blabla":{a lot of blabla}, "Body":[{....
I just want everything after "Body" (including " Body")
How can I do it?

You can combine the extraction with building a json
SELECT json_build_object('Body',json_extract_path('{"blabla": { "a": "a lot of blabla"},"Body": [{"a": [1,2]}, {"b":2}]}','Body'))
| json_build_object |
| :--------------------------------- |
| {"Body" : [{"a": [1,2]}, {"b":2}]} |
db<>fiddle here

Parse record (PCF) from Kafka using Kafka Kusto Sink

I've set-up my environment using docker based on this guide.
On kafka-console-producer I will send this line:
Hazriq|27|Undegrad|UNITEN
I want this data to be ingested to Kusto like this:
+--------+-----+----------------+------------+
| Name | Age | EducationLevel | University |
+--------+-----+----------------+------------+
| Hazriq | 27 | Undegrad | UNITEN |
+--------+-----+----------------+------------+
Can this be handled by Kusto using the mapping (which I'm still trying to understand) or this should be catered by Kafka?
Tried #daniel suggestion:
.create table ParsedTable (name: string, age: int, educationLevel: string, univ:string)
.create table ParsedTable ingestion csv mapping 'ParsedTableMapping' '[{ "Name" : "name", "Ordinal" : 0},{ "Name" : "age", "Ordinal" : 1 },{ "Name" : "educationLevel", "Ordinal" : 2},{ "Name" : "univ", "Ordinal" : 3}]'
kusto.tables.topics_mapping=[{'topic': 'kafkatopiclugiaparser','db': 'kusto-test', 'table': 'ParsedTable','format': 'psv', 'mapping':'ParsedTableMapping'}]
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
but getting this instead:
+----------------------------+-----+----------------+------+
| Name | Age | EducationLevel | Univ |
+----------------------------+-----+----------------+------+
| Hazriq|27|Undergrad|UNITEN | | | |
+----------------------------+-----+----------------+------+

Currently, the connector passes the data as it comes (no manipulation on it on the client side), and any parsing is left to Kusto.
As such, psv format is supported by kusto, and it should be possible by setting the format to psv and providing a mapping reference.
When adding the plugin as described, you should be able to set it up like:
kusto.tables.topics_mapping=[{'topic': 'testing1','db': 'testDB', 'table': 'KafkaTest','format': 'psv', 'mapping':'KafkaMapping'}]
The mapping can be defined in Kusto as described in the Kusto docs defined like so

ingestion of data as you've shown using the psv format is supported (see below) - it's probably just a matter of debugging why your client-side invocation of the underlying commands aren't yielding the expected result. if you could share the full flow and code, including parameters, it may be helpful.
.create table ParsedTable (name: string, age: int, educationLevel: string, univ:string)
.ingest inline into table ParsedTable with(format=psv) <| Hazriq|27|Undegrad|UNITEN
ParsedTable:
| name | age | educationLevel | univ |
|--------|-----|----------------|--------|
| Hazriq | 27 | Undegrad | UNITEN |

Cognos force 0 on group by

I've got a requirement to built a list report to show volume by 3 grouped by columns. The issue i'm having is if nothing happened on specific days for the specific grouped columns, i cant force it to show 0.
what i'm currently getting is something like:
ABC | AA | 01/11/2017 | 1
ABC | AA | 03/11/2017 | 2
ABC | AA | 05/11/2017 | 1
what i need is:
ABC | AA | 01/11/2017 | 1
ABC | AA | 02/11/2017 | 0
ABC | AA | 03/11/2017 | 2
ABC | AA | 04/11/2107 | 0
ABC | AA | 05/11/2017 | 1
ive tried going down the route of unioning a "dummy" query with no query filters, however there are days where nothing has happened, at all, for those first 2 columns so it doesn't always populate.
Hope that makes sense, any help would be greatly appreciated!

to anyone who wanted an answer i figured it out. Query 1 for just the dates, as there will always be some form of event happening daily so will always give a unique date range.
query 2 for the other 2 "grouped by" columns.
Create a data item in each with "1" as the result (but would work with anything as long as they are the same).
Query 1, left join to Query 2 on this new data item.
This then gives a full combination of all 3 columns needed. The resulting "Query 3" can then be left joined again to get the measures. Final query (depending on aggregation) may need to have the measure data item wrapped with a COALESCE/ISNULL to create a 0 on those days nothing happened.

Forum like data structure: NoSQL appropriate?

I'm trying to save data which has a "forum like" structure:
This is the simplified data model:
+---------------+
| Forum |
| |
| Name |
| Category |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Thread |
| |
| ID |
| Name |
| Author |
| Creation Date |
| URL |
| |
+---------------+
|1
|n
+---------------+
| |
| Post |
| |
| Creation Date |
| Links |
| Images |
| |
+---------------+
I have multiple forums/boards. They can have some threads. A thread can contain n posts (I'm just interested in the links, images and creation date a thread contains for data analysis purposes).
I'm looking for the right technology for saving and reading data in a structure like this.
While I was using SQL databases heavily in the past, I also had some NoSQL projects (primarily document based with MongoDB).
I'm sure MongoDB is excellent for STORING data in such a structure (Forum is a document, while the Threads are subdocuments. Posts are subdocuments in Threads). But what about reading them? I have the following use cases:
List all posts from a forum with a specific Category
Find a specific link in a Post in all datasets/documents
Which technology is best for those use cases?

Please find below my draft solution. I have considered MongoDB for the below design.
Post Collection:-
"image" should be stored separately in GridFS as MongoDB collection have a maximum size of 16MB. You can store the ObjectId of the image in the Post collection.
{
"_id" : ObjectId("57b6f7d78f19ac1e1fcec7b5"),
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"links" : "google.com",
"image" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"thread" : [
{
"id" : ObjectId("5143ddf3bcf1bf4ab37d9c6e"),
"name" : "Sam",
"author" : "Sam",
"createdate" : ISODate("2013-03-16T02:50:27.877Z"),
"url" : "https://www.wikipedia.org/"
}
],
"forum" : [
{
"name" : "Andy",
"category" : "technology",
"url" : "https://www.infoq.com/"
}
]
}
In order to access the data by category, you can create an index on "forum.category" field.
db.post.createIndex( { "forum.category": 1 } )
In order to access the data by links, you can create an index on "links" field.
db.organizer.createIndex( { "links": 1 } )
Please note that the indexes are not mandatory. You can access/query the data without index as well. You can create indexes if you need better read performance.
I have seen applications using MongoDB for similar use case as yours. You can go ahead with MongoDB for the above mentioned use cases (or access patterns).

Iterate on a tMssqlInput in Talend

I use the last version of Talend 5.3.1.
I have a tmssqlInput which query my database like :
SELECT IdInvoice, DateInvoice, IdStuff, Name FROM Invoice
INNER JOIN Stuff ON Invoice.IdInvoice = Stuff.IdInvoice
which result in something like this
IdInvoice | DateInvoice | IdStuff | Name
1 | 2013-01-01 | 10 | test
1 | 2013-01-01 | 11 | test2
2 | 2013-02-01 | 12 | test3
2 | 2013-02-01 | 13 | test4
I'd like to export one file per invoice, here the specifications :
one header line with IdInvoice;DateInvoice
then one line per stuff like IdStuff;Name
example file 1:
1;2013-01-01
10;test
11;test2
example file 2 :
2;2013-02-01
12;test3
13;test4
how can I resolve that case with talend ?
Probably in tFileOutputDelimited but how can I have one file with multiple informations and iterate over each IdInvoice

Please go through the following link, you will get clear idea how to split data into multiple files
http://www.talendfreelancer.com/2013/09/talend-tflowtoiterate.html