Splitting a column into multiple columns in MongoDB - mongodb

I have a Dictionary field in a MongoDB document which contains values that are separated by a semicolon. Is there any query that I could use to split the column into multiple columns.
The scenario is that I load in contents from a CSV file which sometimes has columns that are delimited by characters like a semicolon. Since I will have to support any kind of input CSV file, I cannot fix anything in the schema. Thus I have a dictionary field called "content" that stores the document contents as a dictionary. Now I need to be able to perform splits on columns that have multiple values.
Eg: Author Names column has entries like Author1;Author2;Author3. The user should be able to split this into 3 columns - one for each author.
Edit: For now, I have implemented this by means of a process on the server side. Ideally it would be great if I can do this in MongoDB itself (speed constraints).

Related

Comma within a field of csv file. Exporting the data on regular basis

We will be exporting the data through command line data loader
I have a CSV file - it has many values with comma as a part of them. The commas within the fields will mislead and making it seem like the row has more columns than previously.
Name,Amount,Address
Me,20,000,My Home,India
you,23,300,Your Home,Where
What are my options here as it will be automated process.

Azure ADF Copy Activity with Trailing Column Delimiter

I have a strange source CSV file where it contains a trailing column delimiter at the end of each record just before the carriage return/new line.
When ADF is previewing this data, it displays only 2 columns without issue and all the data rows. However, when using the copy activity, it fails with the following exception.
ErrorCode=DelimitedTextColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The
name of column index 3 is empty. Make sure column name is properly
specified in the header
Now I understand why it's complaining about this due to trailing delimiter, but my question is whether or not there is a way to deal with this condition? I've tried including the trailing comma in the record delimiter (,\r\n), but then it just pivots the data where all the columns become rows.
Is there a way to address this condition in copy activity?
When preview the data in dataset, it seams correct:
But actually in copy actives, the data will derived to 3 columns by the column delimiter ",", the third column is empty or NULL value. This will cause the error.
If you use Data Flow import projection from source, you can see the third column:
Just for now, copy active doesn't support modify the data schema. You must use Data flow Derived Column to create a new schema for the source. For example:
Then mapping the new column/schema to sink will solve the problem.
HTH.
Use a different encoding for your CSV. CSV utf-8 will do the trick.

How to compute a variable or column of comma separated values from multiple rows of the same column

Scenario: azure data flow processing bulk records from a csv dataset. for doing dependent jobs at destination sql required a comma separated ids from multiple rows of that csv. Can some one help how to do this.
Tried using derived column step with coalesce, concat functions, didn't get the result looking for.
Use the collect() aggregate function. This will act like a string agg. It was just released last week.
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions#collect
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-new-hierarchical-data-handling-and-new-flexibility-for/ba-p/1353956

PostgreSql Dynamic JSON Indexing

I am new to PostgreSql world. We chose this DB so that we could query our JSON results for filter queries like contains, less than , greater than, etc. JSON results are dynamic and we cannot know in advance what keys will be generated as the output. Table (result_id (int64),jsondata(jsonb)) data looks like this
id1,{k1:vab,k2:abc,k3:def}
id1,{k1:abv,k2:v7,k3:ghu}
id1,{k1:v5,k2:vdd,k3:vew}
id1,{k1:v6,k2:v9s,k3:ved}
id2,{k4:vw,k5:vds,k6:vdss}
id2,{k4:v1,k5:fgg,k6:dd}
id2,{k4:qw,k5:gfd,k6:ess}
id2,{k4:er,k5:dfs,k6:fss}
My queries would be something like
Select * from table where result_id = 'id1' and jsondata->'k1' contains 'ab'
My script outputs a json content that I store in this table.
Each json key is represented in a Grid column and json key's values are column data.Grid offers filtering capabilities, which means filtering on JSON data.
My problem is that the filtering can happen on any JSON key, but key names are not static. Keys (json output) might change when the script content is changed So previously indexed keys would become irrelevant. But if the script is not changed the keys remain constant.
How do I apply indexing so that my JSON filter operations become faster? The same table contains many keys within the same JSON row and across rows. Wouldn't it be inefficient to index all keys so that filtering can be made efficient?

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited