Databricks autoloader writing data with invalid characters in column name

Databricks autoloader writing data with invalid characters in column name - pyspark

when trying to use databricks' autoloader for writing data, the nested columns contain invalid characters
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema.
How to deal with this issue?
Note again that it is the nested columns, not the outermost columns themselves. The latter would be easily fixed with a
for col in df.columns:
df = df.select([col(c).alias(re.sub("[^0-9a-zA-Z\_]+","",c)) for c in df.columns])
How do I reach the nested columns, as they're not yet exploded?

If you're writing to Delta Lake you can use column mapping to get around this.

Related

Scala Dataframe columns with space save as a databricks table

I am working on databricks notebook (Scala) and I have a spark query that goes kinda like this:
df = spark.sql("SELECT columnName AS `Column Name` FROM table")
I want to store this as a databricks table. I tried below code for the same:
df.write.mode("overwrite").saveAsTable("df")
But it is giving an error because of the space in the column name. Here's the error:
Attribute name contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I don't want to remove the space so is there any alternative for this?

No, that's a limitation of the underlying technologies used by Databricks under the hood (for example, PARQUET-677). The only solution here is to rename column, and if you need to have space in the name, do renaming when reading it back.

Azure ADF Copy Activity with Trailing Column Delimiter

I have a strange source CSV file where it contains a trailing column delimiter at the end of each record just before the carriage return/new line.
When ADF is previewing this data, it displays only 2 columns without issue and all the data rows. However, when using the copy activity, it fails with the following exception.
ErrorCode=DelimitedTextColumnNameNotAllowNull,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The
name of column index 3 is empty. Make sure column name is properly
specified in the header
Now I understand why it's complaining about this due to trailing delimiter, but my question is whether or not there is a way to deal with this condition? I've tried including the trailing comma in the record delimiter (,\r\n), but then it just pivots the data where all the columns become rows.
Is there a way to address this condition in copy activity?

When preview the data in dataset, it seams correct:
But actually in copy actives, the data will derived to 3 columns by the column delimiter ",", the third column is empty or NULL value. This will cause the error.
If you use Data Flow import projection from source, you can see the third column:
Just for now, copy active doesn't support modify the data schema. You must use Data flow Derived Column to create a new schema for the source. For example:
Then mapping the new column/schema to sink will solve the problem.
HTH.

Use a different encoding for your CSV. CSV utf-8 will do the trick.

How to pass an array for column pattern matching in mapping dataflow derived column from CSV file through pipeline?

I have a mapping data flow with a derived column, where I want to use a column pattern for matching against an array of columns using in()
The data flow is executed in a pipeline, where I set the parameter $owColList_md5 based on a variable that I've populated from a single-line CSV file containing a comma-separated string
If I have a single column name in the CSV file/variable encapsuled in single quotes and have the "Expression" checkbox ticked, it works.
The problem is to get it to work with multiple columns. There seems to be parsing problems having multiple items in the variable each encapsuled in single-quotes, or potentially with the comma separating them. This often causes errors executing the data flow with messages like "store is not defined" etc
I've tried having ''col1'',''col2'' and "col1","col2" (2x single quotes and double quotes) in the CSV file. I've also tried having the file without quotes, trying to replace the comma with escaped quotes (using ) in the derived column pattern expression with no luck.
How do you populate this array in the derived column based on the data flow parameter which is based on the comma-separated string in the CSV file / variable from the pipeline with column names in a working way?

While array types are not supported as data flow parameters, passing in a comma-separated string can work if you use the instr() function to match.
Say you have two columns, col1 and col2. Pass in a parameter with value '"col1","col2"'.
Then use instr($<yourparamname>, '"' + name + '"') > 0 to see if the column name exists within the string you pass in. Note: You do not need double quotes, but the can be useful if you have column names that are subsets of other columns names such as id1 and id11.
Hope this helps!

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.

dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.

You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited

Splitting a column into multiple columns in MongoDB

I have a Dictionary field in a MongoDB document which contains values that are separated by a semicolon. Is there any query that I could use to split the column into multiple columns.
The scenario is that I load in contents from a CSV file which sometimes has columns that are delimited by characters like a semicolon. Since I will have to support any kind of input CSV file, I cannot fix anything in the schema. Thus I have a dictionary field called "content" that stores the document contents as a dictionary. Now I need to be able to perform splits on columns that have multiple values.
Eg: Author Names column has entries like Author1;Author2;Author3. The user should be able to split this into 3 columns - one for each author.
Edit: For now, I have implemented this by means of a process on the server side. Ideally it would be great if I can do this in MongoDB itself (speed constraints).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse