PySpark read in multiple files CSV or TSV

PySpark read in multiple files CSV or TSV - pyspark

I'm trying to load all the files in a folder. They have they same schema, but sometimes have a different delimiter (ie Usually CSV, but occasionally tab separated)
Is there a way to pass in two delimiters?
Being specific I don't want a two character delimiter "||", but to be able to treat multiple delimiters the same way.
I'm letting it infer the schema. Commas work, but tabbed rows just end up in the first column.

Related

CopyData recursively copying files to a sink struggles with different column order

I have 20+ delimited files in DataLake2 that are pulled in recursively via a single Copy Data activity that runs 16 sub-processes. About 5 of them have slightly different column order -- 1 column is moved in those 5 files. The ADF seems to struggle occasionally with these files, because it seems to be assuming that the headers line up with other files.
Does this sound possible/correct? There are just 109 columns and the column that is transposed is in column 104 on most of the files, but is in column 98 on these 5 files.
The error I get when importing from these files is:
Column 'xxx_date_time' contains an invalid value 'F'
But looking both in Excel and in a text editor, 'xxx_date_time' is blank (as it should be), but only relative to the order of the columns in the specific files. If you were to use the standard column order from the other 15+ files, there is an 'F' there.
I have done some command line work to ensure there are an even number of quotes (') and same number of column delimiters (;) in each line so I don't think that formatting is off. The line endings are all \r.
In summary, any ideas why this would be happening and why the specific order of the individual file headers are being ignored? Is this a bug/feature of the Copy Data activity?

Multiple tables with different columns on single BIRT reports

I have a BIRT report with multiple tables with different datasets and number of columns on it. I generate output in .xls and convert into .csv using ssconvert utility on Unix. But in the .csv file I see extra delimiter for tables where there are fewer columns. For example, here is the .csv output with extra "," in .csv file:
table1-- this has only 10 columns
5912,,,0,,,0,,0,,,0,,,0,,,0,,,
tables2 --this has 20 columns
'12619493',28/03/2018 17:27:40,sdfsdfasd,'61901492478'1.08,,,1.08,sdfs,,dsf,,sdfadfs,'738331',,434,,,,,,,333,
I try to put grid but still I see extra ",". I have opened the .xls file and I see it has same issue. The cells in Excel are merged.

COPY ignore blank columns

Unfortunately I've got some huge number of csv files with missing separator as following. Notice the second data got only 1 separator with 2 values. Currently I'm getting "delimiter not found error".
Only if I could insert NULL to 3rd column in case there is only two values.
1,avc,99
2,xyz
3,timmy,6
Is there anyway I can COPY this files into Redshift without modifying CSV files?

Use the FILLRECORD parameter to load NULLs for blank columns
You can check the docs for more details

Read csv file excluding first column and first line

I have a csv file containing 8 lines and 1777 columns.
I need to read all the contents in matlab, excluding the first line and first column. First line and first column contain strings and matlab can't parse them.
Do you have any idea?
data = csvread(filepath);
The code above reads all the contents

As suggested, csvread with a range will read in the numeric data. If you would like to read in the strings as well (which are presumably column headers), you can use readtable:
t = readtable(filepath);
This will create a table with the column headers in your file as variable names of the columns of the table. This way you can keep the strings associated with the data, if need be.

Postgresql: Execute query write results to csv file - datatype money gets broken into two columns because of comma

After running Execute query write results to file - the columns in my output file for datatype money get broken into two columns. e.g if my revenue is $500 it is displayed correctly. But, if my revenue is $1,500.00 - there is an issue. It gets broken into two columns $1 and $500.00
Can you please help me getting my results in a csv file in a single column for datatype money?

What is this command "execute query write results to file"? Do you mean COPY? If so, have a look at the FORCE QUOTE option http://www.postgresql.org/docs/current/static/sql-copy.html
Eg.
COPY yourtable to '/some/path/and/file.csv' CSV HEADER FORCE QUOTE *;
Note: if the application that is consuming the csv files still fails because of the comma, you can change the delimiter from "," to whatever works for you (eg. "|").
Additionally, if you do not want CSV, but you do want TSV, you can omit the CSV HEADER keywords and the results will output in tab-separated format.

Comma is the list separator of our computer for some regions, some region semicolon is the list separator. so I think you need to replace the comma when you write it to csv.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark read in multiple files CSV or TSV - pyspark

Related

CopyData recursively copying files to a sink struggles with different column order

Multiple tables with different columns on single BIRT reports

COPY ignore blank columns

Read csv file excluding first column and first line

Postgresql: Execute query write results to csv file - datatype money gets broken into two columns because of comma

Categories

Resources