How to process CSV with different columns in CDAP (Datafusion)? - google-cloud-data-fusion

I have a case where I receive multiple CSV from third parties (little hard to make them change the format), and those CSVs should have the same columns, but sometimes one or more columns are missing. If I use CDAP File (reading as text) followed by a Wrangler to process the CSV the Wrangler with the following directive:
parse-as-csv :body '\\t' true
cleanse-column-names
It will assume that all files read have the same column format and will mess the data of the files that have less or more column than the first file.
So far I tried to use the File to read as blob and to have the output as bytes with a Wrangler configured with this directive:
set-type :body string
parse-as-csv :body '\t' true
cleanse-column-names
But now I do not even have any output (or error), so I am clueless how to parse those non uniform files. Is CDAP able to handle this case? If yes, how?

You can use the directive set-column to add new columns to the files which don't have all the needed columns. By and large, I would recommend you to look into all the directives documentation to preprocess your files.
I hope that helps.

Related

Keeping whitespace in csv headers (Matlab)

So I'm reading in a .csv file, and it all works as I want bar one thing. The headers of the data have spaces, which I want later for displaying data to the user. However, these spaces get stripped when the csv file is read in via readtable (as they get used as the variable names). Again, no problem with this per se, but I still need the unmodified strings as well.
Two additional notes:
I'm happy for the strings to be stored separately from the main table if that makes things easier.
The actual .csv file I'm reading in is reasonably large (about 2 million data points) so from a computational cost side of things, the less reading of the file the better
Example read in code:
File = 'example.csv';
Import_Options = detectImportOptions( File, 'NumHeaderLines', 0 );
Data = readtable( File )
Example csv file (example.csv):
"this","is","an","example test"
"1","1","2","3"
"3","1","4","1"
"hot","hot","cold","hot"
You can simply read the first line with fgetl, thus grabbing the headers, before reading the entire file with readtable.

converting text file to gps track file

Note:Question is edited according to suggestion
I want to geotag my images
im1.jpg
im2.jpg
Content of Images
I tried the solution with csv but getting this error
I have a csv file adata.csv
SourceFile,DateTimeOriginal,GPSLatitude,GPSLongitude,GPSLatitudeRef,GPSLongitudeRef
im1.jpg,1635.387709,52.23829321,10.54680910,52.23829321,10.54680910
im2.jpg,1645.892446,52.23828047,10.54680857,52.23828047,10.54680857
C:\EXIF>exiftool -csv=adata.csv Images
Error:
C:\EXIF>exiftool -csv=adata.csv Images
No SourceFile 'Images/im1.jpg' in imported CSV database
(full path: 'c:/exif/images/im1.jpg')
No SourceFile 'Images/im2.jpg' in imported CSV database
(full path: 'c:/exif/images/im2.jpg')
1 directories scanned
0 image files read
I don't know much about the gpx format but your example doesn't include timestamps, which are required for exiftool to be able to sync between images and the track. Another thing to watch for is the fact that the gpx timestamps are supposed to be in UTC, which may require some work to sync properly, especially if the timestamps in your text file are local time.
Instead, I'd suggest converting your TXT file to a CSV file and using the -csv option. Some simple changes would be required. The first column would need to be changed to filenames, which it looks like would only require adding .jpg to each number in the first column. The column header for the first column would need to be changed to SourceFile. The Time column could be removed, unless you need to add the timestamps to the image files, in which case I'd suggest changing the column header to DateTimeOriginal. The Latitude and Longitude column headers need to be changed to GPSLatitude and GPSLongitude. Finally, because GPS metadata is unsigned, you will need to set the reference tags. Duplicate the GPSLatitude and GPSLongitude columns and change the headers to GPSLatitudeRef and GPSLongitudeRef. This all should be relatively easy in a spreadsheet program such as Excel or LibreOffice.
At that point your new CSV file should look like this:
SourceFile,DateTimeOriginal,GPSLatitude,GPSLongitude,GPSLatitudeRef,GPSLongitudeRef
1.jpg,13:22:05,45.9874167,-76.875233,45.9874167,-76.875233
You could then run this command to fill the gps data
exiftool -csv=data.csv c:\Images

Using the second row of a delimited text file as the header row when importing into Access 2010.

Is it possible to use the values of the second row of a delimited text file (e.g. a csv file) as the header row when importing into Access 2010?
No - the headers have to be in the first line of the imported file. You need to delete the empty first line of data.
If there are too many files for this to be practical, as you imply, you have a couple of options.
Presuming the headers are the same on all of your files to be imported, you could combine all of the text files into one file and import that.
If the headers are different, you could write some code to batch delete the first line from all your files, as is suggested here.

Talend tFileOutputdelimited component - problems with the split .csv files

I tried my luck on the Talend forum and no luck there, so I will try here as well.
I have a job that is reading a large table and then writing the data to .csv files in increments of 25000 rows. What I have noticed is that all .csv files created after the first .csv file have the data loaded all in one row versus the first .csv file that has the data loaded in 25000 rows (as I want it).
Is there a setting that needs to get set on the tFileOutputDelimited component that will allow for the rows in all subsequent .csv files to get loaded as they are in the first (and 'good') .csv file? I am thinking it may be due to what is being used for the 'Escape char' value on the 'Advance settings' tab but am not sure.
On the tFileOutputDelimited component's 'Basic settings' tab, the CSV Row Separator value is CRLF("\r\n") and the field separator is ",". On the component's 'Advanced settings' tab, the Escape char value is """ and the Text enclosure value also is """.
Also, this is being run in a Windows 7 environment.
Unfortunately the documentation I found for the tFileOutputDelimited component's 'Advance settings' tab is lacking in regards to the CSV options.
Below is an example of what is being encountered. As listed below, the first file looks great but all files that follow do not break on the line break and end up placing all of the data on one row versus individual rows.
File #1
header row
row 1
row 2
row 3
...
row 25000
File #2...
header rowrow1row2...row25000
File #3...
header rowrow1row2...row25000
If you need more details, let me know and I'll send them right off. Thank you in advance.
Figured it out. As mentioned in my initial post, the CSV Row Separator had been set to the CRLF("\r\n") option. I changed this to the LF("\n") and that addressed the problem. I had looked atthe generated java code and noticed that it was not treating the CRLF("\r\n") as one of the default options - only \n and \r were. This pointed me in the direction of trying the \n option.

export mysql database to excel by php

I want to export my database to an excel file by php,I need a source code in php to do this
I'm not going to write your whole program for you (that's not what this site is about) but if you have a specific problem, feel free to post another question.
It looks like PHP has a built-in function to export an array to a line in a CSV file: fputcsv. So run your query and for each row returned, call fputcsv.
Or, just use mysqldump which claims to support dumping to natively support dumping a database to CSV.
PLEASE NOTE!
Exporting Records to .csv is not the same as exporting records to MX Excel .csv.
First and foremost, the source code is out there. Not problem finding it.
The difference though is with Excel, with you are separating with commas and encapsulating with ", Excel escapes quotes (") with an additional quote (so it looks like "").
This means you can't simply use addslashes when trying to export.
This is not meant any harm. If you need the sourcecode for an CSV export (lot of code available at php.net) the phpBlocks is maybe the right tool for you. Export to CSV without
coding. Click&Point like Google's AppInventor.
see: http://www.freegroup.de/software/phpBlocks/demo.html