Talend tFileOutputdelimited component - problems with the split .csv files - talend

I tried my luck on the Talend forum and no luck there, so I will try here as well.
I have a job that is reading a large table and then writing the data to .csv files in increments of 25000 rows. What I have noticed is that all .csv files created after the first .csv file have the data loaded all in one row versus the first .csv file that has the data loaded in 25000 rows (as I want it).
Is there a setting that needs to get set on the tFileOutputDelimited component that will allow for the rows in all subsequent .csv files to get loaded as they are in the first (and 'good') .csv file? I am thinking it may be due to what is being used for the 'Escape char' value on the 'Advance settings' tab but am not sure.
On the tFileOutputDelimited component's 'Basic settings' tab, the CSV Row Separator value is CRLF("\r\n") and the field separator is ",". On the component's 'Advanced settings' tab, the Escape char value is """ and the Text enclosure value also is """.
Also, this is being run in a Windows 7 environment.
Unfortunately the documentation I found for the tFileOutputDelimited component's 'Advance settings' tab is lacking in regards to the CSV options.
Below is an example of what is being encountered. As listed below, the first file looks great but all files that follow do not break on the line break and end up placing all of the data on one row versus individual rows.
File #1
header row
row 1
row 2
row 3
...
row 25000
File #2...
header rowrow1row2...row25000
File #3...
header rowrow1row2...row25000
If you need more details, let me know and I'll send them right off. Thank you in advance.

Figured it out. As mentioned in my initial post, the CSV Row Separator had been set to the CRLF("\r\n") option. I changed this to the LF("\n") and that addressed the problem. I had looked atthe generated java code and noticed that it was not treating the CRLF("\r\n") as one of the default options - only \n and \r were. This pointed me in the direction of trying the \n option.

Related

Preventing MATLAB's readtable function from ignoring first row of delimited text data file

I have a very similar issue as the following question that was previously asked:
readtable on text file ignores first row which contains the column names
However, in my case, the file is consistently formatted correctly. All values are separated by a single space, including the first row, which consists of the column headers. I've tried switching the spaces to tabs, but this did not fix anything.
I am simply using the following code:
% Get list of file names from current directory and make file name variable
filelist = ls();
filename=filelist(3,1:97);
% create table object using file name
DE_genelst_raw_CntrlMvF = readtable(filename);
And where I should have a table with 6 rows and 5 columns with headers, I get a 6x5 table with the column headers missing. I used the readtable function with a more complex delimited dataset and it correctly included the headers. So I know it should be able to work. just not sure what is wrong. If need be I can provide a copy of the file. Thank you for the help.

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the fileā€”so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Using the second row of a delimited text file as the header row when importing into Access 2010.

Is it possible to use the values of the second row of a delimited text file (e.g. a csv file) as the header row when importing into Access 2010?
No - the headers have to be in the first line of the imported file. You need to delete the empty first line of data.
If there are too many files for this to be practical, as you imply, you have a couple of options.
Presuming the headers are the same on all of your files to be imported, you could combine all of the text files into one file and import that.
If the headers are different, you could write some code to batch delete the first line from all your files, as is suggested here.

Paginate a big text file

I have a big text file. Each line in the file is a record. And I need to parse the text file and show only 20 records in a HTML table at a time. I will have to support sorting as well.
What I am currently doing is read the file line by line based on the parameters start, stop, and page_size which is provided in querystring. It seems to work fine until I have to sort the records, because in order to sort I need to process every line in the text file.
So is there a Unix command which can I extract from line to line and sort? I tried grep but I do not know enough it to get this problem solved.
Take a look at the pr command. This is what we use to use all the time to paginate big files. You can set the page length, headers, footers, turn on line numbers, etc.
There's probably even a way to munge the output into HTML.
How big is the file?
man sort
Here

Is it possible to show the contents of a text file in Crystal Reports

I have a crystal report which contains a list of absolutely referenced text files. There is one text file referenced in each body line.
e.g.
line1 c:\file1.txt
line2 c:\file2.txt
Is there any way to display the contents of these files in Crystal?
i.e. I would like each crystal body line to show the text from the referenced text file.
I'm using Crystal reports 11 with a non-standard database connector (dataflex).
You would need to set up a file dsn (in XP it's under Control Panel/Administrative Tools/Datasources (ODBC)) and then use the file dsn (Microsoft Text Driver) for the datasource as an ODBC(RDO) connection.
I set this test scenario up on mine like the following:
**File 1**
column1
1row1
1row2
1row3
**File 2**
column1
2row1
2row2
2row3
I set up the file dsn to point to the c drive and in the datasource screen I added file1.txt and file2.txt to the selected tables. Then the easiest thing to do is clear the links of the tables so that it pulls every row. It will warn you that there are multiple starting points. I don't generally recomend this, but it will work in this case and since it's not reporting off a database it probably isn't the end of the world. If you disregard the starting point message then add the fields to the report, when you run it you should get the following output:
1row1 2row1
1row1 2row2
1row1 2row3
1row2 2row1
1row2 2row2
1row2 2row3
1row3 2row1
1row3 2row2
1row3 2row3
From this you can change your grouping to get the output that you need.
You can also use this same connect against subreports instead of doing this linking where you have the main report pull the info from file1.txt and then put a subreport in the report footer that pulls from file2.txt. This option won't have the text collated, but you'd still have it in the same report.
Hope this helps some.
It's easier than you think. I just set up one myself before I wrote this to make sure I was giving you the right steps. Using CR version XI and a .txt file, I followed these steps:
For each text file you want to import, make a subsection in your report (i.e. DetailsA, DetailsB, etc.). If your list of text files is constantly changing (and I don't think it is, based on your description), you'll need another method.
Make sure your text file is comma delimited and the first row contains field names. If these text files are actually text (i.e. not tables), then just put a dummy variable name in the first row so Crystal will see the text as a table of data with just 1 row.
For each text file you want to display, create a new Subreport (Insert->Subreport)
In the database selection menu, go to "Create New Connection"->"Access/Excel (DAO)"
Under 'database type', you'll see a 'text' option at the bottom of the screen.
Choose your file.
Relax! (I'm in a good mood this morning, don't know why)
I guess if you have a function that takes a file name as an argument and returns the contents of that file - you could use that function in a Crystal Report formula.
I am not familiar with the current CR, it has been years since I last used it (I last used version 8). In the versions I did use, such a function was not built in. What you would have to do back then, was to create a UFL (user function library) containing the functions you needed. If I remember correctly, you had to do this using COM.
In this day and age, I guess you can extend CR using some other mechanism, perhaps writing .NET code?
I suggest you search the CR documentation for the term UFL.
Another suggestion, then:
Create a new table FILECONTENTS (filename varchar primary key, contents blob)
Create a script that on a schedule populates this table with the filenames and contents of all the files (assuming that there is a finite number of files, and that you have a way of knowing about them)
Modify the report datasource query to join it with the FILECONTENTS table, and add the contents field to the report.
You could setup a file dsn. But this is geared toward tabular file data, not text.
How big are these text files? You want to display the entire contents of each file?
There is probably no easy way to dynamically read in a file from within crystal. You will most likely have to push a dataset to the report which contains the file contents.