Dataprep import dataset does not detect headers in first row automatically

Dataprep import dataset does not detect headers in first row automatically - google-cloud-dataprep

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:

While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!

Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Related

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4

tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

Store row numbers which are causing "error"

I have to retrieve certain information from urls. For this I have to enter text into fields of the url. I am using GET operation for this. I have to modify the text to replace spaces with "%20". Some times the text(which is taken from the database) is badly formed. I would like to know the row numbers so I can manually change the text for such rows in the database and run it again. I have tried to use the logs and errors section but with little luck. Does anybody have an idea of how to do this?

First shot: Output bad urls on the console
So far, I came up with the following job design for your problem:
The trick is to catch the exceptions of the tHttpRequest component and print the necessary details on the console. For this example, I included the line number, the exception message and the URL that produced the exception.
Output (I couldn't reproduce your "Illegal character error", so I took a different one):
Second shot: Output to a file
If you really need to output the line numbers to a file, things get a little more complicated.
Instead of printing the info straight onto the console, we collect all line numbers into a context variable of type (Java) List inside the tJavaFlex. After the usual URL processing (which I have left out from the job design to keep the example small), we iterate over the Java List
and save it into a tHashOutput, so that we can finally write to a file.
We cannot directly write to the file in the tLoop section, since the Iterate flow would lead to the situation the the tFileInputDelimited would be opened several times. If "Append" was disabled, only the last bad URL line number would finally appear in the output file. If "Append" was enabled, you would get the full list of line numbers after the very first job run - but you would append every time you run the job, making the list longer and longer. Workarounds would be to use a runtime-dependent file name (e.g. timestamp) or to delete the file at the beginning of the job run. I chose the third option, that overwrites the file every time we run the job. Feel free to chose among those options the one which suits your use case best.
Details
The tHashOutput/tHashInput components are not visible on default, but must be enabled first to show up: https://www.talendforge.org/forum/viewtopic.php?pid=107249#p107249
Context variable:
INIT:
tJavaFlex "catch errors", end code:
tLoop:
tFixedFlowInput "badURL":
tHashOutput:
Needs to have "Append" enabled.

Is it possible to view data as it is being imported in Teradata?

I'm trying to import data from a txt file and keep getting a 'Wrong number of data values in row xxx' error. Looking at the text file, everything looks fine but I can't tell what/how Teradata is interpreting it.
So is there a way to view or preview the data from Teradata's perspective? I tried running a SELECT statement, but since the import doesn't finish, nothing is even imported. Which brings me to my next question, is there a way to limit an external-file import to a certain # of rows? Like import just the first 50 rows from the text file?

May I suggest you obtain a copy of Notepad++ or Sublime Text, both of which are free to download, to view the text file. This will allow you to open the text file and identify what in the records is causing you trouble loading the file. You will be able to display non-printable characters and use advanced search techniques to traverse the files looking for problems with the data.
It is possible there is an embedded carriage return, line feed, or other non-printable character that is being interpreted during the import and generating this error.

Saving a CSV with the following format in iPhone application

In my application I have a simple logbook where the user can save simple posts about an event. The format is like this:
Date, duration(seconds), distance(km), a comment AND categories with a variable number between 0 - 4 AND circumstances/conditions with a variable number between 0 - 4
An example would be:
Header of CSV file
Date,Duration,Distance,Comment
Then multiple rows like this
07.02.11,7800,300,"A comment"
07.02.11,7800,300,"A comment"
07.02.11,7800,300,"A comment"
But how can I add the categories and conditions to this format and how would I know where in the categories/conditions end in the CSV if I at a later point in the application want to import the file again?
(I do not need help with how to save etc this to file, already done that, but I could need guidiance on how to format it, thank you)
(This seems pretty odd)
Header of CSV file
Date,Duration,Distance,Comment, Category, Category, Category, Category, Condition, Condition, Condition, Condition
Then multiple rows like this
07.02.11,7800,300,"A comment", "Categoryname", "Categoryname", "Categoryname","Categoryname", "Condtion", "Condition", "Condtion", "Condition"
(Would this be better)
Header of CSV file
Date,Duration,Distance,Comment, Category, Condition
Then multiple rows like this
07.02.11,7800,300,"A comment", "Multiple category names separted by -", "Multiple condition names separted by -"

I think you last proposal of separating conditions or categories using a special separator symbol (the hyphen in your example) is the right one.
By the way I would suggest two extra things:
use a less common separator, that is something you can forbid the user to use without limiting user choice; probably the hyphen is a character you don't want to forbid, use a different sequence such as three pipes: ||| which is not common.
if possible (but be careful in this case about final destination of the CSV file) you can avoid using the standard "comma separator". The reason for this is that if comma is used inside the fields content, then this content must be separated by double quotes. This is some time problematic if you need to do some custom parsing by other software. Normally when I know that my CSV will not be used as source import from other software (e.g. Numbers or Excel) I prefer to use a different separator, e.g. a sequence of 2 hash (##) or something more "strange". Note that in this case you are no more strict-CSV compliant! but there is some software, like OpenOffice, which is more flexible with this special formats.

The second solution you are proposing will work but practically defeats the idea of using a standard format, since you will need to do the parsing of categories and conditions on your own instead of using a standard CSV parser. Writing your own parser is never good.
I would personally do this differently: not trying to put everything in a single file and have two files, one for events (each event has a unique id) and the other for categories/conditions (each category condition is associated to an event through the event's id, multiple categories/events for a given event would appear on multiple lines associated sharing the same event id). Both files would be standard CSV files.
As an alternative, if you are not tied to CSV for any reason, you might think of using JSON, which allows for a richer set of data types, including arrays, and offers plenty of code that you can reuse. This will not require much change to your code.
Another option, more "canonical" (IMO) but also more expensive in terms of code rewrite, would be using sqlite3.
If I had to choose, I would go for JSON, but I don't know if this is ok for you.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.

I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.