Preparing data for (stanford) Deepdive (ValueError) - postgresql

I started using Stanford-Deepdive a while ago.
I am currently facing the problem, that deepdive will interpret some of the rows he gets as incomplete.
Value Error: Expected 6 attributes, but found 5 in input row:
<Row()>
I already had this problem with another data-set. At this set there were some rows, that contained "\n" within the text. So i removed that and everything went flawlessly.
For my new set of data I am removing "\n", "\t", and any occurence of multiple spaces. Also I replace any empty text value by "EMPTY" - still the error refuses to go away.
Are there any other formatting errors or characters that I need to take care of?
Is my way of approaching this reasonable?

I found the problem. It was caused by a singular TAB (\t) entry. I replaced that by a singe SPACE and in the end it would not be a valid antry anymore
so if you use some text for deepdive you will want to treat etrys consisting of a single SPACE as if they were empty.

Related

Removing spaces from a string using Powershell

I have an issue where extracting data from database it sometimes (quite often) adds spaces in between strings of texts that should not be there.
What I'm trying to do is create a small script that will look at these strings and remove the spaces.
The problem is that the spaces can be in any position in the string, and the string is a variable that changes.
Example:
"StaffID": "0000 25" <- The space in the number should not be there.
Is there a way to have the script look at this particular line, and if it finds spaces, to remove them.
Or:"DateOfBirth": "23-10-199 0" <-It would also need to look at these spaces and remove them.
The problem is that the same data also has lines such as:
"Address": " 91 Broad street" <- The spaces should be here obviously.
I've tried using TRIM, but that only removes spaces from start/end.
Worth mentioning that the data extracted is in json format and is then imported using API into the new system.
You should think about the logic of what you want to do, and whether or not it's programmatically possible to determine if you can teach your script where it is or is not appropriate to put spaces. As it is, this is one of the biggest problems facing AI research right now, so unfortunately you're probably going to have to do this by hand.
If it were me, I'd specify the kind of data format that I expect from each column, and try my best to attempt to parse those strings. For example, if you know that StaffID doesn't contain spaces, you can have a rule that just deletes them:
$staffid = $staffid.replace("\s+",'')
There are some more complicated things that you can do with forced formatting (.replace) that have already been covered in this answer, but again, that requires some expectation of exactly what data is going to come out of what column.
You might want to look more closely at where those spaces are coming from, rather than process the output like this. Is the retrieval script doing it? Maybe you can optimize the database that you're drawing from?

Formatting dates as dd_mm_yyyy in Go gives strange values

So as in the title, I'm trying to format a date in dd_mm_yy format using time.Now().Format("02_01_2006") as shown in this playground session:
http://play.golang.org/p/alAj-OcRZt
First problem, dd_mm_yyyy isn't an acceptable format, only dd_mm_yy is, which is fine I can manipulate the returned string myself.
The problem I have for you is to help me figure out what Go is even trying to do with this input.
You should notice the result you get is:
10_1110009
A good few thousand years off and it's lost the underscore which it only does it for _2. Does this character sequence represent something special here?
Replacing the last underscore with a hyphen or space returns a valid result. dd_mm_yy works fine. Just this particular case seems to completely fly off the handle.
On my local machine (Go playground is on a specific date) the result for today (the 5th) is:
05_01 5016
Which is equally strange, if not moreso as it's substituted in a space which seems to be an ANSIC thing.
This is very likely due to the following bug: https://github.com/golang/go/issues/11334
This has been fixed in Go 1.6beta1
Found an issue from their github:
https://github.com/golang/go/issues/11334
Basically _2 is taking the 2 as the day value from the reference time and then trying to parse the rest (006) which it doesn't recognise so it all goes wrong from there.

Identifying hidden characters in text

I have an ETL process that regularly extracts code from an ODBC data source, manipulates it, and inserts it into my postgres database. One of the columns from this data source regularly has odd characters in it.
For the most part I can catch and convert all of the characters appropriately, but I have one character that exists in the ODBC data source, cannot be brought into postgres (all of the text after that character gets truncated), and I'm having a hard time identifying what the character is.
I can't even insert an example of the character directly into this post because it gets stripped out :/ The closest I can get is a screen shot of the character in textmate (the only application I can actually see the character in):
There character is the diamond between the 1 and 0. When my data comes in, everything after the 0 is truncated.
Is there a good way of identifying what this character is so I can figure out a way of stripping it out?
Per tripleee's comment on the original question post:
To identify the character I grabbed the hex value of the text to identify the hex value of the offending character in question.
There are a number of ways to do this, but the quickest way for me was to use a utility application I have called HexFiend so dump the text into. Once the text was in and I highlighted the character it returned the hex value "00".
A bit more investigation pointed towards the hex null value being used as a line terminator in C applications (which makes sense given the context of my project).
I've fit this null value into my ETL process so that it gets switched out with a new line and now everything is sunshine and daises.
Thanks again for the help!

How to fill a field with spaces until a length in Notepad++

I've prepared a macro in Notepad++ to transform a ldif file in a csv file with a few fields. Everything is OK but I have a final problem: I have to have 2 fields with a specific length and in this moment I cannot ensure that length because in the source file they are not coming so
For instance, I generate this line:
12345,namenamename,123456
And I have to ensure that the 2nd and 3rd fields have 30 (filling with spaces at right side) and 9 (filling with zeros at left) characters, so in this case I should generate:
12345,namenamename ,000123456
I haven't found how Notepad++ could match a pattern in order to add spaces/zeros, so I have though in to add 1 space/zero to the proper field and repeat this step so many times as needed to ensure the lengths (this is, 29 and 8, because they cannot come empty) and search with the length in the regex (for instance: \d{1,8} for the third field)
My question is: can I repeat only one step of the macro several times (and the rest of the macro only 1 repetition)?
I've read the wiki related to this point (http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Editing_Configuration_Files#.3CMacros.3E) and I don't found anything neither
If not possible, how could be a good solution? Create another 2 different macros and after execute the main one, execute this new 2 macros several times?
Thanks in advance!
A two pass solution with Notepad++ is possible. Find a pair of characters or two short sequence of characters that never occurs in your data file. I will use =#<= and =>#= here.
First pass, generate or convert the input text into the form 12345,=#<=namenamename______________________________,000000000123456=>#=. Ie add 30 spaces after the name and nine zeroes before the number (underscores used here just to make things clearer).
Second pass, do a regular expression search for =#<=(.{30})_*,0*(\d{9})=>#= and replace with \1,\2.
I have just suggested a similar solution in special timestamp format of csv

Strip excess padding from a string

I asked a question earlier today and got a really quick answer from llbrink. I really should have asked that question before I spent several hours trying to find an answer.
So - here's another question that I have never found an answer for (although I have created a work-around which seems very cludgy).
My AHK program asks the user for a login name. The program then compares the login name with an existing list of names in a file.
The login name in the file may contain spaces, but there are never spaces at the beginning of the name. When the user enters the name, he may include spaces at the beginning. This means that when my program compares the name with those in the file, it can not find a match (because of the extra spaces).
I want to find a way of stripping the spaces from the beginning of the input.
My work-round has been to split the input string into an array (which does ignore leading spaces) and then use the first element of the array. This is my code :
name := DoStrip(name)
DoStrip(xyz) ; strip leading and trailing spaces from string
{
StringSplit, out, xyz, `,, %A_Space%
Return out1
}
This seems to be a very laboured way to do it - is there a better way ?
I don't see a problem with your example if it works on all cases.
There is a much simpler way; just use Autotrim which works like this.
AutoTrim, On ; not required it is on by default
my_variable = %my_variable%
There are also many other different ways to trim string in autohotkey,
which you can combine into something useful.
You can also use #LTrim and #RTrim to remove white spaces at the beginning and at the end of the string.