What is the proper format for uploading a multi-label multi-class classification datasets with text and label in Doccano? - annotations

I have a question that I'd like to upload datasets to my doccano annotation project in which the labels have been already set beforehand in 8 classes with tags.
I'd like to know what is the correct uploading format of CSV or JSON for multi-label classification datasets with text and label column.
For example, I have 8 classes (a, b, c ,... ,h)
When I upload the file in this kind of format:
| text | label |
| ------ | --------- |
| text_1 | [a, b] |
| text_2 | [a, b ,c] |
| text_3 | [a, c] |
It is expected for text_1, it will only shows a and b, yet it turn out to be like [a, b]
Another example with screenshot.
0-7 are my project defined classes, in this cases it is expected only showing the correct marks in the labels with tags number 5 and 6. However it return a lot of mixing label list.
How do I modify my uploading dataset format to do it?

I found a solution,
there are a lot of mistaken labels in this project since at the beginning I upload the label column in the wrong format "[a, b]" (while it requires array) and it is stored inside the project. This kind of wrong label may mess up the following upload
my debugging step:
delete all labels in label management
re-create the label with tags
re-upload the file with JSON format and it works
Now the annotation is fine like:

Related

Need to explain the kdb/q script to save partitioned table

I'm trying to understand this snippet code from:
https://code.kx.com/q/kb/loading-from-large-files/
to customize it by myself (e.x partition by hours, minutes, number of ticks,...):
$ cat fs.q
\d .Q
/ extension of .Q.dpft to separate table name & data
/ and allow append or overwrite
/ pass table data in t, table name in n, : or , in g
k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]#[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;
#[;f;`p#]#[d;`.d;:;f,r#&~f=r:!r];n}
/ generalization of .Q.dpfnt to auto-partition and save a multi-partition table
/ pass table data in t, table name in n, name of column to partition on in c
k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]}
\d .
r:flip`date`open`high`low`close`volume`sym!("DFFFFIS";",")0:
w:.Q.dcfgnt[`:db;`date;`sym;,;`stats]
.Q.fs[w r#]`:file.csv
But I couldn't find any resources to give me detail explain. For example:
if[~&/qm'r:+en[d]t;'`unmappable];
what does it do with the parameter d?
(Promoting this to an answer as I believe it helps answer the question).
Following on from the comment chain: in order to translate the k code into q code (or simply to understand the k code) you have a few options, none of which are particularly well documented as it defeats the purpose of the q language - to be the wrapper which obscures the k language.
Option 1 is to inspect the built-in functions in the .q namespace
q).q
| ::
neg | -:
not | ~:
null | ^:
string | $:
reciprocal| %:
floor | _:
...
Option 2 is to inspect the q.k script which creates the above namespace (be careful not to edit/change this):
vi $QHOME/q.k
Option 3 is to lookup some of the nuggets of documentation on the code.kx website, for example https://code.kx.com/q/wp/parse-trees/#k4-q-and-qk and https://code.kx.com/q/basics/exposed-infrastructure/#unary-forms
Options 4 is to google search for reference material for other/similar versions of k, for example k2/k3. They tend to be similar-ish.
Final point to note is that in most of these example you'll see a colon (:) after the primitives....this colon is required in q/kdb to use the monadic form of the primitive (most are heavily overloaded) while in k it is not required to explicitly force the monadic form. This is why where will show as &: in the q reference but will usually just be & in actual k code

Talend: Equivalent of logstash "key value" filter

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.
My data are some sets of key value data like this example:
A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1
I want to transform it into a CSV like this one:
A,B,C
0,3,4
2,,4
2,4,
With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.
This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.
Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.
1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.
2 - tConvertType : here you can convert Object to String type
3 - tNormalize : you'll have to separate manually your lines (use \n as separator)
4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.
5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair
6 - tMap : you'll have to split on "=" sign.
At this step, you'll have an output like :
|seq|key|value|
|=--+---+----=|
|I1 |A |1 |
|I1 |B |2 |
|I1 |C |3 |
|I2 |A |2 |
|I2 |C |4 |
|I3 |A |2 |
|I3 |B |4 |
'---+---+-----'
where seq is the line number.
7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components).
Use ID column as the regroup column.
Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.
Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:
Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

Stata mmerge update replace gives wrong output

I wanted to test what happened if I replace a variable with a different data type:
clear
input id x0
1 1
2 13
3 .
end
list
save tabA, replace
clear
input id str5 x0
1 "1"
2 "23"
3 "33"
end
list
save tabB, replace
use tabA, clear
mmerge id using tabB, type(1:1) update replace
list
The result is:
+--------------------------------------------------+
| id x0 _merge |
|--------------------------------------------------|
1. | 1 1 in both, master agrees with using data |
2. | 2 13 in both, master agrees with using data |
3. | 3 . in both, master agrees with using data |
+--------------------------------------------------+
This seems very strange to me. I expected breakdown or disagreement. Is this a bug or am I missing something?
mmerge is user-written (Jeroen Weesie, SSC, 2002).
If you use the official merge in an up-to-date Stata, you will get what you expect.
. merge 1:1 id using tabB, update replace
x0 is str5 in using data
r(106);
I have not looked inside mmerge. My own guess is that what you see is a feature from the author's point of view, namely that it's not a problem if one variable is numeric and one variable is string so long as their contents agree. But why are you not using merge directly? There was a brief period several years ago when mmerge had some advantages over merge, but that's long past. BTW, I agree in wanting my merges to be very conservative and not indulgent on variable types.

Saving next line to .mat

I need to save some data to an existing table. So i have column names and one row that has data. Now I get the second set of information and i need to put it in the second row and so on.
Can you just point me where i can find this.
I have done this so far. Played arround with save( -struct) but doesn't seem to work.
if exist('table.mat','file')
...
...
else
dataCell = [name,trez,score];
colNames = {'Name','R','G','B','Shape'};
uisave({'colNames','dataCell'},'table');
end
So I check if there is table.mat, if there is none it creates it with some passed values. Now table.mat exists I need to put the second values without deleting other values.
UPDATE
OK i made the code like this:
if exist('table.mat','file')
dataCell = [name,num2cell(trez),num2cell(score)];
save('table.mat', '-append','dataCell');
else
dataCell=[name,num2cell(trez),num2cell(score)];
colNames={'Name','R','G','B','Shape'};
uisave({'colNames','dataCell'},'table');
end
But when i do save data using :
dataCell = [name,num2cell(trez),num2cell(score)];
save('table.mat', '-append','dataCell');
It deletes the old entry. Lets say in my table information is as it follows :
Name | R | G | B | Shape |
Orange | 239 | 135 | 2 | 0.87
Then if I try to save another entry like :
Apple | 100 |31 |56 | 0.79
It deletes the Orange. So do i need to add something or use some other method for this kind of information saving?
The save command can take an -append flag, which allows you to add data to existing files without overwriting old data. However for .mat files -append only allows you to add new variables. If you specify a variable name that already exists in the .mat file it will be overwritten.
However, if you are saving to an ASCII file then data is simply appended to the end of the file.
This presents you with two options.
Save using a .mat file, but for each variable you want to save you would need to load any variables with the same name from the .mat file, combine old variable with the new variable and then resave it to the file.
Save the matrices using an ASCII format, and then convert the from ASCII when you load the file.
Update: After re-reading your original question, I have to ask why not save in a single operation rather than saving line by line?

Programmatically merge cells in Openoffice

I want to programm a script, which should generate a OpenOffice-Calc table.
I have downloades the package "libooolib-perl" for Debian, and it works good, but I have a problem:
I can't concentrate Cells. I want the headline look like that:
This is the Head-Line of the Document |
This is subheadline 1 | This is subheadline 2 | This is subheadline 3 |
This is content 1 | This is content 2 | This is content 3 |
This is content 4 | This is content 5 | This is content 6 |
As you see the first Line contains 3 cells. As far as I know, I am not able to archive this by using csv or another non-binary format, so I need a proper Library, which can concentrate cells.
cellSpan does the job!
use OpenOffice::OODoc;
my $document = odfDocument(file=>'filename.odt',create=> 'text');
my $table=$document->appendTable("Table", 4, 3);
$document->cellSpan($table,"A1",3);
$document->cellValue($table, "A1", "This is the Head-Line of the Document");
#(...)
$document->save;
It would appear that the linked perl module does not support merging cells.
Perhaps the documentation of OpenOffice document format helps:
http://books.evc-cit.info/oobook/book_onepart.html#merged-spreadsheet-cells-section
It contains code samples, albeit in python, perhaps you can use the knowledge to implement the missing function in libooolib-perl