word2vec toolkit distance script - distance

I'm using the "distance" script to find similar words over a word2vec that I have built. It contains around 1.6M words and was trained by this command:
./word2vec -train processed-text-2016.txt -output vec-cbow-neg.txt -debug 2 -threads 5 -size 300 -window 10 -sample 1e-3 -negative 10 -hs 0 -binary 0 -cbow 1 > w2v-neg.log &
My problem is that when I type any word, I get the following:
Enter word or sentence (EXIT to break): rt
Word: rt Position in vocabulary: 658253
Word Cosine distance
-0.000451 0.494857
356414 0.477918
9 0.441466
83 0.432876
63 0.431347
-0.020525 0.429472
.047345 0.425791
36 0.423420
242 0.418320
... ...
Enter word or sentence (EXIT to break): nd
Word: nd Position in vocabulary: 336527
Word Cosine distance
3 0.494377
489 0.492153
632 0.483827
0.002335 0.462591
0693 0.458801
036869 0.452456
036819 0.447690
31 0.443887
... ...
Enter word or sentence (EXIT to break): and
Word: and Position in vocabulary: 1600843
Word Cosine distance
080852 0.451752
57 0.438413
16577 0.437900
4 0.433538
.005464 0.429279
003131 0.422587
17380 0.420614
9 0.419624
5082 0.419569
0.019322 0.417945
.000435 0.417265
115991 0.414139
... ...
Enter word or sentence (EXIT to break): happy
Word: happy Position in vocabulary: -1
Out of dictionary word!
Enter word or sentence (EXIT to break): man
Word: man Position in vocabulary: 470143
Word Cosine distance
0.055039 0.488181
4793 0.455608
90743 0.454786
060493 0.453180
36 0.451387
6 0.450261
4 0.445118
830 0.442580
490 0.439919
0.025327 0.437766
0.005571 0.436606
0.001964 0.436544
-0.012627 0.434358
... ...
Enter word or sentence (EXIT to break): women
Word: women Position in vocabulary: -1
Out of dictionary word!
Enter word or sentence (EXIT to break): queen
Word: queen Position in vocabulary: -1
If I grep these words from the model file (text file), I find them, so I'm not sure why this is happening or how to overcome this? Is it because of noise in data (I'm degugging this) or in params I used?

The answer is simply I'm using text format of the model not the binary format...

Related

Given the dendrogram, is this an anomaly?

I used Scipy to create the following dendrogram:
I used the Levenshtein distance to create a distance matrix with scipy.spatial.distance.pdist, which I then used for clustering using scipy.cluster.hierarchy.ward. This was the output I got after I used scipy.cluster.hierarchy.dendrogram.
The sample of words I used is:
'bistum osnabrück intranet', 'fernbusse kaiserslautern',
'abfalleimer gelber sack', 'crazy factory app', 'angel schwerin',
'mietspiegel oberstaufen', 'sata jet nr 95',
'haare schneiden schere', 'magix deluxe 2013', 'coach bus',
'zwergobst', '+ischia +sorriso', '+sägeblatt +schärfdienst',
'+av +receiver +onkyo +tx +nr646', 'treppenbau aachen',
'ivb nummer', 'elektro hoen saarlouis', 'disponent ausbildung',
'+schokolade +werkzeug', 'bildungsurlaub englisch',
'deutsche lernen b1', 'mietewohnung', 'anwendung von roundup',
'rente nachzahlung', 'klinik am zauberwald',
'beton schutting prijzen', '+vergewaltigung +afrikaner',
'sandstein bremen', 'straubing landshuter hof',
'brandenburgviewer', 'gebetskleidung frauen', 'keepass 2 deutsch',
'emp versand', 'einrichtungshaus münchen',
'+bmw +dachgepäckträger +e91', 'blokker gartenmöbel',
'konto sparkasse kosten', 'navis fürs fahrrad',
'+buffalo +steakhaus', 'autogalerie köhler siegen',
'rennie nebenwirkungen', 'geräte schutzbrief',
'sozialberatung leipzig', 'bomann gspe 649 anleitung',
'klimaschutz bilder', 'maggi zwiebelsuppe',
'zitat für hochzeitskarte', 'kreul schablonen'
Why are 4 (abfalleimer gelber sack), 37 (blokker gartenmöbel), 41 (autogalerie köhler siegen), 44 (sozialberatung leipzig) omitted?
What if the color of the cluster is just black?

How to convert output of Emboss:Palindrome into gff/bed file (perl)

I am sorry ton ask this kind of stupid question but I could not find it by myself... I learned perl a while ago and I am a little lost.
I want to convert this kind of output :
Palindromes of: seq1
Sequence length is: 24
Start at position: 1
End at position: 24
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaaaaaaa 11
|||||||||||
24 ttttttttttt 14
Palindromes of: seq2
Sequence length is: 15
Start at position: 1
End at position: 15
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaac 7
|||||||
15 ttttttg 9
Into a gff or bed file :
seq1 1 24
seq2 1 15
I found a perl module to do it : https://metacpan.org/pod/Bio::Tools::GFF
This is my little script :
#!/usr/bin/perl
use strict;
use warnings 'all';
use Bio::Tools::EMBOSS::Palindrome;
use Bio::Tools::GFF;
my $filename = "truc.pal";
# a simple script to turn palindrome output into GFF3
my $parser = Bio::Tools::EMBOSS::Palindrome->new(-file => $filename);
my $out = Bio::Tools::GFF->new(-gff_version => 3,
-file => ">$filename.gff");
while( my $seq = $parser->next_seq ) {
for my $feat ( $seq->get_SeqFeatures ) {
$out->write_feature($feat);
}
}
This is the result :
##gff-version 3
seq1 palindrome similarity 14 24 . - 1 allowed_mismatches=0;end=24;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=24;start=1
seq2 palindrome similarity 9 15 . - 1 allowed_mismatches=0;end=15;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=15;start=1
The issue is : I want to have it the result the start and the end of the palindrome and the specific position in the last line.
Exemple of what I want:
##gff-version 3
seq1 palindrome similarity 1 24 . - 1 mismatches=0;gap_positions=11-14;gap_size=3
seq2 palindrome similarity 1 15 . - 1 mismatches=0;gap_positions=7-9;gap_size=2
Thank you in advance.

How to extract data set from a text file?

Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile

Text file processing in Matlab

I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986
2) AAACCGCCTC (GAGGCGGTTT) 1.865
DETAILED RESULTS:
1) TTATAGCCGC (GCGGCTATAA) 1.986
Matrix: MAT1 TTATAGCCGC
A 0.1249 0.177 0.7364 0.1189 0.7072 0.1149 0.09858 0.1096
C 0.0899 0.07379 0.1136 0.1298 0.08662 0.1293 0.7528 0.721
G 0.06828 0.1284 0.07195 0.1031 0.1352 0.6708 0.05556 0.0713
T 0.7169 0.6209 0.07802 0.6482 0.07096 0.08492 0.09305 0.09804
OCCURRENCES:
>GENE_1 1 TTATAGCCGC 1 561 +
>GENE_2 24 TAATAGCCGC 0.928699 762 -
>GENE_3 10 ATATAGCCGC 0.904905 185 -
>GENE_1 7 TTATAGCAGC 0.901785 726 +
**********
2) AAACCGCCTC (GAGGCGGTTT) 1.865
Matrix: MAT2 AAACCGCCTC
A 0.653 0.7401 0.7763 0.1323 0.09619 0.09134 0.07033 0.1383
C 0.1163 0.07075 0.09441 0.749 0.6347 0.1132 0.6559 0.6982
G 0.09136 0.09402 0.07385 0.04209 0.1799 0.7332 0.1241 0.07568
T 0.1393 0.09518 0.05541 0.07659 0.08921 0.06234 0.1497 0.08786
OCCURRENCES:
>GENE_1 21 AAACCGCCTC 1 963 +
>GENE_2 14 AAACGGCCTC 0.928198 212 +
>GENE_2 8 AAACCGTCTC 0.92009 170 +
>GENE_4 3 TAACCGCCTC 0.918883 370 +
**********
I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986 3
2) AAACCGCCTC (GAGGCGGTTT) 1.865 3
AVERAGE OCCURRENCE: 3
For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)
How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.
Kindly help.
Thanks,
AP
You better try to change the original text file so that it will be legal matlab m file code, then just use 'eval' function to run it .
Most of the job will be to find where to insert '=' and '[' ']' and '%' for ignore parts.
If all files are identical in format than it will be easy.

AWK - filter file with not equal fields

I've been trying to pull a field from a row in a file although each row may have plus or minus 2 or 3 fields per row. They aren't always equal in the number of fields per row.
Here is a snippet:
A orarpp 45286124 1 1 0 20 60 Nov 25 9-16:42:32 01:04:58 11176 117056 0 - oracleXXX (LOCAL=NO)
A orarpp 45351560 1 1 3 20 61 Nov 30 5-03:54:42 02:24:48 4804 110684 0 - ora_w002_XXX
A orarpp 45548236 1 1 22 20 71 Nov 26 8-19:36:28 00:56:18 10628 116508 0 - oracleXXX (LOCAL=NO)
A orarpp 45679190 1 1 0 20 60 Nov 28 6-23:42:20 00:37:59 10232 116112 0 - oracleXXX (LOCAL=NO)
A orarpp 45744808 1 1 0 20 60 10:52:19 23:08:12 00:04:58 11740 117620 0 - oracleXXX (LOCAL=NO)
A root 45810380 1 1 0 -- 39 Nov 25 9-19:54:34 00:00:00 448 448 0 - garbage
In the case of the first line, I'm interested in 9-16:42:32 and the similar fields for each row.
I've tried to pull it by using ':' as the field separator and then filter from there however, what I am trying to accomplish is to do something if the number before the dash (in the example it's 9) is greater than one.
cat file.txt | grep oracle | awk -F: '{print substr($1, length($1)-5)}'
This is because the number of fields on either side of the actual field I need can be different from line to line.
Definitely not the most efficient but I've been trying to do this with an awk one liner.
Hints or a direction would be appreciated to get me moving again. I am not opposed to doing in a better way than awk.
Thanks.
Maybe cut is the right tool for this job? For example, with your snippet:
$ cut -c 62-71 file.txt
9-16:42:32
5-03:54:42
8-19:36:28
6-23:42:20
23:08:12
9-19:54:34
The arguments tell cut to snip columns (-c) 62 through 71.
For additional processing, you can pipe it to awk.
You can also accomplish the whole thing in awk by accepting entire lines and then using substr to extract the columns you want. For example, this awk command produces the same output as the cut command above:
awk '{ print substr($0, 62, 10) }' file.txt
Whether you create a pipeline or do the processing entirely in awk is at least in part a matter of personal taste / style.
Would this do?
awk -F: '/oracle/ {print substr($0,62,10)}' file.txt
9-16:42:32
8-19:36:28
6-23:42:20
23:08:12
This search for oracle and then print 10 characters starting from position 62
You can grab those identifiers with one of
grep -o '[[:digit:]]\+-[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\}'
grep -oP '\d+-\d\d:\d\d:\d\d' # GNU grep
It sounds like you want to do something with the lines, not just find the ids. Please elaborate.
Using GNU awk:
gawk --re-interval '
/oracle/ && \
match($0, /([[:digit:]]+)-([[:digit:]]{2}:){2}[[:digit:]]{2}/, a) && \
a[1]>1 {
# do something with the matching line
print
}
' file