Tokenizer in moses-SMT system stuck even with 10 sentences

Tokenizer in moses-SMT system stuck even with 10 sentences - perl

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en \
> ~/corpus/news-commentary-v8.fr-en.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr \
> ~/corpus/news-commentary-v8.fr-en.tok.fr
( say S = French & T = English)
I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.
Did I do anything wrong?
PS: OS = Ubuntu 14.04.5 LTS
Sony ultrabook
No dual boot.

Please Follow bellow steps ;
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make
mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools
scripts/tokenizer/tokenizer.perl -l fr < ~/corpus/training/news-commentary-v8.fr-en.fr > ~/corpus/news-commentary-v8.fr-en.tok.fr

Related

How to fix the sdk_addon data copy build error?

I tried to build the Android SDK Addon System Image of halogenOS 13 (AOSP 13, nothing was changed in the AOSP source code yet).
The steps to build are as usual:
source build/envsetup.sh
lunch aosp_sdk_phone_x86_64-eng
m sdk_addon
At 100%, the build fails with following error:
[100% 12483/12483] Packaging SDK Addon System-Image: out/host/linux-x86/sdk_addon/custom-eng.simao--
FAILED: out/host/linux-x86/sdk_addon/custom-eng.simao--img.zip
/bin/bash -c "(cp -R out/target/product/emulator_x86_64/data out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/x86_64/data ) && (out/host/linux-x86/bin/soong_zip -o out/host/linux-x86/sdk_addon/custom-eng.simao--img.zip -C out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/ -D out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/x86_64 )"
cp: bad 'out/target/product/emulator_x86_64/data': No such file or directory
13:17:13 ninja failed with: exit status 1
If I just mkdir out/target/product/emulator_x86_64/data, of course, that just solves the build error but the SDK addon does not actually boot in the emulator due to encryption issues with the userdata partition so I think this is related. This makes me guess that in the data directory there should be some files in there but for some reason the are not created.
EDIT:
What's really odd here is that the file device/generic/goldfish/vendor.mk explicitly adds some data files to PRODUCT_COPY_FILES, notably:
PRODUCT_COPY_FILES += \
device/generic/goldfish/data/etc/dtb.img:dtb.img \
device/generic/goldfish/emulator-info.txt:data/misc/emulator/version.txt \
device/generic/goldfish/data/etc/apns-conf.xml:data/misc/apns/apns-conf.xml \
device/generic/goldfish/radio/RadioConfig/radioconfig.xml:data/misc/emulator/config/radioconfig.xml \
device/generic/goldfish/data/etc/iccprofile_for_sim0.xml:data/misc/modem_simulator/iccprofile_for_sim0.xml \
If I manually build them using, for example, m out/target/product/emulator_x86_64/data/misc/emulator/version.txt, the file is created at the correct location, as expected. Which leads me to wonder when entries PRODUCT_COPY_FILES are considered targets to be built and when they aren't.
EDIT2:
I got the emulator to boot but the data directory is still not being created. (Creating manually or building one target in the data dir is a workaorund).

TYPO3 9 LTS Version 9.5.5 - Could not access remote resource https://repositories.typo3.org/mirrors.xml.gz

I am working with the current version Typo3 9.5.5, with PHP version: 7.3.2 and XAMPP 3.2.3 on Windows 7. In the backend interface under ADMIN TOOLS -> Extensions -> get preconfigured distributions I always get the error "Could not access remote resource https://repositories.typo3.org/mirrors.xml.gz.". I have tried it with https://www.pagemachine.de/blog/wie-ihr-typo3-8-0-als-locales-testsystem-unter-windows-installer-unser-tutorial/#div-comment-4718 but unfortunately did not work. Does anyone have a solution?
[Edit] I would like to upload the whole file php.ini, but unfortunately I have not found a way to do this. It is not possible to put all the content of the php.ini here, because the character limit is 30000. I could just show the uncommented lines, but that would not be nice either.
A part of the C:\xampp\php\php.ini:
extension_dir="C:\xampp\php\ext"
;...
;...
;...
; When the extension library to load is not located in the default extension
; directory, You may specify an absolute path to the library file:
;
; extension=/path/to/extension/mysqli.so
;
; Note : The syntax used in previous PHP versions ('extension=<ext>.so' and
; 'extension='php_<ext>.dll') is supported for legacy reasons and may be
; deprecated in a future PHP major version. So, when it is possible, please
; move to the new ('extension=<ext>) syntax.
;
; Notes for Windows environments :
;
; - Many DLL files are located in the extensions/ (PHP 4) or ext/ (PHP 5+)
; extension folders as well as the separate PECL DLL download (PHP 5+).
; Be sure to appropriately set the extension_dir directive.
;
extension=bz2
extension=curl

You need to activate curl in your XAMPP installation. Maybe this Stack Overflow post can help you with that.
If you have Windows 10 Professional I can highly recommend DDEV-Local which gives you a solid base for local development. It should also work with Windows 10 Home.
In case you need help with DDEV, the #DDEV channel in TYPO3 Slack is very helpful.

I solved the problem by adding
(1) the line "extension = php_curl.dll" in the module settings part
and by updating
(2) the file "curl-ca-bundle.crt" in the line ; curl.cainfo = "C: \ xampp \ apache \ bin \ curl-ca-bundle.crt "does not exist in the Xampp directory (including subdirectories). Therefore, download the file "cacert.pem" expilicitly (just google) and replace the line ; curl.cainfo = "C: \ xampp \ apache \ bin \ curl-ca-bundle.crt" with curl.cainfo = "C: \ xampp \ cacert.pem ". Put the "cacert.pem" file in the directory "*C: \ xampp *".
(3) Also replace the line: openssl.cafile = "C: \ xampp \ apache \ bin \ curl-ca-bundle.crt" with openssl.cafile = "C: \ xampp \ cacert.pem". It should work then.

Buildroot Config Option for applying custom patch

I am new to buildroot and working to build Linaro with buildroot ..I have multiple fragment kernel config files and specified that in buildroot defconfig.
I have specified a custom kernel patches directory with BR2_LINUX_PATCH_DIR .
I dont have some of the config flags not set which are supposed to be there in the .config files..so i suspect that the Patches are applied successfully..so i tried giving a non existing location as Linux Patch dir and it does not give any error..
Is there anything other than giving value to BR2_LINUX_PATCH_DIR and what should be the format of the dir structure...in buildroot manual it says it should be
Package_name/patch name..For linux what should be the package name? It should be the same with which linux dir is created.for example for me it is linux-custom
Plz suggest and guide me in this.
Thanks in Advance

The option is named BR2_LINUX_KERNEL_PATCH, there is nothing named BR2_LINUX_PATCH_DIR. It applies all patches listed in this option (if those are files), or all files named *.patch if what's given in this option is a directory. See the code in linux/linux.mk:
define LINUX_APPLY_LOCAL_PATCHES
for p in $(filter-out ftp://% http://% https://%,$(LINUX_PATCHES)) ; do \
if test -d $$p ; then \
$(APPLY_PATCHES) $(#D) $$p \*.patch || exit 1 ; \
else \
$(APPLY_PATCHES) $(#D) `dirname $$p` `basename $$p` || exit 1; \
fi \
done
endef
Also, I would recommend that you watch the output of Buildroot: it shows everything it is doing, especially it lists the patches it applied. Look at the line >>> linux .... Patching, which is the marker for the beginning of the patching step of the linux package.

Can you get the number of lines of code from a GitHub repository?

In a GitHub repository you can see “language statistics”, which displays the percentage of the project that’s written in a language. It doesn’t, however, display how many lines of code the project consists of. Often, I want to quickly get an impression of the scale and complexity of a project, and the count of lines of code can give a good first impression. 500 lines of code implies a relatively simple project, 100,000 lines of code implies a very large/complicated project.
So, is it possible to get the lines of code written in the various languages from a GitHub repository, preferably without cloning it?
The question “Count number of lines in a git repository” asks how to count the lines of code in a local Git repository, but:
You have to clone the project, which could be massive. Cloning a project like Wine, for example, takes ages.
You would count lines in files that wouldn’t necessarily be code, like i13n files.
If you count just (for example) Ruby files, you’d potentially miss massive amount of code in other languages, like JavaScript. You’d have to know beforehand which languages the project uses. You’d also have to repeat the count for every language the project uses.
All in all, this is potentially far too time-intensive for “quickly checking the scale of a project”.

You can run something like
git ls-files | xargs wc -l
Which will give you the total count →
You can also add more instructions. Like just looking at the JavaScript files.
git ls-files | grep '\.js' | xargs wc -l
Or use this handy little tool → https://line-count.herokuapp.com/

A shell script, cloc-git
You can use this shell script to count the number of lines in a remote Git repository with one command:
#!/usr/bin/env bash
git clone --depth 1 "$1" temp-linecount-repo &&
printf "('temp-linecount-repo' will be deleted automatically)\n\n\n" &&
cloc temp-linecount-repo &&
rm -rf temp-linecount-repo
Installation
This script requires CLOC (“Count Lines of Code”) to be installed. cloc can probably be installed with your package manager – for example, brew install cloc with Homebrew. There is also a docker image published under mribeiro/cloc.
You can install the script by saving its code to a file cloc-git, running chmod +x cloc-git, and then moving the file to a folder in your $PATH such as /usr/local/bin.
Usage
The script takes one argument, which is any URL that git clone will accept. Examples are https://github.com/evalEmpire/perl5i.git (HTTPS) or git#github.com:evalEmpire/perl5i.git (SSH). You can get this URL from any GitHub project page by clicking “Clone or download”.
Example output:
$ cloc-git https://github.com/evalEmpire/perl5i.git
Cloning into 'temp-linecount-repo'...
remote: Counting objects: 200, done.
remote: Compressing objects: 100% (182/182), done.
remote: Total 200 (delta 13), reused 158 (delta 9), pack-reused 0
Receiving objects: 100% (200/200), 296.52 KiB | 110.00 KiB/s, done.
Resolving deltas: 100% (13/13), done.
Checking connectivity... done.
('temp-linecount-repo' will be deleted automatically)
171 text files.
166 unique files.
17 files ignored.
http://cloc.sourceforge.net v 1.62 T=1.13 s (134.1 files/s, 9764.6 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Perl 149 2795 1425 6382
JSON 1 0 0 270
YAML 2 0 0 198
-------------------------------------------------------------------------------
SUM: 152 2795 1425 6850
-------------------------------------------------------------------------------
Alternatives
Run the commands manually
If you don’t want to bother saving and installing the shell script, you can run the commands manually. An example:
$ git clone --depth 1 https://github.com/evalEmpire/perl5i.git
$ cloc perl5i
$ rm -rf perl5i
Linguist
If you want the results to match GitHub’s language percentages exactly, you can try installing Linguist instead of CLOC. According to its README, you need to gem install linguist and then run linguist. I couldn’t get it to work (issue #2223).

I created an extension for Google Chrome browser - GLOC which works for public and private repos.
Counts the number of lines of code of a project from:
project detail page
user's repositories
organization page
search results page
trending page
explore page

If you go to the graphs/contributors page, you can see a list of all the contributors to the repo and how many lines they've added and removed.
Unless I'm missing something, subtracting the aggregate number of lines deleted from the aggregate number of lines added among all contributors should yield the total number of lines of code in the repo. (EDIT: it turns out I was missing something after all. Take a look at orbitbot's comment for details.)
UPDATE:
This data is also available in GitHub's API. So I wrote a quick script to fetch the data and do the calculation:
'use strict';
async function countGithub(repo) {
const response = await fetch(`https://api.github.com/repos/${repo}/stats/contributors`)
const contributors = await response.json();
const lineCounts = contributors.map(contributor => (
contributor.weeks.reduce((lineCount, week) => lineCount + week.a - week.d, 0)
));
const lines = lineCounts.reduce((lineTotal, lineCount) => lineTotal + lineCount);
window.alert(lines);
}
countGithub('jquery/jquery'); // or count anything you like
Just paste it in a Chrome DevTools snippet, change the repo and click run.
Disclaimer (thanks to lovasoa):
Take the results of this method with a grain of salt, because for some repos (sorich87/bootstrap-tour) it results in negative values, which might indicate there's something wrong with the data returned from GitHub's API.
UPDATE:
Looks like this method to calculate total line numbers isn't entirely reliable. Take a look at orbitbot's comment for details.

You can clone just the latest commit using git clone --depth 1 <url> and then perform your own analysis using Linguist, the same software Github uses. That's the only way I know you're going to get lines of code.
Another option is to use the API to list the languages the project uses. It doesn't give them in lines but in bytes. For example...
$ curl https://api.github.com/repos/evalEmpire/perl5i/languages
{
"Perl": 274835
}
Though take that with a grain of salt, that project includes YAML and JSON which the web site acknowledges but the API does not.
Finally, you can use code search to ask which files match a given language. This example asks which files in perl5i are Perl. https://api.github.com/search/code?q=language:perl+repo:evalEmpire/perl5i. It will not give you lines, and you have to ask for the file size separately using the returned url for each file.

Not currently possible on Github.com or their API-s
I have talked to customer support and confirmed that this can not be done on github.com. They have passed the suggestion along to the Github team though, so hopefully it will be possible in the future. If so, I'll be sure to edit this answer.
Meanwhile, Rory O'Kane's answer is a brilliant alternative based on cloc and a shallow repo clone.

From the #Tgr's comment, there is an online tool :
https://codetabs.com/count-loc/count-loc-online.html

You can use tokei:
cargo install tokei
git clone --depth 1 https://github.com/XAMPPRocky/tokei
tokei tokei/
Output:
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
BASH 4 48 30 10 8
JSON 1 1430 1430 0 0
Shell 1 49 38 1 10
TOML 2 78 65 4 9
-------------------------------------------------------------------------------
Markdown 4 1410 0 1121 289
|- JSON 1 41 41 0 0
|- Rust 1 47 38 5 4
|- Shell 1 19 16 0 3
(Total) 1517 95 1126 296
-------------------------------------------------------------------------------
Rust 19 3750 3123 119 508
|- Markdown 12 358 5 302 51
(Total) 4108 3128 421 559
===============================================================================
Total 31 6765 4686 1255 824
===============================================================================
Tokei has support for badges:
Count Lines
[![](https://tokei.rs/b1/github/XAMPPRocky/tokei)](https://github.com/XAMPPRocky/tokei)
By default the badge will show the repo's LoC(Lines of Code), you can also specify for it to show a different category, by using the ?category= query string. It can be either code, blanks, files, lines, comments.
Count Files
[![](https://tokei.rs/b1/github/XAMPPRocky/tokei?category=files)](https://github.com/XAMPPRocky/tokei)

You can use GitHub API to get the sloc like the following function
function getSloc(repo, tries) {
//repo is the repo's path
if (!repo) {
return Promise.reject(new Error("No repo provided"));
}
//GitHub's API may return an empty object the first time it is accessed
//We can try several times then stop
if (tries === 0) {
return Promise.reject(new Error("Too many tries"));
}
let url = "https://api.github.com/repos" + repo + "/stats/code_frequency";
return fetch(url)
.then(x => x.json())
.then(x => x.reduce((total, changes) => total + changes[1] + changes[2], 0))
.catch(err => getSloc(repo, tries - 1));
}
Personally I made an chrome extension which shows the number of SLOC on both github project list and project detail page. You can also set your personal access token to access private repositories and bypass the api rate limit.
You can download from here https://chrome.google.com/webstore/detail/github-sloc/fkjjjamhihnjmihibcmdnianbcbccpnn
Source code is available here https://github.com/martianyi/github-sloc

Hey all this is ridiculously easy...
Create a new branch from your first commit
When you want to find out your stats, create a new PR from main
The PR will show you the number of changed lines - as you're doing a PR from the first commit all your code will be counted as new lines
And the added benefit is that if you don't approve the PR and just leave it in place, the stats (No of commits, files changed and total lines of code) will simply keep up-to-date as you merge changes into main. :) Enjoy.

Firefox add-on Github SLOC
I wrote a small firefox addon that prints the number of lines of code on github project pages: Github SLOC

npm install sloc -g
git clone --depth 1 https://github.com/vuejs/vue/
sloc ".\vue\src" --format cli-table
rm -rf ".\vue\"
Instructions and Explanation
Install sloc from npm, a command line tool (Node.js needs to be installed).
npm install sloc -g
Clone shallow repository (faster download than full clone).
git clone --depth 1 https://github.com/facebook/react/
Run sloc and specifiy the path that should be analyzed.
sloc ".\react\src" --format cli-table
sloc supports formatting the output as a cli-table, as json or csv. Regular expressions can be used to exclude files and folders (Further information on npm).
Delete repository folder (optional)
Powershell: rm -r -force ".\react\" or on Mac/Unix: rm -rf ".\react\"
Screenshots of the executed steps (cli-table):
sloc output (no arguments):
It is also possible to get details for every file with the --details option:
sloc ".\react\src" --format cli-table --details

Open terminal and run the following:
curl -L "https://api.codetabs.com/v1/loc?github=username/reponame"

If the question is "can you quickly get NUMBER OF LINES of a github repo", the answer is no as stated by the other answers.
However, if the question is "can you quickly check the SCALE of a project", I usually gauge a project by looking at its size. Of course the size will include deltas from all active commits, but it is a good metric as the order of magnitude is quite close.
E.g.
How big is the "docker" project?
In your browser, enter api.github.com/repos/ORG_NAME/PROJECT_NAME
i.e. api.github.com/repos/docker/docker
In the response hash, you can find the size attribute:
{
...
size: 161432,
...
}
This should give you an idea of the relative scale of the project. The number seems to be in KB, but when I checked it on my computer it's actually smaller, even though the order of magnitude is consistent. (161432KB = 161MB, du -s -h docker = 65MB)

Pipe the output from the number of lines in each file to sort to organize files by line count.
git ls-files | xargs wc -l |sort -n

This is so easy if you are using Vscode and you clone the project first. Just install the Lines of Code (LOC) Vscode extension and then run LineCount: Count Workspace Files from the Command Pallete.
The extension shows summary statistics by file type and it also outputs result files with detailed information by each folder.

There in another online tool that counts lines of code for public and private repos without having to clone/download them - https://klock.herokuapp.com/

None of the answers here satisfied my requirements. I only wanted to use existing utilities. The following script will use basic utilities:
Git
GNU or BSD awk
GNU or BSD sed
Bash
Get total lines added to a repository (subtracts lines deleted from lines added).
#!/bin/bash
git diff --shortstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD | \
sed 's/[^0-9,]*//g' | \
awk -F, '!($2 > 0) {$2="0"};!($3 > 0) {$3="0"}; {print $2-$3}'
Get lines of code filtered by specified file types of known source code (e.g. *.py files or add more extensions, etc).
#!/bin/bash
git diff --shortstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- *.{py,java,js} | \
sed 's/[^0-9,]*//g' | \
awk -F, '!($2 > 0) {$2="0"};!($3 > 0) {$3="0"}; {print $2-$3}'
4b825dc642cb6eb9a060e54bf8d69288fbee4904 is the id of the "empty tree" in Git and it's always available in every repository.
Sources:
My own scripting
How to get Git diff of the first commit?
Is there a way of having git show lines added, lines changed and lines removed?

shields.io has a badge that can count up all the lines for you here. Here is an example of what it looks like counting the Raycast extensions repo:

You can use sourcegraph, an open source search engine for code. It can connect to your GitHub account, index the content, and then on the admin section you would see the number of lines of code indexed.

I made an NPM package specifically for this usage, which allows you to call a CLI tool and providing the directory path and the folders/files to ignore
it goes like this:
npm i -g #quasimodo147/countlines
to get the $ countlines command in your terminal
then you can do
countlines . node_modules build dist

Magento: upgrade pre 1.6 version to most recent one

I've seen a lot of questions about pre 1.6 Magento installations to the most recent version (at the current moment 1.7.0.2) but there are a lot of answers that don't work for everybody.
So below the answer to the question:
How to upgrade Magento from a pre 1.6 installation to the most recent one.

There are a lot of versions and not all of them are working. This one has worked for me for a lot of versions, as far as 1.3 to 1.7.
Please add comments with solutions to problems you're experiencing, I can update the answer so other people get help from this topic too!
What you need:
- SUDO rights/root account on your server.
- The linux package 'nohub'
- make sure NOBODY can trigger the index.php. If your version supports maintenance.flag, put an empty maintenance.flag file in your Magento root.
Walkthrough
1) Download the latest Magento. Overwrite: ./download/* ./lib/* ./mage
2) Run these steps from you Magento root als SUDOer (if you're not root, put 'sudo' for all the commands)
find . -type f -exec chmod 644 {} \;
find . -type d -exec chmod 755 {} \;
chmod -R 777 ./var
chmod 550 mage
3) Go to your Magento root folder and type:
./mage list-upgrades
./mage config-set preferred_state stable
./mage upgrade-all --force
./mage install http://connect20.magentocommerce.com/community Mage_All_Latest --force
4) Now there is the last step. Note: In some situations this process can take up to 8+ hours!
nohup php -f ./index.php
Known issues
1) it's possible that your update gets in a loop. To find this loop, enable debugging.Edit: /lib/Varien/Db/Adapter/Pdo/Mysql.php (+/- line 112 and 112)
protected $_debug = true;
protected $_debuglogeverything = true;
This will write a debug to: /var/debug/[debug_file]
2) Read the file by opening the dir:
cd /var/debug/[debug_file] <-- replace with the actual filename
tail -f [debug_file]
3) If you use debug, the file will get HUGE! Make sure you delete it once in a while.
Tip: as a root user, type:
crontab -e
*/5 * * * * rm /[my_magento_base_folder]/var/debug/[debug_file] <-- add this line
If you want to read the file, add a # to this line and use tail to read it.
These steps help you find common errors and loops (if the tail shows a repeating error message)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Tokenizer in moses-SMT system stuck even with 10 sentences - perl

Related

How to fix the sdk_addon data copy build error?

TYPO3 9 LTS Version 9.5.5 - Could not access remote resource https://repositories.typo3.org/mirrors.xml.gz

Buildroot Config Option for applying custom patch

Can you get the number of lines of code from a GitHub repository?

Magento: upgrade pre 1.6 version to most recent one

Categories

Resources