How to feed several LMDB files to the data layer in Caffe - neural-network

I have a very big dataset and it is not a good idea to convert it to a single LMDB file for Caffe. Thus, I am trying to split it into small parts and specify a TXT file containing the paths to the corresponding LMDB files.
Here's an example of my data layer:
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
data_param {
source: "path/to/lmdb.txt"
batch_size: 256
backend: LMDB
}
}
And this is my lmdb.txt file:
/path/to/train1lmdb
/path/to/train2lmdb
/path/to/train3lmdb
However, I got the following error:
I0828 10:30:40.639502 26950 layer_factory.hpp:77] Creating layer data
F0828 10:30:40.639549 26950 db_lmdb.hpp:15] Check failed: mdb_status == 0
(20 vs. 0) Not a directory
*** Check failure stack trace: ***
# 0x7f678e4a3daa (unknown)
# 0x7f678e4a3ce4 (unknown)
# 0x7f678e4a36e6 (unknown)
# 0x7f678e4a6687 (unknown)
# 0x7f678ebee5e1 caffe::db::LMDB::Open()
# 0x7f678eb2b7d4 caffe::DataLayer<>::DataLayer()
# 0x7f678eb2b982 caffe::Creator_DataLayer<>()
# 0x7f678ec1a1a9 caffe::Net<>::Init()
# 0x7f678ec1c382 caffe::Net<>::Net()
# 0x7f678ec2e200 caffe::Solver<>::InitTrainNet()
# 0x7f678ec2f153 caffe::Solver<>::Init()
# 0x7f678ec2f42f caffe::Solver<>::Solver()
# 0x7f678eabcc71 caffe::Creator_SGDSolver<>()
# 0x40f18e caffe::SolverRegistry<>::CreateSolver()
# 0x40827d train()
# 0x405bec main
# 0x7f678ccfaf45 (unknown)
# 0x4064f3 (unknown)
# (nil) (unknown)
Aborted (core dumped)
So, how can I make it work? Is this kind of method feasible? Thanks in advance.

The problem:
You are confusing "Data" layer and "HDF5Data" layer:
With "Data" layer you can only specify one lmdb/leveldb dataset, and your source: entry should point to the only database you are using.
On the other hand, with "HDF5Data" layer you can have multiple binary hdf5 files, and the source: parameter points to a text file listing all the binary files you are about to use.
Solutions
0. (Following PrzemekD's comment) Add different "Data" layer for each lmdb you have (with smaller batch_size) and then use "Concat" layer to "merge" the different inputs into a single minibatch.
1. As you can already guess, one solution is to convert your data to hdf5 binary format and use "HDF5Data" layer.
2. Alternatively, you can write your own "Python" input layer, this layer should be able to read from all lmdb file (using python lmdb interface) and feed the data, batch by batch to your net.

Related

Steps on how to set up Sonarqube for Xcode/Swif Project

I have been trying to get this done for two days now. I have had different error to deal with. I am Using an M1 Apple Chip Mac pro and Xcode 13.4 and it has been difficult to get Sonarqube running. I finally found a docker image which is M1 specific and I have been able to get Sonarqube running locally.
My current challenge is having the test result sent to the Sonarqube project.
I have tried several method which includes
xcrun xccov view YourPathToThisFile/*.xccovreport --json
This script is not working even though I wanted an xml format.
Is there a better way to have the code coverage report sent to sonarqube. I have Sonarqube running but the test result and coverage is not showing. Sonarqube page currently says "The main branch has no lines of code."
NB: I am running Sonarqube with Docker
Below is my Sonarqube properties file.
#
# Swift SonarQube Plugin - Enables analysis of Swift and Objective-C projects into SonarQube.
# Copyright © 2015 Backelite (${email})
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# Sonar Server details
sonar.host.url=http://localhost:9000/
sonar.login=782a04fee8bfc7ae181f04bbd13734eb89e5580c
# sonar.password=admin
# Project Details
sonar.projectKey=tinggios
sonar.projectName=TinggIOSApp
sonar.projectDescription=This is TinggiOS
# Comment if you have a project with mixed ObjC / Swift
sonar.language=swift
sonar.projectKey=tinggios
sonar.qualitygate.wait=true
# Path to source directories
# sonar.sources=SonarDemo,SonarDemoTests,SonarDemoUITests
sonar.sources=.
# Exclude directories
sonar.test.inclusions=**/*Test*/**
sonar.test.inclusions=*.swift
sonar.exclusions=**/*.xml,Pods/**/*,Reports/**/*
# sonar.inclusions=*.swift
# Path to test directories (comment if no test)
sonar.tests=TinggIOS/Core/Tests/CoreTests,TinggIOS/Home/Tests/HomeTests,TinggIOS/OnboardingUITest
# Destination Simulator to run surefire
# As string expected in destination argument of xcodebuild command
# Example = sonar.swift.simulator=platform=iOS Simulator,name=iPhone 6,OS=9.2
# sonar.swift.simulator=platform=iOS Simulator,name=iPhone 7,OS=12.0
sonar.swift.simulator=platform=iOS Simulator,name=iPhone 11,OS=15
# Xcode project configuration (.xcodeproj)
# and use the later to specify which project(s) to include in the analysis (comma separated list)
# Specify either xcodeproj or xcodeproj + xcworkspace
sonar.swift.project=TinggIOS/TinggIOS.xcodeproj
sonar.swift.workspace=TinggIOS/TinggIOS.xcworkspace
sonar.language=swift
sonar.c.file.suffixes=-
sonar.cpp.file.suffixes=-
sonar.objc.file.suffixes=-
# Specify your appname.
# This will be something like "myApp"
# Use when basename is different from targeted scheme.
# Or when slather fails with 'No product binary found'
sonar.swift.appName=TinggIOS
# Scheme to build your application
sonar.swift.appScheme=TinggIOS
# Configuration to use for your scheme. if you do not specify that the default will be Debug
sonar.swift.appConfiguration=Debug
##########################
# Optional configuration #
##########################
# Encoding of the source code
sonar.sourceEncoding=UTF-8
# SCM
# sonar.scm.enabled=true
# sonar.scm.url=scm:git:http://xxx
# JUnit report generated by run-sonar.sh is stored in sonar-reports/TEST-report.xml
# Change it only if you generate the file on your own
# The XML files have to be prefixed by TEST- otherwise they are not processed
sonar.junit.reportsPath=sonar-reports/TEST-report.xml
# Cobertura report generated by run-sonar.sh is stored in sonar-reports/coverage-swift.xml
# Change it only if you generate the file on your own
sonar.swift.coverage.reportPattern=sonar-reports/coverage-swift*.xml
#sonar.coverageReportPaths=sonarqube-generic-coverage.xml
#sonar.swift.coverage.reportPattern=sonar-reports/cobertura.xml
# OCLint report generated by run-sonar.sh is stored in sonar-reports/oclint.xml
# Change it only if you generate the file on your own
sonar.swift.swiftlint.report=sonar-reports/*swiftlint.txt
# Change it only if you generate the file on your own
sonar.swift.tailor.report=sonar-reports/*tailor.txt
# Paths to exclude from coverage report (surefire, 3rd party libraries etc.)
# sonar.swift.excludedPathsFromCoverage=pattern1,pattern2
# sonar.swift.excludedPathsFromCoverage=.*Tests.*,
##########################
# Tailor configuration #
##########################
# Tailor configuration
# -l,--max-line-length=<0-999> maximum Line length (in characters)
# --list-files display Swift source files to be analyzed
# --max-class-length=<0-999> maximum Class length (in lines)
# --max-closure-length=<0-999> maximum Closure length (in lines)
# --max-file-length=<0-999> maximum File length (in lines)
# --max-function-length=<0-999> maximum Function length (in lines)
# --max-name-length=<0-999> maximum Identifier name length (in characters)
# --max-severity=<error|warning (default)> maximum severity
# --max-struct-length=<0-999> maximum Struct length (in lines)
# --min-name-length=<1-999> minimum Identifier name length (in characters)
sonar.swift.tailor.config=--no-color --max-line-length=100 --max-file-length=500 --max-name-length=40 --max-name-length=40 --min-name-length=4

Issue with incorporating BERT in RASA Pipeline

BERT provides an option to include pre-trained language models from Hugging Face in pipline. As per the doc:
- name: HFTransformersNLP
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "bert-base-uncased"
# An optional path to a specific directory to download and cache the pre-trained model weights.
# The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
cache_dir: null
Following this I configured my pipeline as:
- name: "HFTransformersNLP"
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "bert-base-uncased"
cache_dir: "C:/Project ABC/cache/"
But the problem is that on running the training steps, the model keeps failing with:
OSError: Model name 'bert-base-uncased' was not found in tokenizers
model name list (bert-base-uncased, bert-large-uncased,
bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese,
bert-base-german-cased, bert-large-uncased-whole-word-masking,
bert-large-cased-whole-word-masking,
bert-large-uncased-whole-word-masking-finetuned-squad,
bert-large-cased-whole-word-masking-finetuned-squad,
bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased,
bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1,
bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed
'bert-base-uncased' was a path, a model identifier, or url to a
directory containing vocabulary files named ['vocab.txt'] but couldn't
find such vocabulary files at this path or url.
I did some research and it looks like that there might be issue in downloading the files from internet, so I manually downloaded the files config.json, pytorch_model.bin and placed it in C:/Project ABC/cache/ still I am getting the same error message. Any idea how to resolve this, not giving cache directory is failing too with the same error.

automate uploading of glue script

We are currently using cloud formation to create a glue job (via codebuild and codepipeline). The one thing we are stuck on is how to automate the code that goes into the glue job.
Our current relevant piece of the cloudformation template looks like this:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole
The problem is is the "ScriptLocation". Looks like it is required to be an S3 location. How can we automate the upload of this. The code is in a .py file in our Git repository and I assume is uploaded to the artifact repository as are of the codebuild process, but how to access it?
Would like to hear how others are doing this. Thanks!
EDIT: I was able to find a similar stack overflow post:AWS Glue automatic job creation but it the answers really don't give a solution or understand the question posed.
I've written a tool to handle the upload of stack dependencies, including CloudFormation nested templates and non-inline Lambda functions.
Currently AWS Glue was not handled since I haven't try it in any project yet. But it should be easy to expand to support Glue.
The dependencies were defined in separate config file, and a piece of code within the tool is responsible for the config. Here's the sample config:
Nested CloudFormation templates:
# DEPENDS=( <ParameterName>=<NestedTemplate> )
#
# Required: Yes if has nested template, otherwise No
# Default: None
# Syntax:
# <ParameterName>: The name of template parameter that is referred at the
# value of nested template property `TemplateURL`.
# <NestedTemplate>: A local path or a S3 URL starting with `s3://` or
# `https://` pointing to the nested template.
# The nested templates at local is going to be uploaded
# to S3 Bucket automatically during the deployment.
# Description:
# Double quote the pairs which contain whitespaces or special characters.
# Use `#` to comment out.
# ---
# Example:
# DEPENDS=(
# NestedTemplateFooURL=/path/to/nested/foo/stack.json
# NestedTemplateBarURL=/path/to/nested/bar/stack.json
# )
Lambda functions:
# LAMBDA=( <S3BucketParameterName>:<S3KeyParameterName>=<LambdaFunction> )
#
# Required: Yes if has None-inline Lambda Function, otherwise No
# Default: None
# Syntax:
# <S3BucketParameterName>: The name of template parameter that is referred
# at the value of Lambda property `Code.S3Bucket`.
# <S3KeyParameterName>: The name of template parameter that is referred
# at the value of Lambda property `Code.S3Key`.
# <LambdaFunction>: A local path or a S3 URL starting with `s3://` pointing
# to the Lambda Function.
# The Lambda Functions at local is going to be zipped and
# uploaded to S3 Bucket automatically during the deployment.
# Description:
# Double quote the pairs which contain whitespaces or special characters.
# Use `#` to comment out.
# ---
# Example:
# DEPENDS=(
# S3BucketForLambdaFoo:S3KeyForLambdaFoo=/path/to/LambdaFoo.py
# S3BucketForLambdaBar:S3KeyForLambdaBar=s3://mybucket/LambdaBar.py
# )
The tools were written in bash and come with 2 parts:
xsh: It works as a bash library framework.
xsh-lib/aws: It's a library of xsh.
The code you may need to expand is located in xsh-lib/aws/functions/cfn/deploy.sh.
The example deploy command looks like:
$ xsh aws/cfn/deploy -C /path/to/your/template-and-config-dir -t stack.json -c sample.conf
I'm considering to abstract the dependencies such as CloudFormation template, Lambda functions and Glue, into a single interface for both configs and handlers.
This will make it easier to add new dependency handlers to the deployer.

Cannot access FTP directory with CP1250/CP852/UTF-8 encoding

I am trying to read in some files from the following directory structure:
/jc/06 Önéletrajzok/Profession/Előszűrés sablonok név szerint
But for some strange reason I cannot enter not even in the upper level directories.
I already tried with PHP/Python3.6/Ruby but without much luck. At least with PHP and Python I can CWD() at least until the /jc/06 Önéletrajzok/Profession part.
Here is my python code for reference:
from ftplib import FTP
ftp = FTP('hostname')
ftp.login('username','pwd')
ftp.cwd('jc') # Just for demonstration purposes as step by step
ftp.cwd('06 Önéletrajzok')
ftp.cwd('Profession')
print(ftp.nlst()[2]) # Which gives: 'ElÅ\x91szűrés sablonok név szerint
# But when I am trying:
ftp.cwd('ElÅ\x91szűrés sablonok név szerint')
# Or either:
ftp.cwd('Előszűrés sablonok név szerint')
# It gives:
# UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151' in position 6: ordinal not in range(256)
# So I am trying encoding CP1250 or CP852 (for Hungarian)
dir = 'Előszűrés sablonok név szerint'.encode('cp852') # which gives: b'El\x8bsz\xfbr\x82s sablonok n\x82v szerint'
ftp.cwd(dir.decode('utf-8'))
# and it gives the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 2: invalid start byte
So I am starting to give up on this one, I don't know how to access those files. The directory structure was created with Windows laptops accessing a Synology File server.
I have already tried with ftp.encoding = "utf-8" too.
Any ideas?

What is the use of Hash in Nginx Upload module?

I have made the Nginx Upload working normally with Python (Tornado). I save the paths of the uploaded files in the database.
However, I wonder why the upload module has to split my uploads and put them into 10 different folders /var/www/.../uploads/0,1,2,3,4,5...9 ? The comment below says the files were hashed, what and why the module does this?
# Store files to this directory
# The directory is hashed, subdirectories 0 1 2 3 4 5 6 7 8 9 should exist
upload_store /var/www/...uploads 1;
# filesystem location where we store uploads
#
# The second argument is the level of "hashing" that nginx will perform
# on the filenames before storing them to the filesystem. I can't find
# any documentation online, so as an example, say we were using this
# configuration:
#
# upload_store /tmp/uploads 2 1;
#
# A file named '43829042' would be written to this path:
#
# /tmp/uploads/42/0/43829042
#
# I hope that's clear enough. The argument is required and must be
# greater than 0. You can see the implementation here:
#
# http://lxr.evanmiller.org/http/source/core/ngx_file.c#L118
Source: http://bclennox.com/extremely-large-file-uploads-with-nginx-passenger-rails-and-jquery