sphinx CJK support - sphinx

We are using Sphinx with MYSQL. So our MYSQL is utf, and has Chinese characters and we need Sphinx to support CJK. Here's what we have in sphinx.conf:
charset_type = utf-8
charset_table = 0..9, U+27, U+41..U+5a->U+61..U+7a, U+61..U+7a, \
U+aa, U+b5, U+ba, \
U+c0..U+d6->U+e0..U+f6, U+d8..U+de->U+f8..U+fe, U+df..U+f6, \
U+f8..U+ff, U+100..U+12f/2, U+130->U+69, \
U+131, U+132..U+137/2, U+138, \
...
...
...
ngram_chars = U+3400..U+4DB5, U+4E00..U+9FA5, U+20000..U+2A6D6,U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, \
U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, \
U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
ngram_len = 1
And mysql conf:
character_set_client:utf8
character_set_connection:utf8
character_set_database:utf8 character_set_results:utf8 character_set_server:utf8 character_set_system:utf8 collation_connection:utf8_general_ci collation_database:utf8_general_ci collation_server:utf8_general_ci init_connect:SET NAMES utf8
It manage to index weird characters such as this as Chinese: 今宵离别åŽä½•æ—¥å›å†æ¥
And real chinese like this it's showing up as ??? in sphinx: 后来
My believe is there's some encoding problem but I don't know where.

Related

protoc-gen-grpc-swift: Plugin killed by signal 9

I run the following command
protoc ./Sources/Protos/echo.proto \
--proto_path=./Sources/Protos/ \
--plugin=/usr/local/bin/protoc-gen-swift \
--swift_opt=Visibility=Public \
--swift_out=Sources/Protos/ \
--plugin=/usr/local/bin/protoc-gen-grpc-swift \
--grpc-swift_opt=Visibility=Public,AsyncClient=True,AsyncServer=True \
--grpc-swift_out=./Sources/Protos/
The protocol is:
syntax = "proto3";
package echo;
service EchoService {
rpc echo(EchoRequest) returns (EchoResponse);
}
message EchoRequest {
string contents = 1;
}
message EchoResponse {
string contents = 1;
}
And receive the error:
--grpc-swift_out: protoc-gen-grpc-swift: Plugin killed by signal 9.
I am running tag 1.7.1-async-await.2.
This was generating the output files under 1.8.1 but I would like the async-await version. This fails even if the async flags are False. I have verified that the plugins in /usr/local/bin are the ones generated by make plugins in 1.7.1-async-await.2.

Programmatic access to password from Elytron credential-store

I am using Elytron on WildFly 12 to store a datasource password encoded.
I use the following CLI commands to store the password:
/subsystem=elytron/credential-store=ds_credentials:add( \
location="credentials/csstore.jceks", \
relative-to=jboss.server.data.dir, \
credential-reference={clear-text="changeit"}, \
create=true)
/subsystem=datasources/data-source=mydatasource/:undefine-attribute(name=password)
/subsystem=elytron/credential-store=ds_credentials:add-alias(alias=db_password, \
secret-value="datasource_password_clear_text")
/subsystem=datasources/data-source=mydatasource/:write-attribute( \
name=credential-reference, \
value={store=ds_credentials, alias=db_password})
This works very well so far. Now I need a way to read this password programmatically, so I can create a PostgreSQL database dump.
I found a possibility but somehow it feels like an improper solution.
static final String DB_PASS_ALIAS = "db_password/passwordcredential/clear/";
File keystoreFile = new File(System.getProperty("jboss.server.data.dir"),
"credentials/csstore.jceks");
// Open keystore
InputStream keystoreStream = new FileInputStream(keystoreFile);
KeyStore keystore = KeyStore.getInstance("JCEKS");
keystore.load(keystoreStream, KEYSTORE_PASS.toCharArray());
// Check if password for alias is available
if (!keystore.containsAlias(DB_PASS_ALIAS)) {
throw new RuntimeException("Alias for key not found");
}
// Get password
Key key = keystore.getKey(DB_PASS_ALIAS, KEYSTORE_PASS.toCharArray());
// Decode password - remove offset from decoded string
final String password = new String(key.getEncoded(), 2, key.getEncoded().length - 2);
I am open to any better solutions.

Using scala sys.process with single quotes, white-space, pipes etc

I am trying to use scala.sys.process._ to submit a POST request to my chronos server with curl. Because there is white space in the command's arguments, I am using the Seq[String] variant of cmd.!!
I am building the command like so:
val cmd = Seq("curl", "-L", "-X POST", "-H 'Content-Type: application/json'", "-d " + jsonHash, args.chronosHost + "/scheduler/" + jobType)
which produces, as expected,
cmd: Seq[String] = List(curl, -L, -X POST, -H 'Content-Type: application/json', -d '{"schedule":"R/2014-02-02T00:00:00Z/PT24H", "name":"Scala-Post-Test", "command":"which scalac", "epsilon":"PT15M", "owner":"myemail#thecompany.com", "async":false}', localhost:4040/scheduler/iso8601)
however, running this appears to mangle the 'Content-Type: application/json' argument:
scala> cmd.!!
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 264 0 100 100 164 2157 3538 --:--:-- --:--:-- --:--:-- 54666
res21: String =
"The HTTP header field "Accept" with value "*/* 'Content-Type:application/json'" could not be parsed.
"
which I don't understand. By contrast, calling cmd.mkString(" ") and copy+pasting into the terminal works as expected.
curl -L -X POST -H 'Content-Type:application/json' -d '{"schedule":"R/2014-02-02T00:00:00Z/PT24H", "name":"Scala-Post-Test", "command":"which scalac", "epsilon":"PT15M", "owner":"austin#quantifind.com", "async":false}' mapr-01.dev.quantifind.com:4040/scheduler/iso8601
I have tried numerous variations on the -H argument to no avail, any insight into using single quotes in sys.process._'s !! would be greatly appreciated.
I have tried variations on this as well, which generates a slew of errors, including
<h2>HTTP ERROR: 415</h2>
<p>Problem accessing /scheduler/iso8601. Reason:
<pre> Unsupported Media Type</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>
(in addition to butchering the jsonHash, ie:
[1/6]: '"schedule":"R/2014-02-02T00:00:00Z/PT24H"' --> <stdout>
curl: (6) Couldn't resolve host ''"schedule"'
Which makes me think it is not interpreting the -H argument correctly
You need to split each argument into a separate element of a sequence.
Instead of this:
val cmd = Seq("curl", "-L", "-X POST", "-H 'Content-Type: application/json'", "-d " + jsonHash, args.chronosHost + "/scheduler/" + jobType)
you need to write this:
val cmd = Seq("curl", "-L", "-X", "POST", "-H", "'Content-Type: application/json'", "-d " + jsonHash, args.chronosHost + "/scheduler/" + jobType)
It puts each element of a sequence as an argument on a command line. So "-H 'Content-Type... looks like a single argument to curl while it should be 2.
Here is a simple way to test:
import scala.sys.process._
val cmd = Seq("find", "/dev/null", "-name", "null") // works
// does not work: val cmd = Seq("find", "/dev/null", "-name null")
val res = cmd.!!
println(res)
Most of the other answers are a bit faffy, use the following trick to get Bash to do all the quote handling etc etc for you
import scala.sys.process._
Seq("bash", "-c", bashLine) !
For easier Googling myself: bash execute process sys.process pipe string scala
As a String pimp method (easy copy pasting)
import scala.sys.process._
implicit class PimpedString(bashLine: String) {
def !!!: Int = Seq("bash", "-c", bashLine) !
}
I also had a hard time to make this work. Enlighten by Aleksey, below worked for me. Note that, surprisingly, there is no extra quote (or double quote) for Content-type:
val deploy = Seq("curl", "-X", "POST", s"$ip/v2/apps", "-H", "Content-Type: application/json", "-d", s"#$containerJson")
println(deploy.!!)

two separate rt indexes sphinx

I'm able to do Unicode search in sphinx now, the issue I'm seeing is that English isn't working any more when I search, the question is do I need to have separate indexes for languages? or one should be enough for both languages?
path = /var/data/sphinx/forums
rt_field = subject
rt_attr_uint = pid
charset_type = utf-8
charset_table = charset_table = U+0622->U+0627, U+0623->U+0627, U+0624->U+0648, U+0625->U+0627, U+0626->U+064A, U+06C0->U+06D5, U+06C2->U+06C1, U+06D3->U+06D2, U+FB50->U+0671, U+FB51->U+0671, U+FB52->U+067B, U+FB53->U+067B, U+FB54->U+067B, U+FB56->U+067E, U+FB57->U+067E, U+FB58->U+067E, U+FB5A->U+0680, U+FB5B->U+0680, U+FB5C->U+0680, U+FB5E->U+067A, U+FB5F->U+067A, U+FB60->U+067A, U+FB62->U+067F, U+FB63->U+067F, U+FB64->U+067F, U+FB66->U+0679, U+FB67->U+0679, U+FB68->U+0679, U+FB6A->U+06A4, U+FB6B->U+06A4, U+FB6C->U+06A4, U+FB6E->U+06A6, U+FB6F->U+06A6, U+FB70->U+06A6, U+FB72->U+0684, U+FB73->U+0684, U+FB74->U+0684, U+FB76->U+0683, U+FB77->U+0683, U+FB78->U+0683, U+FB7A->U+0686, U+FB7B->U+0686, U+FB7C->U+0686, U+FB7E->U+0687, U+FB7F->U+0687, U+FB80->U+0687, U+FB82->U+068D, U+FB83->U+068D, U+FB84->U+068C, U+FB85->U+068C, U+FB86->U+068E, U+FB87->U+068E, U+FB88->U+0688, U+FB89->U+0688, U+FB8A->U+0698, U+FB8B->U+0698, U+FB8C->U+0691, U+FB8D->U+0691, U+FB8E->U+06A9, U+FB8F->U+06A9, U+FB90->U+06A9, U+FB92->U+06AF, U+FB93->U+06AF, U+FB94->U+06AF, U+FB96->U+06B3, U+FB97->U+06B3, U+FB98->U+06B3, U+FB9A->U+06B1, U+FB9B->U+06B1, U+FB9C->U+06B1, U+FB9E->U+06BA, U+FB9F->U+06BA, U+FBA0->U+06BB, U+FBA1->U+06BB, U+FBA2->U+06BB, U+FBA4->U+06C0, U+FBA5->U+06C0, U+FBA6->U+06C1, U+FBA7->U+06C1, U+FBA8->U+06C1, U+FBAA->U+06BE, U+FBAB->U+06BE, U+FBAC->U+06BE, U+FBAE->U+06D2, U+FBAF->U+06D2, U+FBB0->U+06D3, U+FBB1->U+06D3, U+FBD3->U+06AD, U+FBD4->U+06AD, U+FBD5->U+06AD, U+FBD7->U+06C7, U+FBD8->U+06C7, U+FBD9->U+06C6, U+FBDA->U+06C6, U+FBDB->U+06C8, U+FBDC->U+06C8, U+FBDD->U+0677, U+FBDE->U+06CB, U+FBDF->U+06CB, U+FBE0->U+06C5, U+FBE1->U+06C5, U+FBE2->U+06C9, U+FBE3->U+06C9, U+FBE4->U+06D0, U+FBE5->U+06D0, U+FBE6->U+06D0, U+FBE8->U+0649, U+FBFC->U+06CC, U+FBFD->U+06CC, U+FBFE->U+06CC, U+0621, U+0627..U+063A, U+0641..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06BF, U+06C1, U+06C3..U+06D2, U+06D5, U+06EE..U+06FC, U+06FF, U+0750..U+076D, U+FB55, U+FB59, U+FB5D, U+FB61, U+FB65, U+FB69, U+FB6D, U+FB71, U+FB75, U+FB79, U+FB7D, U+FB81, U+FB91, U+FB95, U+FB99, U+FB9D, U+FBA3, U+FBA9, U+FBAD, U+FBD6, U+FBE7, U+FBE9, U+FBFF
The reason I'm asking this is that when I remove the charset_table and charset_type English starts working again
Your charset_table doesnt have the basic 'English' characters, eg
0..9, A..Z->a..z, _, a..z,
The reason removing the charset_table, works, is because then it falls back to the defaul, which is the default sbcs charset:
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
So to get multiple languages in one index, need to make the charset_table include the charactors from ALL the languages invovled.

There is no sound with iFrameExtractor player

I has built ffmpeg and iFrameExtractor with ios5.1 successful, But when I play the video, There is no sound
// Register all formats and codecs
avcodec_register_all();
av_register_all();
avformat_network_init();
if(avformat_open_input(&pFormatCtx, [#"http://somesite.com/test.mp4" cStringUsingEncoding:NSASCIIStringEncoding], NULL, NULL) != 0) {
av_log(NULL, AV_LOG_ERROR, "Couldn't open file\n");
goto initError;
}
The log is
[swscaler # 0xdd3000] No accelerated colorspace conversion found from
yuv420p to rgb24. 2012-10-22 20:42:47.344 iFrameExtractor[356:707]
video duration: 5102.840000 2012-10-22 20:42:47.412
iFrameExtractor[356:707] video size: 720 x 576 2012-10-22 20:42:47.454
iFrameExtractor[356:707] Application windows are expected to have a
root view
This is my configure file for ffmpeg 0.11.1:
#!/bin/tcsh -f
rm -rf compiled/*
./configure \
--cc=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc \
--as='/usr/local/bin/gas-preprocessor.pl /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc' \
--sysroot=/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.1.sdk \
--target-os=darwin \
--arch=arm \
--cpu=cortex-a8 \
--extra-cflags='-arch armv7' \
--extra-ldflags='-arch armv7 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.1.sdk' \
--prefix=compiled/armv7 \
--enable-cross-compile \
--enable-nonfree \
--disable-armv5te \
--disable-swscale-alpha \
--disable-doc \
--disable-ffmpeg \
--disable-ffplay \
--disable-ffprobe \
--disable-ffserver \
--enable-decoder=h264 \
--enable-decoder=svq3 \
--disable-asm \
--disable-bzlib \
--disable-gpl \
--disable-shared \
--enable-static \
--disable-mmx \
--disable-neon \
--disable-decoders \
--disable-muxers \
--disable-demuxers \
--disable-devices \
--disable-parsers \
--disable-encoders \
--enable-protocols \
--disable-filters \
--disable-bsfs \
--disable-postproc \
--disable-debug
There is not enough information here.
What url are you trying to open for instance?
Where there messages in the log. I know using version .11 you get a few warnings about you not including the network_init, but that wouldn't stop it from working.
Some things have changed that worked in previous versions ie. you used to be able to append ?tcp to specify ffmpeg is using tcp but now it must be done in the dictionary.
Please provide both the syslog and build logs if possilbe.
Here's an example from one of our apps
avcodec_register_all();
avdevice_register_all();
av_register_all();
avformat_network_init();
const char *filename = [url UTF8String];
NSLog(#"filename = %#" ,url);
// err = av_open_input_file(&avfContext, filename, NULL, 0, NULL);
AVDictionary *opts = 0;
if (usesTcp) {
av_dict_set(&opts, "rtsp_transport", "tcp", 0);
}
err = avformat_open_input(&avfContext, filename, NULL, &opts);
av_dict_free(&opts);
if (err) {
NSLog(#"Error: Could not open stream: %d", err);
return nil;
}
else {
NSLog(#"Opened stream");
}
So assuming you do have a block of code like the following what do you do with the audio, you have to use one of the audio api's to process it, audioQueues probably would be the easiest if your dealing mostly with known types.
First in your initialization get the audio info from the stream
// Retrieve stream information
if(av_find_stream_info(pFormatCtx)<0)
return ; // Couldn't find stream information
// Find the first video stream
videoStream=-1;
for(int i=0; i<pFormatCtx->nb_streams; i++) {
if(pFormatCtx->streams[i]->codec->codec_type==AVMEDIA_TYPE_VIDEO)
{
videoStream=i;
}
if(pFormatCtx->streams[i]->codec->codec_type==AVMEDIA_TYPE_AUDIO)
{
audioStream=i;
NSLog(#"found audio stream");
}
}
Then later in your processing loop do something like this.
while(!frameFinished && av_read_frame(pFormatCtx, &packet)>=0) {
// Is this a packet from the video stream?
if(packet.stream_index==videoStream) {
// Decode video frame
//do something with the video.
}
if(packet.stream_index==audioStream) {
// NSLog(#"audio stream");
//do something with the audio packet, here we simply add it to a processing
queue to be handled by another thread.
[audioPacketQueueLock lock];
audioPacketQueueSize += packet.size;
[audioPacketQueue addObject:[NSMutableData dataWithBytes:&packet length:sizeof(packet)]];
[audioPacketQueueLock unlock];
To play the audio you can look at this for some examples
https://github.com/mooncatventures-group/FFPlayer-beta1/blob/master/FFAVFrames-test/AudioController.m