Google Cloud Text-to-Speech WaveNet API Character conversion rate - google-text-to-speech

What is the character or word conversion rate for Google Cloud’s Text-to-Speech WaveNet API? I want to know rough timestamps for long WaveNet generated audio files. I could use even a rough estimation.

It takes about 0.5s to convert about 150 characters.
That's the size of blocks I convert. Can't say how much of that is the invocation overhead. I.e. it may be more efficient with larger amounts of text to generate audio.

Related

How to extract digits (number) using Matlab

At work I have to record a lot of data from png data. Every time I have to manually record the digits (e.g. mean\SD 101.1\11) on the excel sheet and read it with Matlab. Would it be possible that Matlab could directly read the digits from the PNG image, so that lots of work could be saved?
I know it might involve pattern recognition, but still hope that there may be someone who has done this before.
You can make use of Optical Character Recognition (OCR). The code for it is available here

Which 2d barcode has the highest data capacity/density

;)
if you wanted encode 2mb of data onto a 2d-bar code, which 2-bar code would be good to starting point or recommend.
There are lots and different types of 2dbar codes out today,Aztec 2-d barcodes,maxicodes,Pdf417,Microsoft HCCB,vericodes....etc...lots.... all unique in their own way.
i guess in a nutshell my questions is.... which barcode would make a good start off point to encode 2mb of data??
i tried reading through the Qr code international standard turns out even # version 40L the most amount of data you could encode is on to a Qr code is
1) numeric data: 7 089 characters
2) alphanumeric data: 4 296 characters
3) 8-bit byte data: 2 953 characters
4) Kanji data: 1 817 characters
which are all a far cry from the 17million bits thats is 2mb
my goal was to create something like
http://realestatemobilemarketingsolutions.com/wp-content/uploads/2012/07/real-estate-mobile-marketing.png
After you scan the barcode you can view photos of the house/property on your phone, you dont have to walk-in or wait for an open home,20 photos # 100kb each is about 2mb
Even if you could create a single 2D barcode which will encode the whole thing, the user won't be able to scan the whole thing in one go. No one has a cellphone imager which will support that kind of resolution. Your best bet is to do a QR-code with a URL in it.
Things like DataMatrix and QR-codes are extensible. You have a limit to how much data can be encoded into one block, but you CAN create a code which has multiple blocks. Indeed, if you look at this page, you'll see a discussion of using pages full of 2D barcodes as a form of data backup. They were able to fit up to 1/2 MByte of raw data into a single page. That's at 600 dpi, which will require a scanner (not a smartphone) to decode.
From what I've been reading, DataMatrix tends to have less overhead and, therefore, will stuff more (payload) data into a square inch for a given DPI. You would need a mobile app capable of shooting multiple images (tiles) of a very large image and either:
compositing the individual images into one large one for decoding OR
decoding each of the smaller blocks and reconstructing the original data from the pieces
I know of no app which will do that.
I've pondered providing bulk data via 2D barcodes. I was pondering publishing a mobile app in a magazine and providing a way for people to "download" the app from the magazine, without needing to provide a website / FTP site where they could download it. I'd first need to provide an app which could decode such a monster. Then, the end user would have to be patient enough to scan the whole thing. Good luck with that.
I MIGHT be able to provide a large 2D barcode containing a .torrent file and then using existing BitTorrent apps to download the resulting app; I have a .torrent for a recent Linux Live-DVD where the .torrent is < 32 KB.
A chunk of data (an app or images) in the MB or larger range ... really not feasible through this channel. The megabytes of data you're wanting to provide ... again ... really not feasible through this channel.
Voiceye Code is the highest density 3d code I have been able to find. Works well too, but code making software is price prohibitive to screw around with. 500.00 (ish)
How about using some variant of DataGlyphs, which has a lot in common with steganography? In other words, you use a greyscale image to also store your data...
I have developed a reader for JAB codes that can read whole audio file from a codebar. JAB codes are very high capacity due to polychrome nature.
More on this here

Tesseract OCR: How to find the read-error-magnitude of each returned character?

I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos.
Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.
If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.
I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.
Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?
It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.
The hOCR output also contains this value.

Best way to send 10,000 doubles over HTTP

I have a client application (iPhone) that collects 10,000 doubles. I'd like to send this double array over HTTP to an appengine server (java). I'm looking for the best way to do this.
Best can be defined as some combination of ease of programming and compactness of representation as the amount of data can be quite high.
My current idea is that I will convert the entire array of doubles to a string representation and send that as a POST parameter, on the server parse the string and convert back to a double array. Seems inefficient though...
Thanks!
I think you kind of answered your own question :) The big thing to beware of is differences between the floating point representation on the device and the server. These days they're both going to be little-endian, (mostly) IEEE-754 compliant. However, there can still be some subtle differences in implementation that might bite, e.g handling of denormals and infinities, but you can likely get away with ignoring them. I seem to recall a few of the edge cases in NEON (used in the iPhone's Cortex A-8) aren't handled the same as x86.
If you do send as a string, you'll end up with a decimal and binary conversion between, and potentially lose accuracy. This isn't that inefficient, though - it's only 10,000 numbers. Unless you're expecting thousands of devices pumping this data at your server non-stop.
If you'd like some efficiency in the wire transfer and on the device side, then one approach is to just send the doubles in their raw binary form. On the server, reparse them to a doubles (Double.longBitsToDouble). Make sure you get the endian-ness right when you grab the data as longs (it'll be fairly obvious when it's wrong).
I guess that there are lots and lots of different ways to do this. If it were me I would probably just serialize to an array of bytes and then base64 encode it, most other mechanisms will significantly increase the volume of data being passed.
10k doubles is 80k binary bytes is about 107k or so characters base64 encoded. Or 3 doubles is 24 binary bytes is 32 base64 characters. There's tons of base64 conversion example source code available.
This is far preferable to any decimal representation conversions, since the decimal conversion is slower and, worse, potentially lossy.
json
for iphone encode with yajl-obj-c
and for java read with jsonarray
If you have a working method, and you haven't identified a performance problem, then the method you have now is just fine.
Don't go trying to find a better way to do it unless you know it doesn't meet your needs.
On inspection it seems that on the java side, a double (64 bytes) will be about 4 characters (16 bytes * 4). Now, when I think of your average double, let's say 10 digits and a decimal point, plus some delimiter like a space of semicolon, you're looking at about 12 characters per decimal. That's only 3x as much.
So you originally had 80k of data, and now you have 240k of data. Is that really that much of a difference?

How to export sound from timeline of sounds on iOS with OpenAL

I'm not sure if it's possible to achieve what I want, but basically I have a NSDictionary which represents a recording. It's a timeline of what sound id was played at what point in time.
I have it so that you can play back this timeline/recording, and it works perfectly.
I'm wondering if there is anyway to take this timeline, and export it as a single sound that could be saved to a computer if the device was synced with iTunes.
So basically I'm asking if I can take a timeline of sounds, play it back and have these sounds stitched together as a single sound, that can then be exported.
I'm using OpenAL as my sound framework and the sound files are all CAFs.
Any help or guidance is appreciated.
Thanks!
You will need:
A good understanding of linear PCM audio format (See Wikipedia's Linear PCM page).
A good understanding of audio sample-rates and some basic maths to convert your timings into sample-offsets.
An awareness of how two's-complement binary numbers (signed/unsigned, 16-bit, 32-bit, etc.) are stored in computers, and how the endian-ness of a processor affects this.
Patience, interest in learning, and a strong desire to get this working.
Here's what to do:
Enable file sharing in your app (UIFileSharingEnabled=YES in info.plist and write files to /Documents directory).
Render the used sounds into memory buffers containing linear PCM audio data (if they are not already, i.e. if they are compressed). You can do this using the offline rendering functionality of Audio Queues (see Apple audio queue docs). It will make things a lot easier if you render them all to the same PCM format and sample rate (For example 16-bit signed samples #44,100Hz, I'll use this format for all examples), and use the same format for your output. I recommend starting off with a Mono format then adding stereo once you get it working.
Choose an uncompressed output format and mix your sounds into a single stream:
3.1. Allocate a buffer large enough, or open a file stream to write to.
3.2. Write out any headers (for example if using WAV format output instead of raw PCM) and write zeros (or the mid-point of your sample range if not using a signed sample format) for any initial silence before your first sound starts. For example if you want 0.1 seconds silence before your first sound, write 4410 (0.1 * 44100) zero-samples i.e. write 4410 shorts (16-bit) all with zero.
3.3. Now keep track of all 'currently playing' sounds and mix them together. Start with an empty list of 'currently playing sounds and keep track of the 'current time' of the sample you are mixing, for each sample you write out increment the 'current time' by 1.0/sample_rate. When it gets time for another sound to start, add it to the 'currently playing' list with a sample offset of 0. Now to do the mixing, you iterate through all of the 'currently playing' sounds and add together their current sample, then increment the sample offset for each of them. Write the summed value into the output buffer. For example if soundA starts at 0.1 seconds (after the silence) and soundB starts at 0.2 seconds, you will be doing the equivalent of output[8820] = soundA[4410] + soundB[0]; for sample 8820 and then output[8821] = soundA[4411] + soundB[1]; for sample 8821, etc. As a sound ends (you get to the end of its samples) simply remove it from the 'currently playing' list and keep going until the end of your audio data.
3.4. The simple mixing (sum of samples) described above does have some problems. For example if two samples have values that add up to a number larger than 32767, this cannot be stored in a signed-16-bit number, this is called clipping. For now, just clamp the value to 32767, and get it working... later on come back and implement a simple limiter (see description at end).
Now that you have a mixed version of your track in an uncompressed linear PCM format, that might be enough, so write it to /Documents. If you want to write it in a compressed format, you will need to get the source for an audio encoder and run your linear PCM output through that.
Simple limiter:
Let's choose to limit the top 10% of the sample range, so if the absolute value is greater than 29490 (int limitBegin = (int)(32767 * 0.9f);) we will scale down the value. The maximum possible peak would be int maxSampleValue = 32767 * numPlayingSounds; and we want to scale values above limitBegin to peak at 32767. So do the summation into sampleValue as per the very simple mixer described above, then:
if(sampleValue > limitBegin)
{
float overLimit = (sampleValue - limitBegin) / (float)(maxSampleValue - limitBegin);
sampleValue = limitBegin + (int)(overLimit * (32767 - limitBegin));
}
If you're paying attention, you will have noticed that when numPlayingSounds changes (for example when a new sound starts), the limiter becomes more (or less) harsh and this may result in abrupt volume changes (within the limited range) to accommodate the extra sound. You can use the maximum number of playing sounds instead, or devise some clever way to ramp up the limiter over a few milliseconds.
Remember that this is operating on the absolute value of sampleValue (which may be negative in signed formats), so the code here is just to demonstrate the idea. You'll need to write it properly to handle limiting at both ends (peak and trough) of your sample range. Also, there are some tricks you can do to optimize all of the above during the mixing - you will probably spot these while you're writing the mixer, be careful and get it working first, then go back and refactor/optimize if needed.
Also remember to consider the endian-ness of the platform you are using and the file-format you are writing to, as you may need to do some byte-swapping.
One approach which isn't too hard if your files are stored in a simple format is just to combine them together manually. That is, create a new file with the caf format and manually put together the pieces you want.
This will be really easy if the sounds are uncompressed (linear PCM). But, read the documents on the caf file format here:
http://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/CAFSpec/CAF_spec/CAF_spec.html#//apple_ref/doc/uid/TP40001862-CH210-SW1