Please help me find information on Watson Speech to Text and Text to Speech regarding:
-Sampling rate
-Buffer size
-No of Channels etc.
I am currently using 10KB buffer size for Text to Speech. Do you think its ok or do I need to increase the size.
You can find information about the sampling rate for the audio that is returned by Text to Speech here:
http://www.ibm.com/watson/developercloud/doc/text-to-speech/http.shtml#sampling
This page of the documentation may also address some of your other questions. I will follow up on some of the other specific issues you mention.
Related
How might I achieve adding a track of accurately aligned real-time "additional" data with live-streamed audio? Primarily interested in the browser here, but ideally the solution would be possible with any platform.
The idea is, if I have a live recording from my computer being sent into Icecast via something like DarkIce, I want a listener (who could join a stream at any time) to be able to place some kind of annotation over a few of the samples and allow them to send only the annotation back (for example, using a regular HTTP request). However, this needs a mechanism to align the annotation with the dumped streamed audio at the server side, and in a live stream, the user AFAIK can't actually get the timestamp in the "whole" stream, just from when they joined. But if there was some kind of simultaneously aligned metadata, then perhaps this would be possible.
The problem is, most systems seem to assume you "pre-caption" or multi-plex your data streams beforehand. However, this wouldn't make sense for something being recorded and live-streamed in real-time. Google's examples seem to be mostly around their ability to do "live captioning" which is more about processing audio in real-time then adding slightly delayed captions using speech recognition. This isn't what I'm after. I've looked into various ways data is put into OGG containers, as well as the current captioning like WebVTT, and I am struggling to find examples of this.
I found maybe a hint here: https://github.com/w3c/webvtt/issues/320 and I've been recommended to look for examples by Apple and Google using WebVTT for something along these lines, but cannot find these demos. There's older tech as well (Kate, CMML, Annodex, etc) but none of these are in use and are completely replaced by WebVTT. Perhaps I can achieve something like this web WebRTC, but I'm not sure this gives any guarantees on alignment and it's a slightly different technology stack that I am looking at in this scenario.
I am planning to buy and use AXIS Camera Station S2208 Appliance
, and seeking way to retrieve images stored in this recorder in remote area via API. (not retrieve from camera, but recorder)
I guess VAPIX (or ONVIF) API is responsible for this task, but not sure where the exact description is (I looked over VAPIX-library page, but found no clue).
Questions are as follows
To begin with, is it possible to retrieve images from recorder via VAPIX (or ONVIF) ?
If it is possible, where is the description in VAPIX-library page (from Network video, Applications, Audio systems, Physical access control or Radar)?
If not, are there any ways to do it?
I also searched in AXIS Camera Station User Manual, and found out Developer API. However, it was not clear to me about details.
I posted here because I couldn't get answer from official page.
Any help would be great. Thanks!
For example, if someone wanted to make a journal skill it might ask "What would you like to add for your journal today?"
Some users may have a response that would be several sentences long or maybe even a few minutes. Is there any hard limit to how long a user's response/query to an action can be?
Although there is no specific limit on how long the user can speak, the Assistant does have some heuristics to determine when they are "done" talking. These heuristics seem to be better tuned for short replies, rather than long dictation, so it may choose even a slight pause to be the "break".
There is currently no way to indicate the user can talk for a longer duration, or specify when they have finished their response. There are a few tricks you can work with (rapidly respond so they can continue talking, for example), but the system is not currently well suited for long input.
For anyone still looking for this answer, after some digging I found it nested deep in the Google Cloud Docs as I was looking to build something similar.
Maximum detect intent text input length is 256 characters.
There are also some handy limits so check it out. https://cloud.google.com/dialogflow/quotas
I don't think I got what you are trying to build. Do you mean a "text to speech" response or an audio response?
A text-to-speech response has the following limit. AoG Site
640 character limit per chat bubble. Strings longer than the limit are
truncated at the first word break (or whitespace) before 640
characters.
A media response, instead, has not defined limit:
Media responses let your Actions play audio content with a playback duration longer than the 120-second limit of SSML. The primary component of a media response is the single-track card. The card allows the user to perform these operations [...]
Hope it will be helpful.
would like to know if Bluemix, potentially Watson capability could do the following: if multiple persons having conversation via one or many microphones as streamed audio source, could identify also a person tone / spectrum - i.e. who of all is producing the sound ?Thanks: Markku
There's no Watson API to work on an audio stream at the moment.
As you pointed out, you could use Speech-to-text to get a transcript of the conversation and potentially Tone-analyzer to get a sentiment analysis but this won't be enough to determine who's the speaker.
If you want to know more about how those two services work, please check their pages on the Watson Developer Cloud
Here is an example application and the code for combining STT and Tone Analyzer:
Application
https://realtime-tone.mybluemix.net/
Get the code here to use for your own applications:
https://github.com/IBM-Bluemix/real-time-tone-analysis
Julia
I am converting text to speech using Nuance SDK and it works fine.
I want to mail the the output to the user as a file, "voice.wav" for example.
Being new to this field, I'm not sure, does this text to speech process create an output file?
I don't see an output file, does it exist?
Can I make it generate one?
Thanks in advance.
At this time, the SDKs/libraries don't expose access to the raw audio data. This is done in an effort to guarantee the optimal audio subsystem as well as simplifying the process of speech-enabling apps.
Depending on the plan you're enrolled in, you may be able to use the HTTP service, which means you will have to construct your own audio layer. That said, this is your best bet for getting access to the audio data if you need it.