Realtime meta-data/ captioning for live streamed audio

Realtime meta-data/ captioning for live streamed audio - metadata

How might I achieve adding a track of accurately aligned real-time "additional" data with live-streamed audio? Primarily interested in the browser here, but ideally the solution would be possible with any platform.
The idea is, if I have a live recording from my computer being sent into Icecast via something like DarkIce, I want a listener (who could join a stream at any time) to be able to place some kind of annotation over a few of the samples and allow them to send only the annotation back (for example, using a regular HTTP request). However, this needs a mechanism to align the annotation with the dumped streamed audio at the server side, and in a live stream, the user AFAIK can't actually get the timestamp in the "whole" stream, just from when they joined. But if there was some kind of simultaneously aligned metadata, then perhaps this would be possible.
The problem is, most systems seem to assume you "pre-caption" or multi-plex your data streams beforehand. However, this wouldn't make sense for something being recorded and live-streamed in real-time. Google's examples seem to be mostly around their ability to do "live captioning" which is more about processing audio in real-time then adding slightly delayed captions using speech recognition. This isn't what I'm after. I've looked into various ways data is put into OGG containers, as well as the current captioning like WebVTT, and I am struggling to find examples of this.
I found maybe a hint here: https://github.com/w3c/webvtt/issues/320 and I've been recommended to look for examples by Apple and Google using WebVTT for something along these lines, but cannot find these demos. There's older tech as well (Kate, CMML, Annodex, etc) but none of these are in use and are completely replaced by WebVTT. Perhaps I can achieve something like this web WebRTC, but I'm not sure this gives any guarantees on alignment and it's a slightly different technology stack that I am looking at in this scenario.

Related

How to send custom dimensions, medium, source or referer with an event via Measurement Protocol V2?

With v1 of the measurement protocol, you could use these parameters to add custom dimensions or change medium, source or refer for a page view:
https://ssl.google-analytics.com/collect?v=1&tid=UA-xxxxxxxx&cid=[custom-id]&t=pageview&dp=[Url of pageview]&dh=[hostname of pageview]&cm=[new-medium]&cs=[new-source]&dr=[new-referer]&cd1=[custom-dimension-1]&cd2=[custom-dimension-2]
How is it done in measurement protocol v2?
I couldn't find any documentation about the page-view-event in V2 (for example it's just not mentioned here
https://developers.google.com/analytics/devguides/collection/protocol/ga4/reference/events), even the event-builder (https://ga-dev-tools.web.app/ga4/event-builder/) doesn't support a simple page-view.
So, all I got so far is this:
$data = '
{ "client_id": "'.[custom-id].'",
"events": [
{
"name": "page_view",
"params": {
"page_location": "'.[Url of pageview].'"
}
}
]
}
';
So, what are possible parameters for a page-view-event?

Ok, a few things here right away that you should know if you're playing with MP:
Measurement protocol is a poor name. It implies there's more than one protocol for data gathering. There's none. There is just only one protocol for tracking.
MP2 still largely MP1. Google tries to pose GA4 as a new product, but it's just our old good GA UA with a simplified backend and overengineered front-end that tries to deliver the level of quality Site Catalyst/Omniture/Adobe Analytics have been delivering for a decade. MP is largely the same. dr, cm, cs and a lot of other fields are still there. cds aren't there anymore cuz they're replaced with eps and ups, but more about that a bit later.
GA4 uses this big marketing claim that the new analytics is so wonderfully event-based, unlike the old one. When I dug into why they keep claiming it everywhere, I realized that the only difference is that pageviews are now events. Not much difference really. But yes, a pageview is just an event named page_view. We'll talk about it a bit more later.
Custom dimensions are no more. Now they're called event properties and user properties. The same thing really, Google just tries to make it less obvious that there are no more session level custom dimensions. Or product-level CDs. Though the product level is seemingly on their roadmap.
Make sure you're using the correct measurement id. They made it a lot harder to find it in GA4. It's no longer just the property id visible in the property list, unfortunately.
GA's real-time reports don't include all dimensions, especially if those dimensions are involved in advanced metrics/dimensions calculations. Do not use real time reports for inspecting the content of your events. It's not meant for debugging. It's a vanity report. Still helpful to check the volume of events when you're sending a bunch and expect to see them in GA. Google even has a warning here:
Like the DebugView report, the Realtime report performs limited attribution analysis to ensure responsive reporting. We recommend that you refer to the Acquisition reports for the most accurate attribution information.
Finally, what I often do instead of reading the so-still-unfinished-and-not-really-helpful documentation on MP2, is either use a library like this.
Or, since 1 is the case, I would just implement a moniker tracking in my test GTM, then see what and how it sends to where in the Network debugger and simply reimplement it on my side exactly how GTM does it. No magic involved. Here is how my GTM tag would look like:
With a trigger on any click or any page load. After all is done, I publish the lib. Then I would inject this GTM's code in a local site, or in my test site, or however else you want to test it. And trigger the tag that you need to mimic with MP.
I use this wonderful extension to show all events that fire and their details right in my console.
Now this is how the above tag looks on my test site through the extension:
It's pretty useful.
How do I know that page_referrer is used as dr instead of ep in GTM? Here is the list of the fields that will never be seen as ep. But Google doesn't care enough to map them properly to what these fields are called in MP, so you either have to test, or know, or google it elsewhere.
Finally, here is how the network request looks like:
I published the tag to prod (I keep a test site in prod), so you can go and look at it. Or just find a site that uses GA4 and see its network requests. How does google know that this is a pageview? by the event name: en=page_view
Of course, you do the same with medium and source. Judging from the documentation I've linked to above, the medium and source look like campaign_source and campaign_medium in GTM. GTM maps them accordingly to cs and cm fields. And that's how you know these are the correct mp fields. Give GA time to process these and check on them in a few days.
Good, now this is applicable to the enhanced ecommerce hits too, it's just that they have more variables and data structures in them typically.
Finally, if you want to simulate batch events, you can just make a few tags fire in rapid succession and GTM will neatly pack them in one network request if they fit. You can then digest how the packing is done through the same methods as we do here and simulate.

How can I create a multi-part response with DialogFlow?

So far, I have a conversational app that works with webhooks to my backend PHP server that sends JSON responses back to the Dialogflow API. So far, its working rather well.
The next step in the development would be to have the Google Assitant respond to the user with multi-part responses. I've seen the "Lucky Trivia" game do something similar (screenshot attached).
It is not clear to me how I can have the Assistant App generate multiple bubbles.
Some solutions I've tried:
Using rich responses with multiple parts
Generating SSML responses and using several <speak> or <p> tags
Using message objects
Using a followupEvent object
None of these have gotten me to the point Id like.
Rich responses will work for a maximum of two separate bubbles and no more.
SSML seems promising and is a great way to add prosody and sound bites, but everything I've tried will not deliver multi-part speech bubbles.
I can't find a syntax for message objects that works with "platform":"google". Indeed, specific support for platform=google isn't listed on that page, but I have seen it in some request/response JSON objects.
The followupEvent response seemed most promising, but as far as I can tell, the intent that triggers from the named event completely replaces the current response, it doesn't just add onto it.
So, my question is: What's the best strategy for getting similar multi-part messages on Google Assistant using DialogFlow?
Optimally, I'd like to fire new requests to my webhook sequentially, but building one large response containing all parts is a viable option if necessary.
How does Lucky Trivia do this?

I suspect that Lucky Trivial is able to get around the rules because it was made by Google and doesn't use the same library that we do. But let's look at each of your attempts and then some possible other approaches.
What doesn't work
As you note, RichResponses are limited to only two SimpleResponses which translate to two text bubbles. You could make larger responses, but there is still a suggested limit of 300 characters per bubble, and a hard limit of 640 characters.
The SSML responses, as the name suggests, are about what you hear - not so much what you see.
Message objects are turned into native platform objects anyway, so unless there was some way to support it in Google (and there isn't), then you can't do it.
Follow-up events are specifically documented to ignore the text that is returned from the original event. Their entire point is to delegate processing to the other intent.
What might work: Cards
This doesn't look exactly the same as what you want, but one way to get additional text included that is separate from the two bubbles is through a Basic card as one of the rich response items. You can even do some basic formatting in the card and include graphics.
More complicated: Media Response
Including a Media response object with the rich response items is a way you can send multiple responses to the user without having to wait for them to say something. In this way, you can get multiple text bubbles in a row without the user having to reply.
The trick is that you'll send the two simple responses in the rich response, and then include a Media response with a very short, and possibly silent, audio file.
After the audio file finished playing, you'll get an intent that indicates the media has finished playing. You can then send another reply with one or two more simple responses. If necessary, you can repeat this.
There are some downsides - the media player will show while it is playing, which will interrupt the bubbles, but once done it should clear. There will also be a pause in between some of the bubbles. But playing audio might also enhance your reply.

Terminology: "live-dvr" in mpeg-dash streaming

I'm working with live MPEG-DASH streaming, and I would like to know if there exists a stardard terminology for a given functionality.
It's the "live-dvr" functionality. That is, a mix between a live stream and VOD features: a live stream with the seeking bar in the player allowing to watch past stream time. This involves a series of infrastructure tweaks.
The term "live-dvr" for this setup is kind of informal, and different parties call it in its own way: "live catch-up", "live-vod", "cached live", some vendors set the name for this based on their product lines, and so on. I would like to know if there's a standard term for this kind of setup. Specially because interpreting the standard in order to understand setup parameters for the manifests may be confusing or even misleading without proper terminology.

The MPEG-DASH standard only mentions a timeShiftBufferDepth, which specifies how long after the availability of a segment it is still available on the server.
From the spec:
#timeShiftBufferDepth specifies the duration of the time shifting buffer for this Representation that is guaranteed to be available for a Media Presentation with type 'dynamic'.
There is no mention at all of DVR in the spec. So time shift seems to be the term used by MPEG-DASH. However, for example HLS does not mentioned DVR or time shift at all.

DVR (Digital Video Recording, also known as nDVR - network DVR) is a functionality that allows recording the live stream and perform its playback from any moment of recorded period. Live stream can still run while the end-user may rewind it to any particular moment in past.
Typically media servers (like our Nimble Streamer) also provide time-shift and time range selection - see our links for details.

WoW Addon to REST API

I´m going to create a web service for learning purposes and wanted to combine it with my WoW Hobby. My goal would be to create a "simple" Addon, which tracks my battleground activity in real time.
So when queuing for AB it enters my data in an db and when I´m out of the BG it should delete the db entry. The information should be stored in an JSON/XML-File and whenever the bg-status changes it should execute the post/update on the DB on the RESTful service.
The real time communication is very important here and I would like to know which ways of communicating to a web service are available, so I could directly dive in and create a solution.I´d like to have resources instead of solutions.
Currently I´m not used to LUA, but would like to learn it to get the knowledge of creating such a service.Which sites are you suggesting for learning LUA, especially the WoW-API?

Addons only write to disk when you log out of a character (and read that saved data when you log in) so what you intend would not be possible.*
More involved ways of communicating with the rest of the computer or even the internet are prohibited to prevent the gain of certain advantages, an example would be looking up details about your Arena opponents.
* Well, there are certainly some ways, but rather complicated ones: a program monitoring sound output to check when the BG queue pop sound is played, or a screengrabber that registers when the BG score screen comes up (which can be viewed during the match though, too)

How do I seamlessly concatenate MP3 streams?

I'm working on a streaming server that will be capable of broadcasting targetted ads. Basically listeners hear the same music, but every, say, 30 minutes comes a block of ads and every listener has his/her own block. Implementing such streaming server poses various problems and this question is about one of them.
The server will work in a manner similar to Icecast, i.e. it will read the stream over the network from some stream generator and relay it to every listener. When it's time to broadcast ads, the server stops fetching the stream from the generator, reads ads from files and inserts them into each listener's buffer, transmits them and resumes on relaying stream from the generator.
When the server switches from relaying stream to broadcasting ads, it has to concatenate two MP3 streams (we broadcast in MP3). My concern is that simply appending one piece of data after another may produce some audible artifacts. Can it be done seamlessly?
I've already figured out this:
- I can make the server be aware of MP3 frames to avoid sync errors.
- I'm thinking about appending MP3 frames from the ad file after MP3 frames from the stream.
- Since ad is loaded from properly encoded MP3 file, I circumvent the problem of byte reservoir, because the first frame from the file can't use it.
But my concern is the way MDCT works. Listeners have no idea of what my server will do, so their MP3 decoders may produce some artifacts because incorrect MDCT data will be placed one after another in the stream they download. Will zero-padding at the beginning of the file with the ad compensate for this?
Do you know any libraries/tools (open source if possible) that can seamlessly join two MP3 files without decompressing them?
Can you point any good resources describing MP3 format? I searched Internet a lot, found lots of information, but I still miss the overall picture.
Maybe you know that this would be easier if I used another codec like OGG/Vorbis, AAC?
PS. This question is not a duplicate of What is the best way to merge mp3 files?. mp3wrap and tools alike are not an option for me.

I believe MP3s can be merged by simply concatenating the files. In some quick testing (cat file1.mp3 file2.mp3 > merged.mp3; mplayer merged.mp3) it seems to work as expected. Streaming from a web server probably will work just as well.
How are you going to handle switching the current input file? You can simply treat the advertisements as short tracks to play.

You should be able to concatenate mp3 files of both CBR and VBR formats.
MP3 files do not have a main header (disregarding ID3 and Xing). The audio data is stored as chunks where every chunk includes its own header. The header contains the necessary information (bitrate, sample frequency, stereo, etc) for the decoding of the audio data in that chunk.
This is one of the reasons why it is difficult to determine the duration of a mp3 file.
Another way of looking at it is, if you concatenate a CBR MP3 file with a VBR file, the end result is the same as one long VBR file with the first section of Audio at a constant bitrate.
The issue is that some MP3 players may be strict and expect a Xing header for a VBR MP3 file. This however was never the specification for the MP3 format but it is now assumed to be true.

If you're on Windows, the Microsoft DirectShow API may be the way to go. You should find that is is capable of doing things with audio and video both statically and streaming, in a variety of formats (you only need the necessary codecs, and the interface is virtually the same for all).
Saying this, DirectShow is unfortunately designed in a horribly intricate way, and has a steep learning curve, but the power it offers in unparallel if you're going to be doing audio/video manipulation on Windows. There are however a great number of samples and tutorials on how to use it, so it may not be so painful in the end. Also, if you're using the .NET Framework, there is a managed wrapped by the name of DirectShow.NET. It's not going to be an easy task whatever you do, unless there's something out there than I'm not aware of. Good luck with it anyway!

I approached a very similar problem, and after asking the right questions at various sources came up with the following...
Any worthy decoder will skip "bad" data until it hits a valid frame header. This is what ID3v2 relies upon to inject additional information into mp3 data. At the server, I'd go with analysis of source MP3 files to only serve valid MP3 frames. If you serve a few silent frames (about 7 should do it), the decoder should have time to settle before ramping up for the next load of (unassociated) MP3 data, avoiding the artefacts you (correctly) assume when concatenating frames from different encoding sessions.
More problematic is the possible switching of MP3 attributes (1/2 channels, output sample rate etc) between one frame to the next. Some decoders get quite upset when confronted with such a stream, resulting in 1/2 speed playback and the like. So, you need to ensure that all your source material is encoded to the same output attributes otherwise you may come unstuck.
You may have seen this already, but if not:
http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=79&printer=t

I don't see why you would want to concatenate the files. Why don't you use some sort of play list system and just change which file your sending. I would think this would allow more flexibility in the long run, and you wouldn't end up with large MP3 files.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse