Is there any test link for Mpeg DASH in both: type = "dynamic" and with multiple representations (bitrates)? - streaming

As title:
I have tried to find some but I found for most of cases
if the test url is of type = "dynamic" there is ONLY ONE representation (a unique bitrate; CANNOT apply bitrate switch).
Does anyone know if there is a test link?
Thanks

There are several DASH data sets and test vectors out there, lots of them are listed in this blog post. Many don't have live streams, but some have (at least simulated live streams).
The DASH IF Test Vectors might be a good starting point, there are several live streams (look at the column mpd_type and search for the value dynamic), at least some should have multiple representations.

Related

Classification of gender for given names

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !
For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

Is it possible to store multiple video paragraphs, each has its owned parameters, in one track of a mp4 file?

I want to encode a sequence of video frames (FHD) into a h264 stream in a way like this:
From time t1 to time t2: encode with "main" profile, FHD and at 30fps.
From time t3 to time t4: encode with "high" profile, HD(scaled) and at 15fps.
From time t5 to time t6: encode with "main" profile, FHD and at 30fps.
Note: t1 < t2 < t3 < t4 < t5 < t6.
My question is, by complying the MP4 standard, is it possible to put video streams encoded by different parameters into a same video track of a mp4 file? If it is impossible, what is the best alternative?
Yes, at least according to the specification. If you look at ISO/IEC 14496-15 (3rd edition), it contains a definition of Parameter set track:
A sync sample in a parameter set track indicates that all parameter sets needed
from that time forward in the video elementary stream are in that or succeeding parameter stream
samples. Also there shall be a parameter set sample at each point a parameter set is updated. Each
parameter set sample shall contain exactly the sequence and picture parameter sets needed to
decode the relevant section of the video elementary stream.
As I understand it, in this case instead of writing the intial SPS/PPS data into the avcC box in stbl you write a separate track containing the changing SPS/PPS data as sync samples. So at least according to the spec, you would have samples in that stream with presentation times t1,t2,t3,t4,t5 and the samples themselves would contain the updated SPS/PPS data. This quote from the same standard seems to agree:
Parameter sets: If a parameter set elementary stream is used, then the sample in the parameter
stream shall have a decoding time equal or prior to when the parameter set(s) comes into effect
instantaneously. This means that for a parameter set to be used in a picture it must be sent prior to the
sample containing that picture or in the sample for that picture.
NOTE Parameter sets are stored either in the sample descriptions of the video stream or in the parameter set
stream, but never in both. This ensures that it is not necessary to examine every part of the video elementary
stream to find relevant parameter sets. It also avoids dependencies of indefinite duration between the sample that
contains the parameter set definition and the samples that use it. Storing parameter sets in the sample
descriptions of a video stream provides a simple and static way to supply parameter sets. Parameter set
elementary streams on the other hand are more complex but allow for more dynamism in the case of updates.
Parameter sets may be inserted into the video elementary stream when the file is streamed over a transport that
permits such parameter set updates.
ISO/IEC 14496-15 (3rd edition) also defines additional avc3 / avc4 boxes, which, when used should allow to actually write the parameter sets in-band with the video NAL units:
When the sample entry name is 'avc3' or 'avc4', the following applies:
If the sample is an IDR access unit, all parameter sets needed for decoding that sample shall be included either in the sample entry or in the sample itself.
Otherwise (the sample is not an IDR access unit), all parameter sets needed for decoding the sample shall be included either in the sample entry or in any of the samples since the previous random access point to the sample itself, inclusive.
A different question is, even though standard allows at least two ways (in band with avc3, out of band with parameter set track) to achieve this, how many players there are which honor this. I'd assume looking at least into the sources of ffmpeg to find if this is supported there is a good start.
The answers in this question also lean towards the fact that many demuxers are only honoring the avcC box and not separate parameter set track, but a couple of quick google searches show that at least both vlc/ffmpeg forums and newsletters have mentions of these terms, so I'd say it's best to try to mux such a file and simply check what happens.

How to break up large document into smaller answer units on Retrieve and Rank?

I am still very new to Retrieve and Rank, and Document Conversion services, so I have been playing around with that lately.
I encountered a problem where when I upload a large document (100+ pages) - Retrieve and Rank would help me automatically break it up into answer units, which is great and helpful.
However, some questions only require ONE small line in the big chunks of answer units, is there a way that I can manually break further down the answer units that Retrieve and Rank service has provided me?
I heard that you can do it through JavaScript, but is there a way to do it through the UI?
I am contemplating to manually break up the huge doc into multiple smaller documents, but that could potentially lead to 100s of them - which is probably the last option that I'd resort to.
Any help or suggestions is greatly appreciated!
Thank you all!
First off, one clarification:
Retrieve and Rank does not break up your documents into answer units. That is something that the Document Conversion Service does when your conversion target is ANSWER_UNITS.
Regarding your question:
I don't fully understand exactly what you're trying to do, but if the answer units that are produced by default don't meet your requirements, you can customize different steps of the conversion process to adjust the produced answer units. Take a look at the documentation here.
Specifically, you want to make sure that the heading levels (for Word, PDF or HTML, depending on your document type) are defined in a way that
they detect the start of each answer unit. Then, make sure that the heading levels that you defined (h1, h2, h3, etc.) are included in the selector_tags list within the answer_units section.
Once your custom Document Conversion Service configuration produces the answer units you are looking for, you will be ready to send them to Retrieve and Rank to be indexed.

consistent hashing on Multiple machines

I've read the article: http://n00tc0d3r.blogspot.com/ about the idea for consistent hashing, but I'm confused about the method on multiple machines.
The basic process is:
Insert
Hash an input long url into a single integer;
Locate a server on the ring and store the key--longUrl on the server;
Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.(How does this step work? In a single machine, there is a auto-increased id to calculate for shorten url, but what is the value to calculate for shorten url on multiple machines? There is no auto-increased id.)
Retrieve
Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
Locate the server containing that key and return the longUrl. (And how can we locate the server containing the key?)
I don't see any clear answer on that page for how the author intended it. I think this is basically an exercise for the reader. Here's some ideas:
Implement it as described, with hash-table style collision resolution. That is, when creating the URL, if it already matches something, deal with that in some way. Rehashing or arithmetic transformation (eg, add 1) are both possibilities. This means, naively, a theoretical worst case of having to hit a server n times trying to find an available key.
There's a lot of ways to take that basic idea and smarten it, eg, just search for another available key on the same server, eg, by rehashing iteratively until you find one that's on the server.
Allow servers to talk to each other, and coordinate on the autoincrement id.
This is probably not a great solution, but it might work well in some situations: give each server (or set of servers) separate namespace, eg, the first 16 bits selects a server. On creation, randomly choose one. Then you just need to figure out how you want that namespace to map. The namespaces only really matter for who is allowed to create what IDs, so if you want to add nodes or rebalance later, it is no big deal.
Let me know if you want more elaboration. I think there's a lot of ways that this one could go. It is annoying that the author didn't elaborate on this point; my experience with these sorts of algorithms is that collision resolution and similar problems tend to be at the very heart of a practical implementation of a distributed system.

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf