I read that one can support query by humming by using MIDI files. Can someone please give me an idea on how this can be done?
If you have access to IEEE Library: Query by humming of midi and audio using locality sensitive hashing
Quoting from Query by Humming of MIDI and Audio Using Locality Sensitive Hashing, audio retrieval examples
We propose a query by humming method based on locality sensitive
hashing (LSH). The method constructs an index of melodic fragments by
extracting pitch vectors from a database of melodies. In retrieval,
the method automatically transcribes a sung query into notes and then
extracts pitch vectors similarly to the index construction. For each
query pitch vector, the method searches for similar melodic fragments
in the database to obtain a list of candidate melodies. This is
performed efficiently by using LSH. The candidate melodies are ranked
by their distance to the entire query and returned to the user. To
retrieve audio signals, we apply an automatic melody transcription
method to construct the melody database directly from music
recordings.
Here is an open source query by humming system which support the midi for building song db: https://github.com/EmilioMolina/QueryBySingingHumming
and see the reference:
[1] Lei Wang, Shen Huang, Sheng Hu, Jiaen Liang, Bo Xu, An Effective and Efficient Method for Query by Humming System Based on Multi-Similarity Measurement Fusion, ICALIP, 2008
[2] Lei Wang, Shen Huang, Sheng Hu, Jiaen Liang, Bo Xu, Improving Searching Speed and Accuracy of Query by Humming System Based on Three Methods: Feature Fusion, Candidates Set Reduction and Multiple Similarity Measurement Rescoring, INTERSPEECH, 2008
[3] http://mirlab.org/dataSet/public/MIR-QBSH-corpus.rar
[4] http://www.esac-data.org/
Related
I have crawled MTurk website. and I have 260 Hits as a dataset and from this dataset particular number of users has selected Hits and assigned ratings to each selected Hits. now I want to give recommendation to these users on basis of their selection. How it is possible ? Can anyone recommend me any recommendation algorithm ?
It sounds that You should go for the one of the Collaborative Filtering (CF) algorithm as users have explicit feedback in a form of ratings. First, I would suggest implementing a simple item/user-based k-Nearest Neighbours algorithm. If the results do not satisfy You and maybe Your data is very sparse - probably matrix factorization techniques should do the trick. A good recently survey which I read was [1] - it presents the different methods on different data settings.
If You fill fill comfortable with this and You realize that what You need is actually ranked list of Top-N predictions than ratings, I would suggest reading about e.g. Bayesian Personalized Ranking[2].
And the best part is - those algorithms are really well known and most of them are available for almost every programming language, e.g. python -> https://github.com/Mendeley/mrec/
[1] J. Lee, M. Sun, and G. Lebanon, “A Comparative Study of Collaborative Filtering Algorithms,” ArXiv, pp. 1–27, 2012.
[2] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-thieme, “BPR : Bayesian Personalized Ranking from Implicit Feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, vol. cs.LG, pp. 452–461.
I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.
I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.
I know that there is approximately 8 types of documents. Here are my questions:
How can I find these groups?
Are there some similarity, dissimilarity measures I can use?
Is there some implementation of longest common substring in PostgreSQL?
Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)
I can probably use some like '%%' or SIMILAR TO, but there might be better approach.
You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).
For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.
You can use a clustering technique such as K-Means or Hierarchical Clustering.
Yes you can use the Cosine similarity between documents, looking at the binary term count, term counts, term frequencies, or TF-IDF frequencies.
I don't know about that one.
Not sure, but you could use R or RapidMiner to do the data mining against your database.
I would like to know if there are some libraries/algorithms/techniques that help to extract the user context (walking/standing) from accelerometer data (extracted from any smartphone)?
For example, I would collect accelerometer data every 5 seconds for a definite period of time and then identify the user context (ex. for the first 5 minutes, the user was walking, then the user was standing for a minute, and then he continued walking for another 3 minutes).
Thank you very much in advance :)
Check new activity recognization apis
http://developer.android.com/google/play-services/location.html
its still a research topic,please look at this paper which discuss the algorithm
http://www.enggjournals.com/ijcse/doc/IJCSE12-04-05-266.pdf
I don't know of any such library.
It is a very time consuming task to write such a library. Basically, you would build a database of "user context" that you wish to recognize.
Then you collect data and compare it to those in the database. As for how to compare, see Store orientation to an array - and compare, the same holds for accelerometer.
Walking/running data is analogous to heart-rate data in a lot of ways. In terms of getting the noise filtered and getting smooth peaks, look into noise filtering and peak detection algorithms. The following is used to obtain heart-rate information for heart patients, it should be a good starting point : http://www.docstoc.com/docs/22491202/Pan-Tompkins-algorithm-algorithm-to-detect-QRS-complex-in-ECG
Think about how you want to filter out the noise and detect peaks; the filters will obviously depend on the raw data you gather, but it's good to have a general idea of what kind of filtering you'd want to do on your data. Think about what needs to be done once you have filtered data. In your case, think about how you would go about designing an algorithm to find out when the data indicates activity (like walking, running,etc.), and when it shows the user being stationary. This is a fairly challenging problem to solve, once you consider the dynamics of the device itself (how it's positioned when the user is walking/running), and the fact that there are very few (if not no) benchmarked algos that do this with raw smartphone data.
Start with determining the appropriate algorithms, and then tackle the complexities (mentioned above) one by one.
Yes, I'm aware that speech recognition is fairly complicated (as an understatement). What I'm looking for is a method for distinguishing between maybe 20-30 phrases. An ability to split words (discrete speech is fine) would be nice, but isn't required. The software will be user-dependent(i.e. for use by me). I'm not looking for existing software, but for a good way of going about doing this myself. I've looked into various existing methods and it seems like splitting the sound into phonemes, while common, is somewhat excessive for my needs.
For some context, I'm just looking for a way to control some aspects of my computer with a few simple voice commands. I'm aware that Windows already has speech recognition software, but I'd like to go about this one myself as a learning exercise. Commands would be simple like "Open Google", or "Mute". What I had in mind (not sure if this is a good idea) is that some commands would be compound. So "Mute" would just be "Mute". Whereas the "Open" command could be recognized individually, and then have its suffixes (Google, Photoshop, etc). recognized with another network/model/whatever. But I'm not sure if looking for prefixes/word breaks in this way would produce better results than having to deal with an increased number of individual commands.
I've been looking into perceptrons, hopfield networks (though they're somewhat obsolete from what I understand) and HMMs, and while I understand the ideas behind these (I've implemented the ANNs before) I don't really know which is best suited to this task. I'm assuming that linear vector quantization models would also be appropriate, but I can't really find much literature to this end. Any guidance/resources would be greatly appreciated.
There are some open source project in speech recognition:
HTK (Hidden Markov Models Toolkit)
Sphinx
Both have decoder, training, language model toolkits. Eveything to build a complete and robust speech recognizer.
Voxforge has acoustic and language models for both open source speech recognition toolkits.
Some time ago, I read a whitepaper about a limited vocabulary system, which used a simple recognition process. The system divided each utterance into a small number of bins (6 in time, and 4 in magnitude, if I remember correctly, for 24 total), and all it did was count the number of sample audio measurements in each bin. There was a fuzzy logic rule base which then interpreted each utterances 24 bin counts, and generated an interpretation.
I imagine that (for some applications) a simple matching process might work just as well, in which the 24 bin counts of the current utterance are simple matched against those of each of your stored prototypes, and the one with the least overall difference is the winner.