Split HDMI Image to 3D Projector System [closed]

Split HDMI Image to 3D Projector System [closed] - raspberry-pi

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I am quite stuck on a board, or something that could fit my needs.
I made a dual Projector 3D System at home, like this : http://www.cinema3dglass.com/Dual_projector_3D_polarization_system.php
And the HDMI can have different (8) formats(Left Right, Above Below, ect..) all here: https://www.tridef.com/user-guide/3d-file-formats
So the needed images on the incoming HDMI port can ble placed like above, and I need to split them on two separate HDMI outputs according to the format, so I can plug them into the projectors.
Basically I need a device that is in the first image in the first link above title: "HDMI Distribution Amplifier & EDID Emulator"
I know an Arduino can't handle this amount of processing, because I overloaded it with simlier tasks.
Can anyone Help me where to start? I foud Panda development board but that's too expensive.
Or if there is a not owerly expensive device existing for this task, I could buy that.
I manadged to use the system from Tridef 3D but that's hard to get working.
I'd like my device to get the input from a Chromecast 2.0, but if it's not possible a normal player 'll do it.
I found some devices called HDMI Demultiplexer they simply cut one half of the input, but that's quite expensieve, for 260$ and two would be needed.
Help me please.
Thanks in advance.

From the HDMI specification at page 56 the transfer/interconnection looks like this:
I would start with interleaved left/right format where even pixels are left and odd pixels are right because there is high chance that it does not need any FIFO. If you want standard left/right then you need single line FIFO for each channel and for up/down full image FIFO. In case variable clock is supported by your HW then this simplified example should work:
You need to add H/V sync decoder from Channel0 to reset the binary counter. Binary counter counts which data address is being processed. The single data-line to the AND gates should be D1 half of the input clock but not entirely you need to toggle between D0 and D1 depends in the timing of data processed (for pixels it would be D1 and for other data it would be D0) that is the variable clock I mentioned before. The comparator just compares the address against predefined constants (like half of line for non interleaved left/right format or detect even odd for interleaved format but both must take +/- other data offsets) beware the transfer is on bits not Bytes so the address will be multiplied by number of bits per data chunk ... The gates just toggle clock between left and right part. LATCHES make sure output signal will be not mixed and also boost the signal.
I would start with oscilloscope measurements of the channels so you can see how the data is transfered and then experiment-ate. If you use FPGA then you do not need to make any changes to the board while ecxperimentating with configurations as the circuit will be solely inside FPGA.
If variable clock is not supported then you need to use FIFO and or RAM to store the full line/image and then send the appropriate parts to their connectors. For that you most likely need full decoding capability so use the SIL9134 + SIL9135. Halving resolution will introduce timing problems because you will need more time to send half speed half frame then the full speed full frame (the auxiliary and sync data is copied not halved). If the sending has big enough gaps you could fit the missing time there but again not all HW can support it losing sync/flickering/etc. In such case you could change the resolution to a bit smaller (after halving) to fit in the send time ... or enlarge thi full resolution input (in x axis).
Good luck with your quest.

Related

Image based steganography that survives resizing?

I am using a startech capture card for capturing video from the source machine..I have encoded that video using matlab so every frame of that video will contain that marker...I run that video on the source computer(HDMI out) connected via HDMI to my computer(HDMI IN) once i capture the frame as bitmap(1920*1080) i re-size it to 1280*720 i send it for processing , the processing code checks every pixel for that marker.
The issue is my capture card is able to capture only at 1920*1080 where as the video is of 1280*720. Hence in order to retain the marker I am down scaling the frame captured to 1280*720 which in turn alters the entire pixel array I believe and hence I am not able to retain marker I fed in to the video.
In that capturing process the image is going through up-scaling which in turn changes the pixel values.
I am going through few research papers on Steganography but it hasn't helped so far. Is there any technique that could survive image resizing and I could retain pixel values.
Any suggestions or pointers will be really appreciated.

My advice is to start with searching for an alternative software that doesn't rescale, compress or otherwise modify any extracted frames before handing them to your control. It may save you many headaches and days worth of time. If you insist on implementing, or are forced to implement a steganography algorithm that survives resizing, keep on reading.
I can't provide a specific solution because there are many ways this can be (possibly) achieved and they are complex. However, I'll describe the ingredients a solution will most likely involve and your limitations with such an approach.
Resizing a cover image is considered an attack as an attempt to destroy the secret. Other such examples include lossy compression, noise, cropping, rotation and smoothing. Robust steganography is the medicine for that, but it isn't all powerful; it may be able to provide resistance to only specific types attacks and/or only small scale attacks at that. You need to find or design an algorithm that suits your needs.
For example, let's take a simple pixel lsb substitution algorithm. It modifies the lsb of a pixel to be the same as the bit you want to embed. Now consider an attack where someone randomly applies a pixel change of -1 25% of the time, 0 50% of the time and +1 25% of the time. Effectively, half of the time it will flip your embedded bit, but you don't know which ones are affected. This makes extraction impossible. However, you can alter your embedding algorithm to be resistant against this type of attack. You know the absolute value of the maximum change is 1. If you embed your secret bit, s, in the 3rd lsb, along with setting the last 2 lsbs to 01, you guarantee to survive the attack. More specifically, you get xxxxxs01 in binary for 8 bits.
Let's examine what we have sacrificed in order to survive such an attack. Assuming our embedding bit and the lsbs that can be modified all have uniform probabilities, the probability of changing the original pixel value with the simple algorithm is
change | probability
-------+------------
0 | 1/2
1 | 1/2
and with the more robust algorithm
change | probability
-------+------------
0 | 1/8
1 | 1/4
2 | 3/16
3 | 1/8
4 | 1/8
5 | 1/8
6 | 1/16
That's going to affect our PSNR quite a bit if we embed a lot of information. But we can do a bit better than that if we employ the optimal pixel adjustment method. This algorithm minimises the Euclidean distance between the original value and the modified one. In simpler terms, it minimises the absolute difference. For example, assume you have a pixel with binary value xxxx0111 and you want to embed a 0. This means you have to make the last 3 lsbs 001. With a naive substitution, you get xxxx0001, which has a distance of 6 from the original value. But xxx1001 has only 2.
Now, let's assume that the attack can induce a change of 0 33.3% of the time, 1 33.3% of the time and 2 33.3%. Of that last 33.3%, half the time it will be -2 and the other half it will be +2. The algorithm we described above can actually survive a +2 modification, but not a -2. So 16.6% of the time our embedded bit will be flipped. But now we introduce error correcting codes. If we apply such a code that has the potential to correct on average 1 error every 6 bits, we are capable of successfully extracting our secret despite the attack altering it.
Error correction generally works by adding some sort of redundancy. So even if part of our bit stream is destroyed, we can refer to that redundancy to retrieve the original information. Naturally, the more redundancy you add, the better the error correction rate, but you may have to double the redundancy just to improve the correction rate by a few percent (just arbitrary numbers here).
Let's appreciate here how much information you can hide in a 1280x720 (grayscale) image. 1 bit per pixel, for 8 bits per letter, for ~5 letters per word and you can hide 20k words. That's a respectable portion of an average novel. It's enough to hide your stellar Masters dissertation, which you even published, in your graduation photo. But with a 4 bit redundancy per 1 bit of actual information, you're only looking at hiding that boring essay you wrote once, which didn't even get the best mark in the class.
There are other ways you can embed your information. For example, specific methods in the frequency domain can be more resistant to pixel modifications. The downside of such methods are an increased complexity in coding the algorithm and reduced hiding capacity. That's because some frequency coefficients are resistant to changes but make embedding modifications easily detectable, then there are those that are fragile to changes but they are hard to detect and some lie in the middle of all of this. So you compromise and use only a fraction of the available coefficients. Popular frequency transforms used in steganography are the Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT).
In summary, if you want a robust algorithm, the consistent themes that emerge are sacrificing capacity and applying stronger distortions to your cover medium. There have been quite a few studies done on robust steganography for watermarks. That's because you want your watermark to survive any attacks so you can prove ownership of the content and watermarks tend to be very small, e.g. a 64x64 binary image icon (that's only 4096 bits). Even then, some algorithms are robust enough to recover the watermark almost intact, say 70-90%, so that it's still comparable to the original watermark. In some case, this is considered good enough. You'd require an even more robust algorithm (bigger sacrifices) if you want a lossless retrieval of your secret data 100% of the time.
If you want such an algorithm, you want to comb the literature for one and test any possible candidates to see if they meet your needs. But don't expect anything that takes only 15 lines to code and 10 minutes of reading to understand. Here is a paper that looks like a good start: Mali et al. (2012). Robust and secured image-adaptive data hiding. Digital Signal Processing, 22(2), 314-323. Unfortunately, the paper is not open domain and you will either need a subscription, or academic access in order to read it. But then again, that's true for most of the papers out there. You said you've read some papers already and in previous questions you've stated you're working on a college project, so access for you may be likely.
For this specific paper, table 4 shows the results of resisting a resizing attack and section 4.4 discusses the results. They don't explicitly state 100% recovery, but only a faithful reproduction. Also notice that the attacks have been of the scale 5-20% resizing and that only allows for a few thousand embedding bits. Finally, the resizing method (nearest neighbour, cubic, etc) matters a lot in surviving the attack.

I have designed and implemented ChromaShift: https://www.facebook.com/ChromaShift/
If done right, steganography can resiliently (i.e. robustly) encode identifying information (e.g. downloader user id) in the image medium while keeping it essentially perceptually unmodified. Compared to watermarks, steganography is a subtler yet more powerful way of encoding information in images.
The information is dynamically multiplexed into the Cb Cr fabric of the JPEG by chroma-shifting pixels to a configurable small bump value. As the human eye is more sensitive to luminance changes than to chrominance changes, chroma-shifting is virtually imperceptible while providing a way to encode arbitrary information in the image. The ChromaShift engine does both watermarking and pure steganography. Both DRM subsystems are configurable via a rich set of of options.
The solution is developed in C, for the Linux platform, and uses SWIG to compile into a PHP loadable module. It can therefore be accessed by PHP scripts while providing the speed of a natively compiled program.

Why is bandwidth measured in bits per second? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
There is another question with this same title, but the question is asked differently than what's troubling me, and the answer is not sufficient.
The most prominent analogies I hear to explain bandwidth are the highway example, and the pipe example. In the highway example, bandwidth is the amount of cars that can drive on the highway in a given amount of time, and in the pipe its an amount of water that can flow through.
My question is - by measuring by cars per second, or liters per second, does that mean that a longer highway, pipe or copper wire has a higher bandwidth than a shorter one? That seems strange to me.
Wouldn't it make more sense to give the highway bandwidth as the amount of lanes it has - irrespective of a unit of time? It just makes more sense to me and is simpler to say that the pipe is "1 foot in diameter" rather than "it carries 100 litres per second".
Why do we measure bandwidth in bits per second and not just in bits?

"My question is - by measuring by cars per second, or liters per second, does that mean that a longer highway, pipe or copper wire has a higher bandwidth than a shorter one?"
No!
Bandwidth is not about how many cars can fit on the road. It's about how many cars can pass a point on the road during a certain time. How many cars per second can pass under a bridge, for example.

No, it wouldn't. You quote a highway in terms of lanes, because it's more understandable, and a reasonable approximation to assume 4 lanes = 4x as much traffic. But even then, you might have a traffic jam, and then 4 lanes is 'transmitting' fewer cars per minute than it would otherwise.
With a hose pipe, the width of the pipe is the speed of transmission if you assume the same water pressure.
These assumptions don't apply to communications - when I'm transmitting 'a bit' nothing physical is moving *. A 'bit' is the smallest piece that 'information' can be broken down into, and in order to transmit it, something needs to change.
If I turn on my torch and shine it at you, I've sent one 'message' (my torch is on). To send you anything more detailed, I would need to turn it off and on again - morse code is an example of doing this. The pattern of switching it off and on gives you some letters. How fast I can switch it off and on again, is how fast I can send a message.
So it is with bandwidth. I need to change things to communicate. If I can change things faster, I can communicate faster.
"bits" would be a measure of the number of torches I own. Bits per second is how fast I can flick them on and off to send a message.
* Electrons and photons do move, as does air to carry sound. But the signal isn't the thing moving - I don't have to move an atom of air from my mouth to your ear to 'talk' to you, the wave propagates through the medium.

Performing Intra-frame Prediction in Matlab

I am trying to implement a hybrid video coding framework which is used in the H.264/MPEG-4 video standard for which I need to perform 'Intra-frame Prediction' and 'Inter Prediction' (which in other words is motion estimation) of a set of 30 frames for video processing in Matlab. I am working with Mother-daughter frames.
Please note that this post is very similar to my previously asked question but this one is solely based on Matlab computation.
Edit:
I am trying to implement the framework shown below:
My question is how to perform horizontal coding method which is one of the nine methods of Intra Coding framework? How are the pixels sampled?
What I find confusing is that Intra Prediction needs two inputs which are the 8x8 blocks of input frame and the 8x8 blocks of reconstructed frame. But what happens when coding the very first block of the input frame since there will be no reconstructed pixels to perform horizontal coding?
In the image above the whole system is a closed loop where do you start?
END:
Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
I know that pixels from reconstructed block are used for comparing, but is it done one pixel at a time within the search area? If so wouldn't that be too time consuming if 30 frames are to be processed?

Continuing on from our previous post, let's answer one question at a time.
Question #1
Usually, you use one I-frame and denote this as the reference frame. Once you use this, for each 8 x 8 block that's in your reference frame, you take a look at the next frame and figure out where this 8 x 8 block best moved in this next frame. You describe this displacement as a motion vector and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the reference frame best moved in this frame.
Now, the next question you may be asking is how many frames is it going to take before we decide to use another reference frame? This is entirely up to you, and you set this up in your decoder settings. For digital broadcast and DVD storage, it is recommended that you generate an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means that you would need to generate an I-frame every 12 frames. This Wikipedia article was where I got this reference.
As for the intra-coding modes, these tell the encoder in what direction you should look for when trying to find the best matching block. Actually, take a look at this paper that talks about the different prediction modes. Take a look at Figure 1, and it provides a very nice summary of the various prediction modes. In fact, there are nine all together. Also take a look at this Wikipedia article to get better pictorial representations of the different mechanisms of prediction as well. In order to get the best accuracy, they also do subpixel estimation at a 1/4 pixel accuracy by doing bilinear interpolation in between the pixels.
I'm not sure whether or not you need to implement just motion compensation with P-frames, or if you need B frames as well. I'm going to assume you'll be needing both. As such, take a look at this diagram I pulled off of Wikipedia:
Source: Wikipedia
This is a very common sequence for encoding frames in your video. It follows the format of:
IBBPBBPBBI...
There is a time axis at the bottom that tells you the sequence of frames that get sent to the decoder once you encode the frames. I-frames need to be encoded first, followed by P-frames, and then B-frames. A typical sequence of frames that are encoded in between the I-frames follow this format that you see in the figure. The chunk of frames in between I-frames is what is known as a Group of Pictures (GOP). If you remember from our previous post, B-frames use information from ahead and from behind its current position. As such, to summarize the timeline, this is what is usually done on the encoder side:
The I-frame is encoded, and then is used to predict the first P-frame
The first I-frame and the first P-frame are then used to predict the first and second B-frame that are in between these frames
The second P-frame is predicted using the first P-frame, and the third and fourth B-frames are created using information between the first P-frame and the second P-frame
Finally, the last frame in the GOP is an I-frame. This is encoded, then information between the second P-frame and the second I-frame (last frame) are used to generate the fifth and sixth B-frames
Therefore, what needs to happen is that you send I-frames first, then the P-frames, and then the B-frames after. The decoder has to wait for the P-frames before the B-frames can be reconstructed. However, this method of decoding is more robust because:
It minimizes the problem of possible uncovered areas.
P-frames and B-frames need less data than I-frames, so less data is transmitted.
However, B-frames will require more motion vectors, and so there will be some higher bit rates here.
Question #2
Honestly, what I have seen people do is do a simple Sum-of-Squared Differences between one frame and another to compare similarity. You take your colour components (whether it be RGB, YUV, etc.) for each pixel from one frame in one position, subtract these with the colour components in the same spatial location in the other frame, square each component and add them all together. You accumulate all of these differences for every location in your frame. The higher the value, the more dissimilar this is between the one frame and the next.
Another measure that is well known is called Structural Similarity where some statistical measures such as mean and variance are used to assess how similar two frames are.
There are a whole bunch of other video quality metrics that are used, and there are advantages and disadvantages when using any of them. Rather than telling you which one to use, I defer you to this Wikipedia article so you can decide which one to use for yourself depending on your application. This Wikipedia article describes a whole bunch of similarity and video quality metrics, and the buck doesn't stop there. There is still on-going research on what numerical measures best capture the similarity and quality between two frames.
Question #3
When searching for the best block from an I-frame that has moved in a P-frame, you need to restrict the searching to a finite sized windowed area from the location of this I-frame block because you don't want the encoder to search all of the locations in the frame. This would simply be too computationally intensive and would thus make your decoder slow. I actually mentioned this in our previous post.
Using one pixel to search for another pixel in the next frame is a very bad idea because of the minuscule amount of information that this single pixel contains. The reason why you compare blocks at a time when doing motion estimation is because usually, blocks of pixels have a lot of variation inside the blocks which are unique to the block itself. If we can find this same variation in another area in your next frame, then this is a very good candidate that this group of pixels moved together to this new block. Remember, we're assuming that the frame rate for video is adequately high enough so that most of the pixels in your frame either don't move at all, or move very slowly. Using blocks allows the matching to be somewhat more accurate.
Blocks are compared at a time, and the way blocks are compared is using one of those video similarity measures that I talked about in the Wikipedia article I referenced. You are certainly correct in that doing this for 30 frames would indeed be slow, but there are implementations that exist that are highly optimized to do the encoding very fast. One good example is FFMPEG. In fact, I use FFMPEG at work all the time. FFMPEG is highly customizable, and you can create an encoder / decoder that takes advantage of the architecture of your system. I have it set up so that encoding / decoding uses all of the cores on my machine (8 in total).
This doesn't really answer the actual block comparison itself. Actually, the H.264 standard has a bunch of prediction mechanisms in place so that you're not looking at all of the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This alludes to the different prediction modes in the Wikipedia article and in the paper that I referred you to. The encoder is intelligent enough to detect a pattern, and then generalize an area of your image where it believes that this will exhibit the same amount of motion. It skips this area and moves onto the next.
This assignment (in my opinion) is way too broad. There are so many intricacies in doing motion prediction / compensation that there is a reason why most video engineers already use available tools to do the work for us. Why invent the wheel when it has already been perfected, right?
I hope this has adequately answered your questions. I believe that I have given you more questions than answers really, but I hope that this is enough for you to delve into this topic further to achieve your overall goal.
Good luck!

Question 1: Is intra-predicted image only for the first image (I-frame) of the sequence or does it need to be computed for all 30 frames?
I know that there are five intra coding modes which are horizontal, vertical, DC, Left-up to right-down and right-up to left-down.
Answer: intra prediction need not be used for all the frames.
Question 2: How do I actually get around comparing the reconstructed frame and the anchor frame (original current frame)?
Question 3: Why do I need a search area? Can the individual 8x8 blocks be used as a search area done one pixel at a time?
Answer: we need to use the block matching algo for finding the motion vector. so search area is reqd. Normally size of the search area should be larger than the block size. larger the search area, more the computation and higher the accuracy.

Separation of singing voice from music [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to know how to perform "spectral change detection" for the classification of vocal & non-vocal segments of a song. We need to find the spectral changes from a spectrogram. Any elaborate information about this, particularly involving MATLAB?

Separating out distinct signals from audio is a very active area of research, and it is a very hard problem. This is often called Blind Signal Separation in the literature. (There is some MATLAB demo code in the previous link.
Of course, if you know that there is vocal in the music, you can use one of the many vocal separation algorithms.

As others have noted, solving this problem using only raw spectrum analysis is a dauntingly hard problem, and you're unlikely to find a good solution to it. At best, you might be able to extract some of the vocals and a few extra crossover frequencies from the mix.
However, if you can be more specific about the nature of the audio material you are working with here, you might be able to get a little bit further.
In the worst case, your material would be normal mp3's of regular songs -- ie, a full band + vocalist. I have a feeling that this is the case you are probably looking at given the nature of your question.
In the best case, you have access to the multitrack studio recordings and have at least a full mixdown and an instrumental track, in which case you could extract the vocal frequencies from the mix. You would do this by generating an impulse response from one of the tracks and applying it to the other.
In the middle case, you are dealing with simple music which you could apply some sort of algorithm tuned to the parameters of the music to. For instance, if you are dealing with electronic music, you can use to your advantage the stereo width of the track to eliminate all mono elements (ie, basslines + kicks) to extract the vocals + other panned instruments, and then apply some type of filtering and spectrum analysis from there.
In short, if you are planning on making an all-purpose algorithm to generate clean acapella cuts from arbitrary source material, you're probably biting off more than you can chew here. If you can specifically limit your source material, then you have a number of algorithms at your disposal depending on the nature of those sources.

This is hard. If you can do this reliably you will be an accomplished computer scientist. The most promising method I read about used the lyrics to generate a voice only track for comparison. Again, if you can do this and write a paper about it you will be famous (amongst computer scientists). Plus you could make a lot of money by automatically generating timings for karaoke.

If you just want to decide wether a block of music is clean a-capella or with instrumental background, you could probably do that by comparing the bandwidth of the signal to a normal human singer bandwidth. Also, you could check for the base frequency, which can only be in a pretty limited frequency range for human voices.
Still, it probably won't be easy. However, hearing aids do this all the time, so it is clearly doable. (Though they typically search for speech, not singing)

first sync the instrumental with the original, make sure they are the same length and bitrate and start and end at the exact time and convert them to .wav
then do something like
I = wavread(instrumental.wav);
N = wavread(normal.wav);
i = inv(I);
A = (N - i); // it could be A = (N * i) or A = (N + i) you'll have to play around
wavwrite(A, acapella.wav)
that should do it.. a little linear algebra goes a long way.

Creating music visualizer [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So how does someone create a music visualizer? I've looked on Google but I haven't really found anything that talks about the actual programming; mostly just links to plug-ins or visualizing applications.
I use iTunes but I realize that I need Xcode to program for that (I'm currently deployed in Iraq and can't download that large of a file). So right now I'm just interested in learning "the theory" behind it, like processing the frequencies and whatever else is required.

As a visualizer plays a song file, it reads the audio data in very short time slices (usually less than 20 milliseconds). The visualizer does a Fourier transform on each slice, extracting the frequency components, and updates the visual display using the frequency information.
How the visual display is updated in response to the frequency info is up to the programmer. Generally, the graphics methods have to be extremely fast and lightweight in order to update the visuals in time with the music (and not bog down the PC). In the early days (and still), visualizers often modified the color palette in Windows directly to achieve some pretty cool effects.
One characteristic of frequency-component-based visualizers is that they don't often seem to respond to the "beats" of music (like percussion hits, for example) very well. More interesting and responsive visualizers can be written that combine the frequency-domain information with an awareness of "spikes" in the audio that often correspond to percussion hits.

For creating BeatHarness ( http://www.beatharness.com ) I've 'simply' used an FFT to get the audiospectrum, then use some filtering and edge / onset-detectors.
About the Fast Fourier Transform :
http://en.wikipedia.org/wiki/Fast_Fourier_transform
If you're accustomed to math you might want to read Paul Bourke's page :
http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/dft/
(Paul Bourke is a name you want to google anyway, he has a lot of information about topics you either want to know right now or probably in the next 2 years ;))
If you want to read about beat/tempo-detection google for Masataka Goto, he's written some interesting papers about it.
Edit:
His homepage : http://staff.aist.go.jp/m.goto/
Interesting read : http://staff.aist.go.jp/m.goto/PROJ/bts.html
Once you have some values for for example bass, midtones, treble and volume(left and right),
it's all up to your imagination what to do with them.
Display a picture, multiply the size by the bass for example - you'll get a picture that'll zoom in on the beat, etc.

Typically, you take a certain amount of the audio data, run a frequency analysis over it, and use that data to modify some graphic that's being displayed over and over. The obvious way to do the frequency analysis is with an FFT, but simple tone detection can work just as well, with a lower lower computational overhead.
So, for example, you write a routine that continually draws a series of shapes arranged in a circle. You then use the dominant frequencies to determine the color of the circles, and use the volume to set the size.

There are a variety of ways of processing the audio data, the simplest of which is just to display it as a rapidly changing waveform, and then apply some graphical effect to that. Similarly, things like the volume can be calculated (and passed as a parameter to some graphics routine) without doing a Fast Fourier Transform to get frequencies: just calculate the average amplitude of the signal.
Converting the data to the frequency domain using an FFT or otherwise allows more sophisticated effects, including things like spectrograms. It's deceptively tricky though to detect even quite 'obvious' things like the timing of drum beats or the pitch of notes directly from the FFT output
Reliable beat-detection and tone-detection are hard problems, especially in real time. I'm no expert, but this page runs through some simple example algorithms and their results.

Devise an algorithm to draw something interesting on the screen given a set of variables
Devise a way to convert an audio stream into a set of variables analysing things such as beats/minute frequency different frequency ranges, tone etc.
Plug the variables into your algorithm and watch it draw.
A simple visualization would be one that changed the colour of the screen every time the music went over a certain freq threshhold. or to just write the bpm onto the screen. or just displaying an ociliscope.
check out this wikipedia article

Like suggested by #Pragmaticyankee processing is indeed an interesting way to visualize your music. You could load your music in Ableton Live, and use an EQ to filter out the high, middle and low frequencies from your music. You could then use a VST follwoing plugin to convert audio enveloppes into MIDI CC messages, such as Gatefish by Mokafix Audio (works on windows) or PizMidi’s midiAudioToCC plugin (works on mac). You can then send these MIDI CC messages to a light-emitting hardware tool that supports MIDI, for instance percussa audiocubes. You could use a cube for every frequency you want to display, and assign a color to the cube. Have a look at this post:
http://www.percussa.com/2012/08/18/how-do-i-generate-rgb-light-effects-using-audio-signals-featured-question/

We have lately added DirectSound-based audio data input routines in LightningChart data visualization library. LightningChart SDK is set of components for Visual Studio .NET (WPF and WinForms), you may find it useful.
With AudioInput component, you can get real-time waveform data samples from sound device. You can play the sound from any source, like Spotify, WinAmp, CD/DVD player, or use mic-in connector.
With SpectrumCalculator component, you can get power spectrum (FFT conversion) that is handy in many visualizations.
With LightningChartUltimate component you can visualize data in many different forms, like waveform graphs, bar graphs, heatmaps, spectrograms, 3D spectrograms, 3D lines etc. and they can be combined. All rendering takes place through Direct3D acceleration.
Our own examples in the SDK have a scientific approach, not really having much entertainment aspect, but it definitely can be used for awesome entertainment visualizations too.
We have also configurable SignalGenerator (sweeps, multi-channel configurations, sines, squares, triangles, and noise waveforms, WAV real-time streaming, and DirectX audio output components for sending wave data out from speakers or line-output.
[I'm CTO of LightningChart components, doing this stuff just because I like it :-) ]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse