How do I plot Precision-Recall graphs for Content-Based Image Retrieval in MATLAB? - matlab

I am accessing 10 images from a folder "c1" and I have query image. I have implemented code for loading images in cell array and then I'm calculating histogram intersection between query image and each image from folder "c1" one-by-one. Now i want to draw precision-recall curve but i am not sure how to write code for getting "precision-recall curve" using the data obtained from histogram intersection.
My code:
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
Seq{i}=I;
end
for i = 1 : length(srcFiles) % loop for calculating histogram intersections
A=Seq{i};
B=Inp1;
a = size(A,2); b = size(B,2);
K = zeros(a, b);
for j = 1:a
Va = repmat(A(:,j),1,b);
K(j,:) = 0.5*sum(Va + B - abs(Va - B));
end
end

Precision-Recall graphs measure the accuracy of your image retrieval system. They're also used in the performance of any search engine really, like text or documents. They're also used in machine learning evaluation and performance, though ROC Curves are what are more commonly used.
Precision-Recall graphs are more suitable for document and data retrieval. For the case of images here, given your query image, you are measuring how similar this image is with the rest of the images in your database. You then have similarity measures for each of the database images in relation to your query image and then you sort these similarities in descending order. For a good retrieval system, you would want the images that are the most relevant (i.e. what you are searching for) to all appear at the beginning, while the irrelevant images would appear after.
Precision
The definition of Precision is the ratio of the number of relevant images you have retrieved to the total number of irrelevant and relevant images retrieved. In other words, supposing that A was the number of relevant images retrieved and B was the total number of irrelevant images retrieved. When calculating precision, you take a look at the first several images, and this amount is A + B, as the total number of relevant and irrelevant images is how many images you are considering at this point. As such, another definition Precision is defined as the ratio of how many relevant images you have retrieved so far out of the bunch that you have grabbed:
Precision = A / (A + B)
Recall
The definition of Recall is slightly different. This evaluates how many of the relevant images you have retrieved so far out of a known total, which is the the total number of relevant images that exist. As such, let's say you again take a look at the first several images. You then determine how many relevant images there are, then you calculate how many relevant images that have been retrieved so far out of all of the relevant images in the database. This is defined as the ratio of how many relevant images you have retrieved overall. Supposing that A was again the total number of relevant images you have retrieved out of a bunch you have grabbed from the database, and C represents the total number of relevant images in your database. Recall is thus defined as:
Recall = A / C
How you calculate this in MATLAB is actually quite easy. You first need to know how many relevant images are in your database. After, you need to know the similarity measures assigned to each database image with respect to the query image. Once you compute these, you need to know which similarity measures map to which relevant images in your database. I don't see that in your code, so I will leave that to you. Once you do this, you then sort on the similarity values then you go through where in the sorted similarity values these relevant images occur. You then use these to calculate your precision and recall.
I'll provide a toy example so I can show you what the graph looks like as it isn't quite clear on how you're calculating your similarities here. Let's say I have 5 images in a database of 20, and I have a bunch of similarity values between them and a query image:
rng(123); %// Set seed for reproducibility
num_images = 20;
sims = rand(1,num_images);
sims =
Columns 1 through 13
0.6965 0.2861 0.2269 0.5513 0.7195 0.4231 0.9808 0.6848 0.4809 0.3921 0.3432 0.7290 0.4386
Columns 14 through 20
0.0597 0.3980 0.7380 0.1825 0.1755 0.5316 0.5318
Also, I know that images [1 5 7 9 12] are my relevant images.
relevant_IDs = [1 5 7 9 12];
num_relevant_images = numel(relevant_IDs);
Now let's sort the similarity values in descending order, as higher values mean higher similarity. You'd reverse this if you were calculating a dissimilarity measure:
[sorted_sims, locs] = sort(sims, 'descend');
locs will now contain the image ranks that each image ranked as. Specifically, these tell you which position in similarity the image belongs to. sorted_sims will have the similarities sorted in descending order:
sorted_sims =
Columns 1 through 13
0.9808 0.7380 0.7290 0.7195 0.6965 0.6848 0.5513 0.5318 0.5316 0.4809 0.4386 0.4231 0.3980
Columns 14 through 20
0.3921 0.3432 0.2861 0.2269 0.1825 0.1755 0.0597
locs =
7 16 12 5 1 8 4 20 19 9 13 6 15 10 11 2 3 17 18 14
Therefore, the 7th image is the highest ranked image, followed by the 16th image being the second highest image and so on. What you need to do now is for each of the images that you know are relevant, you need to figure out where these are located after sorting. We will go through each of the image IDs that we know are relevant, and figure out where these are located in the above locations array:
locations_final = arrayfun(#(x) find(locs == x, 1), relevant_IDs)
locations_final =
5 4 1 10 3
Let's sort these to get a better understand of what this is saying:
locations_sorted = sort(locations_final)
locations_sorted =
1 3 4 5 10
These locations above now tell you the order in which the relevant images will appear. As such, the first relevant image will appear first, the second relevant image will appear in the third position, the third relevant image appears in the fourth position and so on. These precisely correspond to part of the definition of Precision. For example, in the last position of locations_sorted, it would take ten images to retrieve all of the relevant images (5) in our database. Similarly, it would take five images to retrieve four relevant images in the database. As such, you would compute precision like this:
precision = (1:num_relevant_images) ./ locations_sorted;
Similarly for recall, it's simply the ratio of how many images were retrieved so far from the total, and so it would just be:
recall = (1:num_relevant_images) / num_relevant_images;
Your Precision-Recall graph would now look like the following, with Recall on the x-axis and Precision on the y-axis:
plot(recall, precision, 'b.-');
xlabel('Recall');
ylabel('Precision');
title('Precision-Recall Graph - Toy Example');
axis([0 1 0 1.05]); %// Adjust axes for better viewing
grid;
This is the graph I get:
You'll notice that between a recall ratio of 0.4 to 0.8 the precision is increasing a bit. This is because you have managed to retrieve a successive chain of images without touching any of the irrelevant ones, and so your precision will naturally increase. It goes way down after the last image, as you've had to retrieve so many irrelevant images before finally hitting a relevant image.
You'll also notice that precision and recall are inversely related. As such, if precision increases, then recall decreases. Similarly, if precision decreases, then recall will increase.
The first part makes sense because if you don't retrieve that many images in the beginning, you have a greater chance of not including irrelevant images in your results but at the same time, the amount of relevant images is rather small. This is why recall would decrease when precision would increase
The second part also makes sense because as you keep trying to retrieve more images in your database, you'll inevitably be able to retrieve all of the relevant ones, but you'll most likely start to include more irrelevant images, which would thus drive your precision down.
In an ideal world, if you had N relevant images in your database, you would want to see all of these images in the top N most similar spots. As such, this would make your precision-recall graph a flat horizontal line hovering at y = 1, which means that you've managed to retrieve all of your images in all of the top spots without accessing any irrelevant images. Unfortunately, that's never going to happen (or at least not for now...) as trying to figure out the best features for CBIR is still an on-going investigation, and no image search engine that I have seen has managed to get this perfect. This is still one of the most broadest and unsolved computer vision problems that exist today!
Edit
You retrieved this code to compute histogram intersection from this post. They have a neat way of computing histogram intersection as:
n is the total number of bins in your histogram. You'll have to play around with this to get good results, but we can leave that as a parameter in your code. The code above assumes that you have two matrices A and B where each column is a histogram. You'll generate a matrix that is of a x b, where a is the number of columns in A and b is the number of columns in b. The row and column of this matrix (i,j) tells you the similarity between the ith column in A with the b jth column in B. In your case, A would be a single column which denotes the histogram of your query image. B would be a 10 column matrix that denotes the histograms for each of the database images. Therefore, we will get a 1 x 10 array of similarity measures through histogram intersection.
As such, we need to modify your code so that you're using imhist for each of the images. We can also specify an additional parameter that gives you how many bins each histogram will have. Therefore, your code will look like this. Each new line that I have placed will have a %// NEW comment beside each line.
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
num_bins = 32; %// NEW - I'm specifying 32 bins here. Play around with this yourself
A = imhist(Inp1, num_bins); %// NEW - Calculate histogram
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
B = zeros(num_bins, length(srcFiles)); %// NEW - Store histograms of database images
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
B(:,i) = imhist(I, num_bins); %// NEW - Put each histogram in a separate
%// column
end
%// NEW - Taken directly from the website
%// but modified for only one histogram in `A`
b = size(B,2);
Va = repmat(A, 1, b);
K = 0.5*sum(Va + B - abs(Va - B));
Take note that I have copied the code from the website, but I have modified it because there is only one image in A and so there is some code that isn't necessary.
K should now be a 1 x 10 array of histogram intersection similarities. You would then use K and assign sims to this variable (i.e. sims = K;) in the code I have written above, then run through your images. You also need to know which images are relevant images, and you'd have to change the code I've written to reflect that.
Hope this helps!

Related

How to identify an optimal subsample from a data set with missing values in MATLAB

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.

Using SURF algorithm to match objects on MATLAB

The objective is to see if two images, which have one object captured in each image, matches.
The object or image I have stored. This will be used as a baseline:
item1 (This is being matched in the code)
The object/image that needs to matched with-this is stored:
input (Need to see if this matches with what is stored
My method:
Covert images to gray-scale.
Extract SURF interest points.
Obtain features.
Match features.
Get 50 strongest features.
Match the number of strongest features with each image.
Take the ratio of- number of features matched/ number of strongest
features (which is 50).
If I have two images of the same object (two images taken separately on a camera), ideally the ratio should be near 1 or near 100%.
However this is not the case, the best ratio I am getting is near 0.5 or even worse, 0.3.
I am aware the SURF detectors and features can be used in neural networks, or using a statistics based approach. I believe I have approached the statistics based approach to some extent by using 50 of the strongest features.
Is there something I am missing? What do I add onto this or how do I improve it? Please provide me a point to start from.
%Clearing the workspace and all variables
clc;
clear;
%ITEM 1
item1 = imread('Loreal.jpg');%Retrieve order 1 and digitize it.
item1Grey = rgb2gray(item1);%convert to grayscale, 2 dimensional matrix
item1KP = detectSURFFeatures(item1Grey,'MetricThreshold',600);%get SURF dectectors or interest points
strong1 = item1KP.selectStrongest(50);
[item1Features, item1Points] = extractFeatures(item1Grey, strong1,'SURFSize',128); % using SURFSize of 128
%INPUT : Aquire Image
input= imread('MakeUp1.jpg');%Retrieve input and digitize it.
inputGrey = rgb2gray(input);%convert to grayscale, 2 dimensional matrix
inputKP = detectSURFFeatures(inputGrey,'MetricThreshold',600);%get SURF dectectors or interest
strongInput = inputKP.selectStrongest(50);
[inputFeatures, inputPoints] = extractFeatures(inputGrey, strongInput,'SURFSize',128); % using SURFSize of 128
pairs = matchFeatures(item1Features, inputFeatures, 'MaxRatio',1); %matching SURF Features
totalFeatures = length(item1Features); %baseline number of features
numPairs = length(pairs); %the number of pairs
percentage = numPairs/50;
if percentage >= 0.49
disp('We have this');
else
disp('We do not have this');
disp(percentage);
end
The baseline image
The input image
I would try not doing selectStrongest and not setting MaxRatio. Just call matchFeatures with the default options and compare the number of resulting matches.
The default behavior of matchFeatures is to use the ratio test to exclude ambiguous matches. So the number of matches it returns may be a good indicator of the presence or absence of the object in the scene.
If you want to try something more sophisticated, take a look at this example.

MATLAB - histograms of equal size and histogram overlap

An issue I've come across multiple times is wanting to take two similar data sets and create histograms from them where the bins are identical, so as to easily calculate things like histogram overlap.
You can define the number of bins (obviously) using
[counts, bins] = hist(data,number_of_bins)
But there's not an obvious way (as far as I can see) to make the bin size equal for several different data sets. If remember when I initially looked finding various people who seem to have the same issue, but no good solutions.
The right, easy way
As pointed out by horchler, this can easily be achieved using either histc (which lets you define your bins vector), or vectorizing your histogram input into hist.
The wrong, stupid way
I'm leaving below as a reminder to others that even stupid questions can yield worthwhile answers
I've been using the following approach for a while, so figured it might be useful for others (or, someone can very quickly point out the correct way to do this!).
The general approach relies on the fact that MATLAB's hist function defines an equally spaced number of bins between the largest and smallest value in your sample. So, if you append a start (smallest) and end (largest) value to your various samples which is the min and max for all samples of interest, this forces the histogram range to be equal for all your data sets. You can then truncate the first and last values to recreate your original data.
For example, create the following data set
A = randn(1,2000)+7
B = randn(1,2000)+9
[counts_A, bins_A] = hist(A, 500);
[counts_B, bins_B] = hist(B, 500);
Here for my specific data sets I get the following results
bins_A(1) % 3.8127 (this is also min(A) )
bins_A(500) % 10.3081 (this is also max(A) )
bins_B(1) % 5.6310 (this is also min(B) )
bins_B(500) % 13.0254 (this is also max(B) )
To create equal bins you can simply first define a min and max value which is slightly smaller than both ranges;
topval = max([max(A) max(B)])+0.05;
bottomval = min([min(A) min(B)])-0.05;
The addition/subtraction of 0.05 is based on knowledge of the range of values - you don't want your extra bin to be too far or too close to the actual range. That being said, for this example by using the joint min/max values this code will work irrespective of the A and B values generated.
Now we re-create histogram counts and bins using (note the extra 2 bins are for our new largest and smallest value)
[counts_Ae, bins_Ae] = hist([bottomval, A, topval], 502);
[counts_Be, bins_Be] = hist([bottomval, B, topval], 502);
Finally, you truncate the first and last bin and value entries to recreate your original sample exactly
bins_A = bins_Ae(2:501)
bins_B = bins_Ae(2:501)
counts_A = counts_Ae(2:501)
counts_B = counts_Be(2:501)
Now
bins_A(1) % 3.7655
bins_A(500) % 13.0735
bins_B(1) % 3.7655
bins_B(500) % 13.0735
From this you can easily plot both histograms again
bar([bins_A;bins_B]', [counts_A;counts_B]')
And also plot the histogram overlap with ease
bar(bins_A,(counts_A+counts_B)-(abs(counts_A-counts_B)))

Creating image profiles in some parts of the image

I've been struggling with a problem for a while:) in Matlab.
I have an image (A.tif) in which I would like to find maxima (with defined threshold) but more specific coordinates of these maxima. My goal is to create short profiles on the image crossing these maxima (let say +- 20 pixels on both sides of the maximum)
I tried this:
[r c]=find(A==max(max(A)));
I suppose that r and c are coordinates of maximum (only one/first or every maximum?)
How can I implement these coordinates into ,for example improfile function?
I think it should be done using nested loops?
Thanks for every suggestion
Your code is working but it finds only global maximum coordinates.I would like to find multiple maxima (with defined threshold) and properly address its coordinates to create multiple profiles crossing every maximum found. I have little problem with improfile function :
improfile(IMAGE,[starting point],[ending point]) .
Lets say that I get [rows, columns] matrix with coordinates of each maximum and I'm trying to create one direction profile which starts in the same row where maximum is (about 20 pixels before max) and of course ends in the same row (also about 20 pixels from max) .
is this correct expression :improfile(IMAGE,[rows columns-20],[rows columns+20]); It plots something but it seems to only joins maxima rather than making intensity profiles
You're not giving enough information so I had to guess a few things. You should apply the max() to the vectorized image and store the index:
[~,idx] = max(I(:))
Then transform this into x and y coordinates:
[ix,iy] = ind2sub(size(I),idx)
This is your x and y of the maximum of the image. It really depends what profile section you want. Something like this is working:
I = imread('peppers.png');
Ir = I(:,:,1);
[~,idx]=max(Ir(:))
[ix,iy]=ind2sub(size(Ir),idx)
improfile(Ir,[0 ix],[iy iy])
EDIT:
If you want to instead find the k largest values and not just the maximum you can do an easy sort:
[~,idx] = sort(I(:),'descend');
idxk = idx(1:k);
[ix,iy] = ind2sub(size(I),idxk)
Please delete your "reply" and instead edit your original post where you define your problem better

Pruning data for better viewing on loglog graph - Matlab

just wondering if anyone has any ideas about an issue I'm having.
I have a fair amount of data that needs to be displayed on one graph. Two theoretical lines that are bold and solid are displayed on top, then 10 experimental data sets that converge to these lines are graphed, each using a different identifier (eg the + or o or a square etc). These graphs are on a log scale that goes up to 1e6. The first few decades of the graph (< 1e3) look fine, but as all the datasets converge (> 1e3) it's really difficult to see what data is what.
There's over 1000 data points points per decade which I can prune linearly to an extent, but if I do this too much the lower end of the graph will suffer in resolution.
What I'd like to do is prune logarithmically, strongest at the high end, working back to 0. My question is: how can I get a logarithmically scaled index vector rather than a linear one?
My initial assumption was that as my data is lenear I could just use a linear index to prune, which lead to something like this (but for all decades):
//%grab indicies per decade
ind12 = find(y >= 1e1 & y <= 1e2);
indlow = find(y < 1e2);
indhigh = find(y > 1e4);
ind23 = find(y >+ 1e2 & y <= 1e3);
ind34 = find(y >+ 1e3 & y <= 1e4);
//%We want ind12 indexes in this decade, find spacing
tot23 = round(length(ind23)/length(ind12));
tot34 = round(length(ind34)/length(ind12));
//%grab ones to keep
ind23keep = ind23(1):tot23:ind23(end);
ind34keep = ind34(1):tot34:ind34(end);
indnew = [indlow' ind23keep ind34keep indhigh'];
loglog(x(indnew), y(indnew));
But this causes the prune to behave in a jumpy fashion obviously. Each decade has the number of points that I'd like, but as it's a linear distribution, the points tend to be clumped at the high end of the decade on the log scale.
Any ideas on how I can do this?
I think the easiest way to do this would be to use the LOGSPACE function to generate a set of indices into your data. For example, to create a set of 100 points logarithmically spaced from 1 to N (the number of points in your data), you can try the following:
indnew = round(logspace(0,log10(N),100)); %# Create the log-spaced index
indnew = unique(indnew); %# Remove duplicate indices
loglog(x(indnew),y(indnew)); %# Plot the indexed data
Creating a logarithmically-spaced index like this will result in fewer values being chosen from the end of the vector relative to the start, thus pruning values more severely towards the end of the vector and improving the appearance of the log plot. It would therefore be most effective with vectors that are sorted in ascending order.
The way I understand the problem is that your x-values are linearly spaced, so that if you plot them logarithmically, there are way more data points in 'higher' decades, so that markers lie extremely close to one another. For example, if x goes from 1 to 1000, there are 10 points in the first decade 90 in the second, and 900 in the third. You want to have, say, 3 points per decade instead.
I see two ways to solve the problem. The easier one is to use differently colored lines instead of different markers. Thus, you don't sacrifice any data points, and you can still distinguish everything.
The second solution is to create an unevenly spaced index. Here's how you can do that.
%# create some data
x = 1:1000;
y = 2.^x;
%# plot the graph and see the dots 'coalesce' very quickly
figure,loglog(x,y,'.')
%# for the example, I use a step size of 0.7, which is `log(1)`
xx = 0.7:0.7:log(x(end)); %# this is where I want the data to be plotted
%# find the indices where we want to plot by finding the closest `log(x)'-values
%# run unique to avoid multiples of the same index
indnew = unique(interp1(log(x),1:length(x),xx,'nearest'));
%# plot with fewer points
figure,loglog(x(indnew),y(indnew),'.')