Generate a libsvm formatted data from text file - matlab

Firstly, i'm very poor in data pre-processing. I was looking for WebKB data in libsvm format. Later after searching a lot over the internet, i came across this data obtained after stemming and stop-word removal. The format is as follows,
Each line represents a vector and the first word in each file contains the class name followed by some list of words which forms the feature delimited by spaces.
How do i convert such a text file to lib-svm format? Is there any Weka or Matlab tool to construct it?

libshorttext1.1 is a python module having utilities for this purpose with so many extra features. try it, or i think scikit learn packages also have this functionality

Related

Reading and Writing Data to File

I am a rather new user and I am running a simulation experiment. I would like to learn how to write output data to file. I am considering buying the the Big Book of Simulation Modeling which is based on AnyLogic 6. Are there major differences between AnyLogic 6 and 8 in reading/writing data to file? Unfortunately, they haven't released that chapter for the current version of the book that is online. Are there other resources about writing output data to files? Thanks!
Assuming your question is about writing a csv file and not to excel as per your comments, if you want to make use of the standard AnyLogic objects you can easily follow the instructions from the help here
https://anylogic.help/anylogic/connectivity/text-file.html
If you prefer to be in full control of the writing to the CSV file you can also easily use some standard Java functionalities and create a function like this.
The String input can then be any piece of string, with what ever separators you want, comma ,, pipe |, tab "\t" etc, and you simply need to add in line breaks "\n" in your string to write new lines in your output.

How can I compose several xml-vtk files (vtu, vti) into one to get an animation?

I have a simulation which produces a bunch of vtu (also pvtu) and vti (also pvti) files which, as I understand, represent the configuration of points in one timestamp. But is there a way to group them into one close-to-vtk file to be able to visualize a simulation, which consists of many timestamps, in an app like paraview (but not only)?
ParaView can natively open many files as a time series, see the doc.
If your file names contains a number, the ParaView "open file" dialog will collapse them under a dummy filename containing dots instead of number. Open it to open the whole series.
edit: conversion
To be close to the vtk format, you may use .pvd that is a ParaView format described here or the .series from VTK (doc here )
To read it with another software, well, you will need to check the supported file formats by the application you want to use. VTK can write several other formats, including Exodus, XDMF or CGNS for instance.

Xlsread returning zero values....?

I am getting zero values while using xlsread command in MATLAB.I am using a real world dataset taken from UCI repository which has got both integer and float values.
[Train,textData,rawData] = `xlsread('C:\Users\pooja\Documents\project\breastcancer.csv');`
I have tried with xls format too..
[Train,textData,rawData] = xlsread('C:\Users\pooja\Documents\project\breastcancer.xls');
Thanx in Advance..!
In the wide world of computers, there are a lot of data formats. You need to remember that data formats are different from each other. Generally software like Matlab allows you to open different types of data formats. Each one of course with its own function.
You can guess that the function xmlread is to read XML files. If you want to read csv files or any other type of file in the world, please (I think this is obvious) do not use xmlread!
Specifically to open csv files matlab has csvread. Please, do not use csv read to open files that are not CSV.....

Writing to complex PDF's in MATLAB

I'm trying to write a MATLAB function that processes a file and writes a report on that file. The report will contain numbers, strings, tables, and images.
After looking at MATLAB's documentation, I can only find functions that save individual items to a file. For example, print saves a plot, write saves a table, etc. How do I create a single file that contains many of these items (e.g. a PDF with images, tables, and text)?
You can use print with the -append option to write multiple pages to a PostScript file in sequence, and then convert the ps to pdf. Using Matlab's handle graphics system, it is possible (if tedious) to design each print page in detail, arrange elements, etc.
However, if your document is going to be really complex, I think it would be better to generate the pdf in another way. One approach would be to write LaTeX code using lots of fprintfs and compile the file using pdflatex.
Btw., I'm not aware of a Matlab function write that generates a pdf.

Reading large csv files with strings containing commas as one field

I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.
I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?
Is there any other way?
I'm not sure what is generating your CSV file but that is your problem.
The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?
Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.
Again, its all about generating proper CSV content.
There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use
fid = fopen('yourfile.csv');
and then either textscan
t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};
or just fgetl (the example in the help is perfect).
After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.
You have a problem because you're reading it in as a .csv, and you have commas within your data. You can get it in Excel and manipulate the date, possibly extract the unwanted commas with Excel formulas. I work with .csv files for DB imports quite a bit. I imagine matLab has similar rules, which is - no commas in your data.
Can you tell us more about your data? Are there commas throughout, our just one column? Maybe you can read it in as tab delimited?
Are you using a Unix system? The reason I am asking is that you could use a command-line function such as sed and regular expressions to clean those data files before you pass them into Matlab. Here is a link that explains how to do exactly what you are looking for.
Since, as others have observed, your file is CSV with commas inside what you think of as a single field, it's going to be hard to persuade Matlab that that really is only one field. I think your best strategy is going to be to read one line at a time, into a string acting as a buffer, and to translate it, field-by-field, into the variables or other data structures that you want. Since Matlab has in-built regular expression capabilities this shouldn't be too hard.
And, as others have already suggested, posting a sample of your data would help us to help you.
One easy solution is:
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of course you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));
now you will have loaded the data as dataset. An easy way to get a column 1 for example is
double(data(1))