PySpark read text file into single column dataframe - pyspark

I have a text file I'd like to read into a dataframe. I prefer to read it into a single column. This was working until I came across a file with ^ in it.
raw = spark.read.option("delimiter", "^").csv(data_dir + pair[0])
But alas, alack-a-day, the very next broke the pattern. I don't see an option for delimiter None. Is there an efficient way to do this?

Have you looked at using spark.read.textFile instead? It may do what you want it to.

Related

Reading csv from the second line and creating output

I want from a csv archive to read only one column. The problem is that I want to read this column from the second line and by using these commands:
[d1,tex]= xlsread(filename1);
name=tex(:,4)
it's reading from the first line.
Also, I would like to create a matrix that will inclue two columns that have come from commants (equations etc) in my Matlab code.
xlsread is deprecated by MathWorks. Try using readtable in the future.
To your original question, I'm assuming that you want to read everything in the 4th column from the second row onward. If so, your second line is incorrect:
name = tex(2:end,4)
Without further example code, I can't answer the rest of your question. Add some details and I'll see what I can do.

Write RDD in txt file

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

Write ArrayBuffer to file in scala

I want to write multiple ArrayBuffers to file one after the other in append mode in Scala and then I should be able to read the last ArrayBuffer from the file and delete it from the file and save the file with the remaining ArrayBuffer.
I cant think of any good solution for this. How should I do this?

Progress 4gl Creating a .xlsx file without excel

Version: 10.2b
I want to create a .xlsx file with progress but the machine this will run on doesn't have excel.
Can someone point me in the right direction about how to do this.
Is there a library already written that can do something like this?
Thanks for any help!
The project was moved to the Free DocxFactory Project.
It was rewritten in C++ with Progress 4GL/ABL wrappers and tutorial.
It is 300x times faster, alot of new features were added including barcodes, paging features etc.
and it's completely free for private and commercial use without any time or feature limits.
HTH
You might find this to be useful: http://www.oehive.org/project/libooxml although it appears that there is nothing there right now. There might also be an older version of that code here: http://www.oehive.org/project/lib
Also -- in many cases the need to provide data to Excel can be satisfied with a Tab or Comma delimited file.
Another trick is to create an HTML table fragment. Excel imports those quite nicely.
A super simple example of how to export a semi-colon delimited file from a temp-table. In 90% of the cases this is enough Excel-support - at least it has been for me.
DEFINE STREAM strCsv.
DEFINE TEMP-TABLE ttExample NO-UNDO
FIELD col1 AS CHARACTER
FIELD col2 AS INTEGER.
CREATE ttExample.
ASSIGN ttExample.col1 = "ABC"
ttExample.col2 = 123.
CREATE ttExample.
ASSIGN ttExample.col1 = "DEF"
ttExample.col2 = 456.
OUTPUT STREAM strCsv TO VALUE("c:\test\test.csv").
FOR EACH ttExample NO-LOCK:
EXPORT DELIMITER ";" ttExample.
END.
OUTPUT STREAM strCsv CLOSE.

Reading large csv files with strings containing commas as one field

I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.
I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?
Is there any other way?
I'm not sure what is generating your CSV file but that is your problem.
The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?
Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.
Again, its all about generating proper CSV content.
There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use
fid = fopen('yourfile.csv');
and then either textscan
t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};
or just fgetl (the example in the help is perfect).
After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.
You have a problem because you're reading it in as a .csv, and you have commas within your data. You can get it in Excel and manipulate the date, possibly extract the unwanted commas with Excel formulas. I work with .csv files for DB imports quite a bit. I imagine matLab has similar rules, which is - no commas in your data.
Can you tell us more about your data? Are there commas throughout, our just one column? Maybe you can read it in as tab delimited?
Are you using a Unix system? The reason I am asking is that you could use a command-line function such as sed and regular expressions to clean those data files before you pass them into Matlab. Here is a link that explains how to do exactly what you are looking for.
Since, as others have observed, your file is CSV with commas inside what you think of as a single field, it's going to be hard to persuade Matlab that that really is only one field. I think your best strategy is going to be to read one line at a time, into a string acting as a buffer, and to translate it, field-by-field, into the variables or other data structures that you want. Since Matlab has in-built regular expression capabilities this shouldn't be too hard.
And, as others have already suggested, posting a sample of your data would help us to help you.
One easy solution is:
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of course you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));
now you will have loaded the data as dataset. An easy way to get a column 1 for example is
double(data(1))