Column Split - Spark DataFrame - Scala - scala

I'm working on a small project using Spark data frames with Scala. I've managed to clean some data from a .csv file, but the end result (output) includes a single column where the "age" and "job" data are combined. Please see the attached screenshot.
I'm looking to split the "age;job" column into two separate columns called "age" and "job", drop the "age;job" column, and keep the rest of the data in tact.
I've been working on this one for quite awhile, but I'm presently stuck. Any and all feedback is most welcomed and appreciated.
Note: I'm using Scala on Spark Shell, not an IDE like IntelliJ. Just a heads up, as I'll need to accomplish this using the Spark Shell CLI.

You can use split:
import org.apache.spark.sql.functions.{col, split}
val column = split(col("field"), ";")
df.withColumn("left", column(0)).withColumn("right", column(1)).show()
and as result:
+-----------+-----+-----+
| field| left|right|
+-----------+-----+-----+
|data1;data2|data1|data2|
+-----------+-----+-----+

I figured out the issue. I simply just fixed the raw .csv file.
Below I have a snippet of the .csv file. Notice the "age" header at the top left hand corner was missing double quotation marks.
All I did was simply add those double quotes and that fixed the problem. Below is my new output.

Related

PySpark read text file into single column dataframe

I have a text file I'd like to read into a dataframe. I prefer to read it into a single column. This was working until I came across a file with ^ in it.
raw = spark.read.option("delimiter", "^").csv(data_dir + pair[0])
But alas, alack-a-day, the very next broke the pattern. I don't see an option for delimiter None. Is there an efficient way to do this?
Have you looked at using spark.read.textFile instead? It may do what you want it to.

Finding individual filenames when loading multiple files in Apache Spark

I have an Apache Spark job that loads multiple files for processing using
val inputFile = sc.textFile(inputPath)
This is working fine. However for auditing purposes it would be useful to track which line came from which file when the inputPath is a wildcard. Something like an RDD[(String, String)] where the first string is the line of input text and the second is the filename.
Specifically, I'm using Google's Dataproc and the file(s) are located in Google Cloud Storage, so the paths are similar to 'gs://my_data/*.txt'.
Check out SparkContext#wholeTextFiles.
Sometimes if you use many input paths, you may find yourself wanting to use worker resources for this file listing. To do that, you can use the Scala parts of this answer on improving performance of wholeTextFiles in pyspark (ignore all the python bits, the scala parts are what matter).

read a formula stored in a text file in scalding

The problem is that i have 2 files :
1st file having 4 columns as in
1,Sanchit,60,80
2nd file having 2 columns as in
1,(1-(x/y))>1
now i want to apply the formula in 2nd file on values 60 and 80 which i will read from 1st file.
I have tried reading the formula column and wish to compute the formula using the mentioned values, but unable to do so.
Any kind of help will be appreciated. thanx
EDIT : There is a java api that helps. I have included that into my project and now works great
Evaluating a math expression given in string form
Head over to this link for the solutions
Strictly speaking, this is not exacly a Scalding question but you can use something like Apache Commons JeXL to execute formulas dynamically. So you would read the formula from the 2nd file, give it first file record object as a context and execute it.

Matlab's Import Tool recognizes a column as numbers but generate %s in formatSpec

I use Matlab's Import Tool to generate a script that will take care of importing several CSV files with the same columns.
The Import Tool successfully manages to recognize the type of each column of my CSV file:
However, in the generated script, the same column are cast as strings (%s = string):
Any idea why?
Surprisingly it works fine with CSV files with fewer columns (it works with 70-column CSV files, but the issue arises with with 120-column CSV files). Here is one example of a CSV file that triggers the issue.
I use R2014b x64 with Windows 7 SP1 x64 Ultimate.
This is happening because one of the columns in your file contains data which contains numbers and text. The Import Tool is predicting that you're going to want to extract the numbers from this field, so it labels the column as 'NUMBER'. However, standard textscan doesn't allow for this, so the Import Tool must generate code to read in all of the data as text, and does the numeric conversion afterwards. The Import Tool is trying to help avoid errors using textscan.
The result of running the generated code is still numeric, as is shown in the Import Tool.
The specific column is labeled SEGMENT_ID in your example file. It contains data like:
l8K3ziItQ2pRFQ8
L79Y0zA8xF7fVTA
JYYqCCwRrigaUmD

Progress 4gl Creating a .xlsx file without excel

Version: 10.2b
I want to create a .xlsx file with progress but the machine this will run on doesn't have excel.
Can someone point me in the right direction about how to do this.
Is there a library already written that can do something like this?
Thanks for any help!
The project was moved to the Free DocxFactory Project.
It was rewritten in C++ with Progress 4GL/ABL wrappers and tutorial.
It is 300x times faster, alot of new features were added including barcodes, paging features etc.
and it's completely free for private and commercial use without any time or feature limits.
HTH
You might find this to be useful: http://www.oehive.org/project/libooxml although it appears that there is nothing there right now. There might also be an older version of that code here: http://www.oehive.org/project/lib
Also -- in many cases the need to provide data to Excel can be satisfied with a Tab or Comma delimited file.
Another trick is to create an HTML table fragment. Excel imports those quite nicely.
A super simple example of how to export a semi-colon delimited file from a temp-table. In 90% of the cases this is enough Excel-support - at least it has been for me.
DEFINE STREAM strCsv.
DEFINE TEMP-TABLE ttExample NO-UNDO
FIELD col1 AS CHARACTER
FIELD col2 AS INTEGER.
CREATE ttExample.
ASSIGN ttExample.col1 = "ABC"
ttExample.col2 = 123.
CREATE ttExample.
ASSIGN ttExample.col1 = "DEF"
ttExample.col2 = 456.
OUTPUT STREAM strCsv TO VALUE("c:\test\test.csv").
FOR EACH ttExample NO-LOCK:
EXPORT DELIMITER ";" ttExample.
END.
OUTPUT STREAM strCsv CLOSE.