Load a CSV file from line 17 of the file in scala spark - scala

I have a problem with the dataframe of spark in scala. I'm using the method var df = spark.read.format("csv").load("csvfile.csv") to read a CSV file and store it in a DF. My CSV file has 16 lines of some comments that I don't want to read. I have not discover the way to say to avoid a header, but it is only of one line. Any idea?
Thanks you.

Below solution1 work only for comments starting with only one common symbol/alphabets. solution2 work for all the symbols adding in the List in the solution.
Solution 1:
If all the comments are starting with common letter/symbol/number, give that symbol in the option value for key comment as in this answer.
Apache Spark Dataframe - Load data from nth line of a CSV file
But if some comments are starting with different symbols from rest of comments, this won't work out.
Solution 2:
In this solution, I am removing the lines starting with symbols * , / and number 7. Replace the List values based on the starting letters of your actual comments.
import ss.implicits._
val rd = ss.sparkContext.textFile(path)
rd.filter(x => !List('*','7','/').contains(x.charAt(0))) // reading file as RDD and filtering records starting with comment letters or symbols or alphabets
.map(x => x.split(","))
.map(x => (x(0),x(1),x(2),x(3)))
.toDF("id","name","department","amount")
.show()
Input :
*ghfghfgh
*mgffhfg
/fgfgdfgf
7gdfgh
1,Praveen,d1,30000
2,naveen,d1,40000
3,pavan,d1,50000
Output :
+---+-------+----------+------+
| id| name|department|amount|
+---+-------+----------+------+
| 1|Praveen| d1| 30000|
| 2| naveen| d1| 40000|
| 3| pavan| d1| 50000|
+---+-------+----------+------+
In the above example first four lines in input are comments.

Related

Adjusting columns from txt to parquet

this is my first time trying to convert a txt file to parquet format so please bear with me.
I have a txt file which originally looks like this:
id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1
And I'd like to make it to parquet format like this:
+---+-------------------------+-------+
|id |roads |weights|
+---+-------------------------+-------+
|a01|1026|1172 |1 |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1 |
|b01|DT:SR:1|7|SW |1 |
So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:
val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")
But I my column names are now considered a row and everything has been put together as a single column call "value".
Any help would be appreciated!
read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.
val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)

Apache Spark scala lowercase first letter using built-in function

I'm trying to lowerCase the first letter of column values.
I can't find a way to lower only the first letter using built-in functions, I know there's initCap for capitalizing the data but I'm trying to decapitalize.
I tried using substring but looks a bit overkill and didn't work.
val data = spark.sparkContext.parallelize(Seq(("Spark"),("SparkHello"),("Spark Hello"))).toDF("name")
data.withColumn("name",lower(substring($"name",1,1)) + substring($"name",2,?))
I know I can create a custom UDF but I thought there's may be a built-in solution for this.
You can use the Spark SQL substring method, which allows neglecting the length argument (and will get the string until the end):
data.withColumn("name", concat(lower(substring($"name",1,1)), expr("substring(name,2)"))).show
+-----------+
| name|
+-----------+
| spark|
| sparkHello|
|spark Hello|
+-----------+
Note that you cannot + strings. You need to use concat.

Spark CSV Read Ignore characters

I'm using Spark 2.2.1 through Zeppelin.
Right now my spark read code is as follows:
val data = spark.read.option("header", "true").option("delimiter", ",").option("treatEmptyValuesAsNulls","true").csv("listings.csv")
I've noticed when I use the .show() function, the cells are shifted to the right. On the CSV all the cells are in the correct places, but after going through Spark, the cells would be shifted to the right. I was able to identify the culprit: The quotations are misplacing cells. There are some cells in the CSV file that written like so:
{TV,Internet,Wifi,"Air conditioning",Kitchen,"Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer}
Actual output (please note that I used .select() and picked some columns to show the issue I am having.):
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| Kitchen|""Indoor fireplace""|
|Guest room in a l...| "{TV,""Cable TV""| Internet| Wifi|
Expected output:
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| 1400 | $400.00 ||
|Guest room in a l...| "{TV,""Cable TV""| 1100 | $250.00 ||
Is there a way to get rid of the quotations or replace them with apostrophes? Apostrophes appear to not affect the data.
What your are looking for is the regexp_replace function with the syntax regexp_replace(str, pattern, replacement).
Unfortunately, I could not reproduce your issue as I didn't know how to write the listings.csv file.
However, the example below should give you an idea on how to replace certain regex patterns when dealing with a data frame in Spark.
This is reflecting your original data
data.show()
+-----------+----------+-----------+--------+
|description| amenities|square_feet| price|
+-----------+----------+-----------+--------+
|'This large| famil...'| '{TV|Internet|
+-----------+----------+-----------+--------+
With regexp_replace you can replace suspicious string patterns like this
import org.apache.spark.sql.functions.regexp_replace
data.withColumn("amenitiesNew", regexp_replace(data("amenities"), "famil", "replaced")).show()
+-----------+----------+-----------+--------+-------------+
|description| amenities|square_feet| price| amenitiesNew|
+-----------+----------+-----------+--------+-------------+
|'This large| famil...'| '{TV|Internet| replaced...'|
+-----------+----------+-----------+--------+-------------+
Using this function should solve your problem with the problematic characters by replacing them. Feel free to use regular expression in that function.

Talend: Equivalent of logstash "key value" filter

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.
My data are some sets of key value data like this example:
A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1
I want to transform it into a CSV like this one:
A,B,C
0,3,4
2,,4
2,4,
With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.
This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.
Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.
1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.
2 - tConvertType : here you can convert Object to String type
3 - tNormalize : you'll have to separate manually your lines (use \n as separator)
4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.
5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair
6 - tMap : you'll have to split on "=" sign.
At this step, you'll have an output like :
|seq|key|value|
|=--+---+----=|
|I1 |A |1 |
|I1 |B |2 |
|I1 |C |3 |
|I2 |A |2 |
|I2 |C |4 |
|I3 |A |2 |
|I3 |B |4 |
'---+---+-----'
where seq is the line number.
7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components).
Use ID column as the regroup column.
Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.
Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:
Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

How to handle multi line rows in spark?

I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?
Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.
By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string