Apache Spark scala lowercase first letter using built-in function - scala

I'm trying to lowerCase the first letter of column values.
I can't find a way to lower only the first letter using built-in functions, I know there's initCap for capitalizing the data but I'm trying to decapitalize.
I tried using substring but looks a bit overkill and didn't work.
val data = spark.sparkContext.parallelize(Seq(("Spark"),("SparkHello"),("Spark Hello"))).toDF("name")
data.withColumn("name",lower(substring($"name",1,1)) + substring($"name",2,?))
I know I can create a custom UDF but I thought there's may be a built-in solution for this.

You can use the Spark SQL substring method, which allows neglecting the length argument (and will get the string until the end):
data.withColumn("name", concat(lower(substring($"name",1,1)), expr("substring(name,2)"))).show
+-----------+
| name|
+-----------+
| spark|
| sparkHello|
|spark Hello|
+-----------+
Note that you cannot + strings. You need to use concat.

Related

Spark CSV Read Ignore characters

I'm using Spark 2.2.1 through Zeppelin.
Right now my spark read code is as follows:
val data = spark.read.option("header", "true").option("delimiter", ",").option("treatEmptyValuesAsNulls","true").csv("listings.csv")
I've noticed when I use the .show() function, the cells are shifted to the right. On the CSV all the cells are in the correct places, but after going through Spark, the cells would be shifted to the right. I was able to identify the culprit: The quotations are misplacing cells. There are some cells in the CSV file that written like so:
{TV,Internet,Wifi,"Air conditioning",Kitchen,"Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer}
Actual output (please note that I used .select() and picked some columns to show the issue I am having.):
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| Kitchen|""Indoor fireplace""|
|Guest room in a l...| "{TV,""Cable TV""| Internet| Wifi|
Expected output:
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| 1400 | $400.00 ||
|Guest room in a l...| "{TV,""Cable TV""| 1100 | $250.00 ||
Is there a way to get rid of the quotations or replace them with apostrophes? Apostrophes appear to not affect the data.
What your are looking for is the regexp_replace function with the syntax regexp_replace(str, pattern, replacement).
Unfortunately, I could not reproduce your issue as I didn't know how to write the listings.csv file.
However, the example below should give you an idea on how to replace certain regex patterns when dealing with a data frame in Spark.
This is reflecting your original data
data.show()
+-----------+----------+-----------+--------+
|description| amenities|square_feet| price|
+-----------+----------+-----------+--------+
|'This large| famil...'| '{TV|Internet|
+-----------+----------+-----------+--------+
With regexp_replace you can replace suspicious string patterns like this
import org.apache.spark.sql.functions.regexp_replace
data.withColumn("amenitiesNew", regexp_replace(data("amenities"), "famil", "replaced")).show()
+-----------+----------+-----------+--------+-------------+
|description| amenities|square_feet| price| amenitiesNew|
+-----------+----------+-----------+--------+-------------+
|'This large| famil...'| '{TV|Internet| replaced...'|
+-----------+----------+-----------+--------+-------------+
Using this function should solve your problem with the problematic characters by replacing them. Feel free to use regular expression in that function.

Is there a way to see raw string values using SQL / presto SQL / athena?

Edit after asked to better specify my need:
TL;DR: How to show whitespace escaped characters (such as /r) in the Athena console when performing a query? So this: "abcdef/r" instead of this "abcdef ".
I have a dataset with a column that contains some strings of variable length, all of them with a trailing whitespace.
Now, since I had analyzed this data before, using python, I know that this whitespace is a \r; however, if in Athena I SELECT my_column, it obviously doesn't show the escaped whitespace.
Essentially, what I'm trying to achieve:
my_column | ..
----------+--------
abcdef\r | ..
ghijkl\r | ..
What I'm getting instead:
my_column | ..
----------+--------
abcdef | ..
ghijkl | ..
If you're asking why would I want that, it's just to avoid having to parse this data through python if I ever incur in this situation again, so that I can immediately know if there's any weird escaped characters in my strings.
Any help is much appreciated.

Talend: Equivalent of logstash "key value" filter

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.
My data are some sets of key value data like this example:
A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1
I want to transform it into a CSV like this one:
A,B,C
0,3,4
2,,4
2,4,
With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.
This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.
Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.
1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.
2 - tConvertType : here you can convert Object to String type
3 - tNormalize : you'll have to separate manually your lines (use \n as separator)
4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.
5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair
6 - tMap : you'll have to split on "=" sign.
At this step, you'll have an output like :
|seq|key|value|
|=--+---+----=|
|I1 |A |1 |
|I1 |B |2 |
|I1 |C |3 |
|I2 |A |2 |
|I2 |C |4 |
|I3 |A |2 |
|I3 |B |4 |
'---+---+-----'
where seq is the line number.
7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components).
Use ID column as the regroup column.
Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.
Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:
Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

Load a CSV file from line 17 of the file in scala spark

I have a problem with the dataframe of spark in scala. I'm using the method var df = spark.read.format("csv").load("csvfile.csv") to read a CSV file and store it in a DF. My CSV file has 16 lines of some comments that I don't want to read. I have not discover the way to say to avoid a header, but it is only of one line. Any idea?
Thanks you.
Below solution1 work only for comments starting with only one common symbol/alphabets. solution2 work for all the symbols adding in the List in the solution.
Solution 1:
If all the comments are starting with common letter/symbol/number, give that symbol in the option value for key comment as in this answer.
Apache Spark Dataframe - Load data from nth line of a CSV file
But if some comments are starting with different symbols from rest of comments, this won't work out.
Solution 2:
In this solution, I am removing the lines starting with symbols * , / and number 7. Replace the List values based on the starting letters of your actual comments.
import ss.implicits._
val rd = ss.sparkContext.textFile(path)
rd.filter(x => !List('*','7','/').contains(x.charAt(0))) // reading file as RDD and filtering records starting with comment letters or symbols or alphabets
.map(x => x.split(","))
.map(x => (x(0),x(1),x(2),x(3)))
.toDF("id","name","department","amount")
.show()
Input :
*ghfghfgh
*mgffhfg
/fgfgdfgf
7gdfgh
1,Praveen,d1,30000
2,naveen,d1,40000
3,pavan,d1,50000
Output :
+---+-------+----------+------+
| id| name|department|amount|
+---+-------+----------+------+
| 1|Praveen| d1| 30000|
| 2| naveen| d1| 40000|
| 3| pavan| d1| 50000|
+---+-------+----------+------+
In the above example first four lines in input are comments.

PostgreSQL tuple format

Is there any document describing the tuple format that PostgreSQL server adheres to? The official documentation appears arcane about this.
A single tuple seems simple enough to figure out, but when it comes to arrays of tuples, arrays of composite tuples, and finally nested arrays of composite tuples, it is impossible to be certain about the format simply by looking at the output.
I am asking this following my initial attempt at implementing pg-tuple, a parser that's still missing today, to be able to parse PostgreSQL tuples within Node.js
Examples
create type type_A as (
a int,
b text
);
with a simple text: (1,hello)
with a complex text: (1,"hello world!")
create type type_B as (
c type_A,
d type_A[]
);
simple-value array: {"(2,two)","(3,three)"}
for type_B[] we can get:
{"(\"(7,inner)\",\"{\"\"(88,eight-1)\"\",\"\"(99,nine-2)\"\"}\")","(\"(77,inner)\",\"{\"\"(888,eight-3)\"\",\"\"(999,nine-4)\"\"}\")"}
It gets even more complex for multi-dimensional arrays of composite types.
UPDATE
Since it feels like there is no specification at all, I have started working on reversing it. Not sure if it can be done fully though, because from some initial examples it is often unclear what formatting rules are applied.
As Nick posted, according to docs:
the whitespace will be ignored if the field type is integer, but not
if it is text.
and
The composite output routine will put double quotes around field
values if they are empty strings or contain parentheses, commas,
double quotes, backslashes, or white space.
and
Double quotes and backslashes embedded in field values will be
doubled.
and now quoting Nick himself:
nested elements are converted to strings, and then quoted / escaped
like any other string
I give shorted example below, comfortably compared against its nested value:
a=# create table playground (t text, ta text[],f float,fa float[]);
CREATE TABLE
a=# insert into playground select 'space here',array['','bs\'],8.0,array[null,8.1];
INSERT 0 1
a=# insert into playground select 'no_space',array[null,'nospace'],9.0,array[9.1,8.0];
INSERT 0 1
a=# select playground,* from playground;
playground | t | ta | f | fa
---------------------------------------------------+------------+----------------+---+------------
("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
(no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
If you go for deeper nested quoting, look at:
a=# select nested,* from (select playground,* from playground) nested;
nested | playground | t | ta | f | fa
-------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+------------+----------------+---+------------
("(""space here"",""{"""""""",""""bs\\\\\\\\""""}"",8,""{NULL,8.1}"")","space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | ("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
("(no_space,""{NULL,nospace}"",9,""{9.1,8}"")",no_space,"{NULL,nospace}",9,"{9.1,8}") | (no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
As you can see, the output again follows rules the above.
This way in short answers to your questions would be:
why array is normally presented inside double-quotes, while an empty array is suddenly an open value? (text representation of empty array does not contain comma or space or etc)
why a single " is suddenly presented as \""? (text representation of 'one\ two', according to rules above is "one\\ two", and text representation of the last is ""one\\\\two"" and it is just what you get)
why unicode-formatted text is changing the escaping for \? How can we tell the difference then? (According to docs,
PostgreSQL also accepts "escape" string constants, which are an
extension to the SQL standard. An escape string constant is specified
by writing the letter E (upper or lower case) just before the opening
single quote
), so it is not unicode text, but the the way you tell postgres that it should interpret escapes in text not as symbols, but as escapes. Eg E'\'' will be interpreted as ' and '\'' will make it wait for closing ' to be interpreted. In you example E'\\ text' the text represent of it will be "\\ text" - we add backslsh for backslash and take value in double quotes - all as described in online docs.
the way that { and } are escaped is not always clear (I could not anwer this question, because it was not clear itself)