Talend: Equivalent of logstash "key value" filter - talend

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.
My data are some sets of key value data like this example:
A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1
I want to transform it into a CSV like this one:
A,B,C
0,3,4
2,,4
2,4,
With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.

This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.
Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.
1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.
2 - tConvertType : here you can convert Object to String type
3 - tNormalize : you'll have to separate manually your lines (use \n as separator)
4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.
5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair
6 - tMap : you'll have to split on "=" sign.
At this step, you'll have an output like :
|seq|key|value|
|=--+---+----=|
|I1 |A |1 |
|I1 |B |2 |
|I1 |C |3 |
|I2 |A |2 |
|I2 |C |4 |
|I3 |A |2 |
|I3 |B |4 |
'---+---+-----'
where seq is the line number.
7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components).
Use ID column as the regroup column.
Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.

Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:
Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

Related

Apache Spark scala lowercase first letter using built-in function

I'm trying to lowerCase the first letter of column values.
I can't find a way to lower only the first letter using built-in functions, I know there's initCap for capitalizing the data but I'm trying to decapitalize.
I tried using substring but looks a bit overkill and didn't work.
val data = spark.sparkContext.parallelize(Seq(("Spark"),("SparkHello"),("Spark Hello"))).toDF("name")
data.withColumn("name",lower(substring($"name",1,1)) + substring($"name",2,?))
I know I can create a custom UDF but I thought there's may be a built-in solution for this.
You can use the Spark SQL substring method, which allows neglecting the length argument (and will get the string until the end):
data.withColumn("name", concat(lower(substring($"name",1,1)), expr("substring(name,2)"))).show
+-----------+
| name|
+-----------+
| spark|
| sparkHello|
|spark Hello|
+-----------+
Note that you cannot + strings. You need to use concat.

Spark CSV Read Ignore characters

I'm using Spark 2.2.1 through Zeppelin.
Right now my spark read code is as follows:
val data = spark.read.option("header", "true").option("delimiter", ",").option("treatEmptyValuesAsNulls","true").csv("listings.csv")
I've noticed when I use the .show() function, the cells are shifted to the right. On the CSV all the cells are in the correct places, but after going through Spark, the cells would be shifted to the right. I was able to identify the culprit: The quotations are misplacing cells. There are some cells in the CSV file that written like so:
{TV,Internet,Wifi,"Air conditioning",Kitchen,"Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer}
Actual output (please note that I used .select() and picked some columns to show the issue I am having.):
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| Kitchen|""Indoor fireplace""|
|Guest room in a l...| "{TV,""Cable TV""| Internet| Wifi|
Expected output:
| description| amenities| square_feet| price|
+--------------------+--------------------+-----------------+--------------------+
|This large, famil...|"{TV,Internet,Wif...| 1400 | $400.00 ||
|Guest room in a l...| "{TV,""Cable TV""| 1100 | $250.00 ||
Is there a way to get rid of the quotations or replace them with apostrophes? Apostrophes appear to not affect the data.
What your are looking for is the regexp_replace function with the syntax regexp_replace(str, pattern, replacement).
Unfortunately, I could not reproduce your issue as I didn't know how to write the listings.csv file.
However, the example below should give you an idea on how to replace certain regex patterns when dealing with a data frame in Spark.
This is reflecting your original data
data.show()
+-----------+----------+-----------+--------+
|description| amenities|square_feet| price|
+-----------+----------+-----------+--------+
|'This large| famil...'| '{TV|Internet|
+-----------+----------+-----------+--------+
With regexp_replace you can replace suspicious string patterns like this
import org.apache.spark.sql.functions.regexp_replace
data.withColumn("amenitiesNew", regexp_replace(data("amenities"), "famil", "replaced")).show()
+-----------+----------+-----------+--------+-------------+
|description| amenities|square_feet| price| amenitiesNew|
+-----------+----------+-----------+--------+-------------+
|'This large| famil...'| '{TV|Internet| replaced...'|
+-----------+----------+-----------+--------+-------------+
Using this function should solve your problem with the problematic characters by replacing them. Feel free to use regular expression in that function.

Load a CSV file from line 17 of the file in scala spark

I have a problem with the dataframe of spark in scala. I'm using the method var df = spark.read.format("csv").load("csvfile.csv") to read a CSV file and store it in a DF. My CSV file has 16 lines of some comments that I don't want to read. I have not discover the way to say to avoid a header, but it is only of one line. Any idea?
Thanks you.
Below solution1 work only for comments starting with only one common symbol/alphabets. solution2 work for all the symbols adding in the List in the solution.
Solution 1:
If all the comments are starting with common letter/symbol/number, give that symbol in the option value for key comment as in this answer.
Apache Spark Dataframe - Load data from nth line of a CSV file
But if some comments are starting with different symbols from rest of comments, this won't work out.
Solution 2:
In this solution, I am removing the lines starting with symbols * , / and number 7. Replace the List values based on the starting letters of your actual comments.
import ss.implicits._
val rd = ss.sparkContext.textFile(path)
rd.filter(x => !List('*','7','/').contains(x.charAt(0))) // reading file as RDD and filtering records starting with comment letters or symbols or alphabets
.map(x => x.split(","))
.map(x => (x(0),x(1),x(2),x(3)))
.toDF("id","name","department","amount")
.show()
Input :
*ghfghfgh
*mgffhfg
/fgfgdfgf
7gdfgh
1,Praveen,d1,30000
2,naveen,d1,40000
3,pavan,d1,50000
Output :
+---+-------+----------+------+
| id| name|department|amount|
+---+-------+----------+------+
| 1|Praveen| d1| 30000|
| 2| naveen| d1| 40000|
| 3| pavan| d1| 50000|
+---+-------+----------+------+
In the above example first four lines in input are comments.

SAS import word by word

I have one (maybe silly) question. I have a text(it cames from an xml which is longer) file which looks like:
<Hi>
|<interview><rul| |ebook><name>EQU| |OTE_RULES</name| |><version>aaaa| |ON
QUOTE TR2 v2| |.14</version></| |rulebook><creat| |edDate>2017-10-|
|23`16:00:16.581| | UTC</createdDa| |te><status>IN_P| |ROGRESS</$10tus|
|>`<lives>`<life n| |umber="1" clien| |tId="1" status=| |"IN_PROGRESS"><|
|pages>`<page cod| | e="QD_APP" numb| |er="1" name="Pl| |an type" create|
|dDate="2017-10-| </Hi>
I would like to know if there is any way to import word by word, so I could clean the text an remove characters such as $ or to keep the space such as
|umber="1" clien|
| e="QD_APP" numb|
Thank fyou for your help
Julen
SAS can certainly input by word as opposed to by ... whatever else you're inputting by. In fact the most simple way to import would be:
data want;
infile "yourfile.xml";
length word $1024; *or whatever the longest feasible "word" would be;
input word $ ##;
run;
If you don't tell SAS how to split words, it assumes spaces, for example.

Splunk csv to match country code

I'm using splunk but having trouble trying to match first 2 or 3 digits in this:
sample:
messageId=9492947, to=61410428007
My csv looks like this:
to, Country
93, Afghanistan
355, Albania
213, Algeria
61, Australia
I'm trying to push the fields into a CSV and tell me what Country they matched.
I think I need to be doing a regex or something, but i have interesting fields marked in splunk which is "to"
This is one of those messy ones, but it does not require regular expressions. Let's say you have a CSV file with a header - something like:
code,country
92,Afghanistan
355,Albania
214,Algeria
61,Australia
44,United Kingdom
1,United States
You need the header for this. I created an example file, but the source can come from anywhere just as long as you have the to field extracted properly.
source="/opt/testfiles/test-country-code.log"
| eval lOne=substr(to,1,1)
| eval lTwo=substr(to,1,2)
| eval lThree=substr(to,1,3)
| lookup countries.csv code as lOne OUTPUT country as cOne
| lookup countries.csv code as lTwo OUTPUT country as cTwo
| lookup countries.csv code as lThree OUTPUT country as cThree
| eval country=coalesce(cOne,cTwo,cThree)
| table to,country
The substr calls extract one, two or three characters from the start of the string. The lookups convert each of those variables to the country name using the lookup table. The coalesce will take the first one of those with a value.