In Report Builder 3.0 I have a string of comma separated values e.g. "Value1, Value2, Value3, Value4". I have used Split() to get the first and the last position but how do I get everything that is in between "Value1" and "Value4". Can I remove the first and last position in the array that the split function creates? The result i am looking for is "Value2, Value3".
I can think of a couple of methods:
=Trim(Split(Fields!valueString.Value, ",")(1))
& ", "
& Trim(Split(Fields!valueString.Value, ",")(2))
Similar to what you're already doing, just concatenates the two values back together. Trim just helps avoid any issues with whitespace.
=Trim(Mid(Fields!valueString.Value
, InStr(Fields!valueString.Value, ",") + 1
, InStrRev(Fields!valueString.Value, ",") - InStr(Fields!valueString.Value, ",") - 1))
This just uses string manipulation based on comma positions. Again Trim is used to clean up whitespace.
Neither of these are necessarily perfect when your input string is an unexpected format, so depending on your data extra checks might be required, but for Value1, Value2, Value3, Value4 both expressions return Value2, Value3 as required.
Related
I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.
I have few json messages like
{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}
{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}
I need to add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
You have mentioned in the comment above
I have huge dataset in kafka.I am trying to read from kafka and write to hdfs through spark using scala.I am using json parser but unable to parse because of column3 issue.so need to manipulate the message to change into json
So you must be having collecting of malformed jsons as in the question. I have created a list as
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""")
and you are reading it through Spark so you must be having rdds as
val rdd = sc.parallelize(kafkaMsg)
All you need is some parsing in the malformed text json to make it valid json string as
val validJson = rdd.map(msg => msg.replaceAll("[}\"{]", "").split(",").map(_.split(":").mkString("\"", "\":\"", "\"")).mkString("{", ",", "}"))
validJson should be
{"column1":"abc","column2":"123","column3":"qwerty","column4":"abc123"}
{"column1":"defhj","column2":"45","column3":"asdfgh","column4":"def12d"}
You can create a dataframe from the validJson rdd as
sqlContext.read.json(validJson).show(false)
which should give you
+-------+-------+-------+-------+
|column1|column2|column3|column4|
+-------+-------+-------+-------+
|abc |123 |qwerty |abc123 |
|defhj |45 |asdfgh |def12d |
+-------+-------+-------+-------+
Or you can do as per your requirement.
Goal
add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
I would recommend to use RegEx because you have more flexibility with it.
Here is the solution:
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""", """{"column1":"defhj","column2":"45","column3":without-quotes,"column4":"def12d"}""")
val rdd = sc.parallelize(kafkaMsg)
val rePattern = """(^\{.*)("column3":)(.*)(,"column4":.*)""".r
val newRdd = rdd.map(r =>
r match {
case rePattern(start, col3, col3Value, end) => (start + col3 + '"' + col3Value.replaceAll("\"", "'") + '"' + end)
case _ => r }
)
newRdd.foreach(println)
Explanation:
First and second statements are rdd initialization.
Third line defines the regex pattern. You may need to adjust it to your situation.
Regex produce 4 groups of values (whatever is in a () is a group):
string starting with "{" and whatever after until we meet "column3":
"column3": itself
whatever comes after "column3": but before ,"column4":
whatever comes starting ,"column4":
I use these 4 groups in next statement.
Iterate over your rdd, run it against regex, and change it: replace double quotes with single, and add open/close quotes. In case there is no match the original string will be returned.
Because regex was defined with 4 groups, I use 4 variables to map matches:
case rePattern(start, col3, col3Value, end) =>
Note: Code doesn't check if you have double quote in the value or not, it just runs update. You can add validation on your own if you need.
Show the results.
Important notes:
Regex that I used is strictly linked to your source string format. Keep in mind that you have JSON, so order of your keys is not guaranteed. As a result you can end up with "column4" (which is used as a column3 value ending) may come before "column3".
If you use comma as a key/value ending, make sure you don't have it as part of column3 value.
Bottom line: you need to adjust my regex to properly identify the end of column3 value.
Hope it helps.
I need to remove non-numeric characters in a column (character varying) and keep numeric values in postgresql 9.3.5.
Examples:
1) "ggg" => ""
2) "3,0 kg" => "3,0"
3) "15 kg." => "15"
4) ...
There are a few problems, some values are like:
1) "2x3,25"
2) "96+109"
3) ...
These need to remain as is (i.e when containing non-numeric characters between numeric characters - do nothing).
Using regexp_replace is more simple:
# select regexp_replace('test1234test45abc', '[^0-9]+', '', 'g');
regexp_replace
----------------
123445
(1 row)
The ^ means not, so any character that is not in the range 0-9 will be replaced with an empty string, ''.
The 'g' is a flag that means all matches will be replaced, not just the first match.
For modifying strings in PostgreSQL take a look at The String functions and operators section of the documentation. Function substring(string from pattern) uses POSIX regular expressions for pattern matching and works well for removing different characters from your string.
(Note that the VALUES clause inside the parentheses is just to provide the example material and you can replace it any SELECT statement or table that provides the data):
SELECT substring(column1 from '(([0-9]+.*)*[0-9]+)'), column1 FROM
(VALUES
('ggg'),
('3,0 kg'),
('15 kg.'),
('2x3,25'),
('96+109')
) strings
The regular expression explained in parts:
[0-9]+ - string has at least one number, example: '789'
[0-9]+.* - string has at least one number followed by something, example: '12smth'
([0-9]+.\*)* - the string similar to the previous line zero or more times, example: '12smth22smth'
(([0-9]+.\*)*[0-9]+) - the string from the previous line zero or more times and at least one number at the end, example: '12smth22smth345'
Suppose I have a string like:
abc.efg.hijk.lmnop.leaf
I want the substring: abc.efg.hijk.lmnop.
Means: Find out the first comma . from right, then get the substring from left to this comma
How to use t-sql string function return the substring with one expresssion?
First your'll need to reverse the string and find the character index of the first period, then subtract this number from the length of the entire string. This value needs to be used at the length parameter of the sub-string function.
Try this:
DECLARE #S VARCHAR(55) = 'abc.efg.hijk.lmnop.leaf'
SELECT SUBSTRING(#S, 1, LEN(#S) - CHARINDEX('.', REVERSE(#S)))
Is there some better way to trim {""} in result of regexp_matches than:
trim(trailing '"}' from trim(leading '{"' from regexp_matches(note, '[0-9a-z \r\n]+', 'i')::text))
regexp_matches() returns an array of all matches. The string representation of an array contains the curly braces that's why you get them.
If you just want a list of all matched items, you can use array_to_string() to convert the result into a "simple" text data type:
array_to_string(regexp_matches(note, '[0-9a-z \r\n]+', 'i'), ';')
If you are only interested in the first match, you can select the first element of the array:
(regexp_matches(note, '[0-9a-z \r\n]+', 'i'))[1]