how to match pattern from one dataframe on 2nd df in scala

how to match pattern from one dataframe on 2nd df in scala - scala

I have 2 data frame in Scala
1DF=all proxy url(url,ip)
2Df=regex list(patten,type)
I want to extract uuid or credit card id from url using list 2df ..and Store the matched pattern record along with it's IP address and the URL. please give me solution

Related

How to extract specific value in a dictionary column with multiple lists

I'm trying to extract specific value inside a column in a dataframe as you can see in the next image without any success, referring back to similar question still didn't work for my code.
If there is any way to extract the values as [Culture, Climate change, technology, ...]
Data
First Try
I have tried split() function however I reached a dead end as still I need the exact value after the word "name", and this new dataframe contains 75 columns. If I can only get a for loop to extract the value after the word "name" that's my latest vision to solve my problem.

How can I mask a part of email addresses given in dataframe

How can I mask a part of email addresses given in a column in a dataframe?
I tried using regular expression but not too sure

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11

There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe

The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

How do I retrieve multiple rows from HBase using suffix glob, from a REST client?

I have the following rows in a HBase table called test
ROW COLUMN+CELL row1 column=cf:a, timestamp=1429204170712, value=value1
row2 column=cf:b, timestamp=1429204196225, value=value2
row3 column=cf:c, timestamp=1429204213427, value=value3
I am trying to retrieve all the rows with rowkey matching prefix row using Suffix Globbing, as mentioned here
But why do I get Bad request when I try http://localhost:8080/test/row* where localhost:8080 is where the HBase REST server Stargate is listening, test is the table and row is a partial rowkey. I executed it in a browser and in a REST client Poster (Firefox plugin). Executing the URL http://localhost:8080/test/row*/cf gives the response value1 but I would like to retrieve the values in all the rows with rowkey matching prefix row.
I am running HBase 0.94.26, Stargate (came bundled with HBase), Hadoop 1.2.1, Ubuntu 12.04 virtual machine.
Is it possible to retrieve all the rows programmatically atleast?

As per the doc REST works fine for retrieving all the rows. However you need to just modify the URL accordingly.
As per my opinion try the below comination on of them should work, Please note that that I have not yet tested.
http://localhost:8080/test/row*
http://localhost:8080/test/row
Suffix Globbing
Multiple value queries of a row can optionally append a suffix glob on
the row key. This is a restricted form of scanner which will return
all values in all rows that have keys which contain the supplied key
on their left hand side, for example:
org.someorg.*
-> org.someorg.blog
-> org.someorg.home
-> org.someorg.www

Best way to store hierarchical data in hbase

I have a hierarchical XML file received from client, i need to store it in Hbase database, as i am new to the Hbase i not able to understand how to approach, can you please guide me how should i proceed for this hierarchical data storage to Hbase.
Thanks in advance

Hbase stores data in Column wise format. Each record must have a unique key. The sub columns can be created on the fly but not the main columns.
For example condider this xml.
<X1>
<X2 name = "uniqueid">1</X2>
<X3>
<X4>value1</X4>
<X5>value2</X5>
<X6>
<X7>value3</X7>
<X8>value4</X8>
</X6>
</X3>
<X7>value5</X7>
</X1>
In this case, the main column family would be X3 and X7. Row Id can be taken from X2.
You can construct a Hbase entry equivalent to this using java api like,
Put p = new Put("/*put the unique row id */ ".getBytes() );
p.add("X3".getBytes(), "X4".getBytes(), value1.getBytes());
where the first argument is the column family and the second one is called the column qualifier(sub column).
You can also use 2 argument constructor like,
p.add("X3:X6:X7".getBytes(),value3);
then table.put(p). Thats it!!!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to match pattern from one dataframe on 2nd df in scala - scala

I have 2 data frame in Scala 1DF=all proxy url(url,ip) 2Df=regex list(patten,type) I want to extract uuid or credit card id from url using list 2df ..and Store the matched pattern record along with it's IP address and the URL. please give me solution

Related

How to extract specific value in a dictionary column with multiple lists

How can I mask a part of email addresses given in dataframe

Getting the value of a DataFrame column in Spark

How do I retrieve multiple rows from HBase using suffix glob, from a REST client?

Best way to store hierarchical data in hbase

Categories

Resources