I am processing a Kafka stream using Flink SQL where every message is pulled from Kafka, processed using flink sql and pushed back into kafka. I wanted a nested output where input is flat and output is nested. Say for example my input is
{'StudentName':'ABC','StudentAge':33}
and want output as
{'Student':{'Name':'ABC','Age':33}}
I tried searching here and few similar links but could not find so. Is it possible to do so using Apache Flink SQL API? Can use User Defined Functions if necessary but would want to avoid so.
You could try something like this:
SELECT
MAP ['Student', MAP ['Name', StudentName, 'Age', StudentAge]]
FROM
students;
I found the MAP function here, but I had to experiment in the SQL Client to figure out the syntax.
I was able to achieve the same by returning a map from Flink UDF. The eval() function in UDF will return a Map while the FlinkSQL query will call the UDF with student as an Alias:
The UDF should looks like this
public class getStudent extends ScalarFunction {
public Map<String, String> eval(String name, Integer age) {
Map<String, String> student = new HashMap<>();
student.put("Name", name);
student.put("Age", age.toString());
return student;
}
}
and the FlinkSQL query stays like this:
Select getStudent(StudentName, StudentAge) as `Student` from MyKafkaTopic
The same can be done for Lists as well when trying get List out of FlinkSQL
Related
I have created a hive udf like below,
Class customUdf extends UDF{
def evaluate(col : String): String = {
return col + "abc"
}
}
I then registered the udf in sparksession by,
sparksession.sql("""CREATE TEMPORARY FUNCTION testUDF AS 'testpkg.customUdf'""");
When I try to query hive table using below query in scala code it does not progress and does not throw error also,
SELECT testUDF(value) FROM t;
However when I pass a string like below from scala code it works
SELECT testUDF('str1') FROM t;
I am running the queries via sparksession.Tried with GenericUdf, but still facing same issue. This happens only when i pass hive column. What could be reason.
Try referencing your jar from hdfs:
create function testUDF as 'testpkg.customUdf' using jar 'hdfs:///jars/customUdf.jar';
I am not sure about implementation of UDFs in Scala, but when I faced similar issue in Java, I noticed a difference that if you plug in literal
select udf("some literal value")
then it is received by UDF as a String.
But when you select from a Hive table
select udf(some_column) from some_table
you may get what's called a LazyString for which you would need to use getObject to retrieve actual value. I am not sure is Scala handles these lazy values automatically.
I want to generate an unbounded collection of rows and run an SQL query on it using the Apache Beam Calcite SQL dialect and the Apache Flink runner. Based on the source code and documentation of Apache Beam, one can do something like this using a table provider: GenerateSequenceTableProvider. But I don't understand how to use it outside of the Beam SQL CLI. I'd like to use it in my regular Java code.
I was trying to do something like this:
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
GenerateSequenceTableProvider tableProvider = new GenerateSequenceTableProvider();
tableProvider.createTable(Table.builder()
.name("sequence")
.schema(Schema.of(Schema.Field.of("sequence", Schema.FieldType.INT64), Schema.Field.of("event_time", Schema.FieldType.DATETIME)))
.type(tableProvider.getTableType())
.build()
);
PCollection<Row> res = PCollectionTuple.empty(pipeline).apply(SqlTransform.query("select * from sequenceSchema.sequence limit 5").withTableProvider("sequenceSchema", tableProvider));
pipeline.run().waitUntilFinish();
But I'm getting Object 'sequence' not found within 'sequenceSchema' errors, so I guess I'm not actually creating the table. So how do I create the table? If I understand correctly, the values should be provided automatically by the table provider.
Basically, how to use Beam SQL table providers if I want to execute queries on tables that these providers are supposed (I think?) to generate?
The TableProvider interface is a bit difficult to work with directly. The problem you're running into is that the GenerateSquenceTableProvider, like many other TableProviders, doesn't have any way to store table metadata on its own. So calling its createTable method is actually a no-op! What you'll want to do is wrap it in an InMemoryMetaStore, something like this:
GenerateSequenceTableProvider tableProvider = new GenerateSequenceTableProvider();
InMemoryMetaStore metaStore = new InMemoryMetaStore();
metaStore.registerProvider(tableProvider);
metaStore.createTable(Table.builder()
.name("sequence")
.schema(Schema.of(Schema.Field.of("sequence", Schema.FieldType.INT64), Schema.Field.of("event_time", Schema.FieldType.DATETIME)))
.type(tableProvider.getTableType())
.build()
);
PCollection<Row> res = PCollectionTuple.empty(pipeline)
.apply(SqlTransform.query("select * from sequenceSchema.sequence limit 5")
.withTableProvider("sequenceSchema", metaStore));
(Note I haven't tested this, but I think something like it should work)
As robertwb pointed out, another option would be to just avoid the TableProvider interface and use GenerateSequence directly. You'd just need to make sure that your PCollection has a schema. Then you could process it with SqlTransform, like this:
pc.apply(SqlTransform.query("select * from PCOLLECTION limit 5"))
If you can't get TableProviders to work, you could read this as an ordinary PCollection and then apply a SqlTransform to the result.
I have two datasets that i need to join and perform operations on and I cant figure out how to do it.
A stipulation for this is that i do not have org.apache.spark.sql.functions methods available to me, so must use the dataset API
The input given is two Datasets
The first dataset is of type Customer with Fields:
customerId, forename, surname - All String
And the second dataset is of Transaction:
customerId (String), accountId(String), amount (Long)
customerId is the link
The outputted Dataset needs to have these fields:
customerId (String), forename(String), surname(String), transactions( A list of type Transaction), transactionCount (int), totalTransactionAmount (Double),averageTransactionAmount (Double)
I understand that i need to use groupBy, agg, and some kind of join at the end.
Can anyone help/point me in the right direction? Thanks
It is very hard to work with the information you have, but from what I understand you dont want to use the dataframe functions but implement everything with the dataset api, you could do this in the following way
Join both the datasets using joinWith, you can find an example here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith
Aggregating : I would use groupByKey followed by mapGroups something like
ds.groupByKey(x=>x.id).mapGroups { case (key,iter) => {
val list = iter.toList
val totalTransactionAmount = ???
val averageTransactionAmount = ???
(key,totalTransactionAmount,averageTransactionAmount)
}
}
Hopefully the example gives you an idea how you could solve your problem with the dataset API and you could adapt it to your problem.
I have grouped all my customers in JavaPairRDD<Long, Iterable<ProductBean>> by there customerId (of Long type). Means every customerId have a List or ProductBean.
Now i want to save all ProductBean to DB irrespective of customerId. I got all values by using method
JavaRDD<Iterable<ProductBean>> values = custGroupRDD.values();
Now i want to convert JavaRDD<Iterable<ProductBean>> to JavaRDD<Object, BSONObject> so that i can save it to Mongo. Remember, every BSONObject is made of Single ProductBean.
I am not getting any idea of how to do this in Spark, i mean which Spark's Transformation is used to do that job. I think this task is some kind of seperate all values from Iterable. Please let me know how is this possible.
Any hint in Scala or Python are also ok.
You can use the flatMapValues function:
JavaRDD<Object,ProductBean> result = custGroupRDD.flatMapValues(v -> v)
I'm developing a mechanism for Cassandra using Hector.
What I need at this moment is to know which are the hash values of the keys to look at which node is stored (looking at the tokens of each one), and ask directly this node for the value. What I understood is that depending on the partitioner Cassandra uses, the values are stored independently from one partitioner to other. So, are the hash values of all keys stored in any table? In case not, how could I implement a generic class that once I read from System Keyspace the partitioner that is using Cassandra this class could be an instance of it without the necessity of modifying the code depending on the partitioner? I would need it to call the getToken method to calculate the hash value for a given key.
Hector's CqlQuery is poorly supported and buggy. You should use the native Java CQL driver instead: https://github.com/datastax/java-driver
You could just reuse the partitioners defined in Cassandra: https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/dht and then using the token ranges you could do the routing.
The CQL driver offers token-aware routing out of the box. I would use that instead of trying to reinvent the wheel in Hector, especially since Hector uses the legacy Thrift API instead of CQL.
Finally after testing different implementations I found the way to get the partitioner using the next code:
CqlQuery<String, String, String> cqlQuery = new CqlQuery<String, String, String>(
ksp, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
cqlQuery.setQuery("select partitioner from local");
QueryResult<CqlRows<String, String, String>> result = cqlQuery.execute();
CqlRows rows = result.get();
for (int i = 0; i < rows.getCount(); i++) {
RowImpl<String, String, String> row = (RowImpl<String, String, String>) rows
.getList().get(i);
List<HColumn<String, String>> column = row.getColumnSlice().getColumns();
for (HColumn<String , String> c: column) {
System.out.println(c.getValue());
}
}