Selects fields from Spark RDD - scala

I've got a fairly big RDD with 400 fields coming from Kafka spark stream, I need to create another RDD or Map by selecting some fields from initial RDD stream when I transform the stream and eventually writing the Elasticsearch.
I know my fields by field name but don't know the field index.
How do I project the specific fields by field name to a new Map?

Assuming each field is delimited by delimiter '#'. You can determine the index for each field using the first row or header file and store in some data-structure. Subsequently, you can use this structure to determine the fields and create new maps.
You can use Apache Avro format to pre-process the data. That would allow you to access the data based on their fields and would not require the knowledge of their indexes in the String. The following link provides a great starting point to integrate Avro with Kafka and Spark.
http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html

Related

BigQuery: Import of Cloud Firestore export treats map fields as bytes

I have a Firestore collection that I've been importing into BigQuery tables via the managed import/export service. Recently, schema auto-detection has begun failing for these imports, resulting in Firestore map fields being treated as unqueryable byte fields in the BigQuery table.
The docs mention that this may happen if unique field names in your Firestore collection exceeds BigQuery's 10,000 column per table limit. This collection definitely exceeds that limit, however, I was under the impression that using the --projection_fields allowlist param would limit the amount of columns BigQuery tried to ingest. Is this not the case? Will an import operation fail schema detection regardless of --projection_fields if the collection exceeds 10,000 unique names at all, or am I missing something?
For reference, here's an example of the CLI command I'm using to load the import:
bq load --source_format=DATASTORE_BACKUP --replace \
--projection_fields=id,user, <...etc> \
dataset.table \
gs://backups/<backup_file>
Hello I'm facing the same problem here:
map as byte
But I couldn't find the right codec to decode the map from the byte.
The solution seems to be using sub-collection instead of maps fields in your Firestore documents.
You can still try my solution if you can't change the structure of your Firestore documents:
an ETL job to create the raw tables before loading them into BigQuery instead of using the managed import/export service.
You can try decoding those bytes from Base64.
Sometimes it produces some gibberish characters, but there are cases where it can work.

Implement SQL update with Kafka

How can I implement an update of the object that I store in Kafka topic / Ktable?
I mean, if I need not the replacement of the whole value (which compacted Ktable would do), but a single field update. Should I read from the topic/Ktable, deserialize, update the object and then store the new value at the same topic/KTable?
Or should I join/merge 2 topics: one with the original value and the second with the update of the field?
What would you do?
Kafka (and RocksDB) stores bytes; it cannot compare nested fields as through they are database columns. In order to do so would require deserialization anyway
To update a field, you need to construct and post that whole value; a JOIN will effectively do the same thing
Related - Is there a KSQL statement to update values in table?

How to append data to a existing document in mongodb using spark?

I'm trying to append data to a existing document in a collection when I'm migrating data using Spark from other sources. i searched the documentation but i didn't find any.
Any kind of help will be appreciated.
Thanks.
I was researching this problem and I found that you can append to a document with an existing _id from spark dataframe to MongoDB, you can do it using:
MongoSpark.save(df.write.mode("append"))
In mode "append" the connector will append all the fields to the document with the _id that exists, notice that it will remove fields that will not exist in the dataframe you're writing to the database.
Source: https://groups.google.com/forum/#!topic/mongodb-user/eF-qdpYbFS0

SQL(MSSQL/MariaDB) or NoSQL(MongoDB): XML search and processing

Current project situation.
Get lot of XML from outside system whose size is less than 50KB and if it contains an attachment, the size would be around 5MB MAX. XML structure complexity is medium because of inner nested elements. There are ~70 first level element and then its child and child of child... Storing that XML in String column of MS SQL server.
While storing XML, we read Search criteria field from XML and maintain them in new columns to improve search queries.
Search functionality to display these messages data in the form of list. Search criteria fields(optional ~10 fields) are from XML elements. Parse that XML to show the elements(around 10 -15) in lists
There are chances of reporting functionality too in future.
Challenge with this design: If new functionality introduced to search the list based on new criteria, then need to add one more column in DB table and have to store that field value from XML which is not best part of this design.
Suggested Improvement: Instead of storing an XML in String format, plan is to store it as an XML column to get a rid of extra columns to keep value of search fields and use XML column query for search and fetch.
Question:
Which DB will give me optimum performance in case of search? I have to fetch only the XMLs which are fitting inside that search criteria.
SQL or NoSQL like MongoDB?
Is there any performance metrics available? Or any case study for same?
DB to manage Reporting load.
What client language are you using? Java / PHP / C# / ...? Most of them have XML libraries that do what you need. A database is a data repository, not a data manipulator.

is there any efficient way to insert a copy field in titan?

Is there any efficient way to store a copy of a field with different data type in Titan?
I am using titan 1.0.0 with Solr as a backend data store ,according to my queries and my backend strategies ,I want to store a field with two different data type in my Solr ( text and string). I have already knew that I can store them separately, but I want to know if it is possible to insert data once while storing it in both fields.