I am trying to read datasets stored in Databricks DBFS. I want to read these datasets in Qliksense.
I am using Databricks Rest API and getting data in JSON format and also my retrieved data is base64 encoded.
How can get this data in tabular format directly from Rest API.
Related
I need to paas Id in another activity in data factory .Id is stored in blob storage in json format.
I am using Look-Up in my activity which will fetch data .But my pipeline gets failed when data is more than 5000.I need a solution for this.I didnt understand the existing solution in stack overflow.
Ah ok, well you cannot use OFFSET/LIMIT pagination sensibly in Cosmos and ADF cannot use continuation tokens. Also you cannot LOOKUP >5000 results from blob or paginate the blob output.
If I had this problem I would try the following based on this idea Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount.
Use dataflow to get the data from cosmos and write to several json files using partitioning, each < 5000 rows (using the method described in the comment on the above link - using a surrogate and the MOD operator)
ForLoop over those blobs
Have a nested pipeline that does the lookup and calls the API, as you have now - now the lookup will only have max 5000 items
I have a scenario where I need to load Json data from s3 bucket in to Spark DataFrame, But problem here is my data in S3 bucket is encrypted with Javax.crypto library using AES/ECB/PKCS5Padding encryption algorithm. when I tried to read data from S3, spark is throwing error that this is not json data since it is in encrypted formate. Is there any way I can write my custom spark code which reads data from s3 bucket as file input stream and apply this decryption process using Javax.crypto util to convert and assign it DataFram? (I want my spark custom code to run over the distributed cluster). Appreciate your help.
I'm building a data collection RESTful API 0 External devices will post json data to database server. So, my idea is to make it by Store-and-Forward ideology.
At the moment of the post, it will store raw json data to table with timestamp and processed true/false fields.
At the next moment when(if) the db server is not loaded will run some function, trigger or stored procedure, etc. The idea is to process all json data into suitable tables and fields for charting graphs/bars and map it over Google Maps later.
So how and what to use to run this functionality(2) when the db server no loaded and free for processing the posted json data?
Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.
My task is basically:
Read data from Google Cloud BigQuery using Spark/Scala.
Perform some operation (Like, Update) on the data.
Write back the data to BigQuery
Till now, I am able to read data from BigQuery using newAPIHadoopRDD() which returns RDD[(LongWritable, JsonObject)].
tableData.map(entry => (entry._1.toString(),entry._2.toString()))
.take(10)
.foreach(println)
And below is the sample data,
(341,{"id":"4","name":"Shahu","score":"100"})
I am not able to figure out what functions should I use on this RDD to meet requirement.
Do I need to convert this RDD to DataFrame/Dataset/JSON format? and How?