Creating a pyspark dataframe from exploded (nested) json values

Creating a pyspark dataframe from exploded (nested) json values - pyspark

I'm trying to get nested json values in a pyspark dataframe. I have easily solved this using pandas, but now I'm trying to get it working with just pyspark functions.
print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}
With pandas I can parse this input into a dataframe with the following code:
object_df = pd.DataFrame()
for elem in response:
if 'Contents' in elem:
object_df = pd.json_normalize(response['Contents'])
print(object_df)
Key LastModified \
0 202207110000_qweIntraday.csv 2022-07-12 08:32:10+00:00
ETag Size StorageClass
0 "fqweqweqwee0cb4" 1165 STANDARD
(there are sometimes multiple "Contents", so I have to use recursion).
This was my attempt to replicate this with spark dataframe, and sc.parallelize:
object_df = spark.sparkContext.emptyRDD()
for elem in response:
if 'Contents' in elem:
rddjson = spark.read.json(sc.parallelize([response['Contents']]))
Also tried:
sqlc = SQLContext(sc)
rddjson = spark.read.json(sc.parallelize([response['Contents']]))
df = sqlc.read.json("multiline", "true").json(rddjson)
df.show()
+--------------------+
| _corrupt_record|
+--------------------+
|[{'Key': '2/3c6a6...|
+--------------------+
This is not working. I already saw some related posts, saying that I can use explode like in this example (stackoverflow answer) instead of json_normalize, but i'm having trouble replicating the example.
Any suggestion how I can solve this with pyspark or pyspark.sql (and not adding additional libraries) is very welcome.

It looks like the issue is with the data containing a python datetime object (in the LastModified field).
One way around this might be (assuming your ok with python standard libraries):
import json
sc = spark.sparkContext
for elem in response:
if 'Contents' in elem:
json_str = json.dumps(response['Contents'], default=str)
object_df = spark.read.json(sc.parallelize([json_str]))

Related

How to find occurrence of words from a log file with pyspark RDD

I have a log file. I have read the file and converted in rdd. I want to count the number of times 'server_name' is present in the file.
Original log file I have -
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
base_df = spark.read.text("/content/fsm-20210817.logs")
base_df_rdd = base_df.rdd
server_list = ['nginx-ingress-controller-5b6697898-zqxl4','cert-manager-5695c78d49-q9s9j']
for i in server_list:
res = textFile.rdd.map(lambda x: x[0].split(' ').count(i)).sum()
print(i,res)
I'm getting output as -
nginx-ingress-controller-5b6697898-zqxl4 0
cert-manager-5695c78d49-q9s9j 0
I have base_df_rdd as-
[Row(value='{"log":{"offset":5367960,"file":{"path":"/var/log/containers/cert-manager-5695c78d49-q9s9j_cert-manager_cert-manager-cf5af9cbaccdccd8f637d0ba7313996b1cfc0bab7b25fe9f157953918016ac84.log"}},"stream":"stderr","message":"E0817 00:00:00.144379 1 sync.go:183] cert-manager/controller/challenges \\"msg\\"=\\"propagation check failed\\" \\"error\\"=\\"wrong status code \'404\', expected \'200\'\\" \\"dnsName\\"=\\"prodapi.fsmbuddy.com\\" \\"resource_kind\\"=\\"Challenge\\" \\"resource_name\\"=\\"prodapi-fsmbuddy-tls-cert-vtvdq-1471208926-2451592135\\" \\"resource_namespace\\"=\\"default\\" \\"resource_version\\"=\\"v1\\" \\"type\\"=\\"HTTP-01\\" ","#timestamp":"2021-08-17T00:00:00.144Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"cert-manager"},"labels":{"helm_sh/chart":"cert-manager-v1.0.1","app_kubernetes_io/component":"controller","app_kubernetes_io/managed-by":"Tiller","pod-template-hash":"5695c78d49","app_kubernetes_io/instance":"cert-manager","app_kubernetes_io/name":"cert-manager","app":"cert-manager"},"pod":{"uid":"0be2ef9e-f2ee-4d40-b74b-17734527d78c","name":"cert-manager-5695c78d49-q9s9j"},"replicaset":{"name":"cert-manager-5695c78d49"},"namespace":"cert-manager"}}'),
Row(value='{"log":{"offset":1946553,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"Inside user information updation cron: 2021-08-17T00:00:00.001Z","#timestamp":"2021-08-17T00:00:00.001Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1946687,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m orders.find({ orderStatus: \\u001B[32m\'success\'\\u001B[39m, orderDate: { \\u001B[32m\'$gte\'\\u001B[39m: new Date(\\"Mon, 16 Aug 2021 00:00:00 GMT\\") }}, { projection: {} })","#timestamp":"2021-08-17T00:00:00.002Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"architecture":"x86_64"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1946955,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"******","#timestamp":"2021-08-17T00:00:00.003Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1947032,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m enrollments.find({ createdAt: { \\u001B[32m\'$gte\'\\u001B[39m: new Date(\\"Mon, 16 Aug 2021 23:00:00 GMT\\") }, isActive: { \\u001B[32m\'$ne\'\\u001B[39m: \\u001B[33mfalse\\u001B[39m }}, { projection: {} })","#timestamp":"2021-08-17T00:00:00.004Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1947329,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"Currency data:{\\"result\\":\\"success\\",\\"documentation\\":\\"https://www.exchangerate-api.com/docs\\",\\"terms_of_use\\":\\"https://www.exchangerate-api.com/terms\\",\\"time_zone\\":\\"UTC\\",\\"time_last_update\\":1629072001,\\"time_next_update\\":1629158521,\\"base\\":\\"INR\\",\\"conversion_rates\\":{\\"INR\\":1,\\"AED\\":0.04946,\\"AFN\\":1.075,\\"ALL\\":1.3897,\\"AMD\\":6.63,\\"ANG\\":0.0241,\\"AOA\\":8.6508,\\"ARS\\":1.3046,\\"AUD\\":0.01829,\\"AWG\\":0.0241,\\"AZN\\":0.02285,\\"BAM\\":0.02239,\\"BBD\\":0.02693,\\"BDT\\":1.1405,\\"BGN\\":0.02239,\\"BHD\\":0.005063,\\"BIF\\":26.6861,\\"BMD\\":0.01347,\\"BND\\":0.0183,\\"BOB\\":0.09274,\\"BRL\\":0.07057,\\"BSD\\":0.01347,\\"BTN\\":1,\\"BWP\\":0.1497,\\"BYN\\":0.03361,\\"BZD\\":0.02693,\\"CAD\\":0.01688,\\"CDF\\":26.7409,\\"CHF\\":0.0124,\\"CLP\\":10.4188,\\"CNY\\":0.08741,\\"COP\\":51.9025,\\"CRC\\":8.3535,\\"CUC\\":0.01347,\\"CUP\\":0.3468,\\"CVE\\":1.2625,\\"CZK\\":0.2931,\\"DJF\\":2.3933,\\"DKK\\":0.08542,\\"DOP\\":0.7673,\\"DZD\\":1.8193,\\"EGP\\":0.2112,\\"ERN\\":0.202,\\"ETB\\":0.6076,\\"EUR\\":0.01145,\\"FJD\\":0.02801,\\"FKP\\":0.009749,\\"FOK\\":0.08542,\\"GBP\\":0.009749,\\"GEL\\":0.04202,\\"GGP\\":0.009749,\\"GHS\\":0.08076,\\"GIP\\":0.009749,\\"GMD\\":0.6996,\\"GNF\\":131.3706,\\"GTQ\\":0.1042,\\"GYD\\":2.8153,\\"HKD\\":0.1049,\\"HNL\\":0.3193,\\"HRK\\":0.08627,\\"HTG\\":1.2853,\\"HUF\\":4.086,\\"IDR\\":193.9741,\\"ILS\\":0.04382,\\"IMP\\":0.009749,\\"IQD\\":19.6346,\\"IRR\\":564.6901,\\"ISK\\":1.6955,\\"JMD\\":2.08,\\"JOD\\":0.009548,\\"JPY\\":1.4838,\\"KES\\":1.4694,\\"KGS\\":1.1411,\\"KHR\\":54.8746,\\"KID\\":0.01829,\\"KMF\\":5.6327,\\"KRW\\":15.6725,\\"KWD\\":0.004035,\\"KYD\\":0.01122,\\"KZT\\":5.7203,\\"LAK\\":129.023,\\"LBP\\":20.3005,\\"LKR\\":2.6874,\\"LRD\\":2.3085,\\"LSL\\":0.1992,\\"LYD\\":0.06092,\\"MAD\\":0.1208,\\"MDL\\":0.2391,\\"MGA\\":52.5738,\\"MKD\\":0.7039,\\"MMK\\":22.1546,\\"MNT\\":38.2846,\\"MOP\\":0.1081,\\"MRU\\":0.486,\\"MUR\\":0.5712,\\"MVR\\":0.2062,\\"MWK\\":10.9455,\\"MXN\\":0.2685,\\"MYR\\":0.05709,\\"MZN\\":0.8614,\\"NAD\\":0.1992,\\"NGN\\":5.5967,\\"NIO\\":0.4725,\\"NOK\\":0.1188,\\"NPR\\":1.6,\\"NZD\\":0.01915,\\"OMR\\":0.005178,\\"PAB\\":0.01347,\\"PEN\\":0.05494,\\"PGK\\":0.04721,\\"PHP\\":0.6817,\\"PKR\\":2.2107,\\"PLN\\":0.05259,\\"PYG\\":93.2322,\\"QAR\\":0.04902,\\"RON\\":0.05625,\\"RSD\\":1.3457,\\"RUB\\":0.988,\\"RWF\\":13.5649,\\"SAR\\":0.0505,\\"SBD\\":0.1072,\\"SCR\\":0.1964,\\"SDG\\":5.9863,\\"SEK\\":0.1167,\\"SGD\\":0.0183,\\"SHP\\":0.009749,\\"SLL\\":138.6306,\\"SOS\\":7.7848,\\"SRD\\":0.2883,\\"SSP\\":2.3945,\\"STN\\":0.2805,\\"SYP\\":16.9261,\\"SZL\\":0.1992,\\"THB\\":0.4511,\\"TJS\\":0.1521,\\"TMT\\":0.04714,\\"TND\\":0.03747,\\"TOP\\":0.03017,\\"TRY\\":0.1152,\\"TTD\\":0.09132,\\"TVD\\":0.01829,\\"TWD\\":0.3739,\\"TZS\\":31.209,\\"UAH\\":0.3591,\\"UGX\\":47.6167,\\"USD\\":0.01347,\\"UYU\\":0.5869,\\"UZS\\":144.1304,\\"VES\\":55690.0545,\\"VND\\":308.1789,\\"VUV\\":1.5055,\\"WST\\":0.03438,\\"XAF\\":7.5103,\\"XCD\\":0.03636,\\"XDR\\":0.009481,\\"XOF\\":7.5103,\\"XPF\\":1.3663,\\"YER\\":3.3625,\\"ZAR\\":0.1992,\\"ZMW\\":0.2599}}","#timestamp":"2021-08-17T00:00:00.813Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1950159,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m countries.find({ currency_code: \\u001B[32m\'INR\'\\u001B[39m }, { projection: {} })","#timestamp":"2021-08-17T00:00:01.026Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"}},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1950341,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m countries.find({ currency_code: \\u001B[32m\'AED\'\\u001B[39m }, { projection: {} })","#timestamp":"2021-08-17T00:00:01.027Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}')],
Server_names are there in the log file but I'm getting count as 0. Please help.

Here is a solution using pyspark sql functions:
Split your string on the substring(server name in your case) that you are trying to count and length of the array value - 1 would be the actual count of the substring present in the one row. To calculate occurrences in the total file, use sum function.
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import sum, col, size, split
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
base_df = spark.read.text("/content/fsm-20210817.logs")
server_list = ['nginx-ingress-controller-5b6697898-zqxl4','cert-manager-5695c78d49-q9s9j']
for i in server_list:
base_df_with_count = base_df.withColumn(i+"count", size(split(col("column_name"), i)) - 1)
res = base_df_with_count.select(sum(i+"count")).collect()[0][0]
print(i,res)
PS: Not tested the code, but should work.

Write csv to Ibm bluemix object storage from DSX python 2.7 notebook

I am trying to write a pandas dataframe as CSV to Bluemix Object Storage from a DSX Python notebook. I first save the dataframe to a 'local' CSV file. I then have a routine that attempts to write the file to Object Storage. I get a 413 response - object too large. The file is only about 3MB. Here's my code, based on a JSON example I found here: http://datascience.ibm.com/blog/working-with-object-storage-in-data-science-experience-python-edition/
import requests
def put_file(credentials, local_file_name):
"""This function writes file content to Object Storage V3 """
url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['name'],'domain': {'id': credentials['domain']},
'password': credentials['password']}}}}}
headers = {'Content-Type': 'text/csv'}
with open(local_file_name, 'rb') as f:
resp1 = requests.post(url=url1, data=f, headers=headers)
return resp1
Any help or pointers is much appreciated.

This code snippet from the tutorial worked fine for me (for a 12 MB file).
from io import BytesIO
import requests
import json
import pandas as pd
def put_file(credentials, local_file_name):
"""This functions returns a StringIO object containing
the file content from Bluemix Object Storage V3."""
f = open(local_file_name,'r')
my_data = f.read()
url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/csv'}
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']=='dallas'):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.put(url=url2, headers=headers2, data = my_data )
print resp2
I created a random pandas dataframe using:
df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 4)), columns=list('ABCD'))
saved it to csv
df.to_csv('myPandasData_1000000.csv',index=False)
and then put it to object store
put_file(credentials_1,'myPandasData_1000000.csv')
You can get the credentials_1 object by clicking insert to code -> Insert credentials for any object in your object store.

How to create data frames from rdd of word's list

I have gone through all the answers of the stackoverflow and on internet but nothing works.so i have this rdd of list of words:
tweet_words=['tweet_text',
'RT',
'#ochocinco:',
'I',
'beat',
'them',
'all',
'for',
'10',
'straight',
'hours']
**What i have done till now:**
Df =sqlContext.createDataFrame(tweet_words,["tweet_text"])
and
tweet_words.toDF(['tweet_words'])
**ERROR**:
TypeError: Can not infer schema for type: <class 'str'>

Looking at the above code, you are trying to convert a list to a DataFrame. A good StackOverflow link on this is: https://stackoverflow.com/a/35009289/1100699.
Saying this, here's a working version of your code:
from pyspark.sql import Row
# Create RDD
tweet_wordsList = ['tweet_text', 'RT', '#ochocinco:', 'I', 'beat', 'them', 'all', 'for', '10', 'straight', 'hours']
tweet_wordsRDD = sc.parallelize(tweet_wordsList)
# Load each word and create row object
wordRDD = tweet_wordsRDD.map(lambda l: l.split(","))
tweetsRDD = wordRDD.map(lambda t: Row(tweets=t[0]))
# Infer schema (using reflection)
tweetsDF = tweetsRDD.toDF()
# show data
tweetsDF.show()
HTH!

No rows present in the request when stream importing into Big Query

I was trying to stream import data into big query using tabledata.insert_all
job_data = {
kind: 'bigquery#tableDataInsertAllRequest',
rows: [
{ json: { column_name: value} }
]
}
response = execute(
api_method: bigquery.tabledata.insert_all,
parameters: {
projectId: config['project_id'],
datasetId: DATASET_ID,
tableId: table_id
},
body_object: job_data
)
But I always get the following error message
Google::APIClient::Request Sending API request post https://www.googleapis.com/bigquery/v2/projects/propane-tribute-90023/datasets/development/tables/api_requests_20150414/insertAll {"User-Agent"=>"My Test App/1.0 google-api-ruby-client/0.8.5 Mac OS X/10.9.5\n (gzip)", "Content-Type"=>"application/json", "Accept-Encoding"=>"gzip", "Authorization"=>"Bearer ya29.VgFYvU2nxGDhWiCdS47XRw0J-7GLenRry0Cd3AA2D1RDzMh5gnf-m85I5GeSr9oNW51OuUb9mdwObg", "Cache-Control"=>"no-store"}
Decompressing gzip encoded response (155 bytes)
Decompressed (261 bytes)
Google::APIClient::Request Result: 400 {"Vary"=>"X-Origin", "Content-Type"=>"application/json; charset=UTF-8", "Date"=>"Wed, 15 Apr 2015 03:14:17 GMT", "Expires"=>"Wed, 15 Apr 2015 03:14:17 GMT", "Cache-Control"=>"private, max-age=0", "X-Content-Type-Options"=>"nosniff", "X-Frame-Options"=>"SAMEORIGIN", "X-XSS-Protection"=>"1; mode=block", "Server"=>"GSE", "Alternate-Protocol"=>"443:quic,p=0.5", "Transfer-Encoding"=>"chunked"} => {"error"=>{"errors"=>[{"domain"=>"global", "reason"=>"invalid", "message"=>"No rows present in the request.", "locationType"=>"other", "location"=>"rows"}], "code"=>400, "message"=>"No rows present in the request."}}
Does anyone have the same the issue and know how to fix it?
Thanks.

Make sure you are providing appropriate values for all of the column headers present in your table’s schema. Providing separate “json” entries will populate individual rows with the column data you provide. Unless you have already assigned values to the variables named column_name and value, you need to provide those values in the statement following the json declaration.
A sample Ruby syntax for a tabledata.insert_all “rows” operation would look as follows:
body = {
"rows" =>[
{"json" => { "person_id" => 10, "person_name" => "test"}},
{"json" => { "person_id" => 11, "person_name" => "test2"}}
]
}

Importing Archived TweetStream Twitter Ouput into Mongodb?

I have around 1000 lines of twitter data captured using python
tweetstream. The data was collected using the simple tweetstream
example of:
>>> stream = tweetstream.SampleStream("username", "password")
>>> for tweet in stream:
... print tweet
which outputs like:
{u'user': {u'follow_request_sent': None,
u'profile_use_background_image': True,
u'profile_background_image_url_https': u'https://si0.twimg.com/
profile_background_images/
181013334/25957_1367646636642_1395984493_31038644_61586_n.jpg',
u'verified': False, u'profile_image_url_https': u'https://
si0.twimg.com/profile_images/1820265868/rosajennifer_normal.jpg',
u'profile_sidebar_fill_color': u'DDEEF6', u'id': 46478005,
u'profile_text_color': u'333333', u'followers_count': 505,
u'protected': False, u'location': u'', u'default_profile_image':
False, u'listed_count': 4, u'utc_offset': -21600, u'statuses_count':
35923, u'description': u'Take me as I am or watch me as I go. . .\n
\n', u'friends_count': 315, u'profile_link_color': u'0084B4',
u'profile_image_url': u'http://a1.twimg.com/profile_images/1820265868/
rosajennifer_normal.jpg', u'notifications': None,
u'show_all_inline_media': True, u'geo_enabled': False,
u'profile_background_color': u'C0DEED', u'id_str': u'46478005',
u'profile_background_image_url': u'http://a2.twimg.com/
profile_background_images/
181013334/25957_1367646636642_1395984493_31038644_61586_n.jpg',
u'name': u'rosa jennifer', u'lang': u'en', u'following': None,
u'profile_background_tile': True, u'favourites_count': 82,
u'screen_name': u'rosajennifer', u'url': u'http://www.facebook.com/
profile.php?ref=profile&id=1329240058', u'created_at': u'Thu Jun 11
20:11:28 +0000 2009', u'contributors_enabled': False, u'time_zone':
u'Central Time (US & Canada)', u'profile_sidebar_border_color':
u'C0DEED', u'default_profile': False, u'is_translator': False},
u'favorited': False, u'contributors': None, u'entities':
{u'user_mentions': [{u'indices': [1, 14], u'id': 90939650, u'id_str':
u'90939650', u'name': u'Dajuan(Dae-John)', u'screen_name':
u'Juan_Ton5oup'}], u'hashtags': [], u'urls': []}, u'text':
u'\u201c#Juan_Ton5oup: Spanish girls love jeans with animals outlined
on the back pockets.\u201dfoh lmao', u'created_at': u'Tue Feb 14
00:27:32 +0000 2012', u'truncated': False, u'retweeted': False,
u'in_reply_to_status_id': None, u'coordinates': None, u'id':
169216166617817088, u'source': u'<a href="http://twitter.com/#!/
download/ipad" rel="nofollow">Twitter for iPad</a>',
u'in_reply_to_status_id_str': None, u'in_reply_to_screen_name': None,
u'id_str': u'169216166617817088', u'place': None, u'retweet_count': 0,
u'geo': None, u'in_reply_to_user_id_str': None,
u'in_reply_to_user_id': None}
I have a single file of ~1000 of these, each on a seperate line. I've
tried mongoimport as well as a dozen other methods but I can't seem to
get this data imported. Mongoimport passes back this error:
Sat Mar 10 12:51:00 Assertion: 10340:Failure parsing JSON string near:
u'user': {
0x581762 0x528994 0xaa29f3 0xaa4ca3 0xa9b7dd 0xa9f772 0x34df82169d
0x4fe679
mongoimport(_ZN5mongo11msgassertedEiPKc+0x112) [0x581762]
mongoimport(_ZN5mongo8fromjsonEPKcPi+0x444) [0x528994]
mongoimport(_ZN6Import8parseRowEPSiRN5mongo7BSONObjERi+0x8b3)
[0xaa29f3]
mongoimport(_ZN6Import3runEv+0x16e3) [0xaa4ca3]
mongoimport(_ZN5mongo4Tool4mainEiPPc+0x169d) [0xa9b7dd]
mongoimport(main+0x32) [0xa9f772]
/lib64/libc.so.6(__libc_start_main+0xed) [0x34df82169d]
mongoimport(__gxx_personality_v0+0x3d1) [0x4fe679]
exception:Failure parsing JSON string near: u'user': {
I assume this is because the string is not actual json, it's some sort
of (json like) format.
Can anyone help?

The first problem as you noted, the following is not valid JSON, it's a python dictionary: {u'indices':.
Second problem, why are you trying to use mongoimport? In python you can just save the dictionary to the database. This is basically the first example of how to use MongoDB.
>>> from pymongo import Connection
>>> connection = Connection('localhost', 27017)
>>> db = connection.test_database
>>> posts = db.posts
>>> stream = tweetstream.SampleStream("username", "password")
>>> for tweet in stream:
... posts.insert(post)

The following code works, I had some issues with AST but managed to get past them once I updated python on my system. This script reads in a file line by line in python dictionary format and outputs valid JSON for import into MongoDB.
import json
import ast
mydict = open('data.txt', 'r')
for line in mydict:
line = ast.literal_eval(line)
line = json.dumps(line)
print line

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Creating a pyspark dataframe from exploded (nested) json values - pyspark

Related

How to find occurrence of words from a log file with pyspark RDD

Write csv to Ibm bluemix object storage from DSX python 2.7 notebook

How to create data frames from rdd of word's list

No rows present in the request when stream importing into Big Query

Importing Archived TweetStream Twitter Ouput into Mongodb?

Categories

Resources