I have a log file. I have read the file and converted in rdd. I want to count the number of times 'server_name' is present in the file.
Original log file I have -
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
base_df = spark.read.text("/content/fsm-20210817.logs")
base_df_rdd = base_df.rdd
server_list = ['nginx-ingress-controller-5b6697898-zqxl4','cert-manager-5695c78d49-q9s9j']
for i in server_list:
res = textFile.rdd.map(lambda x: x[0].split(' ').count(i)).sum()
print(i,res)
I'm getting output as -
nginx-ingress-controller-5b6697898-zqxl4 0
cert-manager-5695c78d49-q9s9j 0
I have base_df_rdd as-
[Row(value='{"log":{"offset":5367960,"file":{"path":"/var/log/containers/cert-manager-5695c78d49-q9s9j_cert-manager_cert-manager-cf5af9cbaccdccd8f637d0ba7313996b1cfc0bab7b25fe9f157953918016ac84.log"}},"stream":"stderr","message":"E0817 00:00:00.144379 1 sync.go:183] cert-manager/controller/challenges \\"msg\\"=\\"propagation check failed\\" \\"error\\"=\\"wrong status code \'404\', expected \'200\'\\" \\"dnsName\\"=\\"prodapi.fsmbuddy.com\\" \\"resource_kind\\"=\\"Challenge\\" \\"resource_name\\"=\\"prodapi-fsmbuddy-tls-cert-vtvdq-1471208926-2451592135\\" \\"resource_namespace\\"=\\"default\\" \\"resource_version\\"=\\"v1\\" \\"type\\"=\\"HTTP-01\\" ","#timestamp":"2021-08-17T00:00:00.144Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"cert-manager"},"labels":{"helm_sh/chart":"cert-manager-v1.0.1","app_kubernetes_io/component":"controller","app_kubernetes_io/managed-by":"Tiller","pod-template-hash":"5695c78d49","app_kubernetes_io/instance":"cert-manager","app_kubernetes_io/name":"cert-manager","app":"cert-manager"},"pod":{"uid":"0be2ef9e-f2ee-4d40-b74b-17734527d78c","name":"cert-manager-5695c78d49-q9s9j"},"replicaset":{"name":"cert-manager-5695c78d49"},"namespace":"cert-manager"}}'),
Row(value='{"log":{"offset":1946553,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"Inside user information updation cron: 2021-08-17T00:00:00.001Z","#timestamp":"2021-08-17T00:00:00.001Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1946687,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m orders.find({ orderStatus: \\u001B[32m\'success\'\\u001B[39m, orderDate: { \\u001B[32m\'$gte\'\\u001B[39m: new Date(\\"Mon, 16 Aug 2021 00:00:00 GMT\\") }}, { projection: {} })","#timestamp":"2021-08-17T00:00:00.002Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"architecture":"x86_64"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1946955,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"******","#timestamp":"2021-08-17T00:00:00.003Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1947032,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m enrollments.find({ createdAt: { \\u001B[32m\'$gte\'\\u001B[39m: new Date(\\"Mon, 16 Aug 2021 23:00:00 GMT\\") }, isActive: { \\u001B[32m\'$ne\'\\u001B[39m: \\u001B[33mfalse\\u001B[39m }}, { projection: {} })","#timestamp":"2021-08-17T00:00:00.004Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1947329,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"Currency data:{\\"result\\":\\"success\\",\\"documentation\\":\\"https://www.exchangerate-api.com/docs\\",\\"terms_of_use\\":\\"https://www.exchangerate-api.com/terms\\",\\"time_zone\\":\\"UTC\\",\\"time_last_update\\":1629072001,\\"time_next_update\\":1629158521,\\"base\\":\\"INR\\",\\"conversion_rates\\":{\\"INR\\":1,\\"AED\\":0.04946,\\"AFN\\":1.075,\\"ALL\\":1.3897,\\"AMD\\":6.63,\\"ANG\\":0.0241,\\"AOA\\":8.6508,\\"ARS\\":1.3046,\\"AUD\\":0.01829,\\"AWG\\":0.0241,\\"AZN\\":0.02285,\\"BAM\\":0.02239,\\"BBD\\":0.02693,\\"BDT\\":1.1405,\\"BGN\\":0.02239,\\"BHD\\":0.005063,\\"BIF\\":26.6861,\\"BMD\\":0.01347,\\"BND\\":0.0183,\\"BOB\\":0.09274,\\"BRL\\":0.07057,\\"BSD\\":0.01347,\\"BTN\\":1,\\"BWP\\":0.1497,\\"BYN\\":0.03361,\\"BZD\\":0.02693,\\"CAD\\":0.01688,\\"CDF\\":26.7409,\\"CHF\\":0.0124,\\"CLP\\":10.4188,\\"CNY\\":0.08741,\\"COP\\":51.9025,\\"CRC\\":8.3535,\\"CUC\\":0.01347,\\"CUP\\":0.3468,\\"CVE\\":1.2625,\\"CZK\\":0.2931,\\"DJF\\":2.3933,\\"DKK\\":0.08542,\\"DOP\\":0.7673,\\"DZD\\":1.8193,\\"EGP\\":0.2112,\\"ERN\\":0.202,\\"ETB\\":0.6076,\\"EUR\\":0.01145,\\"FJD\\":0.02801,\\"FKP\\":0.009749,\\"FOK\\":0.08542,\\"GBP\\":0.009749,\\"GEL\\":0.04202,\\"GGP\\":0.009749,\\"GHS\\":0.08076,\\"GIP\\":0.009749,\\"GMD\\":0.6996,\\"GNF\\":131.3706,\\"GTQ\\":0.1042,\\"GYD\\":2.8153,\\"HKD\\":0.1049,\\"HNL\\":0.3193,\\"HRK\\":0.08627,\\"HTG\\":1.2853,\\"HUF\\":4.086,\\"IDR\\":193.9741,\\"ILS\\":0.04382,\\"IMP\\":0.009749,\\"IQD\\":19.6346,\\"IRR\\":564.6901,\\"ISK\\":1.6955,\\"JMD\\":2.08,\\"JOD\\":0.009548,\\"JPY\\":1.4838,\\"KES\\":1.4694,\\"KGS\\":1.1411,\\"KHR\\":54.8746,\\"KID\\":0.01829,\\"KMF\\":5.6327,\\"KRW\\":15.6725,\\"KWD\\":0.004035,\\"KYD\\":0.01122,\\"KZT\\":5.7203,\\"LAK\\":129.023,\\"LBP\\":20.3005,\\"LKR\\":2.6874,\\"LRD\\":2.3085,\\"LSL\\":0.1992,\\"LYD\\":0.06092,\\"MAD\\":0.1208,\\"MDL\\":0.2391,\\"MGA\\":52.5738,\\"MKD\\":0.7039,\\"MMK\\":22.1546,\\"MNT\\":38.2846,\\"MOP\\":0.1081,\\"MRU\\":0.486,\\"MUR\\":0.5712,\\"MVR\\":0.2062,\\"MWK\\":10.9455,\\"MXN\\":0.2685,\\"MYR\\":0.05709,\\"MZN\\":0.8614,\\"NAD\\":0.1992,\\"NGN\\":5.5967,\\"NIO\\":0.4725,\\"NOK\\":0.1188,\\"NPR\\":1.6,\\"NZD\\":0.01915,\\"OMR\\":0.005178,\\"PAB\\":0.01347,\\"PEN\\":0.05494,\\"PGK\\":0.04721,\\"PHP\\":0.6817,\\"PKR\\":2.2107,\\"PLN\\":0.05259,\\"PYG\\":93.2322,\\"QAR\\":0.04902,\\"RON\\":0.05625,\\"RSD\\":1.3457,\\"RUB\\":0.988,\\"RWF\\":13.5649,\\"SAR\\":0.0505,\\"SBD\\":0.1072,\\"SCR\\":0.1964,\\"SDG\\":5.9863,\\"SEK\\":0.1167,\\"SGD\\":0.0183,\\"SHP\\":0.009749,\\"SLL\\":138.6306,\\"SOS\\":7.7848,\\"SRD\\":0.2883,\\"SSP\\":2.3945,\\"STN\\":0.2805,\\"SYP\\":16.9261,\\"SZL\\":0.1992,\\"THB\\":0.4511,\\"TJS\\":0.1521,\\"TMT\\":0.04714,\\"TND\\":0.03747,\\"TOP\\":0.03017,\\"TRY\\":0.1152,\\"TTD\\":0.09132,\\"TVD\\":0.01829,\\"TWD\\":0.3739,\\"TZS\\":31.209,\\"UAH\\":0.3591,\\"UGX\\":47.6167,\\"USD\\":0.01347,\\"UYU\\":0.5869,\\"UZS\\":144.1304,\\"VES\\":55690.0545,\\"VND\\":308.1789,\\"VUV\\":1.5055,\\"WST\\":0.03438,\\"XAF\\":7.5103,\\"XCD\\":0.03636,\\"XDR\\":0.009481,\\"XOF\\":7.5103,\\"XPF\\":1.3663,\\"YER\\":3.3625,\\"ZAR\\":0.1992,\\"ZMW\\":0.2599}}","#timestamp":"2021-08-17T00:00:00.813Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"containerized":false},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1950159,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m countries.find({ currency_code: \\u001B[32m\'INR\'\\u001B[39m }, { projection: {} })","#timestamp":"2021-08-17T00:00:01.026Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","provider":"aws","availability_zone":"ap-south-1b","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","name":"ip-192-168-18-105.ap-south-1.compute.internal","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"}},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}'),
Row(value='{"log":{"offset":1950341,"file":{"path":"/var/log/containers/fsm-backend-cron-prod-6bd6459455-p9p49_default_fsm-backend-cron-prod-51838c21c82b0a19b713bf028da7418f9885fd29b606e5b9912c1c66f6c3046a.log"}},"stream":"stdout","message":"\\u001B[0;36mMongoose:\\u001B[0m countries.find({ currency_code: \\u001B[32m\'AED\'\\u001B[39m }, { projection: {} })","#timestamp":"2021-08-17T00:00:01.027Z","ecs":{"version":"1.0.0"},"cloud":{"instance":{"id":"i-06c596f469bcf9b4a"},"region":"ap-south-1","availability_zone":"ap-south-1b","provider":"aws","machine":{"type":"t3a.large"}},"input":{"type":"container"},"#version":"1","host":{"architecture":"x86_64","os":{"codename":"Core","version":"7 (Core)","name":"CentOS Linux","kernel":"4.14.186-146.268.amzn2.x86_64","platform":"centos","family":"redhat"},"hostname":"ip-192-168-18-105.ap-south-1.compute.internal","containerized":false,"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"tags":["beats_input_codec_plain_applied","_grokparsefailure"],"agent":{"version":"7.2.0","type":"filebeat","ephemeral_id":"af246a38-d99d-43ab-849b-cf25288dd6c1","hostname":"ip-192-168-18-105.ap-south-1.compute.internal","id":"9631683c-b8fc-40d8-9e28-f80f5fa3cc2c"},"kubernetes":{"node":{"name":"ip-192-168-18-105.ap-south-1.compute.internal"},"container":{"name":"fsm-backend-cron-prod"},"labels":{"pod-template-hash":"6bd6459455","app":"fsm-backend-cron-prod"},"pod":{"uid":"c356f207-bc60-4974-9f8b-02ecfe87eaa0","name":"fsm-backend-cron-prod-6bd6459455-p9p49"},"replicaset":{"name":"fsm-backend-cron-prod-6bd6459455"},"namespace":"default"}}')],
Server_names are there in the log file but I'm getting count as 0. Please help.
Here is a solution using pyspark sql functions:
Split your string on the substring(server name in your case) that you are trying to count and length of the array value - 1 would be the actual count of the substring present in the one row. To calculate occurrences in the total file, use sum function.
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import sum, col, size, split
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
base_df = spark.read.text("/content/fsm-20210817.logs")
server_list = ['nginx-ingress-controller-5b6697898-zqxl4','cert-manager-5695c78d49-q9s9j']
for i in server_list:
base_df_with_count = base_df.withColumn(i+"count", size(split(col("column_name"), i)) - 1)
res = base_df_with_count.select(sum(i+"count")).collect()[0][0]
print(i,res)
PS: Not tested the code, but should work.
I have around 1000 lines of twitter data captured using python
tweetstream. The data was collected using the simple tweetstream
example of:
>>> stream = tweetstream.SampleStream("username", "password")
>>> for tweet in stream:
... print tweet
which outputs like:
{u'user': {u'follow_request_sent': None,
u'profile_use_background_image': True,
u'profile_background_image_url_https': u'https://si0.twimg.com/
profile_background_images/
181013334/25957_1367646636642_1395984493_31038644_61586_n.jpg',
u'verified': False, u'profile_image_url_https': u'https://
si0.twimg.com/profile_images/1820265868/rosajennifer_normal.jpg',
u'profile_sidebar_fill_color': u'DDEEF6', u'id': 46478005,
u'profile_text_color': u'333333', u'followers_count': 505,
u'protected': False, u'location': u'', u'default_profile_image':
False, u'listed_count': 4, u'utc_offset': -21600, u'statuses_count':
35923, u'description': u'Take me as I am or watch me as I go. . .\n
\n', u'friends_count': 315, u'profile_link_color': u'0084B4',
u'profile_image_url': u'http://a1.twimg.com/profile_images/1820265868/
rosajennifer_normal.jpg', u'notifications': None,
u'show_all_inline_media': True, u'geo_enabled': False,
u'profile_background_color': u'C0DEED', u'id_str': u'46478005',
u'profile_background_image_url': u'http://a2.twimg.com/
profile_background_images/
181013334/25957_1367646636642_1395984493_31038644_61586_n.jpg',
u'name': u'rosa jennifer', u'lang': u'en', u'following': None,
u'profile_background_tile': True, u'favourites_count': 82,
u'screen_name': u'rosajennifer', u'url': u'http://www.facebook.com/
profile.php?ref=profile&id=1329240058', u'created_at': u'Thu Jun 11
20:11:28 +0000 2009', u'contributors_enabled': False, u'time_zone':
u'Central Time (US & Canada)', u'profile_sidebar_border_color':
u'C0DEED', u'default_profile': False, u'is_translator': False},
u'favorited': False, u'contributors': None, u'entities':
{u'user_mentions': [{u'indices': [1, 14], u'id': 90939650, u'id_str':
u'90939650', u'name': u'Dajuan(Dae-John)', u'screen_name':
u'Juan_Ton5oup'}], u'hashtags': [], u'urls': []}, u'text':
u'\u201c#Juan_Ton5oup: Spanish girls love jeans with animals outlined
on the back pockets.\u201dfoh lmao', u'created_at': u'Tue Feb 14
00:27:32 +0000 2012', u'truncated': False, u'retweeted': False,
u'in_reply_to_status_id': None, u'coordinates': None, u'id':
169216166617817088, u'source': u'<a href="http://twitter.com/#!/
download/ipad" rel="nofollow">Twitter for iPad</a>',
u'in_reply_to_status_id_str': None, u'in_reply_to_screen_name': None,
u'id_str': u'169216166617817088', u'place': None, u'retweet_count': 0,
u'geo': None, u'in_reply_to_user_id_str': None,
u'in_reply_to_user_id': None}
I have a single file of ~1000 of these, each on a seperate line. I've
tried mongoimport as well as a dozen other methods but I can't seem to
get this data imported. Mongoimport passes back this error:
Sat Mar 10 12:51:00 Assertion: 10340:Failure parsing JSON string near:
u'user': {
0x581762 0x528994 0xaa29f3 0xaa4ca3 0xa9b7dd 0xa9f772 0x34df82169d
0x4fe679
mongoimport(_ZN5mongo11msgassertedEiPKc+0x112) [0x581762]
mongoimport(_ZN5mongo8fromjsonEPKcPi+0x444) [0x528994]
mongoimport(_ZN6Import8parseRowEPSiRN5mongo7BSONObjERi+0x8b3)
[0xaa29f3]
mongoimport(_ZN6Import3runEv+0x16e3) [0xaa4ca3]
mongoimport(_ZN5mongo4Tool4mainEiPPc+0x169d) [0xa9b7dd]
mongoimport(main+0x32) [0xa9f772]
/lib64/libc.so.6(__libc_start_main+0xed) [0x34df82169d]
mongoimport(__gxx_personality_v0+0x3d1) [0x4fe679]
exception:Failure parsing JSON string near: u'user': {
I assume this is because the string is not actual json, it's some sort
of (json like) format.
Can anyone help?
The first problem as you noted, the following is not valid JSON, it's a python dictionary: {u'indices':.
Second problem, why are you trying to use mongoimport? In python you can just save the dictionary to the database. This is basically the first example of how to use MongoDB.
>>> from pymongo import Connection
>>> connection = Connection('localhost', 27017)
>>> db = connection.test_database
>>> posts = db.posts
>>> stream = tweetstream.SampleStream("username", "password")
>>> for tweet in stream:
... posts.insert(post)
The following code works, I had some issues with AST but managed to get past them once I updated python on my system. This script reads in a file line by line in python dictionary format and outputs valid JSON for import into MongoDB.
import json
import ast
mydict = open('data.txt', 'r')
for line in mydict:
line = ast.literal_eval(line)
line = json.dumps(line)
print line