3

I have problem for decoding json format.

Here's my data.

20110312010116730|{"place":{"country_code":"US","url":"http:\/\/api.twitter.com\/1\/geo\/id\/9fbe124c83c364fe.json","bounding_box":{"type":"Polygon","coordinates":[[[-78.894441,35.03811699],[-78.85501596,35.03811699],[-78.85501596,35.08142904],[-78.894441,35.08142904]]]},"place_type":"neighborhood","name":"Downtown Fayetteville","country":"United States","attributes":{},"id":"9fbe124c83c364fe","full_name":"Downtown Fayetteville, Fayetteville"},"user":{"is_translator":false,"listed_count":9,"statuses_count":3695,"profile_link_color":"9ede14","url":"http:\/\/www.facebook.com\/nicholasd.whitehead","following":null,"verified":false,"profile_sidebar_border_color":"a7ed11","contributors_enabled":false,"profile_use_background_image":true,"friends_count":354,"profile_background_color":"131516","description":" #TEAMDROID #TAURUS #TEAMRATCHET #TEAMFITTEDS !!!! \u2752Single \u2752Taken \u2714SLiCK","profile_background_image_url":"http:\/\/a2.twimg.com\/profile_background_images\/213719493\/lime_green_logo.jpg","created_at":"Thu Jun 18 21:07:16 +0000 2009","protected":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1263451862\/my_shirt_off_normal.jpg","follow_request_sent":null,"time_zone":"Eastern Time (US & Canada)","favourites_count":3,"profile_text_color":"b6e82c","location":"from the 252 to the 910","name":"\u015bl\u00ef\u00e7k \u0148\u00ef\u00e7k","show_all_inline_media":false,"geo_enabled":true,"notifications":null,"profile_sidebar_fill_color":"080808","screen_name":"infamous_SLiCK","id":48490066,"id_str":"48490066","lang":"en","profile_background_tile":true,"utc_offset":-18000,"followers_count":224},"coordinates":{"type":"Point","coordinates":[-78.883968,35.052185]},"text":"i dont even know who Sam & Ronnie is !!","in_reply_to_status_id":null,"truncated":false,"source":"\u003Ca href=\"http:\/\/twidroyd.com\" rel=\"nofollow\"\u003Etwidroyd\u003C\/a\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:01:16 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[35.052185,-78.883968]},"contributors":null,"retweeted":false,"id":46450665555378176,"in_reply_to_user_id_str":null,"id_str":"46450665555378176","entities":{"urls":[],"user_mentions":[],"hashtags":[]},"retweet_count":0}

I have more that 200GB data for text like this.

Here's my code.

tweets_data = []
tweets_file = open(tweets_data_path, "r").readlines()
for i,line in enumerate(tweets_file):
if i%2 is 0:
    temp = line.split('|')
    tweet = json.loads(temp[1])
    #tweets_data.append(tweet)

Here's my question. I tried to decode them. but fail. At first, I though the number very first in the data make the error. So I tried to seperate the number and json data. But it still doesn't work. Because something different just appear in my list. Like this:

['20110312015935803', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/a1f2dacd80a51287.json","bounding_box":{"type":"Polygon","coordinates":[[[-73.002796,42.990631],[-72.866051,42.990631],[-72.866051,43.119106],[-73.002796,43.119106]]]},"place_type":"city","name":"Stratton","country":"United States","attributes":{},"id":"a1f2dacd80a51287","full_name":"Stratton, VT"},"user":{"follow_request_sent":null,"show_all_inline_media":false,"geo_enabled":true,"profile_link_color":"546080","url":"http:\\/\\/www.facebook.com\\/br.vivizanatta","following":null,"verified":false,"profile_sidebar_border_color":"bcc7e3","is_translator":false,"listed_count":0,"statuses_count":330,"profile_use_background_image":true,"profile_background_color":"2d313f","description":"Stay up to date with news, photos, videos, blog, bio and more from the brazilian journalist and photographer Vivian Zanatta.","contributors_enabled":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/211883639\\/aspen1_829_49514.jpg","created_at":"Sat Jul 10 15:05:48 +0000 2010","friends_count":79,"protected":false,"profile_image_url":"http:\\/\\/a2.twimg.com\\/profile_images\\/1259071695\\/VIVI_DDH7912_normal.jpg","time_zone":"Eastern Time (US & Canada)","favourites_count":0,"profile_text_color":"537de6","location":"Washington, DC, USA","name":"Vivi Zanatta \\u2714","notifications":null,"profile_sidebar_fill_color":"191e2a","screen_name":"vivizanatta_","id":165082798,"id_str":"165082798","lang":"en","profile_background_tile":false,"utc_offset":-18000,"followers_count":83},"coordinates":{"type":"Point","coordinates":[-72.9053683,43.1134486]},"text":"I\'m at Stratton Mountain Ski Resort (5 Village Lodge Rd, Stratton Mountain) http:\\/\\/4sq.com\\/i3ULvp","in_reply_to_status_id":null,"truncated":false,"source":"\\u003Ca href=\\"http:\\/\\/foursquare.com\\" rel=\\"nofollow\\"\\u003Efoursquare\\u003C\\/a\\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[43.1134486,-72.9053683]},"contributors":null,"retweeted":false,"id":46465342800797698,"in_reply_to_user_id_str":null,"id_str":"46465342800797698","entities":{"hashtags":[],"urls":[{"indices":[76,97],"url":"http:\\/\\/4sq.com\\/i3ULvp","expanded_url":null}],"user_mentions":[]},"retweet_count":0}\n']
['\n']

Suddenly ['\n'] appears. well I guess because lines are seperated by two ['\n']. Anyway, when I use partition,

('20110312015935977', '|', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/b8b87894eb3d7849.json","bounding_box":{"type":"Polygon","coordinates":[[[-95.542521,29.670631],[-95.492419,29.670631],[-95.492419,29.694855],[-95.542521,29.694855]]]},"place_type":"neighborhood","name":"Braeburn","country":"United States","attributes":{},"id":"b8b87894eb3d7849","full_name":"Braeburn, Houston"},"user":{"profile_link_color":"ed0909","url":null,"following":null,"verified":false,"profile_sidebar_border_color":"f00505","follow_request_sent":null,"show_all_inline_media":true,"geo_enabled":true,"profile_use_background_image":true,"profile_background_color":"61b8c2","description":"#TeamPlaystation #TeamLRG #TeamAquarius and #PvNation .It bring me great pleasure to welcome the real and banish the Fake...","is_translator":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/179334599\\/screwston7jsredc.jpg","listed_count":0,"statuses_count":163,"created_at":"Wed Dec 08 04:04:16 +0000 2010","protected":false,"profile_image_url":"http:\\/\\/a0.twimg.com\\/profile_images\\/1256895503\\/image_normal.jpg","time_zone":"Central America","favourites_count":2,"profile_text_color":"fa0505","location":"Houston, Tx","name":"Craig Irving","contributors_enabled":false,"notifications":null,"profile_sidebar_fill_color":"020303","screen_name":"xxMinion","id":224098461,"id_str":"224098461","lang":"en","profile_background_tile":true,"utc_offset":-21600,"friends_count":36,"followers_count":35},"coordinates":null,"text":"If your White or Mexican #WhoSaidItWasOk to say \\"whats up my nigga\\" and then call your homeboys the word Nigga lol","in_reply_to_status_id":null,"truncated":false,"source":"web","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":null,"contributors":null,"retweeted":false,"id":46465343463505920,"in_reply_to_user_id_str":null,"id_str":"46465343463505920","entities":{"urls":[],"user_mentions":[],"hashtags":[{"indices":[25,40],"text":"WhoSaidItWasOk"}]},"retweet_count":0}\n')
('\n', '', '')

It appears.

oh and my data's format is gz. How can I read on python without decompressing?

1
  • Ohhh the error says that ValueError: Unterminated string starting at: line 1 column 664 (char 663). Sorry. It was really 11 pm in Korea.. I was so tired so forget it Commented Mar 2, 2017 at 13:59

1 Answer 1

4

If there are | in your data, split splits too much and the json string is truncated

You can use maxsplit parameter

temp = line.split('|',1)

or partition:

temp = line.partition('|')

(use temp[2] in that case because separator is also returned)

If you have further problems, consider adding a try/except block for each line so you can narrow down the problem.

EDIT: also added protection against blank lines as a follow up to your edit.

tweets_file = open(tweets_data_path, "r")
for i,line in enumerate(tweets_file):
    if i%2 == 0:
       try:
          data = line.partition('|')[2]
          if data:        
              tweet = json.loads(data)
       except ValueError as e:
          print("Cannot parse '{}'".format(data)
          print("Error line {}: {}".format(i+1,str(e)))
Sign up to request clarification or add additional context in comments.

18 Comments

The problem is, when I seperate them, something else just pop up everytime. I just deleted '|', and then lists appear: the number and json format list/ "\n" list. These two lists appear again and again. when I delete '\n', then suddenly " '' " appears. It makes me crazy
@YooInhyeok Don't panic, isolate an issue first.
@YooInhyeok check my edit, should help making your code more robust and isolate the problematic line(s)
Hi. It's morning in Korea. I'm calm now. The problem is clear. I show you the result above.
Just skip the lines partitionned to "", that's it. And I'm replying quicky to your gz question (since it's something else): lookup for gzopen which does exactly what you want. lots of examples here on SO.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.