The other day I had to fix a file containing JSON that had been encoded as JSON (JSON nested in JSON). It looked something like this:
"'{\"animal_list\": [{\"type\": \"mammal\", \"description\": \"Tall, with brown spots, lives in Savanna\", \"name\": \"Giraffa camelopardalis\"},{\"type\": \"mammal\", \"description\": \"Big, grey, with big ears, smart\", \"name\": \"Loxodonta africana\"},{\"type\": \"reptile\", \"description\": \"Green, changes color, lives in \"East Africa\"\", \"name\": \"Trioceros jacksonii\"}]}'"
It was all in one line, considerably big – when properly formatted with around 1.5 million entries.
Quick’n’Dirty
The quick and dirty way to decode this would be to:
- Put newline between each new entry (for readability)
- Remove the quotation marks (” and ‘) from the beginning, end
- Un-escape the quotation marks
newfile = open('animals_editted.json', 'w') with open('animals.json', 'r') as jsonf: for line in jsonf: # there will be only one line newline = line.replace('},{', '},\n{') newline = newline.strip('\"').strip('\'') newline = newline.replace('\\\"', '\"') print newline newfile.write(newline) newfile.close()
Now the JSON looks better but there is a problem:
{"animal_list": [{"type": "mammal", "description": "Tall, with brown spots, lives in Savanna", "name": "Giraffa camelopardalis"}, {"type": "mammal", "description": "Big, grey, with big ears, smart", "name": "Loxodonta africana"}, {"type": "reptile", "description": "Green, changes color, lives in "East Africa"", "name": "Trioceros jacksonii"}]}
By escaping all the quotation marks also those that need to be escaped in the description field are now un-escaped. Now the challenge is to fix these description fields.
One way is to take the line, find “description” in it and split the parts – beginning of the line, description value, and end of the file. If we have the description value, we can escape the quotation marks there and glue back together all the parts:
newfile = open('animals_cleaned.json', 'w') with open('animals_editted.json', 'r') as jsonf: for line in jsonf: start, end = line.split('\"description\": \"') descr_val, end = end.split('", "') # escape the quotes descr_val = descr_val.replace('"', '\\"') descr = '\"description\": \"' + descr_val + '", "' newline = start + descr + end newfile.write(newline) newfile.close()
Now the JSON is importing successfully:
with open('animals_cleaned.json', 'r') as nf: json_object = json.load(nf) print len(json_object['animal_list']) #>> 3
To sum up, this was a brute force way of decoding JSON encoded as JSON, this solution took me about 10 minutes and was good enough to fix the problem. Alternatively, other solutions could be to use json.reads() hook as explained in “encode nested JSON in JSON” answer.
See full code below or on github: