The situation
I get a file with multiple JSON objects in a row, like the following:
{"a": 42, "b": "something"}
{"a": 43, "b": "something else"}
{"a": 44, "b": "a third thing"}
Just like in this example, but there are 100 objects per file. Also, these objects are a bit larger - we have got files with 200MB and more.
The target
I want to extract those JSON objects one after another with Python. At the same time, I do not want to pull the full file to memory because it is really large.
My approach
From a high-level perspective, my algorithm does the following: It only has a part of the document in memory. It tries to parse a JSON object from the beginning of this in-memory part. While this is not successful, it continues reading the file and increases the size of the in-memory document part. If it is able to parse a JSON objects, it drops the beginning of the in-memory document up to the end of the just-parsed element.
A bit more algorithmic:
- Define the in-memory document part and initialize it as the empty string.
- Open the document and start at the beginning.
- While I have not read the full document:
- Read the next part of the document and append it to the in-memory document.
- While a JSON object can be parsed from the beginning of the in-memory document:
- Yield this object.
- Shorten the in-memory document by exactly dropping this JSON object.
My Python algorithm
import json
from json import JSONDecoder
def decode_stacked_json(
fin, # filestream
read_batch_size=10_000 # Batch size to read
):
document = ""
complete_file_was_read = False
while not complete_file_was_read:
# Read a new part of the file
new_part = fin.read(read_batch_size)
# Check if this was the end of the file
if len(new_part) < read_batch_size:
complete_file_was_read = True
document += new_part
while True:
try:
# Parse a JSON object
obj, end_position = JSONDecoder().raw_decode(document, 0)
except (json.JSONDecodeError, json.decoder.JSONDecodeError):
if read_complete_file and not document.strip():
# The full document was read and the last object was parsed.
# This is the stop condition.
return
elif read_complete_file:
# Error handling - There was an unparsable JSON
# object.
raise
else:
# I am not yet finished reading the document.
# Just read the next part and try again.
# Break the inner "while True" loop, continue
# in the outer loop.
break
else:
# I was able to parse a JSON object. Shorten the
# document and yield the object.
document = document[end_position:]
yield obj
Challenges on the way
Where is the end of the file?
The fin
object can be interpreted as a cursor which goes through the file. But with this approach, I am not really able to tell when I have reached the end of the file. Moreover, as soon as I have reached the end of the file, fin.read(batch_size)
just returns an empty string, not even an exception or so.
This behavior is useful if you want to follow updates in a log file for example, but not really helpful for me now, knowing that I have a static file which will not change.
My final approach to find the end of the file was quite simple: I call fin.read(batch_size)
. If the returned string is shorter than it should be (so, if len(returned_string) < batch_size
), this should be the end of the file.
Optimization: How large should a reading batch be?
Obviously, the best way in terms of memory usage would be to read one character at a time. As a consequence, the JSONDecoder.decode_raw
method is called quite often. How does that impact performance?
I did a quick test with batch sizes of 1_000
, 10_000
and 100_000
. The primary results were more or less equal. Not quite sure if that is a result of my quite powerful developer machine or how much this holds true on other machines. But I will leave it on a quite low setting because it seems to not impact performance too much.
How to handle corrupt files?
Assume the following problem: I have such a file and some object in the middle is corrupt because the generating program failed in some way. How can I react on that?
My algorithm currently just continues reading the file up to the end, raising the same problem as before: Filling up the memory, either failing with an unreadable file (I can handle that) or with a memory error again (I improved the previous algorithm to not have to handle this one)…
There may be a more specific JSON Error, but I don’t know a solution to this problem yet. Also, there is the concept of “You aren’t gonna need it” which says that you should implement something as soon as you need it and not earlier.