Burned Marshmallows

The Kindling πŸ‚

As part of enforcing a strong data contract between our interval data API and the library that consumes and creates data (via GET/POST to that API), we use schemas implemented in Marshmallow.

Marshmallow allows declarative schemas in Python, so your schema looks something like this:

from marshmallow import Schema, fields, post_load
from marshmallow.validate import OneOf
import simplejson


class Reading(object):
    def __init__(self, interval_start=None, interval_end=None, measurement=None, unit=None):
        self.interval_start = interval_start
        self.interval_end = interval_end
        self.measurement = measurement
        self.unit = unit

class ReadingSchema(Schema):
    class Meta:
        json_module = simplejson
        dateformat = 'iso'
    measurement = fields.Number(required=True)
    interval_start = fields.LocalDateTime(required=True)
    interval_end = fields.LocalDateTime(required=True)
    unit = fields.Str(default='kw', validate=OneOf(['kw', 'kwh']), missing='kw')

    @post_load
    def make_object(self, data):
        return Reading(**data)

This tells Marshmallow that a β€œReading” object should have * A numerical measurement * A timestamp representing the start of the Reading * A timestamp representing the end of the Reading * And a Unit that must be one of β€œkw” or β€œkwh”, with a default of β€œkw”

Then, Marshmallow gives us some hooks for turning JSON or Python objects into instances of a class via the Schema, thus:

blob = """{"interval_start": "2018-03-01T03:15:00+00:00", "interval_end": "2018-03-01T03:30:00+00:00", "unit": "kw", "measurement": 34.0}"""

reading, errors = ReadingSchema().loads(blob)
print reading.measurement
json, errors = ReadingSchema().dumps(reading)
assert json == blob

This is all well and good.

Now, we use a container full of Readings called a ReadingList; the gist of that is that the ReadingList contains a β€œnested” list of Readings as follows:

class ReadingListSchema(Schema):
    meter_id = fields.Str(required=True)
    resolution = fields.Str(default='15min',
                            validate=OneOf(['15min', '1s', '1min', '5min', '1hr']))
    unit = fields.Str(default='kw', validate=OneOf(['kw', 'kwh']), missing='kw')
    interval_start = fields.LocalDateTime()
    interval_end = fields.LocalDateTime()
    readings = fields.Nested(ReadingSchema, many=True)

At a 15 minute interval, there are 96 datapoints in a 24 hour period. 90 days of data comes out to 8640 Readings.

We assumed 90 days would be a reasonable length of time to work with.

The Campfire πŸ•οΈ

Here's where the story gets strange.. when using this client library to access that API, we were getting really lengthy request times. And, stranger, a bunch of that time was spent locally (in User/CPU) πŸ€”

With the schemas as written above, here's loading a static (on-disk) JSON file that contains 90 days of data.

blob = open("/tmp/data.json", "r").read()
%time data = ReadingListSchema().loads(blob)

CPU times: user 4.28 s, sys: 68 ms, total: 4.35 s Wall time: 4.34 s

Thaaaaat doesn't seem right.

The first thing I tried was changing ReadingSchema to be a list of dicts instead of a nested schema...

class ReadingListSchema(Schema):
    meter_id = fields.Str(required=True)
    resolution = fields.Str(default='15min',
                            validate=OneOf(['15min', '1s', '1min', '5min', '1hr']))
    unit = fields.Str(default='kw', validate=OneOf(['kw', 'kwh']), missing='kw')
    interval_start = fields.LocalDateTime()
    interval_end = fields.LocalDateTime()
    readings = fields.List(fields.Dict)

Let's try that again..

blob = open("/tmp/data.json", "r").read()
%time data = ReadingListSchema().loads(blob)

CPU times: user 52 ms, sys: 24 ms, total: 76 ms Wall time: 47.9 ms

Ok. So.. it's not in the creation of Python objects, or reading the JSON. It's something about validation, specifically.

We couldn't really leave it this way β€” not only because we weren't validating 99% of the data's values against the schema, but because the contents of the ReadingList.readings weren't nice friendly Python objects. (There's additional usability / livability features in these classes not duplicated here for simplicity)

What happens when we create Reading objects with just 1 value in them instead of all 4?

class ReadingSchema(Schema):
    measurement = fields.Number(required=True, allow_none=True)
blob = open("/tmp/data.json", "r").read()
%time data = ReadingListSchema().loads(blob)

CPU times: user 108 ms, sys: 8 ms, total: 116 ms Wall time: 110 ms

That seems reasonable.. very reasonable. What if we pick something else as that 1 field?

class ReadingSchema(Schema):
    interval_start = fields.LocalDateTime(required=True)

CPU times: user 2.05 s, sys: 40 ms, total: 2.09 s Wall time: 2.08 s

😱😱😱


The Burn πŸ€•

Well, we know where the culprit is now, at least. If you crack open Marshmallow's source code, you'll find this little nugget hidden inside:

def from_iso_datetime(datestring, use_dateutil=True):
    """Parse an ISO8601-formatted datetime string and return a datetime object.

    Use dateutil's parser if possible and return a timezone-aware datetime.
    """
    if not _iso8601_re.match(datestring):
        raise ValueError('Not a valid ISO8601-formatted datetime string')
    # Use dateutil's parser if possible
    if dateutil_available and use_dateutil:
        return parser.parse(datestring)
    else:
        # Strip off timezone info.
        return datetime.datetime.strptime(datestring[:19], '%Y-%m-%dT%H:%M:%S')

And, if you were wondering, the _iso8601_re in question is this:

# From Django
_iso8601_re = re.compile(
    r'(?P<year>\d{4})-(?P<month>\d{1,2})-(?P<day>\d{1,2})'
    r'[T ](?P<hour>\d{1,2}):(?P<minute>\d{1,2})'
    r'(?::(?P<second>\d{1,2})(?:\.(?P<microsecond>\d{1,6})\d{0,6})?)?'
    r'(?P<tzinfo>Z|[+-]\d{2}(?::?\d{2})?)?$',
)

It really is (surprisingly?) dateutil.parser.parse that crunks up the works, though.

We can pretty easily (thanks for this design feature!) cook up a custom field type that does the same thing under the hood that we'd prefer, and it looks like this:

class SpeedyDateField(fields.Field):
    def _serialize(self, value, attr, obj):
        try:
            return value.isoformat()
        except Exception as exc:
            raise ValidationError(exc.message, field_names=attr)
        
    def _deserialize(self, value, attr, data):
        """ Takes a string and/or datetime and deserializes it into a Pendulum instance
        """
        try:
            matches = _iso8601_re.search(value)
            if not matches:
                raise ValueError("Unable to match string - is it ISO8601?")
            groupdict = matches.groupdict()
            del(groupdict['tzinfo'])
            # I recognize that this is tossing out the timezone string.
            # This is an irritating problem to solve in Python 2.7.x
            for key in groupdict.keys():
                if groupdict[key]:
                    groupdict[key] = int(groupdict[key])
                else:
                    del(groupdict[key])
            return datetime.datetime(**groupdict)
        except ValueError as exc:
            raise ValidationError(exc.message, field_names=attr)

class ReadingSchema(Schema):
    measurement = fields.Number(required=True)
    interval_start = SpeedyDateField(required=True)
    interval_end = SpeedyDateField(required=True)

CPU times: user 1.12 s, sys: 8 ms, total: 1.13 s Wall time: 1.13 s

This isn't perfect β€” but it's 25% of the time length where we started. My actual end solution (including timezone parsing) is of similar duration using a custom Marshmallow Field class that parses using Pendulum