Kogan.com Dev Blog

Alec McGavin

Django Admin and Deleting Related Objects

What's wrong with the UX in this image?

What's wrong with this image?

There are 2 X buttons which do completely different things. Reading left to right, the first one will set the warehouse to null, and the second one will delete “Australia” from the database without a warning!

We set out to fix this as we decided that we don't want to either:

  • Delete a record without a warning
  • Delete a related object from another admin screen

Removing this element on individual widgets can be achieved with setting can_delete_related=False, but how can it be done across the board?

Django templates are very extensible and customisable. We can force can_delete_related=False at render by extending the related_widget_wrapper template:

{% extends "admin/widgets/related_widget_wrapper.html" %}
{% block links %}
   {% with can_delete_related=False %}
       {{ block.super }}
   {% endwith %}
{% endblock %}

Now the 2nd X is gone and our users won't be confused!

How We Took A Table Offline While Maintaining Uptime

Recently a bug found its way into production which caused excessive updates to one of our critical Postgres tables. While we identified and squashed the bug quickly, the damage was already done: about 160% of the live tuples were dead. At a critical threshold of dead tuples Postgres' query planner will start to ignore indices as it thinks there are much more rows than there are. In this degraded state, queries that would normally complete in sub-second time were now taking multiple seconds, even minutes to complete.

Usually dead tuples aren’t much of an issue as they’re vacuumed away regularly by the auto vacuum process. Auto vacuum works well because it only acquires a SHARE UPDATE EXCLUSIVE lock, meaning selects, updates, and inserts can still occur. We calculated from the currently running vacuum’s progress that the auto vacuum was going to take more than 30 days to complete. With Black Friday sales around the corner we needed another solution. A full vacuum was calculated to take roughly an hour and would reclaim a lot more space, however this would acquire an ACCESS EXCLUSIVE lock, blocking all access to the table for the entire duration: not possible without taking the entire site down!

Or is it? If we can get a full vacuum to work, the database would recover in less than an hour!

We analysed the usage of this table. We determined that, so long as there were no writes, removing all relations (and replacing with null / no rows) would be acceptable in the short term, as the values were already denormalised into our elasticsearch database. Could we do this with minimal code changes?

Our Django application uses select_related/prefetch_related heavily so we would need to create an empty table with the same signature to allow for these queries to work without exceptions. Here’s what we came up with:

BEGIN;
-- we want to be able to rename fast later, so lock the table now
LOCK TABLE affected_table IN ACCESS EXCLUSIVE MODE;

-- create a table with the same columns as affected_table.
-- Foreign keys and indices are not required as
-- there’ll (theoretically) be no data inserted
CREATE TABLE "affected_table_empty"
(
    "id"              serial                   NOT NULL PRIMARY KEY,
    "date"            timestamp with time zone NOT NULL,
    "enabled"         boolean                  NOT NULL,
    "data"            varchar(128)             NOT NULL,
    "product_id"      integer                  NOT NULL,
    "mode_id"         integer                  NOT NULL
);
-- swap the tables around. Table renaming is a fast operation
ALTER TABLE "affected_table" RENAME TO "affected_table_broken";
ALTER TABLE "affected_table_empty" RENAME TO "affected_table";
COMMIT;

We had some foreign keys linking to this table so each of those constraints must be dropped:

-- Another fast operation
ALTER TABLE "table_1" DROP CONSTRAINT "tbl1_fk_affected_table_id";

At this point we’ll have:

  • No references to the affected table (now renamed to broken) in the database
  • An empty table for our code to look at

Testing this out revealed a problem. If a table with a foreign key had a value set, Django would crash with DoesNotExist as it assumes there would be a value.

>>> obj = Table1.objects.get(pk=1)
>>> obj.affected_table_id
<<< 156
>>> obj.affected_table
<<<  DoesNotExist: AffectedTable matching query does not exist

As mentioned earlier, we’re heavy users of select_related so we couldn’t just mock the columns to be None. Here’s what we came up with to work around it:

class DisablingForwardOneToOneDescriptor(ForwardOneToOneDescriptor):
    def __get__(self, instance, cls=None):
        if waffle.switch_is_active(f"disable_relation_{self.field}"):
            return None
        return super().__get__(instance, cls=cls)


class DisablingOneToOneField(models.OneToOneField):
    forward_related_accessor_class = DisablingForwardOneToOneDescriptor

We use django-waffle to manage feature flags which makes it easier to turn off relations through this new class.

The above class could be used in place of a OneToOne relation (but could easily be extended to a normal ForeignKey) as follows:

class Table(models.Model):
    affected_table = DisablingOneToOneField("AffectedTable", null=True, blank=True)

Now when we try again it will just return None:

>>> obj = Table1.objects.get(pk=1)
>>> obj.affected_table_id
<<< 156
>>> obj.affected_table
<<< None

Calling obj.save() is also safe, it will preserve the existing ID.

After some vigorous testing we were confident and ready to roll in production.

The steps that needed to occur were:

  1. Scale down Celery workers to prevent any writes to the empty table
  2. Disable the relation via the feature switch
  3. Create an empty version of the table and swap it with the affected table
  4. Drop all foreign key constraints to the affected table
  5. Vacuum the affected table
  6. Swap the affected table back with the empty table
  7. Insert any data that managed to find its way into the empty table into the affected table
  8. Add all the foreign key constraints back in
  9. Re-enable the relation via the feature switch
  10. Scale workers back up

We executed the plan and observed that the system showed no signs of distress, so we were ready to perform the vacuum.

VACUUM FULL affected_table_broken;

The vacuum completed successfully with no queries blocked!

Now to reverse the procedure to get the data back:

BEGIN;
LOCK TABLE affected_table_broken, affected_table IN ACCESS EXCLUSIVE MODE;
ALTER TABLE "affected_table"
    RENAME TO "affected_table_empty";
ALTER TABLE "affected_table_broken"
    RENAME TO "affected_table";

-- insert any rows that managed to sneak through
INSERT INTO affected_table (date, enabled, data, product_id,
                                         mode_id)
    (SELECT date,
            enabled,
            data,
            product_id,
            mode_id
     FROM "affected_table_empty");
COMMIT;

We’ll also need to add the foreign key constraints back in but in a non blocking way. The trick to doing this is to create it as NOT VALID, meaning existing rows won’t be checked when it’s added. The constraint can later be checked to clear out any invalid values, and it’s done with a less intrusive lock (SHARE UPDATE EXCLUSIVE) so regular activity can continue.

-- put the constraint back on
-- note that it is initially deferred and not valid (old rows won't be checked, new inserts and updates will be). This means it will run fast!
ALTER TABLE "table_1"
    ADD CONSTRAINT "tbl1_fk_affected_table_id" FOREIGN KEY (table_id) REFERENCES affected_table (id) DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "table_1" VALIDATE CONSTRAINT "tbl1_fk_affected_table_id";

After disabling the feature switch and scaling our Celery workers back up our site was back to normal.

We were fortunate enough to be able to substitute null values for the affected table and rely on other databases. If this was not possible we’d have no choice but to bring the entire site down while the vacuum was running. Given we’re in the Christmas period and with Black Friday coming up, everyone was relieved that we could keep our uptime high!

Debugging Celery Issues in Django

Lockdown has ended in Melbourne and we’re able to resume mingling, gossiping, and chattering. Just before we could get off our seats and out the door though, Celery (our distributed task queue) jumped the gun and had us all scratching our heads on a recent issue we uncovered at Kogan.

Most Django developers will use Celery at least to manage long running tasks (as in longer than what’s reasonable in a web request/response cycle). It’s easy to set up and easy to forget that it’s running, until of course something goes wrong.

We observed that at around midnight, hundreds of gigabytes of data was recorded as ingress to our RabbitMQ node hosted on AWS, wreaking havoc on available memory and CPU utilisation.

We continued to investigate. CloudWatch metrics unveiled that the data mass was originating from our worker autoscale group, narrowing the search down.

Introducing Celery mingling.

Celery keeps a list of all revoked tasks, so when a revoked task comes in from the message queue it can be quickly discarded. Since Celery can be distributed, there’s a feature enabled by default called mingling which enables Celery nodes to share their revoked lists when a new node comes online. On paper, this sounds like a good thing: If a task is revoked then it shouldn’t be executed! Our use case doesn’t involve revoking tasks so it seemed harmless - if by chance we wanted to revoke a task manually we’d be able to, knowing that it will propagate to all nodes.

Unfortunately there’s more to the revoked list than meets the eye. Here’s a snippet from Celery:

def maybe_expire(self):
        """If expired, mark the task as revoked."""
        if self._expires:
            now = datetime.now(self._expires.tzinfo)
            if now > self._expires:
                revoked_tasks.add(self.id)
                return True

Adding an expiration to tasks is something we do a lot as there are a lot of time sensitive actions to do. If we’ve missed the window, we shouldn’t run the task.

The above snippet shows that when an expired task comes in it gets added to the revoked list! As a result, when a new Celery node came online our existing workers were eager to share their 250MB lists with the new node. Keep in mind that this list is just a list of UUIDs: we have a lot of tasks executing! We quickly turned off this feature after we observed this behaviour. We also noticed that a lot of workers were restarting at midnight - 250MB multiplied by 30 workers restarting is a lot of handshakes and a lot of data!

Looking through the supervisor logs to find the cause of the restarts initially gave a red herring; processes were exiting with exitcode 0. Surprisingly, Celery will also exit 0 on an unrecoverable exception so we started looking through Sentry for anything suspicious. We uncovered this exception:

TypeError: __init__() missing 1 required positional argument: 'url'

The rest of the trace was unhelpful due to the coroutine nature of its source. At a glance the exception appears to be a bug on our end, but looking at the source reveals it to be a pickle deserialization error.

Ultimately, we found the issue was not an unpickleable class, but an unpickable exception being passed to the retry mechanism. We filed an issue and removed all custom exceptions from the retry method.

If you’re running Celery with a lot of fine tuning with task expiration, we recommend turning off mingling. We’d also recommend not passing custom exceptions into Celery’s retry mechanism and instead log exceptions where they’re initially raised.