Technology Talks

I recently delivered a short talk to the Kogan.com engineering team on GraphQL - a technology that has not yet seen adoption at the company, but for which there arguably exists a need. Customers shop on Kogan.com across devices with different data requirements, and our reliance on traditional REST APIs leads to cases of under-fetching and over-fetching. GraphQL is a potential solution to this issue that can also make calling our APIs more simple, flexible and intuitive.

Presentations like these serve two main purposes: to educate and to inspire. Hosting regular discussions around technology practices helps share knowledge between engineers and identify where improvements can be made. At Kogan.com, we hold regular presentations and discussions about technology, typically hosted by members of the engineering team. These come in two flavours – hour-long “tech talks” and 15-minute “lightning talks”. Each is held roughly once a month, and the responsibility to host them cycles between each member of the engineering team. My presentation was a lightning talk, but another engineer in my team recently delivered a tech talk on OpenAPI practices. An effective talk appeals to all members of the audience. Despite having experience with OpenAPI, there were still valuable things that I learned.

Technology talks are important to organisations, so that they can share the best ideas and keep up with continual change. There will certainly be more talks this year at Kogan.com; I look forward to learning more about what my colleagues have to share.

DECEMBER 2021 HACKDAY (PART 3) v2

Tick tick, boom gate opens!

And by boom gate I am referring to the Kogan.com office car park boom gate. Kogan.com office parking spots are a scarce resource, but the staff members that are lucky enough to have an allocated spot aren’t always using them. Most days there is enough space for other staff members to utilise the parking spots, but there aren’t many remotes to open the car park boom gate. Is there perhaps a way to remotely open the boom gate by using the Kogan app only, without needing an individual remote?

During the recent hack day, team “Kogan Boom” took on solving this problem. We also managed to surprise and perhaps thoroughly confuse some passers-by when we were testing our solution and randomly opening the boom gate from the comfort of the office. This project had everything; electronics, hardware, software, app integration. It was a fun action-packed day for the team and everyone got to contribute.

The Boom Gate project comprised of three software systems:

  1. The Boom Box: An ESP32 wired into the physical RF remote control for the Boom Gate.  These remotes are in short supply and why we wanted to build the project in the first place!

  2. IoT Cloud: We used AWS IoT Core to register the ESP32 as a ‘Thing’.  AWS IoT made it simple to generate the MQTT authentication certificates that were copied to the Arduino code running on the ESP32.  We created pub and sub topics via AWS MQTT for which to issue commands to the ESP32 controlling the gate and for the ESP32 to communicate back.

  3. Web App: A mobile app and web service that enabled authorized users to toggle the car park Boom Gate. Instead of registering this service like another ‘Thing’ - we used AWS boto to publish the message to the MQTT topic and generated a service account for authentication with permissions to publish.

 Overall, when a logged in user hits the ‘Open’ button in the APP, a POST request is made to the web service which then pushes a message to the MQTT topic.  The ESP32 subscribes to the same topic, receives the message and toggles the PIN which connects the circuit on the RF Remote Controller and opens the gate.  The RF signal from the remote conveniently reaches the Boom Gate from within the office!

 

The Electronics

 The electronics part of the project included connecting a physical car park remote to a wifi enabled Arduino chip, requiring some delicate soldering work. Well it just so happens that Hanna used to do soldering professionally! She had a summer job at an electronics factory a long time ago and got to do a lot of soldering. Hack day impromptu workstation pictured below. It is good to note that neither staff nor equipment was harmed during this process. 

 
 
 
 

The ESP 32 Arduino chip was programmed using the Arduino IDE. We connected the ESP32 chip using USB Micro cable to our laptop. Testing whether the code is working or not is one of the most challenging parts. Using the IDE, the code is then uploaded to the chip. Then using a multimeter, we checked whether we received the expected HIGH/LOW signals on the right port.

Using the pin diagram we picked pin 5 to write high/low. Thee ESP32 chip was then soldered to the remote which requires a HIGH single to open up the boom gate. Since ESP 32 is WIFI enabled, therefore the code running inside the chip subscribes to relevant AWS Iot pub/sub channels and listens for the relevant incoming messages from AWS ref. https://iot-esp32.workshop.aws/en/module3/configure_esp32.html. Based on the message from AWS, which in our case was plain “open”, a HIGH signal is sent to the remote which then opens the Boom gate. We didn't need to program anything for closing the boom gate as this was handled by the boom gate itself.

The Mobile App and Web Service

Mobile app section of our team worked on the design and implementation of a new “Kogan Boom” app feature, which is only displayed for Kogan.com staff. There is a button to open the boom gate. Since the boom gate closes automatically, a specific button to close it was not needed.

 
 

When a user taps the ‘OPEN’ button, a web request is sent to the web server which then publishes a message to the AWS IoT MQTT topic that the ESP32 is listening on.  We achieved this without needing to register the web service as another IoT ‘Thing’ and instead used the AWS Python boto library to publish to the topic via a service account.  This was fairly trivial once we figured out the command to run!

The service account we created was given the AWS IAM access policy: arn:aws:iam::aws:policy/AWSIoTDataAccess so that it could publish to the IoT MQTT queue.

 
 

Putting it all together

 The finished hardware prototype turned out to be a rather sketchy-looking box full of electronics and wires. Some of the time the box was even ticking! Perhaps the name “Kogan Boom” was not ideal after all? 

 
 

Our hack day project came together nicely in the end and we were able to open the boom gate using the button in the Kogan.com mobile app. All in all, it was a successful hack day full of collaboration, new learnings, laughter and some very tasty donuts! We are very much looking forward to the next one. 

DECEMBER 2021 HACKDAY (PART 1)

DECEMBER 2021 HACKDAY (PART 1)

**‘Twas the week before Christmas…**

…and like a ghost from the Charles Dickens classic, the Kogan Hackday was back to give the talented members of the Engineering team an opportunity to think of something other than what to add to last year’s Christmas shopping list.

What do **“image recognition”**, **“web audio”** and **“boom gates”** have in common? Absolutely nothing! But having a nice variety of topics this time around allowed us to push our limits as innovators and problem solvers through issues that we found interesting. After a quick introduction on the morning of the hack day, four teams were formed to work on the following problems:

My experience onboarding with the Engineering team at Kogan.com

Can you describe your experience leading up to your start at Kogan.com?

The recruitment process for a software engineering position at Kogan.com was fast, smooth and enjoyable. The whole process - from the initial recruiter conversation to the final interview included a focus on technical excellence and soft skills. It began with a phone conversation, followed by a technical interview, and then a meeting with our Head of Engineering, CTO and Director of People & Culture. The process was efficient and purposeful.

Now that you have joined the team, can you describe what your remote onboarding experience has been like?

I joined Kogan.com at a steep inflection point in the growth of the engineering team. The size of the team roughly doubled in one intake of new engineers. Given this, one would expect onboarding to be chaotic and met with challenges. The reality was contrary to this - onboarding was a pleasant experience met with ample support.

My first day was highly educational and filled with meetings covering technical architecture, business context and culture. Given the pandemic, everything was remote. However, the team did a wonderful job in organising an onboarding that was not impaired by work-from-home conditions. The company even delivered us all food for lunch.

During onboarding and day to day where appropriate, the team practices pair programming. Tell us more about your experience:

Pair programming has been an encouraged practice throughout onboarding. It has provided me with a short feedback loop for rapid upskilling and given me the opportunity to socialise with other engineers, despite the remote conditions of work. In some ways, remote pair programming has advantages over its in-person counterpart. It ensures that both engineers get to use their ideal computer setup, allows for concurrency in certain operations (such as each engineer simultaneously exploring different parts of the codebase on their own screens) and does not require sharing desk space.

Now that we have returned to the office and are working in a hybrid environment, what are you enjoying most about meeting your colleagues in person?

I am enjoying meeting my colleagues in-person. Although Kogan.com has done an excellent job in maintaining company culture whilst we’ve worked remotely, there are some aspects that cannot be replicated. It has been nice to experience the full culture, have lunch and grab coffee with others. I tend to ask a lot of questions, and it’s easier to do that in-person.

How We Took A Table Offline While Maintaining Uptime

Recently a bug found its way into production which caused excessive updates to one of our critical Postgres tables. While we identified and squashed the bug quickly, the damage was already done: about 160% of the live tuples were dead. At a critical threshold of dead tuples Postgres' query planner will start to ignore indices as it thinks there are much more rows than there are. In this degraded state, queries that would normally complete in sub-second time were now taking multiple seconds, even minutes to complete.

Usually dead tuples aren’t much of an issue as they’re vacuumed away regularly by the auto vacuum process. Auto vacuum works well because it only acquires a SHARE UPDATE EXCLUSIVE lock, meaning selects, updates, and inserts can still occur. We calculated from the currently running vacuum’s progress that the auto vacuum was going to take more than 30 days to complete. With Black Friday sales around the corner we needed another solution. A full vacuum was calculated to take roughly an hour and would reclaim a lot more space, however this would acquire an ACCESS EXCLUSIVE lock, blocking all access to the table for the entire duration: not possible without taking the entire site down!

Or is it? If we can get a full vacuum to work, the database would recover in less than an hour!

We analysed the usage of this table. We determined that, so long as there were no writes, removing all relations (and replacing with null / no rows) would be acceptable in the short term, as the values were already denormalised into our elasticsearch database. Could we do this with minimal code changes?

Our Django application uses select_related/prefetch_related heavily so we would need to create an empty table with the same signature to allow for these queries to work without exceptions. Here’s what we came up with:

BEGIN;
-- we want to be able to rename fast later, so lock the table now
LOCK TABLE affected_table IN ACCESS EXCLUSIVE MODE;

-- create a table with the same columns as affected_table.
-- Foreign keys and indices are not required as
-- there’ll (theoretically) be no data inserted
CREATE TABLE "affected_table_empty"
(
    "id"              serial                   NOT NULL PRIMARY KEY,
    "date"            timestamp with time zone NOT NULL,
    "enabled"         boolean                  NOT NULL,
    "data"            varchar(128)             NOT NULL,
    "product_id"      integer                  NOT NULL,
    "mode_id"         integer                  NOT NULL
);
-- swap the tables around. Table renaming is a fast operation
ALTER TABLE "affected_table" RENAME TO "affected_table_broken";
ALTER TABLE "affected_table_empty" RENAME TO "affected_table";
COMMIT;

We had some foreign keys linking to this table so each of those constraints must be dropped:

-- Another fast operation
ALTER TABLE "table_1" DROP CONSTRAINT "tbl1_fk_affected_table_id";

At this point we’ll have:

  • No references to the affected table (now renamed to broken) in the database
  • An empty table for our code to look at

Testing this out revealed a problem. If a table with a foreign key had a value set, Django would crash with DoesNotExist as it assumes there would be a value.

>>> obj = Table1.objects.get(pk=1)
>>> obj.affected_table_id
<<< 156
>>> obj.affected_table
<<<  DoesNotExist: AffectedTable matching query does not exist

As mentioned earlier, we’re heavy users of select_related so we couldn’t just mock the columns to be None. Here’s what we came up with to work around it:

class DisablingForwardOneToOneDescriptor(ForwardOneToOneDescriptor):
    def __get__(self, instance, cls=None):
        if waffle.switch_is_active(f"disable_relation_{self.field}"):
            return None
        return super().__get__(instance, cls=cls)


class DisablingOneToOneField(models.OneToOneField):
    forward_related_accessor_class = DisablingForwardOneToOneDescriptor

We use django-waffle to manage feature flags which makes it easier to turn off relations through this new class.

The above class could be used in place of a OneToOne relation (but could easily be extended to a normal ForeignKey) as follows:

class Table(models.Model):
    affected_table = DisablingOneToOneField("AffectedTable", null=True, blank=True)

Now when we try again it will just return None:

>>> obj = Table1.objects.get(pk=1)
>>> obj.affected_table_id
<<< 156
>>> obj.affected_table
<<< None

Calling obj.save() is also safe, it will preserve the existing ID.

After some vigorous testing we were confident and ready to roll in production.

The steps that needed to occur were:

  1. Scale down Celery workers to prevent any writes to the empty table
  2. Disable the relation via the feature switch
  3. Create an empty version of the table and swap it with the affected table
  4. Drop all foreign key constraints to the affected table
  5. Vacuum the affected table
  6. Swap the affected table back with the empty table
  7. Insert any data that managed to find its way into the empty table into the affected table
  8. Add all the foreign key constraints back in
  9. Re-enable the relation via the feature switch
  10. Scale workers back up

We executed the plan and observed that the system showed no signs of distress, so we were ready to perform the vacuum.

VACUUM FULL affected_table_broken;

The vacuum completed successfully with no queries blocked!

Now to reverse the procedure to get the data back:

BEGIN;
LOCK TABLE affected_table_broken, affected_table IN ACCESS EXCLUSIVE MODE;
ALTER TABLE "affected_table"
    RENAME TO "affected_table_empty";
ALTER TABLE "affected_table_broken"
    RENAME TO "affected_table";

-- insert any rows that managed to sneak through
INSERT INTO affected_table (date, enabled, data, product_id,
                                         mode_id)
    (SELECT date,
            enabled,
            data,
            product_id,
            mode_id
     FROM "affected_table_empty");
COMMIT;

We’ll also need to add the foreign key constraints back in but in a non blocking way. The trick to doing this is to create it as NOT VALID, meaning existing rows won’t be checked when it’s added. The constraint can later be checked to clear out any invalid values, and it’s done with a less intrusive lock (SHARE UPDATE EXCLUSIVE) so regular activity can continue.

-- put the constraint back on
-- note that it is initially deferred and not valid (old rows won't be checked, new inserts and updates will be). This means it will run fast!
ALTER TABLE "table_1"
    ADD CONSTRAINT "tbl1_fk_affected_table_id" FOREIGN KEY (table_id) REFERENCES affected_table (id) DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "table_1" VALIDATE CONSTRAINT "tbl1_fk_affected_table_id";

After disabling the feature switch and scaling our Celery workers back up our site was back to normal.

We were fortunate enough to be able to substitute null values for the affected table and rely on other databases. If this was not possible we’d have no choice but to bring the entire site down while the vacuum was running. Given we’re in the Christmas period and with Black Friday coming up, everyone was relieved that we could keep our uptime high!

Why I joined the Kogan.com engineering team

Welcome to our monthly “Why I joined the Kogan.com engineering team” series! Every month, we talk with members of the Kogan.com engineering team to learn who they are and why they chose to pursue careers at Kogan.com.

Mengfei is our Designer extraordinaire who has been with us for over 4 years now. Her role sees her playing a big part in how our website interacts with our customers and she is involved in turning features developed by our engineers into user friendly applications for our customers. In her past time Mengfei loves to bake, travel and play the viola. Her superpower is spotting our pet cats lurking around our team’s backgrounds during video calls! Mengfei is also an avid photographer who regularly shares her brilliant snaps of her travels with the wider Kogan.com team.

Mengfei at Lake Tyrrell which is in between Melbourne CBD & Mildura

Describe what it’s like being a part of the Kogan.com Engineering team.

Busy, collaborative and an environment where creativity is embraced.

Every day we receive many requests from other teams, as well as feedback from customers. We also have new features that need to be analysed, designed, implemented, and released. As a UX designer at Kogan.com, I am always busy with proposing concepts, receiving feedback, and refining designs. During this process, there is a lot of room to be creative, especially when collaborating with other team members.

We’re always brainstorming various ways to solve usability issues, enhance the User experience, and deliver the best e-commerce site for our customers!

What is something unique about the Kogan.com Engineering team?

Our team is highly collaborative and we love supporting each other.

I joined Kogan.com 4 years ago as my first job after migrating to Australia. We were a team of 14 members at that time. In my time here, I have learnt a lot not only about design, but also about stakeholder management, negotiating requirements with developers, and even about the Australian culture! 

During this time, I have gained so much knowledge about how customers interact with our site which helps with designing features for our customers so they have a great shopping experience.I have also been fortunate enough to be part of the team working on building new vertical sites aligning to our overall company growth strategy. 

4 years ago, when I joined Kogan, I received a lot of support from team members. I had very little knowledge about front-end development at that time. The developers took time to walk me through the process, explored the different tools, and we worked together to find an efficient process to turn designs to implementation. Goran, our CTO, was always kind with his time and supported me to review my designs. He always empowered me and raised suggestions when I got stuck. 

I am so proud to say that I have now become one of the key members in our team who supports others by providing these design suggestions. I also play a big part in the user acceptance testing process. This helps us to get familiar with products and features much faster.

Tell us about a work challenge you’ve had to recently resolve

Our team has doubled in size recently due to the growth of our business and we’re seeing more and more new faces join us! To ensure our new developers aren’t waiting for my designs, I’ve had to learn to manage my time more effectively so I can deliver on these designs for my team and stakeholders.

Tell us about your proudest moment at work

I am really proud to see the new features that I’ve designed, released and used by customers,(though this is the whole team’s effort not my own ^_^).

Debugging Celery Issues in Django

Lockdown has ended in Melbourne and we’re able to resume mingling, gossiping, and chattering. Just before we could get off our seats and out the door though, Celery (our distributed task queue) jumped the gun and had us all scratching our heads on a recent issue we uncovered at Kogan.

Most Django developers will use Celery at least to manage long running tasks (as in longer than what’s reasonable in a web request/response cycle). It’s easy to set up and easy to forget that it’s running, until of course something goes wrong.

We observed that at around midnight, hundreds of gigabytes of data was recorded as ingress to our RabbitMQ node hosted on AWS, wreaking havoc on available memory and CPU utilisation.

We continued to investigate. CloudWatch metrics unveiled that the data mass was originating from our worker autoscale group, narrowing the search down.

Introducing Celery mingling.

Celery keeps a list of all revoked tasks, so when a revoked task comes in from the message queue it can be quickly discarded. Since Celery can be distributed, there’s a feature enabled by default called mingling which enables Celery nodes to share their revoked lists when a new node comes online. On paper, this sounds like a good thing: If a task is revoked then it shouldn’t be executed! Our use case doesn’t involve revoking tasks so it seemed harmless - if by chance we wanted to revoke a task manually we’d be able to, knowing that it will propagate to all nodes.

Unfortunately there’s more to the revoked list than meets the eye. Here’s a snippet from Celery:

def maybe_expire(self):
        """If expired, mark the task as revoked."""
        if self._expires:
            now = datetime.now(self._expires.tzinfo)
            if now > self._expires:
                revoked_tasks.add(self.id)
                return True

Adding an expiration to tasks is something we do a lot as there are a lot of time sensitive actions to do. If we’ve missed the window, we shouldn’t run the task.

The above snippet shows that when an expired task comes in it gets added to the revoked list! As a result, when a new Celery node came online our existing workers were eager to share their 250MB lists with the new node. Keep in mind that this list is just a list of UUIDs: we have a lot of tasks executing! We quickly turned off this feature after we observed this behaviour. We also noticed that a lot of workers were restarting at midnight - 250MB multiplied by 30 workers restarting is a lot of handshakes and a lot of data!

Looking through the supervisor logs to find the cause of the restarts initially gave a red herring; processes were exiting with exitcode 0. Surprisingly, Celery will also exit 0 on an unrecoverable exception so we started looking through Sentry for anything suspicious. We uncovered this exception:

TypeError: __init__() missing 1 required positional argument: 'url'

The rest of the trace was unhelpful due to the coroutine nature of its source. At a glance the exception appears to be a bug on our end, but looking at the source reveals it to be a pickle deserialization error.

Ultimately, we found the issue was not an unpickleable class, but an unpickable exception being passed to the retry mechanism. We filed an issue and removed all custom exceptions from the retry method.

If you’re running Celery with a lot of fine tuning with task expiration, we recommend turning off mingling. We’d also recommend not passing custom exceptions into Celery’s retry mechanism and instead log exceptions where they’re initially raised.