If I have to pick one thing that’s very hard with large scale software, it would be what engineers fail the most – it’s not communication, not leadership or some kind of social skills. It’s planning and executing a migration.
How often did you (or someone in your org) design a new shiny system to replace a legacy one, but failed at closing the migration resulting in the need to maintain two systems? The root cause is often two fold:
- When designing the new system, not enough thoughts were poured into dealing with edge cases, undefined behavior or legacy journeys. This results in a clean system at first, but over time and as more traffic is migrated over, more and more functionality that can’t be properly supported are added to the system – and eventually, you may end up with something you can’t support in the new system
- When migrating to the new system, not enough investment were made into tools to migrate the service, which often result in a lot of manual testing, a lot of rollbacks and in general a high operational cost
One thing that works well is to build a shim in front of your legacy and new services such that you can tee traffic and run diffs (e.g. duplicate 1% of traffic to the new system, and compare the responses of both systems). This isn’t as straightforward as it seems, but definitively doable:
- For write operations (and in general downstream dependencies), retain the request/responses from these dependencies and mock them in the new system
- Make sure you can compare apple to apple – so in case your new system has a new API, build a library to convert from the new API to the old one
From there, you can tee traffic and fix things until you get no differences. This is significantly more efficient than trying a live experiment and manually finding bugs or waiting for users to report issues. Because you tee traffic, you can detect (and correct) edge cases even if they are very rare (e.g. <0.1%)
Thoughts? What do you think is the hardest with distributed software engineering?
LinkedIn post
A (sadly) too common assumption is that a long design doc means that the problem being solved is more important, more complex and/or more impactful. This is wrong.
The only thing a longer doc guarantee is that it takes longer for people to read. As a senior IC, my time is limited (and precious). I don’t have time to read 50 pages for a simple problem unless either the solution is complex and/or trade offs are not obvious.
If you can rewrite your doc to be shorter while carrying the same information, you should. Practically speaking, a shorter doc means:
- People will read it/review it faster – I can spend 5 minutes reading a doc multiple times during the day. I rarely have a 2 hour bloc to just read one doc.
- People will better review your doc – if the critical information is drowned among pages and pages of noise, people may miss it. This means that maybe you’ll have to revert some of the work you did because the problem will show up too late.
On a similar note, simple language is better than long and convoluted terms – basically anything can help the reader should be considered. A few practical tips that may help:
- Write “use” instead of “utilized”, use simple terms as much as possible – keep in mind that you may work with non-native English speakers
- Avoid filler terms that don’t bring value (e.g. It goes without saying/needless to say) – these just takes time to read and bring no value
- Remove the security/privacy sections (or whatever sections) that are empty
- Write multiple versions of your doc for different audiences and different goals (e.g. a short one pager with context, problem statement and high level solution to get rough alignment, a more technical doc with details for people who care about these etc.)
- Don’t put code in your design doc – code should be reviewed in PR, not design doc
- What other tip would you give to others (or to your older self)?
LinkedIn post
One question I guess every now and then as a senior staff software engineer is “how many hours a week do I work?”. Behind the question, there’s the implicit assumption that the main cost of being a staff+ software engineer means you have to work long hours.
My personal experience is a bit different – the true cost of being a very senior engineer is responsibilities (and stress). The more senior you are, the larger your responsibilities, so the more stress will lie on your shoulders – you are responsible for more revenue, avoiding larger fines, or having more components on critical journeys.
It never goes away with time as far as I can tell. You will likely learn to better manage your responsibilities, but you will still have them – and you likely want them. If you can offload all your responsibilities as a senior staff, it means either:
- You aren’t operating at that level
- Your company is full of check and balances (and paperwork) making it a not very interesting place to work
You have to find the balance between stress and thrill – it’s a delicate one, but fun (and important) to maintain.
This is also why maintaining a good work life balance at that level is even harder – and why not everyone truly want to reach that level.
LinkedIn post
One of the most common skills I use as a software engineer isn’t some syntax about threads, promises or coroutine, but how to quickly manipulate data – this can be being able to quickly extract the files responsible for test failures from a large log, parsing a json blob to extract only one field, removing duplicated entries in a file etc.
This is something you should be able to do from multiple tools – from vim, from your terminal, from a google spreadsheet etc. What tools (e.g. sed vs awk) or formula (countif vs exists) you use doesn’t matter much as long as you can quickly use them (or at least know they exist such that you can properly prompt an AI tool to give you the answer).
What’s your favorite tool? What’s the one you still struggle to use? It’s definitively JQ for me as I always screwed up the syntax :)
LinkedIn post