Ring around the repo

DynamoDB Stream Processing: Scaling it up

Merrin Kurian — Tue, 03 Dec 2019 00:36:17 GMT

This is Part II of the Data Streaming from DynamoDB series. You can read Part I where the primary focus is on a use case where stream processing is helpful(indexing data in ElasticSearch). We evaluated the available options to process streams and discussed in detail how Stream processing can be done using Kinesis Client Library(KCL). In this post we will see how to scale up stream processing for very large throughput. As a use case, we will look at online migration of a Cassandra database to DynamoDB and processing streams to index the same data in ElasticSearch.

Why scale up stream processing?

Let us look at a service that has about 250000 active users. They do about 2 million writes to this database. So over the years, the database, currently in Cassandra, has acquired a lot of data. Now, we want to migrate this database to DynamoDB. (Note: To serve 2 million writes a day, one worker seems to be enough to index the data in ElasticSearch in near-real time, as there is no complex processing involved.) Since we cannot take downtime and would like to complete this migration in a reasonable amount of time, we decide to migrate data on a per-user basis from Cassandra to DynamoDB. Now the rate of migration is really dependent on the rate of writes to DynamoDB; so also the writes to ElasticSearch.

Now, there is a difference between how DynamoDB scales up and how ElasticSearch scales up. We can provision a range of values to autoscale for WCUs to a DynamoDB table, but we setup a cluster with predefined capacity for ElasticSearch. Assuming ElasticSearch is appropriately provisioned to support current and a little into the future of the estimated traffic and data, the data migration use case really boils down to DynamoDB. Again, DynamoDB can scale up to a certain limit(40000 WCUs shared by the table and LSIs and Global tables if any) and so do the corresponding streams. These are soft limits which can be raised by support. Then the rate of migration ultimately comes down to how fast the stream is processed so corresponding data appears in ElasticSearch.

Imagine that there is only one worker, but DynamoDB writes are at 40000 WCUs( with a few indexes, let us say effective rate is 10000 WCU to the table and therefore the stream). In that case, it will be several hours before the corresponding data appears in ElasticSearch. Moreover, the table continues to get writes, streams continue to grow. After 24 hours streams become unavailable for reads and if the worker is really lagging at that point, there will be data loss. Even in a less dramatic scenario, a service that does 2 million writes a day, provided that the partition key is somewhat uniformly distributed, can benefit from more than a single worker and make use of the parallelism inbuilt to the system. Let’s learn more.

DynamoDB Streams and Shards

Courtesy: AWS docs

As shown in the picture above, one DynamoDB partition corresponds to one shard in DynamoDB stream, which can be processed by one KCL worker. So if the table has multiple partitions, stream processing can benefit from multiple workers.

How many workers do we need?

How do we calculate the number of partitions in a table? There are several equations floating around claiming to help calculate this number. Even older AWS documents point to a link which now does not show this calculation. Based on their documentation however we can somewhat guess that a partition cannot process more than 1000 WCUs roughly. Again, I’m not sure how accurate this number is. What has been useful for me is to actually see it in action.

This is one foolproof way to find the number of shards in the stream, therefore the number of workers and therefore the number of partitions in the base table. As discussed in Part I of this post series, configure a KCL worker to process the table stream. Then login to AWS console, look for a new table by the name of the worker you have configured. KCL workers checkpoint using a DynamoDB table of the same name and therefore we can see a table that we didn’t create after the worker has started processing stream.

For a table with min WCU configured as 250(has 3 LSIs), the corresponding worker table has 8 items. There are 6 open shards and 2 that are checkpointed with SHARD_END. LeaseOwner column shows the workerId of the shard that holds the current lease to the shard. LeaseKey is the shardId. So now we know how many shards we have in the stream and therefore how many workers can process the stream in parallel.

Worker table

Tips on scaling up writes and workers

Initially I had these configurations for the table:

RCU: Autoscale. Min: 250 Max: 3000

WCU: Autoscale. Min: 250 Max: 40000

What I observed is every time it auto scaled, there was slowness in production performance for 2 minutes. After all this is a live system that does online migrations. Also the number of workers was very small, which do not autoscale. So we started seeing data loss, most likely due to workers not scaling as much as the write throughput on DynamoDB table and stream.

The other issue was that the DynamoDB table automatically created for a worker doesn’t auto-scale. It has RCU and WCU of 5 when it is created. Soon, reads to this table were getting throttled.

It is best to do a few trial runs to learn about the nature of throttling and throughput values of the main table, worker table and the efficiency of workers in processing the stream. A few metrics that will help are Throttled write requests and events.

Throttled writes

Most often these throttling events don’t appear in the application logs as throttling errors are retriable. So looking at these metrics in AWS console help understand if there is a need to increase WCU for the table. In this case, clearly it will help as requests are throttled by the hundreds every once in a while.

We can also see metrics from Streams with the number of records returned per batch and the latency in the following:

Stream metrics

The other metric to watch out is the capacity on the worker table. As we can see, for this table, the rate at which data is read/written the default values of 5 without autoScaling is no good. WCU required is 50–70.

Worker table metrics

How to prevent data loss and achieve near real time processing

From our use case to do high throughput online migration, we have a few useful insights. Here in addition to writing to DynamoDB, all data also need to be indexed in near real time in ElasticSearch via KCL workers configured to process DynamoDB streams.

Create a large number of workers ahead of time: DynamoDB can auto-scale. Lambda functions can auto-scale. But KCL workers that process streams will not auto-scale. They will continue to process one shard per worker. When DynamoDB autoscales and increases capacity, shards split into two. When shards split, there will be double the number of shards as before and therefore require double the number of workers prior to the split to cover all shards, if the write throughput sustains at large numbers.

PS1: A worker in KCL is a thread which takes an id. Each worker is uniquely identified by the workerId. KCL configuration allows you to name your workerIds within an application. Here is the sample code to create ‘multiple workers’.

PS2: If you cannot create enough threads to process the stream in one EC2 instance/JVM, it is beneficial to scale out EC2 instances/JVMs with the same KCL worker configuration, only workerIds need to be different so they are unique across all threads that process the same stream so that the stream shards are distributed correctly. (Similar to Kafka consumer group and consumers). In our case we have a Kubernetes pod for processing DynamoDB stream that we deploy as a ReplicaSet. We can spin up more workers if we need to across multiple EC2 instances by increasing the number of replicas.

In the sample code, there are 9 worker threads are created with application fooWorker. Each worker is uniquely identified by the workerId. One streamWorker can process a single shard. Maximum throughput for processing streams to achieve near real time processing is by configuring same number of workers as the number of shards. More workers will sit idle. Having more workers is not a bad idea if there is a chance for the table/stream scale out further.

Configure DynamoDB table to use high WCU from the beginning: We did see data loss in ElasticSearch during our test runs, possibly due to workers not catching up fast enough and a few shards that were never processed. In order to avoid running into this uncertainty, if we anticipate sustained high throughput over a long period of time, it is better to provision that table with the large throughput than to wait for it to be throttled and thereafter autoscale. While autoscaling shards split up and increase in number. Since migration is only going to take a finite amount of time, we can afford to have the high WCUs provisioned on the table, instead of having to worry about proving that all data migrated correctly, the latter being much harder.

Configure auto-scaling for the worker table: The fooWorker KCL application in this case will create a fooWorker DynamoDB table for checkpointing. This table needs to have higher capacity than the defaults with which it gets automatically created. Ensure that you either provision the required capacity or enable autoScaling/onDemand so stream workers are not throttled while processing streams.

With these settings enabled, ElasticSearch indexes are being updated mostly near real time. We are able to verify that all data eventually reaches ElasticSearch without significant lag and that there is no data loss.

That brings us to the end of 2 part series for processing DynamoDB Streams for near real time indexing in ElasticSearch at high throughput.

DynamoDB Stream Processing

Merrin Kurian — Tue, 03 Dec 2019 00:16:19 GMT

DynamoDB Streams makes change data capture from database available on an event stream. One of the use cases for processing DynamoDB streams is to index the data in ElasticSearch for full text search or doing analytics. In this post, we will evaluate technology options to process streams for this use case. In a subsequent post, we will dive into details on scaling up the stream processing, if this approach is followed.

DynamoDB Streams

Enable DynamoDB Streams in the table specification

"StreamSpecification": {
"StreamEnabled": true,
"StreamViewType": "NEW_AND_OLD_IMAGES"
}

Note: If you are planning to use GlobalTables for DynamoDB, where a copy of your table is maintained in a different AWS region, “NEW_AND_OLD_IMAGES” needs to be enabled.

After streams are enabled on a table, the streamArn is required to configure a client application to process streams. It will look like this:

"arn:aws:dynamodb:{aws-region}:{aws-account-number}:table/{table-name}/stream/2019-11-07T20:49:20.459"

More on how table activity is captured on DynamoDB Streams

Courtesy: AWS Docs

2 approaches to process streams

Serverless approach:

Lambda function Approach to process streams and index data

The easiest approach to index data from DynamoDB into ElasticSearch for example is to enable a Lambda function, as documented here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-dynamodb-es

There are several reasons why I do not prefer a Lambda function for our use case. Some of them are:

Deployment complexity: We run our services in Kubernetes pods, one for each type of application. Adding in a lambda function/serverless will change the deployment topology and bring in more complexity to our deployment automation.
Observability: The only way to observe what happens inside a Lambda function is to use CloudWatch service. We already have a different stack of observability framework to use and analyze information from application logs and would like to continue to leverage that. If we decide to use Lambda function, we need to capture logs from Cloudwatch and publish them to s3 buckets to push to the stack.
Skill set of the team: We are primarily application engineers who switch to DevOps mode when needed. We prefer to work with client libraries in java/kotlin compared to other languages/tools/frameworks for production systems that we need to maintain as a team of 3 engineers.

Here are the reasons why AWS advocates use of Lambda function:

Courtesy: AWS Docs

Ability to autoscale stream processing. Unless you have a really large workload and really complicated processing, lambda functions would work. There is no need to make additional effort to scale up stream processing.
CloudWatch metrics: All metrics go to CloudWatch and that should help with observability if you already have that built in place.
Limitation on throughput: There is a 100 record per shard limit on how many records are processed at a time. KCL workers allow more throughput per batch based on what I heard.

Hosted Service approach:

KCL worker with DynamoDB Adapter

Since we ruled out Lambda function, the other approach is to use KCL(Kinesis Client Library) worker with DynamoDB Adapter for processing DynamoDB streams. Since we are building java/kotlin services and are primarily application developers, this option is better aligned with the skill set of the team for long term maintainability of the stack.

In this case an application is built around KCL with DynamoDB Adapter, that creates a worker configured to listen to changes to the stream and process them.

The disadvantage with using KCL workers is that we need to scale up workers on our own based on performance requirements in processing the stream. More about that in the upcoming post. The advantage is that it is really another application deployed alongside your main service and you can leverage your existing deployment infrastructure(a separate pod on a Kubernetes cluster), code infrastructure(Springboot application) and the telemetry/observability stack you are already familiar with for logging and troubleshooting.

Stream processing requires KCL to instantiate a worker. We must provide the worker with configuration information for the application, such as the stream arn and AWS credentials, and the record processor factory implementation.

As mentioned in the documentation, the worker performs the following tasks. For most cases, we don’t have to tweak any of these settings. It is good to know that these are the activities happening behind the scenes. The worker:

Connects to the stream.
Enumerates the shards within the stream.
Coordinates shard associations with other workers (if any).
Instantiates a record processor for every shard it manages.
Pulls records from the stream.
Pushes the records to the corresponding record processor.
Checkpoints processed records.
Balances shard-worker associations when the worker instance count changes.
Balances shard-worker associations when shards are split.

DynamoDB writes data into shards(based on the partition key). Each shard is open for writes for 4 hours and open for reads for 24 hours. Essentially, KCL worker will subscribe to this stream, pulls records from the stream and pushes them to the record processor implementation that we will provide. KCL will allow a worker per shard and the data lives in the stream for 24 hours. These are important limits to remember. We will discuss throughput and latency of stream processing in a bit.

Worker configuration

So far we know that we need a KCL worker with the right configuration and a record processor implementation that processes the stream and does the checkpointing. How do we actually go about doing it?

Let’s say we have 4 DynamoDB tables whose data need to be indexed in ElasticSearch. Each table produces a stream, identified by the streamArn. Now we need KCL 4 workers, one each for each stream. Here is a sample. Most values can be left as defaults, except the AWS credentials and the identifiers of stream and worker.

StreamRecordProcessor implementation

KCL requires us to provide a StreamRecordProcessorFactory implementation to actually process the stream. Details in the docs: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-java.html

Provide implementations for IRecordProcessor and IRecordProcessorFactory. Refer https://github.com/aws/aws-sdk-java/blob/master/src/samples/AmazonKinesis/AmazonKinesisApplicationSampleRecordProcessor.java

override fun processRecords(processRecordsInput: ProcessRecordsInput) {

processRecordsWithRetries(processRecordsInput.records)

checkpoint(processRecordsInput);
}

processRecordsWithRetries: This is where the stream processing logic will live. In our specific case, we will generate an id for the document based on the keys in DynamoDB table and create an index/delete request in ElasticSearch. Note that it is advantageous to use the Bulk indexing in ElasticSearch to reduce roundtrip time thereby increasing throughput and reducing latency for data to appear in ElasticSearch. At the rate of indexing a few hundred records every second, I have seen them appear in ElasticSearch within 200 ms. Note that, KCL absorbs any exception thrown from the processRecords and moves forward to process next batch of events. So it is really critical to have an effective exception handling strategy, one that retries for retry-able errors(intermediate technical glitches) and another for handling non-retry-able errors(eg. invalid document wrt ElasticSearch mapping).

checkPoint: This is the mechanism used by the KCL worker to keep track of how much data from the stream has been read by the worker. So in case worker terminates/application restarts, it will catch up from the point where it was last checkpointed in the stream. This is similar to committing offsets in Kafka.

Code samples & References

AWS documentation on using KCL to process DynamoDB Stream is here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.html

Here is some sample code from the docs that get one started on the record processing:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html

Throughput and Latency

What we have done so far will create a single worker to process the stream. What if that is not enough? We can determine if we need more worker threads based on the amount of writes to both DynamoDB and ElasticSearch. There are 2 ways to compare:

Analyze the number of DynamoDB writes per minute and compare that to ElasticSearch writes.
Instrument logging to trace a single record through the entire pipeline, both DynamoDB and ElasticSearch. So monitoring a single item can also provide data on how much lag is there for a record to move from DynamoDB to ElasticSearch.

If the application writes to DynamoDB a few hundred records at a time, usually 1 worker is probably enough. It also depends on how distributed the partition key is. Let’s say we found that it takes several minutes for the data to appear in ElasticSearch once it is written in DynamoDB. In such a case, the first parameter to examine is streamConfig.batchSize in the configuration above.

KinesisClientLibrary::maxRecords

If your application writes thousands of Items to DynamoDB, there is no point in keeping maxRecords low, eg. 100. A high number (default: 1000) will definitely improve the throughput and therefore latency of your data appearing in ElasticSearch. There is no reason to lower this value for most cases.

Now, there will be cases when you have high throughput writes (ie. several thousand writes per second) on your DynamoDB tables. In such cases a single worker is not going to be enough. We will discuss scaling up stream processing using KCL workers in the next post in this series.

Platform Migration is hard – Data Migration is even harder

George Chiramattel — Sun, 31 Mar 2019 01:06:02 GMT

In my last blog, I talked about application rewrites. In this post, I would like to focus on Platform rewrite, which is a flavor of application rewrite. The motivation for 'Platform rewrite' could be many. This might include:

A decision to switch from monolithic architecture to micro-services. For this post, I consider this as a Platform Migration.
This could include a decision to change the storage architecture. Like a decision to migrate from an RDBMS store to a NoSQL store. That, in my opinion, is an example of Platform rewrite that includes Data Migration.

What is involved?

Let's look at what is involved in Platform and Data migration

Platform Migration

If we are able to reproduce the current snapshot of the API surface and its behavior in the target system, then we can say that the bulk of platform migration is done. We can rerun integration tests after wiring in the new system to ensure correctness.

Data Migration

As a system matures, its API and business logic evolve. For business logic migration to be considered complete, we have to faithfully capture the latest snapshot of this system. The storage system is where all these historical changes still exists. Think of this like layers of sediments at the bottom of a lake. The data migration task has to deal with all these layered sediments.

This would mean that, data-migration is where the team will spend the most amount of time. Let me explain this is detail.

When we attempt a data migration, a significant portion of it will succeed. But a significant subset will fail. The team will have to analyze the failures, fix current code to accommodate for that and run the migration again. This is repeated many times till all data is migrated.

The best way to think of it is to imagine dropping a bouncing ball. The distance it travels is not the height of the ball when it was dropped. It should also include the sum of all the bounces that happened before the ball comes to total rest.

Why is this significant?

The hidden cost of data migration can cause us to significantly underestimate the time required to switch to the new system. During the transition period - new features should be released to both the classic system and the new system simultaneously.

Biggest Architectural concern

To me, architecture is about modeling a system such that, it continues to remain malleable to change. In this view, 'architectural concerns' is that thing that is the most difficult to change. It is common to see architectural concerns shift through the lifetime of the application. 'Tight coupling' and hard to change 'external dependencies' are early stage architectural concerns.

Most successful enterprise applications, over time collect a lot of data. It can grow to a point that it becomes the most difficult thing to change. So, over time, the size of the data becomes the biggest architectural concern. This is reflected in the fact that any significant change to that (migration for example) is really hard to do.

Hierarchy of concerns

In our team, we keep the above point in mind when we go about making 'architectural' decisions. Most applications can easily go through a UI refresh - and it should take relatively small amount of time to finish. Look at the pace of change in front-end development. Frameworks and libraries change very often.
Business logic layer is tougher to change. Nothing to be trivialized - but doable in reasonable amount of time.
For a system that has grown to be relatively big, data migration can take the longest amount of time. Keep this in mind when we take decisions. Think about your data very carefully. When we start off a new project, change in data structure is relatively easy. Not so, as data becomes bigger. And this realization should influence our hierarchy of concerns.

Application Rewrite

George Chiramattel — Thu, 28 Mar 2019 18:35:25 GMT

In my experience, every successful application will eventually reach a stage, when it becomes worthwhile contemplating, if it is a better investment to continue to improve on the existing codebase or to rewrite the whole application. The gravity of this decision is huge, as the consequence of getting this wrong can be disastrous. Many experts have pointed out that this exercise is not even worth considering. It is good to learn from projects that succeeded or failed. Every project is unique and patterns that work in one organization might not translate well into other companies. But it is worthwhile looking at the available literature. Here are some:

Why should we consider the option of rewrite?

Software is supposed to be infinitely malleable - so we should be able to change any piece of software -right? And only successful software becomes legacy - so that is valuable as well. And, if the current system is not broken, why fix it? All these are important considerations while contemplating on the question of application rewrite.

Also, I want to draw a distinction between refactoring and rewriting. All well managed products should invest in constant refactoring of its codebase. Failure to do so will result in reduced shelf life of the application and making it more ripe for rewrite. In spite of constant refactoring, there could be many reasons to consider an application rewrite. Some of the most common reasons are:

Existing code has become too difficult to handle and extend. Even simple changes have cascading unintended consequences. This makes it difficult to add new features and increases the cost of maintaining the current codebase. This could result in a system that is changing slower than its competition. Even for a market leading application, this could mean slow death as slow growth equal slow death.
It becomes increasing difficult to motivate newer engineers. General enthusiasm decreases. If a happy team delivers good products, the opposite is also true.
As the product evolves, design decisions that were appropriate for that time could become a liability now (historical design mistakes) - Invariants that are built into the system makes it near impossible to repurpose the system, when we find newer approaches.

Let’s look at some reasons why software rewrites fail

These are some high level patterns that we have to keep in mind:

Under Estimate effort: The crufty-looking parts of the application’s codebase often reflect corner cases and weird bugs. An immature approach to estimation can overlook the many man-years of development that went into creating the current software. Underestimating the project can lead to the initiative failing because of cost overruns.
Setup for failure: For a project of this magnitude, the team should start off by setting appropriate expectations with the business stakeholders. The team should not fall into the trap of over promising to justify the cost of the rewrite.
Morale effects: A project rewrite can split the current team. Given that domain knowledge is not evenly distributed among team members, splitting up the team can cause resource issues that are not anticipated.
Elitist mentality: This can also have the unintended consequence of splitting up the team into cool guys and legacy members.

Unrealistic Constraints: Usually, software rewrites are sold to business stakeholders with the expectation that, at the end, the current system can be switched off and all users will be on the newer system. This way of positioning the rewrite will setup the initiative for failure because of the following constraints:
- Feature parity: The project is never done until the target system achieves feature parity with the current system. This will apply for feature that might not have aged well, but customers have grown used to. Customers have grown used to a certain way of working in the current system and they don't want their cheese to be moved. This hinders innovation on the target architecture.
  - If you are not in a saturated market, and if the current product does not appeal to a new-generation user, then sticking to existing features and workflows to satisfy the current user-base will guarantee that the product will die in obsolesce. “Golden Hand-shake”
- Moving target: Usually massive rewrites are multi-year initiatives and the current product continues to evolve. This makes the goal of achieving feature parity that much more difficult.
- Less resources on existing product: Splitting the team also results in lower resources across both existing and newer initiative.
- High burden of correctness: This is easily overlooked aspect. The current product has benefitted from incremental releases are tested and validated in the field. And corner cases are fixed. A completely new system with all the existing feature implemented new has to short-circuit the otherwise evolutionary release process. The entire release has to be correct on day one. Which in reality will never happen. This means that at launch time, the system will be perceived as defective and less stable.
- Migrate Data: In my opinion, this is the biggest cost of rewrites. This will be the topic of a new post. Watch this space.

Possible Solutions

In my opinion, the best way to fight product obsolescence is to disrupt the current product from within. Instead of rewriting the current product, create something innovative in a related category. This is similar to what BaseCamp did. Instead of sunsetting BaseCamp, the team committed to perpetually supporting current product as BaseCamp Classic and they launched BaseCamp 2. There was no guarantees on upgrades. Existing users can stay and will be fully supported. While doing rewrite, its best not to burden yourself with existing users. This gives the team immense freedom:

Freedom to drop features without backlash from current users. Not constrained by influential existing customers.
Freedom to launch Minimum Viable Product (no need for feature parity). Launch a new product optimized for current non-consumption and learn from it.
And above all, freedom to not migrate data

For BaseCamp, this model was so successful that they repeated this pattern and created BaseCamp 3 as well. Another industry example is the case of Visual Studio and Visual Studio Code.

Gmail team did something slightly differently. They came up with ‘Inbox’, a different take on Gmail. But, they decided to keep the same backend. This meant that users could potentially switch between these two experiences. And also, they eliminated the need for data migration (remember, that is certainly a big burden). But, this also put significant constraints on the amount of change they can introduce.

Conclusion

Enterprise application rewrite is non-trivial. You get better success with launching a new product that is optimized for new customers. It might make sense to split features into micro-services and let the newer product use that. This also gives an opportunity to add these capabilities to the classic product and avoid data migration.

Project Lombok & Spring's @Qualifer annotation

Shrisha Radhakrishna — Mon, 18 Feb 2019 21:04:15 GMT

When @hoserdude & I stood up QuickBooks Self-Employed's backend almost 4 years ago, Spring Boot was the new shiny thing offering RoR type productivity in the Java world. We fell in love with Spring's Dependency Injection via the @Autowired annotation - we used the annotation liberally, primarily on fields. I sold many developers on this technique until we started seeing more & more instances of class bloat and developers inadvertently creating instances in invalid states.

@odrotbohm's "Why field injection is evil" post explains the issues really well.

Anyway, I'm a convert now and avoid field injection. Even though our codebase is 4+ years old, I want to think we've been diligent about paying off tech debt and keeping dependencies, patterns, and designs up to date. Thanks to in no small part to @lerocha. However, we had a few nasty @Components that Autowired @Qualifier fields in; moving these classes to constructor injection meant hand-writing constructors as Lombok's made-to-order-constructors feature didn't deal with Qualified beans. The thought of not being able to leverage the boilerplate-reducing abilities of Lombok meant developers stuck with field injection and in fact the pattern perpetuated.

I rejoiced when I saw Lombok's copyableAnnotation feature in 1.18.4. it was as simple as adding a lombok.config in the project root with:

lombok.copyableAnnotations += org.springframework.beans.factory.annotation.Qualifier

Boom! We are now able to move all these legacy classes to constructor injection. No excuses!

This post does a nice job of explaining via an example.

The role of a Software Engineering Manager

Shrisha Radhakrishna — Mon, 18 Feb 2019 06:25:51 GMT

Photo by NESA by Makers / Unsplash

This is an opinionated view based on my observations as an Engineer and a people leader in sizable organizations over the past couple of decades. I have had the privilege of learning from great managers & colleagues. I have also run away from toxic environments that sucked the happiness out of everyone including the people producing the toxins.

If someone tells you that they have complete clarity on the role of an Engineering Manager, they would be lying. Simply put, this is one the toughest roles out there - one that requires you to operate at the intersection of people, process, product, and technology.

Team

Create an environment of fun, learning, & collaboration

This is hard work. Great teams love teaching, are diverse, and every team member has each other’s back. If you haven’t already read Google’s research on what makes teams great, I encourage you to check it out. As managers, we’ll have to have a pulse on the overall strengths and weakness of the team so you can intervene and course correct appropriately. Be it a tactical thing like offering training to the more nuanced forming of groups based on complementary skills. Building trust across your team is foundational to creating a fun, learning and collaborative environment. Creating a place where learning and failing are both recognized (positively) is essential.

Hire and develop diverse talent

Hiring is a critical component of building culture and great teams. You immerse yourself in the hiring process and look for people who can augment the culture, not just be “culture fits”. Looking for eager learners that bring passion and energy to the interview process is something you strive to do.

Teaching to fish instead of fishing

Too often we get ourselves into situations where any work related to “X” always goes to one person who’s heavily knowledgeable in an area. While this ends up delivering the fastest outcome for the task, we end up doing ourselves a disservice in at least three ways:

We burden that person with always “owning” that particular area, which can put a strain on the person and can lead to somewhat reduced code quality over time (b/c only 1 person owns that area, hence they don’t have any ‘qualified’ reviewers)
We lose the opportunity to teach other engineers (who are often hungry for learning) about that area/technology.
That person burns out and/or loses all hope of gaining new knowledge.

Individuals

I do not doubt for one second that routine processes to check in with our Engineers help. Monthly check-ins, 1:1s, etc are all valuable. But what’s most important is if you know what drives Jane and what’s stopping Joe from excelling. To get to this, I believe a few important things need to be talked about:

Ensure every employee is taking on meaningful, and impactful work. If you find yourself just allocating work to someone, stop and look for underlying issues. Note that the definition of meaningful varies by level: A Junior Engineer on his first job may be super excited to take on updating content copy in a workflow whereas a more Senior Engineer may be looking for something that’s end-to-end (full-stack, perhaps) to broaden her horizon.
In your 1:1s, you ask specific questions.

What’s something that everyone is afraid of talking about?

What could we have done differently in project X?

What advice do you have for me before we start this new thing?

You focus on personal development goals. Public speaking, creating tech presentations, opportunity to be mentors are all laudable goals and you are in a unique position to broker connections.

Servant leadership

When you are passionate on behalf of your Engineers, whether it be for the project, for developer productivity, or for their own time to “do what’s right”, this creates so much trust and appreciation.

Talkers vs. Doers: what you recognize, reward, and praise matters

A big component of a healthy culture relies on you keeping the folks who “talk more and do less” under control. In order to do this, you have to get a pulse of `real` work that is happening. 🗣

As teams grow and charters expand, it’s easy to encourage (and even rely on) story-telling, to a fault. And if that behavior gets rewarded (because you are unable to distinguish between what is being said vs. done), this can be a big drain on the entire team.

Your peers

For us and the initiatives we undertake to succeed, we need to help our peers. We lean on each other and continually learn from each other. Give each other candid feedback, call each other out, bring contentious topics to your group to discuss - but, avoid talking behind people’s backs.

Remember that your Team #1 is comprised of your peers.

Process

You strive to create processes that add value and don’t become overhead.
You acknowledge that what works and is needed today may become obsolete tomorrow.
You try to create processes in a way that allows Engineers to focus on the most important part of their job: Building. Creating. Designing. 15 minute blocks of coding amounts to nothing. Four 1-hour blocks in a day are not bad, a solid 4-hour block is great! ⏱
Processes and the environment allow junior engineers to have singular focus while expecting more senior devs to multi-task and context-switch.
You do not create meetings in order for you to remain informed. You remain informed by sitting with your engineers and the working team.
You empower teams to make decisions - but, you're not afraid to break ties.
You create a culture where people disagree, but don’t get stuck debating. You help make decisions.
You embrace the “It’s not my fault, but it’s my problem” mindset. (e.g., you take on a bug fix in a code repo that you don’t “own”).

Product

Use your products. As an employee previously in a telecom company and now a Small Business software company, this is easier said than done. Not just the product that you and your engineers work on.
Report bugs. Follow up.
Play with competitors' products.

Technology

These days, frameworks are a dime a dozen and the technology landscape is changing rapidly. You simply cannot build expertise in all the areas where your Engineers are involved. My recommendation is to pick an area in a given month (if a month is too short, make it a quarter) and go as deep as you can. If you only stay broad, you’ll find it harder to stay in touch with the code and the decisions, imho.

Other components of building a great team and technology driven culture include:

Always pay attention to performance. Digging deep on gnarly performance issues often leads you and the team to learn many hidden corners of your system, subsystems, and dependent services.
Build the culture of “Ship & Re-factor continually”. As a manager, if you feel like you don’t have the bandwidth to take on an item in a critical path, dive into a re-factoring project/task that doesn’t have the same time pressures.
Write unit tests. Celebrate great unit tests! Unit tests are a great way to stay in touch with not only the code, but also the core workflow!
Fostering the open source culture (both “internal open source” and true open source). This is easier said than done as time pressures lead us to taking shortcuts that keep libraries, utilities, tools, and SDKs in the same git repo as our product.
Instituting a healthy Pull Request review process that works for your team.
Be the champion of Developer Productivity: Happy Developers build great products. A developer will join a team where a build takes 30 seconds over a team where a build takes ½ hr over a team where someone from SCM has to kick off a build. I subscribe to the philosophy that you’ve gotta be able to run a scaled-down production system on your laptop.
Push for Continuous Delivery/Continous Operations. Immerse yourself in CD/CO and help your team get to NO-DRAMA releases.
Argue & fight about Database design, be more forgiving on higher layers. Mistakes at the foundational layers can cost dearly, but you should encourage your Engineers to take more risks as they go up the stack.