Milen's Writings

AppKit State Restoration on macOS 14 Sonoma

Wed, 20 Dec 2023 00:00:00 +0000

AppKit State Restoration on macOS 14 Sonoma

AppKit state restoration behaviour changed on macOS 14 Sonoma in a subtle way that can lead to apps not restoring their state correctly. The change can lead to silent breakages which can be hard to debug.

Behavioural Changes

The AppKit release notes for macOS 14 state:

Secure coding is automatically enabled for restorable state for applications linked on the macOS 14.0 SDK. Applications that target prior versions of macOS should implement NSApplicationDelegate.applicationSupportsSecureRestorableState() to return true so it’s enabled on all supported OS versions.

As usual, the behavioural changes only apply to apps that have been linked against the latest SDK to preserve backwards compatibility for existing apps.

Consequences

The relase notes don’t make it immediately clear what the consequences of automatically enabling secure coding might be. In practice, it means that secure coding violations can now occur. The response to such violations depends on the value of NSCoder’s decodingFailurePolicy property which can be one of:

NSDecodingFailurePolicyRaiseException: A failure policy that directs the coder to raise an exception.
NSDecodingFailurePolicySetErrorAndReturn: A failure policy that directs the coder to capture the failure as an error object.

The decodingFailurePolicy property is readonly, thus outside of our control as the NSCoder object is created by the framework. AppKit uses NSDecodingFailurePolicySetErrorAndReturn as the default policy for state restoration.

The documentation states:

On decode failure, the NSCoder will capture the failure as an NSError, and prevent further decodes (by returning 0 / nil equivalent as appropriate).

Thus, after a secure coding violation, subsequent decoding operations would silently fail.

Secure Coding Violations

NSSecureCoding docs demonstrate the canonical secure coding violation:

Historically, many classes decoded instances of themselves like this:

id obj = [decoder decodeObjectForKey:@"myKey"];
if (![obj isKindOfClass:[MyClass class]]) { /* ...fail... */ }

This technique is potentially unsafe because by the time you can verify the class type, the object has already been constructed, and if this is part of a collection class, potentially inserted into an object graph.

So any usages of -[NSCoder decodeObjectForKey:] would trigger a secure coding violation. The docs for NSCoder.decodingFailurePolicy provide further examples:

A decode call can fail for the following reasons:

…snip…

A secure coding violation occurs. This happens when you attempt to decode an object that doesn’t conform to NSSecureCoding. This also happens when the encoded type doesn’t match any of the types passed to decodeObject(of:forKey:).

How to Debug

Violations can now arise in any -restoreStateWithCoder: implementations, so they need to be audited.

Check for any usages of -[NSCoder decodeObjectForKey:].
- Replace with the appropriate secure variants.
At the end of -restoreStateWithCoder:, check the value of NSCoder.error property.
- If it’s non-nil, an error must have occurred earlier.

References

Premature Optimization: Universally Misunderstood

Mon, 28 Aug 2023 00:00:00 +0000

“Premature Optimization”

You might have come across the famous software optimisation quote popularised by Donald Knuth:

Premature optimization is the root of all evil.

– Sir Tony Hoare

It has been commonly interpreted as “don’t think about performance in the beginning, you can fix any performance problem later”. This interpretation is completely and categorically wrong.

Original Quote

It’s very common for statements to lose their original meaning when context has been stripped and that’s exactly what happened here.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

– Sir Tony Hoare

The short version is missing a crucial part: “small efficiencies”. At the time the quote was made, “small efficiencies” referred to techniques like minimising the number of instructions used.

With the additional context, the quote takes on a significantly different meaning: it’s making a statement only about micro-optimisations ("small efficiencies"), not about performance in general.

Hoare was simply advising against micro-optimisations without finding the hotspots first: the “premature” part refers to lacking measurements.

Hotspot Optimisation: “Make it Fast Later” Fallacy

It’s quite tempting to adopt a “ship now, make it fast later” approach. While optimisations will improve performance, it won’t change the fundamental performance envelope. As tech stack fundamentals, access patterns, data dependencies and data structures are baked into a design, it’s not possible to hotspot your way out of a slow architecture.

Daniel Lemire perfectly explains this in Hotspot performance engineering fails:

Developers often believe that software performance follows a Pareto distribution: 80% of the running time is spent in 20% of the code. Using this model, you can write most of your code without any care for performance and focus on the narrow pieces of code that are performance sensitive.

…

Sadly, it does not work.

Charles Cook encounters this with a configuration client which performs a lot DCOM calls:

Again, optimizing this code would require a lot of re-working: optimization after a design has been implemented nearly always involves much more work than incorporating it into the original design.

In Performance Excuses Debunked, Casey Muratori astutely notes that if optimising hotspots was the solution to serious performance problems, software wouldn’t have to get rewritten to make it faster:

If Facebook’s performance problems were concentrated into “hotspots”, why did they have to completely rewrite entire codebases? Why would they have to do a “ground up” rewrite of something if only a few hotspots were causing the problem? Why didn’t they just rewrite the hotspots? Why did they have to rewrite an entire compiler in a new language, instead of just rewriting the hotspots in that language? Why did they have to make their own compiler to speed up PHP and Hack, instead of just identifying the hotspots in those codebases and rewriting them in C for performance?

Designing for Performance

If you want to have fast software, you must think about performance from day one. This includes thinking about the tech stack, the architecture, the data access patterns, the data dependencies, the networking and how it all fits together.

Charles Cook summarises it quite well:

Its usually not worth spending a lot of time micro-optimizing code before its obvious where the performance bottlenecks are. But, conversely, when designing software at a system level, performance issues should always be considered from the beginning.

A good software developer will do this automatically, having developed a feel for where performance issues will cause problems. An inexperienced developer will not bother, misguidedly believing that a bit of fine tuning at a later stage will fix any problems.

A Revised Quote

When having conversations about performance in the future, I’ll be using a revised version:

Premature micro-optimization is the root of all evil.

And that one, I completely agree with.

Quote Origin (Update)

Alexander Keller reached out about the origin of the quote. It first appeared in Donald Knuth’s Structured Programming with go to Statements and Knuth later attributed the quote to Hoare:

Just FYI … after giving Knuth credit for this for years, I recently saw a place where Knuth attributes it to C.A.R. Hoare. From “The Errors of TeX” on page 276 of “Literate Programming” to Hoare:

“But I also knew, and forgot, Hoare’s dictum that premature optimization is the root of all evil in programming.”

Does anybody know where this statement first appeared?

Now, the interesting part is that Hoare did not actually claim to be the author of the quote. Both Bruce Eckel and Tony Hoare credit Edsger Dijkstra.

Dear Hans,

I’m sorry I have no recollection how this quotation came about.? I might have attributed it to Edsger Dijkstra.

I think it would be fair for you assume it is common culture or folklore.

Tony.

References

Exploring Windows XP on macOS ARM64

Wed, 23 Aug 2023 00:00:00 +0000

Windows XP

I have a lot of fond memories of Windows XP and after feeling nostalgic, I was happy to find out that it’s relatively easy to take it for a spin on an ARM64 Mac using QEMU.

I decided to install XP and an assortment of classic software one final time, taking screenshots of the whole process. This included Firefox, Winamp, WinZip, mIRC, Borland Delphi and Visual C++ 6.0.

Sidenote: SerenityOS

If you’re feeling similarly nostalgic, definitely check out SerenityOS:

SerenityOS is a love letter to ’90s user interfaces with a custom Unix-like core. It flatters with sincerity by stealing beautiful ideas from various other systems.

The same people also started the Ladybird browser:

Ladybird is an ongoing project to build an independent web browser from scratch.

Andreas posts regular updates on YouTube which are a great way to marvel at the project’s impressive progress (e.g., SerenityOS update (July 2023)).

Installation

The instructions are based on Emulating Windows XP x86 under M1 Mac via UTM & QEMU.

Download, install and launch UTM
Download the Windows XP UTM template
1. Extract windows-xp-x64-utm.zip
2. Open the Windows XP.utm file
3. Click on the “CD/DVD” dropdown button and mount your installation CD .iso
Start the VM

Installing Additional Software

I couldn’t get directory sharing to work using SPICE tools, so instead I ended up creating an .iso with Firefox and all additional software I needed.

Download Firefox 52.9.0 ESR which is the latest version to run on Windows XP.
Create a folder named Software containing all the files you want to share.
Open Disk Utility.
From the menu bar, go to File → New Image → Image from Folder....
Select the Software folder.
For the Image Format: dropdown, selected DVD/CD master.
Press Save. You will now have a Software.cdr file.
Create Software.iso from Software.cdr in the Terminal.
1. hdiutil makehybrid -iso -joliet -o Software.iso Software.cdr.
Mount Software.iso in UTM.

Gallery

Enjoy the screenshots below and check out the extensive gallery for more.

AppKit vs SwiftUI: Stable vs Shiny

Thu, 10 Aug 2023 00:00:00 +0000

AppKit vs SwiftUI: The Question

When writing a native macOS app, developers need to decide which UI framework to write new code in. AppKit, whose origins date back to 30yrs+ ago, feels like a dinosaur soon to be retired with the shiny SwiftUI waiting around corner to take over.

Ghostty vs SwiftUI

Mitchell Hashimoto has been working on a new cross-platform terminal written in Zig and posted a update on the project’s progress. Notably, he talked about implementing non-native full-screen on macOS:

There is a feature amongst other macOS terminal emulators that is commonly called “non-native fullscreen.”

It turns out implementing this was a doozy. On the surface, it really is very simple: you programmatically modify some window attributes. If you Google around, the code samples to make this behavior happen are a dozen lines or so. But the Ghostty PR was +802/-239. What?

If you’ve been following Ghostty, I very proudly talk about how Ghostty is written in Zig but uses SwiftUI for macOS GUI. Ghostty was 100% SwiftUI (for the GUI): the main entrypoint for Ghostty.app was a SwiftUI App object. The issue is that non-native fullscreen requires subclassing NSWindow and with SwiftUI you just can’t (or, nobody has publicly figured out how).

So, in order to make non-native fullscreen work, we had to rip out the SwiftUI app and window lifecycle management and rewrite it ourselves using plain old AppKit. Note: we still use SwiftUI for the views, just not the window/app lifecycle management.

And it turns out that non-native fullscreen was not the only issue:

If we just wanted non-native fullscreen, this probably wouldn’t be worth it. But, we already had some other bugs or missing features looming because of SwiftUI and this gives us a path to fix all of them, so we decided it was worth it.

So, usage of SwiftUI constrained the product to have bugs and missing features. But why are developers choosing to use SwiftUI if it’s the source of such problems?

To answer that, let’s rewind back to June 2022.

WWDC 2022: The SwiftUI Vision

At the Platforms State of the Union session during WWDC 2022, Josh Shaffer set out the vision of the future platforms:

The Objective-C language, AppKit and UIKit frameworks, and Interface Builder have empowered generations of developers. These technologies were built for each other and will continue to serve us well for a long time to come.

But over time, new abstractions become necessary.

A bit later in the session, he clearly communicated the direction, so that developers know where to invest:

The best way to build an app is with Swift and SwiftUI.

Unfortunately, that might not be quite true on macOS.

Vision vs Reality

It’s important to interpret WWDC statements within the context of the event. Long-term vision statements must to be ambitious & unambigious and as a consequence, they might not match up present-day reality. The evidence certainly supports the hypothesis that SwiftUI is simply not a complete replacement of AppKit today.

SwiftUI: A Thousand Papercuts

The problems with SwiftUI have been thoroughly documented by Michael Tsai, so I’ll include just a few examples.

Daniel Jilg:

The app is also hard to develop because we seem to be fighting against a host of bugs and unclear behaviours in SwiftUI on macOS, even in the newest versions.

Phillip Caudell

Another day, another serious SwiftUI bug: List on macOS won’t select its initial value if the binding has an optional value type. Works on iOS, fails on macOS.

I am deeply regretting my decision to use SwiftUI for Big Mail 2.

Oskar Groth

No macOS SwiftUI component has let me down as much as List. Just when you think you’ve got it working well, there is always some tiny issue relating to reorder, DisclosureGroup expansion, highlight or layout.

Steve Troughton-Smith:

Every WWDC from here on in I’ll be looking at from the perspective of ‘can you make better apps with SwiftUI vs not-SwiftUI?’. The answer right now is ‘no’, not for the apps I want to build and the platforms I want to build them for, but boy would it sure be nice to say yes and start using more of AppKit in my cross-platform apps.

macOS Major Versions

It seems that SwiftUI is breaking behaviour across major version upgrades as well. Steve Troughton-Smith on macOS 13 Ventura breaking SwiftUI layouts:

Another version of macOS, another set of SwiftUI layout changes that break my UI in some way.

And sometimes, you get important bug fixes and features which only available on the latest version:

I couldn’t get my SwiftUI view to expand to fill up the entire superview that the NSHostingView was being added to.

But, there’s just one problem… this property was introduced in macOS 13 and I’m still targeting macOS 12.

SwiftUI: Case Studies

Another set of data points are case studies of app rewrites and green frield projects. The experiences paint a consistent picture: SwiftUI accelerates parts of the workflow but it introduces non-trivial friction in others.

Case Study: Rewriting Remotion in SwiftUI

Rewriting Remotion in SwiftUI:

Development is faster, the app is more stable, and new teammates are ramping up faster thanks to the simpler code base.

We now think: If you are a macOS or iOS developer who hasn’t yet taken the plunge yet, now is a great time to start writing a new app using almost exclusively SwiftUI, and use its friends Combine and concurrency for the data flow.

While SwiftUI is all you’ll need for a basic Mac or iOS application, there are still quite a few gaps that will require you to partially make use of classic Cocoa views. In our code base, for example, we need some access to NSEvents, text input, and tweaking the first responder that just aren’t possible with pure SwiftUI.

Case Study: Wallaroo and SwiftUI on macOS

Wallaroo and SwiftUI on macOS

Counterintuitively, SwiftUI made hard things and easy things hard:

I was expecting stuff like that to take a lot of effort to get working on the Mac. Instead, it ran 100% out of the box with no modifications at all.

And the stuff that I was expecting to be easy, like a settings view, buttons, and menu commands, turned out to be hard.

And a very common experience for developers targeting both iOS and macOS:

You’re going to have platform-specific code. More than you realize: certainly more than I expected!

Case Study: timing.is

SwiftUI in Timing.is App

It took a few hours to fall in love with SwiftUI. So much so that we instantly decided to abandon a cross-platform codebase and go fully native on iOS.

Despite the regular friction, we still loved it. Because like any commitment, you must let the majority rule. It was fun at least 51% of the time. But let’s talk about the <= 49% that wasn’t.

AppKit: Maturity & Stability

AppKit is over 30yrs old, dating back to the early beginnings of NeXTSTEP. That’s a very long time to accumulate a set of APIs which been refined over a large variety of use cases. It’s almost certain that any particular feature can be implemented. On the other hand, it’s simply not possible to replace 30yrs+ of accumulated APIs in just 4yrs (SwiftUI was released in 2019).

Because of its maturity, AppKit does not change often nor significantly: it provides a stable foundation to build upon. Desktop OS innovation is quite slow as resources are focused on mobile and spatial. In turn, this means lower likelihood of breaking changes on each major release and more time to focus on your product.

SwiftUI: Solving a Harder Problem

SwiftUI is tackling a much harder problem along multiple dimensions:

Declarative: SwiftUI adopts a new declarative paradigm vs AppKit’s imperative. This requires changes to how solutions are expressed and how the APIs are designed.
Cross-platform: SwiftUI is designed to be cross-platform across Apple’s OSes. Cross-vendor cross-platforms frameworks suffer from a variety of problems but Apple has an advantage here: it controls the full stack across all environments. Nevertheless, designing a UI framework that can scale all the way from a watch to a Mac is a non-trivial undertaking.

SwiftUI: A Rewrite

SwiftUI can be thought of as a unifying rewrite of AppKit and UIKit, so the usual rewriting caveats, risks and benefits apply.

In this particular case, a major problem is that the developers doing the rewrite are not the same developers who created and evolved AppKit/UIKit. This means that a lot of institutional knowledge and context has been lost and would have to be re-invented.

UIKit (Mac Catalyst)

While using UIKit on macOS is another option, Apple sees UIKit the same way¹ as AppKit: an API destined to be replaced by SwiftUI. This is not surprising at all, given that UIKit is an ancestor of AppKit, so shares the imperative nature and having been designed with Objective-C in mind.

Conclusion: SwiftUI on macOS?

To make an informed decision about the UI framework, we need understand what’s driving the decision, what are the use cases and the priorities.

For example, if a product is being built, the question needs to be answered from the perspective of what’s best for the customers. End-users do not care about which framework was to solve their problems. Not fighting a brand new framework will save a lot of time that can be spent focusing on the product itself.

If you want to have some fun, play with a shiny new API or write an app in a new paradigm, then SwifUI is the clear winner here. You will gain important skills that you will be able to leverage in future.

“macOS Apprentice” by Sarah Reichelt

In macOS Apprentice, Sarah tries to answer the same question:

To support old versions of macOS, use AppKit.

For long-form text editing or for thousands of records, use AppKit.

For existing AppKit apps, add SwiftUI gradually.

For everything else, start with SwiftUI and include AppKit as needed.

Final Words

In summary, think carefully about your use-cases and pick the framework that allows you to deliver the user experience you desire.

If you want to provide the best user experience, you might have to leverage the old and rusty AppKit, at least for a while longer. It’s always important to prioritise what’s best for your customers because in the long term, that’s in your best interest as well.

As evidenced by the slides from WWDC 2022. ↩︎

milen.me v2: Simpler, Faster

Thu, 04 May 2023 00:00:00 +0000

Why?

A lot has changed since I’ve last updated my website. The current design dates back to July 2012, more than 10yrs ago. Not only has technology moved on since then, what I value and how I approach engineering has changed as well.

While Jekyll has served me well, it was time to move to a more robust and actively maintained ecosystem that will hopefully last for another 10yrs.

Simplicity

An overarching theme in my life for the past decade has been the pursuit of simplicity over complexity, especially when it comes to software. I optimised the new design for maximum signal to noise:

Simpler Design: ~~header images~~, ~~custom web fonts~~.
~~Unimportant Content~~: e.g., not reflective of my values or personality.
~~Google Analytics~~: not needed, nor do I support the data collection.
~~JavaScript~~: removed bigfoot.js and its jQuery dependency.

Most pages are now between 5-15KiB (excluding images) and they load instantly.

History

For archival purposes, I saved screenshots of the old design: About, Writings, Software, Resume.

Performance: Faster or Fast?

Sat, 08 Apr 2023 00:00:00 +0000

TL;DR

When describing the performance of systems, always prefer terms which are unambigious.

Use “as fast as” speed ratios and “as long as” time ratios.
Always use “faster” or “slower” when referring to speed, never time.
When referring to time in percentage delta terms, explicitly state “time increased/decreased by X%”.
Time ratio is time_new / time_baseline and speed ratio is time_baseline / time_new.

Performance Language & Ambiguity

As part of my work on the Buck2 build system, I deal with performance analyses on a daily basis. It’s extremely important to use precise and consistent language when describing the performance of systems, so that there’s no ambiguity in its intepretation.

For example, a statement like “X is 50% faster” is ambiguous, as it can be interpreted in two ways: the speed was increased by 50% or the time was decreased by 50% (those are not equivalent). As another example, a statement like “X is 150% faster” is also amgiguous because it can be interpreted as either 50% or 150% faster.

“Faster” vs “As Fast As”

The terms faster/slower refer to the delta of the measurement against a baseline. Faster indicates a positive delta (i.e., 20% faster means +20% or +0.2x), while slower indicates a negative delta (i.e., 20% slower means -20% or -0.2x). For example, 20% faster means the new value is equal to 120% of the baseline.

“As fast as” refers to the ratio of the measurement against a baseline. For example, “1.2x as fast as” means that it’s “0.2x faster” (i.e., 20%). 1.2x can also be expressed as 120% in percentage terms. “As fast as” is also used when a measurement is lower as well, e.g., “0.8x as fast as”(i.e., “80% as fast as” or “20% slower”).

Speed vs Time

Time and speed are inversely proportional, so when time decreases, the speed increases. Crucially, a percentage time delta does not translate to the same percentage speed delta. For example, it’s incorrect to say that a 50% time reduction is the same as being 50% faster.

For example, if the time decreases by 50% (i.e., halved), that means the speed is twice as fast - i.e., “2x/200% as fast as” (or equivalently “100% faster”").

Calculating Time

Assume you have measured the baseline time (time_baseline) and the new time (time_new).

The time_ratio is defined as time_new / time_baseline
The time_ratio_delta is defined as time_ratio - 1.0.

For example, with a time_baseline of 100 seconds and time_new of 80 seconds, the time ratio as 0.8x (80/100), thus time ratio delta is -0.2x (0.8 - 1.0). We could present this as “0.8x as long as”, “80% as long as” or “time decreased by 20%”.

Calculating Speed

Assume you have measured the baseline time (time_baseline) and the new time (time_new).

The speed_ratio is defined as time_baseline / time_new.
The speed_ratio_delta is defined as speed_ratio - 1.0.

For example, with a time_baseline of 100 seconds and time_new of 80 seconds, the speed ratio is 1.25x (100/80), thus speed ratio delta is 0.25x (1.25 - 1.0). We could present this as “1.25x as fast as”, “125% as fast as” or “25% faster”.

Note that a 20% decrease in time (i.e., 20s out of 100s) led to a 25% increase in speed. This becomes clear when we show the derivation of speed_ratio which is the inverse of the time_ratio.

speed_baseline = distance / time_baseline
speed_new = distance / time_new

speed_ratio = speed_new / speed_baseline

speed_new / speed_baseline = (distance / time_new) / (distance / time_baseline)
                           = (distance / time_new) * (time_baseline / distance)
                           = time_baseline / time_new

macOS Network Metrics Using sysctl()

Sun, 05 Mar 2023 00:00:00 +0000

TL;DR

Getting accurate network metrics on macOS is possible using sysctl() with NET_RT_IFLIST2.
Traffic metrics are exposed in units of 1KiB to prevent fingerprinting (only for 3rd party programs).
As of macOS Ventura 13.2.1, there’s a kernel bug which truncates traffic values at the 4GiB mark.
Using getifaddrs() only exposes 32bit fields, so it’s not a viable API on modern systems.

Updates

Mojo_66 pointed out that IFMIB_IFDATA does not suffer from truncation (code).

Network Metrics on macOS

As part of my work on the Buck2 build system, I needed a way to observe the network throughput of the system. After some research, the conclusion was to use sysctl() with NET_RT_IFLIST2: this provided access to 64bit metrics¹ which do not suffer from overflowing that affects the 32bit fields of the older APIs.

I wrote up a short sample program to quickly test the API. While the metrics for packets sent/received exactly matched the ones from Activity Monitor, the traffic metrics did not. I noticed two interesting behaviours:

The traffic metrics were always increasing in multiples of 1KiB.
The traffic metrics did not match the ones from Activity Monitor.

I decided to file a TSI with Apple before digging any further. Thankfully, Quinn resolved the mystery of the inaccurate numbers.

1KiB Units

If you looked at the traffic metrics, they would only ever increase in multiples of 1KiB. The reason for the behaviour is that the kernel applies batching to prevent malicious code from fingerprinting the system. This restriction applies only to 3rd party programs (i.e., not codesigned by Apple).

For example, if you were to copy the netstat binary and re-sign it with an adhoc code signature, you will observe the batching while the Apple-signed binary works without any issues.

Inaccurate Numbers

Testing revealed that sometimes the API returned inaccurate traffic metrics. Upon further investigation, it became clear that the API truncates and wraps around the traffic metrics at the 4GiB mark. Again, this only affects 3rd party programs.

The behaviour is confirmed to be a bug in the kernel as of macOS Ventura 13.2.1 and it’s tracked as rdar://106029568.

Activity Monitor

Activity Monitor and nettop(1) use the private NetworkStatistics.framework to get network metrics. In the *OS Internals, Volume I book, there’s a bonus chapter which covers the details of how the private API works.

Netbottom is a clone of nettop(1) that shows how to use the private APIs.

Alternative APIs

Unfortunately, there’s no alternative public API that can return 64bit metrics on macOS. Using getifaddrs() would only expose a struct if_data which contains 32bit fields that overflow quickly on modern systems: it does not expose struct if_data64.

Rust Crates

If you’re using Rust, the following crates use the NET_RT_IFLIST2 API to get metrics on macOS:

Update (31 Mar, 2023)

Many thanks to Mojo_66 who reached out to note that we can use an additional sysctl() call to get 64bit network metrics which do not suffer from trunctation. Accordingly, I’ve updated the sample code.

Furthermore, using IFMIB_IFDATA does not result in 1KiB batching of the reported metrics. My assumption is that this is a bug and would be fixed in a future version of macOS to prevent fingerprinting.

Thanks

Many thanks to Quinn “The Eskimo!” for investigating and digging into the macOS kernel. Many thanks to Mojo_66 for pointing out a 64bit network metrics API which does not suffer from truncation.

struct if_data64 from if_var.h. ↩︎

Auto Linking on iOS & macOS

Fri, 04 Sep 2020 00:00:00 +0000

Auto Linking Explained

When object files get linked at the final build stage, the linker needs to know which libraries to link against. For example, if you add #import <AppKit/AppKit.h> to an implementation file, you need to also add -framework AppKit to the linker flags.

Auto Linking aims to remove the latter step, i.e., it aims to derive the library linker flags from the import statements in your code. Developers do not need to add any framework/library linker flags anymore, they can just start using any framework by importing¹.

Under the Hood

Auto Linking works by inserting linker flags in object files. When the linker creates the final executable, it’s as if those linker flags were passed as arguments.

The linker flags are stored as LC_LINKER_OPTION load commands in object files. They can be printed using otool -l file.o:

Load command 5
  cmd LC_LINKER_OPTION
  cmdsize 32
  count 2
  string #1 -framework
  string #2 AppKit
Load command 6
  cmd LC_LINKER_OPTION
  cmdsize 40
  count 2
  string #1 -framework
  string #2 QuartzCore

I recently added support for LC_LINKER_OPTION to MachO-Explorer, so you can use it to inspect the linker flags visually as well.

Frameworks vs Dynamic Libraries

It’s important to note that Auto Linking can work with any Clang module, not just frameworks. For example, /usr/include/module.modulemap in the macOS SDK defines the zlib module as follows:

module zlib [system] [extern_c] {
  header "zlib.h"
  export *
  link "z"
}

The above means that if we were to use #import <zlib.h> with modules turned on, -lz will be automatically inserted by the linker. We can verify this using otool -l:

Load command 4
  cmd LC_LINKER_OPTION
  cmdsize 16
  count 1
  string #1 -lz

Controlling Auto Linking

Auto Linking is active only when Clang modules are turned on. If you’re invoking Clang on the command line, this means passing -fmodules. The Xcode setting is CLANG_ENABLE_MODULES (“Enable Modules (C and Objective-C)”).

Auto Linking itself can be disabled even if modules are enabled. The Clang option is -fno-autolink and the corresponding Xcode setting is CLANG_MODULES_AUTOLINK (“Link Frameworks Automatically”).

Swift

Swift extensively uses the Auto Linking mechanism to link against its runtime and overlay frameworks. For example, if you were to compile a simple Swift file using swiftc -c file.swift and inspect it:

Load command 4
  cmd LC_LINKER_OPTION
  cmdsize 32
  count 1
  string #1 -lswiftAppKit
Load command 5
  cmd LC_LINKER_OPTION
  cmdsize 24
  count 1
  string #1 -lswiftCore
Load command 8
  cmd LC_LINKER_OPTION
  cmdsize 32
  count 1
  string #1 -lswiftDarwin
Load command 39
  cmd LC_LINKER_OPTION
  cmdsize 40
  count 1
  string #1 -lswiftSwiftOnoneSupport
Load command 41
  cmd LC_LINKER_OPTION
  cmdsize 40
  count 1
  string #1 -lswiftCompatibility50
Load command 42
  cmd LC_LINKER_OPTION
  cmdsize 56
  count 1
  string #1 -lswiftCompatibilityDynamicReplacements

References

Either using #import or @import in Objective-C or import in Swift. ↩︎

Distributed Caching & Compilation

Sun, 07 Jun 2020 00:00:00 +0000

TL;DR

Determinism plays an important role in the ability to cache output of compilation.
Distributed caching and compilation enable quick iteration on large scale software.
Computation of distributed cache keys is non-trivial as not all inputs are explicit.
A suitable distributed cache fill strategy is required for consistently fast local builds.

Deterministic Builds

Deterministic builds can be defined as:

A build is called deterministic or reproducible if running it twice produces exactly the same build outputs.

The LLVM project outlines several levels of determinism:

Basic: Doing a full build of the same source code in the same directory on the same machine produces exactly the same output every time.
Incremental: Like basic determinism, but the output binaries also do not change in partial rebuilds.
Local: Like incremental basic determinism, but builds are also independent of the name of the build directory. Builds of the same source code on the same machine produce exactly the same output every time, independent of the location of the source checkout directory or the build directory.
Universal: Like local but builds are also independent of the machine the build runs on.

Large Scale Software

Deterministic builds provide multiple benefits, mostly relating to security and caching. In this post, I cover how determinism enables the usage of techniques to speed up development of large scale software.

As software gets increasingly more complex, building all of the source code of a single program locally becomes impractical, especially as we want to support a fast development cycle¹. It’s too inefficient to expect engineers to wait hours to build the latest commit every morning.

Large scale software can arise in different ways:

Self Contained Software: The software itself is a single, self-contained product but it’s very complex and large in scope. For example, Google Chrome and Photoshop would qualify as such software.
Vertical Systems: Even though an individual program might be small, the whole vertical system, including frameworks and system libraries, can be very large. Compiling whole vertical systems from source has multiple benefits² (e.g., quicker vertical signal, lower integration cost, faster system iteration cycle) but compile times can escalate due to the sheer size of code being compiled.

Distributed Caching

One of the major benefits of having universal deterministic builds is that our CI and local development machines will be producing exactly the same artifacts, which allows us to leverage a distributed cache. This changes compile times from being a function of the total code size to being a function of the local code changes (i.e., O(total code) to O(local code changes)).

If you checkout a commit which has been built and cached by CI, almost all local computation would be replaced by fetching the remote artifacts. So, rather than waiting hours to compile everything, you can have a build ready in a few minutes.

Note that distributed caching is also known as remote caching.

Distributed Compilation

A problem that arises when working on large codebases is making local changes that require a large computational power. For example, imagine there’s a header file included almost everywhere and you make a change to that file locally (e.g., Logging.h). A distributed cache will not help because CI machines have not built that revision: the file contains unique local changes. It would be extremely inefficient to have engineers stuck waiting for hours.

There’s another technique that can speedup development in such cases: distributed compilation. The technique is also known as remote execution as it’s not just limited to compilation but applies more generally to execution of tools as part of the build process.

Distribution compilation is about leveraging the power of multiple machines to remotely execute compile commands. It’s like having 1000s of CPU cores instead of just a few dozen. Usually, build systems would schedule a certain amount of compilation jobs locally and distribute the rest over remote machines which will compile the input files and return the object files as a result.

Both Buck and Bazel support remote execution.

Distributed Caching vs Distributed Compilation

While both distributed caching and distributed compilation help speed up the development cycle of large scale software³, they are fundamentally different techniques. Distributed caching is about storing the results of a computation and retrieving those same results later from other machines. Distributed compilation is about leveraging the power of multiple machines to perform a computation that’s too expensive to perform locally.

The techniques are orthogonal and complementary. You could have a setup with just distributed caching or just distributed compilation (ideally both).

In a pure distributed compilation setting, the total amount of work stays roughly the same⁴, it’s just spread across multiple machines. The main benefit is that we have the ability to perform expensive computations quickly but we are still bound by the total computational capacity across the fleet.

On the other hand, distributed caching is a classic example of the space-time tradeoff applied across a network of machines. In the end, as long cache hit rates are sufficiently high, we save a lot of CPU time in exchange for paying a storage cost.

Cache Keys

Determinism

Before we cover key computation, it’s important to note the requirements imposed by having a cache: cached command outputs must be deterministic. This property allows us to safely compute results on one machine and reuse them on others.

If cached commands are not deterministic, then this would lead to a very strange property: the output of a build would depend on the cache hits. Such a system would be very fragile and hard to reason about.

Key Computation

Caches are indexed by a key, so let’s explore how such keys could be computed. Assume we want to cache the results of a command:

compiler --some-option -L /some/directory --input-file /path/to/file @/path/to/argfile

If we were to hash the command itself, it would certainly not be enough to guarantee the output would be deterministic because the contents of the referenced directories and files can affect the output. At a minimum, we must include the following as part of the key:

Referenced Files: Need to include the contents of input files. For response files (argfiles), this needs to happen recursively.
Directory Contents: Need to include the state of referenced directories. For example, such directories could be used for header resolution.
Command Parameters: All parameters must be included, in the order they appear.
Compiler Identity: The exact compiler being used (e.g., hash of the binary).

Compiler Identity

The compiler identity needs to be included in the computation of the cache key because the output can vary between versions. For example, Clang 5 would most likely behave differently compared to Clang 11.

Unfortunately, determining the identity of the compiler is not as easy as just hashing the binary. That’s because the binary might just be a shim that redirects to another binary or even a set of binaries.

Furthermore, the shim might operate in a dynamic manner: e.g., redirect to different compiler versions depending on parameters, environmental variables or state of the filesystem. It’s practically impossible for the build system to guess how the compiler works internally.

Implicit Inputs

In addition to the visible inputs as part of the command, there are two other major sources of implicit inputs:

Environmental Variables: the compiler’s behaviour could be influenced by environment variables. Those variables might have been set at the shell level, build system level or somewhere else.
Filesystem: the compiler will inevitably end up using many more files than those explicitly specified. There might be implicit search paths, SDK paths and so on.

Non-Existent Files

One tricky aspect that can be easily missed are files which do not exist on the filesystem but which affect the output of the compiler. Imagine if the compiler had code like:

const char* lib_path = "/some/predefined/path.a";
if (static_lib_exists_as(lib_path)) {
 append_to(linked_libraries, lib_path);
}

This means the compiler will generate different outputs depending on the existence of /some/predefined/path.a. Consequently, the file must always be part of the cache key, even when it does not exist.

Explicit Inputs

Ultimately, the compiler itself truly knows all the inputs that were used to determine the output. Ideally, compilers would provide a way to output that information in a structured way, so that build systems can utilise it.

For example, Clang and GCC support dependency files which list all the user and system header files that were used. MSVC supports a similar /showIncludes option.

Parameters Not Affecting Determinism

Sometimes, tools might support options which do not affect the output but change the internal operation of the tool (e.g., algorithms used, concurrency, etc).

While it’s possible to add custom handling to exclude such parameters from cache key computation, it’s safer to assume that all parameters affect the input and have CI machines (or a subset) always build the exact same local development configuration.

Cache Availability

An important aspect of a distributed cache is its hit rate: it should be very high. Assuming no local changes, the hit rate would usually depend on the particular commit being checked out, the cache retention policy and cache fill strategy.

For example, if the cache gets filled once an hour by a CI job, checking out the latest commit on master might result in a very slow build if a previous commit invalidated most artifacts⁵.

In terms of fill and retention strategies, several aspects require careful consideration. For example:

Does the cache get filled at every commit or only certain ones?
Does the cache get filled before a commit gets pushed or afterwards?
In a monorepo, for a specific commit, are caches for all targets filled or only a subset?
Is the cache per-target or a global one?
What eviction policy does the cache use? What’s the size of the cache?

There are no right or wrong answers: all of the above aspects represent different tradeoffs and will have associated benefits and costs. It will be down to the constraints and requirements within a company/project to make the appropriate choices and deliver the desired experience.

Metrics

Ultimately, the distributed cache should have a high hit rate for as many builds as possible. For example, that might be aiming for p95 miss rate of 1%: i.e., 95% of all builds will experience a miss rate of 1% or lower (i.e., hit rate of 99% or higher).

Adopting a data-driven approach in optimising the cache hit rate is an appropriate strategy.

References

Incremental build time is incredibly important for developer productivity. The difference between a 2s and 30s incremental build is very significant due to a simple fact: the wait time is not long enough to be utilised productively but it’s long enough to add up to a large total. Doing 100 builds per day leads to 46 minutes of wasted time. ↩︎
Building everything from source is a separate topic which will not be covered in detail here. ↩︎
Another reason for the importance of fast iteration time is that it directly affects the ability to write good software. When tackling a new problem space, the freedom to easily explore and quickly iterate on solutions becomes key to success. ↩︎
There will be some additional overhead associated with cost of distributing the compilation across multiple machines. ↩︎
E.g., changing a widely included header file. ↩︎

Apple's Linker & Deterministic Builds

Sun, 03 May 2020 00:00:00 +0000

TL;DR

Universal deterministic builds require that all paths in artifacts must be repo checkout independent. On Apple platforms, the linker will insert absolute paths to object files in executables. In Xcode 11, Apple added a new linker option, -oso_prefix, that can relativise OSO absolute paths. Another source of non-determinism in object files are the OSO timestamp entries.

Deterministic Builds

One of the requirements for universal deterministic builds is that they are independent of the source checkout path, on any machine. Consequently, there can be no absolute paths in the output artifacts.

Debug Info & Paths

When it comes to compiling C/Obj-C/C++ code for Apple platforms, absolute paths in executables can be found in debug info inserted by:

Compiler: paths to source files.
Linker: paths to object files.

An obvious question arises: why does the linker insert paths to object files?

DWARF on macOS

On Apple platforms, DWARF debugging data can reside in one of two places:

Object Files: The object file for each translation unit will contain DWARF debugging data. Executables will contain absolute paths to the object files, so that the debugger can find the debugging info there.
dSYM Bundles: An Apple bundle which contains the combined DWARF data from all the object files. The executable and its corresponding debug information are linked using an UUID.

Rationale

There’s very special reason why DWARF data is embedded in the object files rather than the compiled executables: much faster incremental builds. Such builds do not have to incur the cost of embedding the full debug info on every build, even if just a single object file changes.

The obvious downside is that it makes debugging of executables depend on having access to the original object files. That’s why dSYM bundles exist which combine all the debugging info from the object files. dsymutil can be thought of a DWARF linker.

Avoiding Absolute Paths

Compiler

Clang supports the -fdebug-prefix-map flag which provides the ability to relativise absolute source paths in the debug info. For example, you can use it like so: -fdebug-prefix-map=/Users/milen/repo=..

Linker

As explained earlier, Mach-O executables would store paths to the object files which contain debug data in OSO entries. You can run the nm command on a binary with debug information to inspect such entries:

nm -a out/main | grep OSO
000000005eaede8d - 03 0001 OSO /Users/milen/repo/out/hello.o
000000005eaede8f - 03 0001 OSO /Users/milen/repo/out/main.o

In 2019, as part of Xcode 11, Apple added a new linker option, -oso_prefix, which can be used to relativise the OSO paths. For example, when linking using the Clang driver, we can pass:

clang ... -Wl,-oso_prefix,/Users/milen/repo/

If we then print the OSO entries, we will see that they have been relativised:

nm -a out/main | grep OSO
000000005eaede8d - 03 0001 OSO out/hello.o
000000005eaede8f - 03 0001 OSO out/main.o

I’d also recommend using MachO-Explorer to visually inspect the symbol tables.

OSO Entries & Timestamps

Another source of non-determinism in Mach-O executables are the timestamps associated with the OSO symtab entries. Those are used to determine if the object files are out of sync with the executables.

For example, if the debugger determines that an object file is newer than the executable which points to it, that means the executable was not recompiled after recompiling the object file.

The strategy used in Buck to guarantee deterministic executables is to always set the modification dates of all object files / static libraries to a predefined date. The tradeoff there is that it breaks the ability of the debugger to detect object file / executable synchronisation issues.

libtool & ld64

If you do not want to apply postprocessing yourself, you have a few options.

libtool supports the -D option which can be used to guarantee deterministic values. The documentation says:

When building a static library, set archive contents’ user ids, group ids, dates, and file modes to reasonable defaults. This allows libraries created with identical input to be identical to each other, regardless of time of day, user, group, umask, and other aspects of the environment.

In addition, both libtool and ld64 support the ZERO_AR_DATE environment variable to control the timestamps for the OSO entries (ld64 code, libtool code).

Debugging

lldb tries to resolve relative paths against the current working directory, so to make sure debugging works, we need to adjust the cwd and add a source mapping. This can be done either at the lldb prompt or in a lldbinit script.

script import os
script os.chdir("/Users/milen/repo")
settings append target.source-map ./ /Users/milen/repo

Buck

Buck, which supports distributed caching, is used as a core build system at Facebook. As the addition of -oso_prefix is relatively recent, how did Buck produce deterministic Mach-O executables until now?

The answer is that Buck performs an optional post-processing step where all OSO entries are relativised, making the executables independent of the checkout path.

While that’s a working solution, there are several downsides.

Performance

Relativisation requires rewriting both the symbol table and the string table. For large binaries (e.g., 500MiB-1,500MiB), the combined size of the tables can be around 50% of the binaries and processing that much data can be slow.

Note that while we can patch the symbol and string tables in-place, different machines will produce string tables of different length depending on the checkout path length. That’s why we need to rewrite the full strings and symbol tables, to ensure that executables are bit for bit equal.

As relativisation is very performance sensitive, special attention needs to be paid to the code implementing it. For example, just an additional 1 microsecond to process a symtab entry will result in ~5s slow down if we need to process 5 million entries (not unusual for binaries of this size).

For example, you can see several optimisations ([1] [2] [3] [4]) I made to improve the performance of the process in Buck.

Maintenance

More code means higher maintenance cost. Furthermore, as we have to keep compatibility with Apple’s tooling, the code’s behaviour has to be verified against every major Xcode release.

Code Signing

Relativisation, as a post-processing step, only works if the executables are not already code signed. As we mutate the binaries, this would in turn invalidate their signatures. While that’s not a problem if we are building from source, the approach would not work in all situations.

Acknowledgement

Many thanks to Michael Eisel for finding the new -oso_prefix option. Thanks to Mark Rowe for surfacing libtool’s option.