15 September, 2022

Request forgery in node by example

Let me start by saying I've been writing Javascript on and off since the early 2000s but I definitely don't consider myself an expert in it, particularly given the rapid iteration of JS best practices and uses.

Somewhere between the end of 2020 and mid 2021 I started building an intentionally awful sketch toy app where I reproduced a handful of CVEs I was familiar with in nodejs, and have kept adding to it periodically as I have learned things or occasionally thought "how would that actually look in code?" when exploring something in a black box environment. 

My occasional dips into other people's Javascript codebases have also made me realize there's a lot of elements of a "typical" modern JS codebase which I've seen, used, or contributed to, but never set up from scratch in a modern JS project. Test frameworks. Linting. And so on. So with the intention of teaching myself how a modern Javascript app goes together properly, I started (and have yet to finish!) adding all the bells and whistles a proper server should have.

I am not opening up this code because it is finished (it is very much not finished or polished, and additions are welcome on basically all fronts), but rather in the hopes that other folks might like to play with it, learn from it, or add to it. 

As I was writing this code (both last year, and then this year after I picked up development again with the intent of releasing this project while I'm funemployed before my next job starts), I asked a couple of individuals for feedback or their thoughts. Thanks very much John, Nathan, Sharon, and anyone else who's spent any time looking at this :)

The code is here: https://github.com/kaoudis/vulnparty-js
How to get started: examples
And the rest of the documentation on what's there and a little about how it works: doc

03 October, 2021

Starting with threat modeling for the devop or developer

Not every company is lucky enough to have enough appsec or even infosec folks that a security staffer can threat model everything. You're on an engineering team who own some services and you've been told by security that you should threat model your system; now what?

Since obviously the folks who own the systems have the deepest knowledge of them, this can sometimes turn out awesome. But, for non-security-folk, threat modeling can also seem arcane or hard to approach.

tl;dr

You don't have to be a security person or a hacker to start to understand what could go wrong in a system. 

I'm gonna walk through how to apply a couple of different methods to begin to understand your system from the attacker's perspective through threat modeling. I want to demo how threat modeling can help engineers and developers spot gaps in a system which could lead to failures in production. 

I'll start in a mildly unusual place since it might not be that intuitive for folks without security background to just do the thing.

Take Five

I'm gonna try (ab|mis)using some ideas from a framework operational postmortems sometimes follow called the Five Whys to lead into other methods more typical for threat modeling.

We'll iteratively build from a series (four, seven, actually five...) of "why" questions which we'll answer by describing the system and process up to objective and (in this case it's an imaginary system, so we're making it all up as we go along, but ideally...) factual statements about the people we think might attack our system and how we might defend our systems, so that we can defend intentionally.

This isn't a perfect application of the Five Whys, however. It's just a hack.

Three perspectives on threat

I view threat modeling as a structured way of looking at systems to determine what parts are most important to protect or refine from a security perspective to improve the system's operational viability.

Data-centric

The most typical style I know of threat modeling (STRIDE, etc here) could be thought of as data-centric. These methods seem to mainly involve drawing data flows and thinking through what data is most important to keep safe, then thinking about how it could be accessed in ways we don't want.

System-centric

Another way folks might somewhat less-intentionally threat model: simply understanding the system deeply and what it should / what it should not be able to do, then applying that knowledge to systems-level protections. 

Attacker-centric

A third is based in trying to thoroughly understand the attacker and what they are most likely to want, then focusing on the parts of the system most interesting to the attacker to protect first.

Combined!

Not novel, but I'd like to apply techniques from all three of the categories I've defined so that we look at the system from multiple angles as well as from the attacker perspective to best understand what could go wrong. 

If you think other techniques should be included here I'm open to it; drop me a line.

Architecture

Threat modeling (in any form) should be an iterative exercise revisited whenever the system undergoes behaviour or structure-affecting changes. As the system evolves, the team's threat model should be updated with it. We'll play with two versions of the same system, with an eye toward trying to understand how our threat model changes as well when we add new features or refactor.

One

Here's a fairly vanilla client/server setup. The client directly calls some APIs backed by a variety of datastores.

Two

The second version of our system includes an intermediate API combining results from the other APIs behind it to provide the client a smaller and more tailored result.

Personae (non gratae)

We can build attacker personae sort of like we'd construct Agile-style customer personae during product development. We'll apply gathered facts clustered by similarity in the same way a product manager might gather info through customer interviews and other means, and group by various customer attributes.

In the real world, we could gather data to start from via:

  • examining logs and scans for repeated anomalies ("x happens and so does y at the same time, and we don't know why" might be something we're interested in using as a starting point for a threat model)
  • aggregating reports to our bug bounty or VDP by type of issue and attack
  • getting to know the folks who frequently report to us through our BBP or VDP
  • keeping up with attack writeups and related reporting
  • our infosec teams' general knowledge
  • history of prior incidents and how they were resolved
  • pentest reports
  • red team exercises
  • ...

Here are our attackers, who I've named completely at random:

  • Akash wants our user account data (maybe we got this pattern from a prior incident and a pentest report which exposed some of the same issues)
  • Jenna is a bug bounty hunter looking for vulns on subdomains and APIs in scope for our BBP
  • Frankie is automating account creation on our platform for a freelance job (maybe we got this pattern from our service logs and scans)

I also think it's important to call out that none of these folks are actively malicious, they're just applying their skills in ways the market and our systems allow.

Attacker detail: System 1

Akash wants active VIP accounts.

Why - there is a market (brokers, buyers) for high-profile user accounts

Why - $company has very quickly / recently become very popular and has a large user graph, but the maturity of $company's systems can't yet support the userbase entirely

Why - the systems crash often, the APIs are pretty open, and there isn't much concern for security or opswork

Why - quick-growth survival mode is different from long-term-stability-mode, and the engineering culture is currently very focused around the former rather than the latter

Jenna wants to find vulns on subdomains and APIs included in our bug bounty.

Why - we pay well and try to treat the folks who report to us kindly, even if they report findings we consider invalid

Why - it costs money, time, and sweat to build skills needed to provide findings we consider useful

Why - the market will provide Jenna other avenues for her skills if we don't

Why - the market values thoughtful and clear findings reports and bug bounty participants who treat the admins with respect

Frankie wants to make fake accounts.

Why - fake engagement if well leveraged can sometimes result in real engagement over time

Why - the public can't always tell fake engagement from real engagement if it's done well 

Why - the clients will pay for a small network of unrelated-enough accounts plus some automation to glue them together

Why - the clients are burgeoning influencers who would like to become more well known and get better sponsorships

More system detail

Now that we have some semi-reasonably motivated attackers to think about, let's talk about what they might focus on in System #1. We'll add a little more detail to the APIs first.

We don't need to always apply all our attacker personae to sketch what might be interesting about the system every time we change the system, but we will this time. The more detail we have to start with, the more detail we can work with over time as needed.
 

Jenna only cares about the /social_graph api, which is in scope for bug bounty.

Why - the other APIs have been scoped out of the bug bounty

Why - the other APIs are currently very fragile compared to /social_graph

Why - run-the-business work is currently prioritized over tech debt and operational hardening work

Why - the engineering culture is primarily focused around supporting fast growth versus long term stability (sound familiar?)

Akash is mainly interested in /account_lookup, /privileges, and /user/billing.

Why - through recon, a list of all the APIs and subdomains the company owns was identified, and then narrowed down to just what would be useful to build an appealing dataset

Why - the integrations and notifications APIs serve "additional info" useful for the client in presenting a nice UI, not account data directly

Why - the notion of account is still centralised around a few key APIs

Why - these additional APIs were added as the notion of "account" was slowly expanded out from a monolith into a series of microservices 

Why - the refactoring was done ad hoc 

Why - the process was not well defined beforehand in a fashion including incremental milestones with definite security and ops goals

Frankie is interested in all the APIs.

Why - one can bypass the client to build accounts how the system expects them to look, then use the account lookup API to check everything got set up correctly

(we'll split this thought into two directions...)

Part A

Why - the non-social_graph APIs don't rate limit client IDs other than the one hardcoded into the client

Why - folks want to be able to run load tests in production without the rate limiting on so they can observe actual capacity

Why - load testing to ensure a certain SLA can be met has been written into one of the enterprise client user contracts

Why - the backend had difficulty handling a denial-of-service attack six months ago because no rate limiting existed then

Why - system hardening tasks like adding rate limiting were put off when the system was built to meet an initial delivery deadline, and then backlogged

Part B

Why - the account lookup API returns every single account attribute as part of a nice clean JSON

Why - the lookup API specification doesn't match its current task well

Why - the lookup API was first built as part of a different feature and got reused in what is currently in production without retailoring to meet current customer needs

Why - (see last why of Part A as well) corners got cut to meet an initial delivery deadline

What have we learned?

Through trying to match up what our attackers want with the current state of the system, we've already uncovered some interesting things!
  • account lookup API could be (with security and performance justifications) refactored to match current use case
  • we might want to consider rate limiting other client IDs than just the one for the official client
  • we may not want to spend quite as much time on the integrations and notifications APIs in terms of security protections compared to the more closely account-related APIs, depending on what other information we discover as we go

Adding in the intermediate API 

The second version of our system has an additional API layer which the client calls instead of calling the backing APIs directly.

This extra layer could be GraphQL, or maybe a WAF or other application doing some security checks, or... a thin REST API designed to combine the data from less tailored APIs so the client only has to make a single call and gets back just what it needs, instead of making a bunch of slower calls and having to discard data that isn't important to the user experience. 

This API could be doing additional input sanitisation or other data preprocessing to clean input up (which the legacy APIs it fronts may not be able to include without heavy refactoring).

For now, it's just a thin, rate limited REST API designed to wrangle those internal APIs without any special security sauce. There's one main endpoint combining/reducing the backing APIs' data.

Let's forget about the previous Whys for a second and redo against the second iteration of our system.

Jenna is interested most in the intermediate API.

Why - the intermediate API has more low-hanging bugs compared to /social_graph, which has been in scope for the BBP since iteration #1 of the system was first released

Why - the intermediate API exposes some of the vulnerabilities of the more fragile APIs (not directly in BBP scope) backing it

Why - the intermediate API combines data from the fragile other APIs, and like rest of system doesn't rate limit client IDs other than the official web app

Why - the implications of only rate limiting the official client in order to not rate limit randomly-generated client IDs used for load testing were not thought through

Why - security and other operational hardening concerns are second priority to moving fast and getting things in prod

Frankie remains interested in the individual backing APIs.

Why - hitting the backing APIs directly using a non-rate-limited client ID is still most efficient compared to going through the client or intermediate API or both

Why - the intermediate API tailors its requests to what the client needs

Why - the main priority in designing and building the intermediate API was client performance

Why - users have been complaining about slow client response times

Why - the fake-account-making automation is quietly taking up a significant chunk of the backing API resources

Why - the backing APIs are overly permissive in terms of authentication and rate limiting

(though the following extra why might not come to light unless e.g. folks from engineering cultures with more investment in logging/monitoring/observability get hired on and try to implement those things, I think it's also worth calling out...)

Why - there is insufficient per-client-ID logging/monitoring/observability, so it is unclear what is actually causing the slow client response times other than high load on some of the backing APIs at certain hours 

What else did we learn?

We've also discovered that being better at understanding data flow through the system on a per-client-ID basis would provide value from a security as well as an operational perspective.

Datastores and more

Some parts of our system are Amazon-hosted, and some of them run in a colo. Let's imagine $company is pretty big, but our system just belongs to one small division.
 
Here's version 1 again with some databaes added in:


And here's version 2: 

You might've noticed this architecture, either version, has flaws which we could start to remedy by enforcing security boundaries, if only we had any security boundaries.

Lack of boundaries invites lack of respect

It's possible attackers can damage or extract data from unexpected parts of the system if they can compromise just one part, since we have (as currently drawn) minimal systems-level security protections aside from a halfhearted attempt at client rate limiting.

Attackers (in arch version #2) can still hit any of our backing APIs directly, despite us having the intermediate API. 

We also don't have any serverside data sanitization in place for either input from client(s) or output from our datastores. 

Trust boundaries should go anyplace data crosses from one "system" within our system (from the user to us, from the client to the API, from the API to AWS...) to another. 
 
Boundaries are about compartmentalization. If an attacker breaches under some maximum set or number of compartments, water shouldn't compromise the whole ship. Since we're most focused on API/backend security in this post, we're going to place our boundaries with respect to the point of view of the API services generally.

Exploring trust

When I start drawing trust boundaries on an architectural diagram, I find it's useful to try a first placement, then think about the implications and perhaps iterate, instead of just going with the first version. There may be something which makes that "hey this is going to be a pain later, don't leave that like this" part of one's brain itch.

We could draw one great big boundary like in the following picture, and just say all our APIs talking to Aurora dbs trust them. But: Amazon is an external service. 
 
We can likely assume Aurora works correctly the majority of the time and the Amazon security teams know their business but... What if we've misconfigured our connections to Amazon in an exploitable way? What if we set up our data stores in Aurora accidentally such that an attacker could exploit a race condition between our API calls?
 
We'll add all those items to our list of "stuff which we should think about taking a hard look at from a security perspective" and continue experimenting.

We could draw boundaries inclusive of the client (which more reflects our current situation as the system has previously been drawn/examined) and say the APIs implicitly trust the client, but none of the data stores.

 
One problem here is we've forgotten about all those other potential client IDs which can hit our APIs. We should likely also consider making it explicit that there are not trust relationships between the majority of our APIs which don't call each other. There's no reason to have implicit trust between these services!
 
As before, we'll add the called-out items to our list of stuff to circle back to later and keep playing around.
 
Refining things a little bit from our previous tries:
Three of our APIs reference the same Aurora data store. This may result in interesting data races and other shenanigans, as previously discussed. It could be the result of poorly factoring out a monolith into microservices without splitting up the data they reference, or it could be the result of adding more microservices into a system ad hoc without contemplating the consequences. 
 
Regardless, it's good to call out that the shared datastore must trust all three APIs talking to it so we can plan around that.

We've settled on trusting neither the client, nor the intermediate API, at this point. Effectively, the intermediate API is the client from system version #1, and (not to overindex on rate limiting here) should be rate limited and otherwise treated just like any other client. 

We could also argue that the intermediate API should be able to trust the backing APIs. What that could look like:
The client remains outside all our trust boundaries. We don't want to trust any client IDs in an ideal world after everything's cleaned up; we want to rate limit them all (not just the official web client ID)!

What's the point of all the lines?

We can use these boundaries we've drawn to try and think about what should not happen now that we've thought through some of what can go wrong and what our attackers might want to do. The important thing though is we likely haven't thought of everything that could break, everything an attacker could want, or everything we can fix. 
 
But! We can do a bit now, make things better, and then iterate as new ideas come up. 
 
Nothing has to be perfect to start with. 
 
So, starting with:
  • rate limit all possible client IDs talking to the intermediate API
  • we are likely interested in adding service to service authentication of some type to our setup, along with client allowlisting, so that, for example, only the intermediate API can talk to the backing APIs, and our friend Frankie can't hit them directly anymore because they aren't allowlisted
  • we want to have data sanitization in our intermediate API probably (before we pass user input on to our backing APIs at the very least), unless we choose to add it to our client

STRIDE

Now that we have some ideas about our attackers, our architecture, how our architecture could be better, and where trust boundaries can go, let's walk through some mildly more traditional attack-based threat modeling. 
 
STRIDE (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege) is a mnemonic or acronym to help us think about broad categories of things that could go wrong in the system, and it's one of the oldest techniques we can use to threat model. 
 
We don't have to necessarily think of something for each category, but the categories are intended to help us think about what could cause catastrophic system failure and loss of user trust in us.

Spoofing

  • It's possible to impersonate the intermediate API to both the client and the backing APIs, in the current setup. 
  • Nothing as documented prevents attackers from CSRFing our users

Tampering

  • Any of our attackers could potentially figure out how to affect just parts of existing accounts, since none of our APIs have proper authentication
  • There are no documented mechanisms preventing stored XSS in our current setup (leaving aside the other kinds for now since we're not really paying attention to the client so much at the moment)

Repudiation 

  • Frankie can just directly hit the backing APIs and the teams who own them haven't got a clue why their users are complaining. They don't have a good way to tell what client did what.

Information Disclosure

  • Akash can build a curated collection of VIP account access / data using our APIs
  • If someone could hit our databases directly (which... nothing in the above architecture as described says they can't), they wouldn't need to go through our APIs to create or collect account info

Denial of Service

  • When Frankie makes a bunch of accounts our legit userbase experiences degraded service because the backing APIs are occupied

Elevation of Privilege

  • If a user doesn't want to be rate limited and is technologically capable, they can stop going through the regular client, which is rate limited, and do just what Frankie is doing in hitting the backing APIs
  • Jenna can (particularly given how underspecified the system currently is) potentially force a race condition in the three account APIs which could lead to her account gaining additional privileges it doesn't or shouldn't have currently as created through the web client, via manipulating requests to / through the intermediate API

Some other frameworks which might be interesting instead of or in addition to STRIDE: ATT&CK and NIST 800-154.

We could go further here and walk through these plus possibly others, but I think I'll leave it here for now because I'm pretty much writing a novel and hopefully have left enough points of interest for someone who didn't know much about threat modeling before to sink in some hooks and google some stuff and follow some links (I hope it's a fun discovery process, too!!!). 

But there's one more step I'd like to do - relating what we think could happen in the system, back to the system - checking our work. I never see anyone doing this with threat modeling and I think it's as useful here as it is in math or CS theory proofs.

Checking our work

Now that we have a pretty good idea of some things we can address which we've gathered, let's walk through what we plan to do about those things, and try to tie our plans well back to the problems we've identified.
 
We'll go step by step back from the bottom to the top and make sure our logic is somewhat sound, and mark those items where we need more info to draw a firm conclusion, or where we made unstated assumptions.

Let's consider some of the statements from STRIDE first. We're going to include / reference 'data' here based on some of the colour we've already added to this system and engineering organisation previously.

Jenna can (particularly given how underspecified the system currently is) potentially force a race condition in the three account APIs which could lead to her account gaining additional privileges it doesn't or shouldn't have currently as created through the web client, via manipulating requests to / through the intermediate API.

Why - the interactions between the three APIs hitting the same data store are not clearly specified.

Why - in the architectural diagrams we drew / spent a bunch of time drawing red dotted lines on, it's become clear that we don't have a good model for which parts of the account data structure should be accessed in what order

Why - the APIs organically grew out of one API which did all the actions, as folks exploded a monolithic service into microservices perhaps, but the data interactions were not also split out so that we could have a one to one relationship between API action and data store

Why - no technical design was followed in making this refactor, or it could have been more clear to everyone involved that there are direct dependencies between the types of data the APIs access in the data store

So, now we have justification for changes we might make to the system, and also some potential process improvements we can make as well!

Let's do this for one more, and then maybe call it the end of this post since you likely get the idea...

Any of our attackers could potentially figure out how to affect just parts of existing accounts, since none of our APIs have proper authentication

Why - we didn't understand the risks involved in only authenticating a single client ID

Why - we already have user-level / account-level authentication, and we thought that would be sufficient to protect user information we store

Why - we aren't subject to things like SOX (let's say $company isn't public) and we don't store credit card data, we use a third party vendor who meets PCI DSS level 1 already, so we don't have to undergo regular third party pentests. We could do GDPR-focused pentesting, but we don't currently

Why - we assume if we aren't seeing much go wrong in our observability tooling (we don't have IDS or SIEM tooling), everything is fine, and if anything goes wrong it's a problem caused by legitimate users

Why - $company do not have enough folks with security experience to provide the partnership necessary at every stage of the SDLC to prevent building a system like this

Conclusion

Applying somewhat overlapping threat modeling techniques can help us to catch some information we might not otherwise catch, threat modeling as the system evolves helps us understand it better, and we can even use hacks built on techniques we might be more familiar with to  help gather enough data to be able to approach less familiar techniques. 

We haven't gotten into "part by part, can we say this system does or does not meet certain security properties" vs as a whole, but perhaps that's better left for a follow-on.

04 June, 2021

Signal safety number privacy issues

Credit

Affected application versions (known as of 6/4/21)

  • 5.13.0 and below (iOS)
  • 5.3.0 and below (macOS)
  • 5.3.0 and below (Windows)
  • 5.10.8 and below (Android)
  • 5.3.0 and below (Linux)

Initial test versions (5/13/21)

  • 5.11.0 and below (iOS)
  • 5.1.0 and below (macOS)
  • 5.1.0 and below (Windows)
  • 5.9.7 and below (Android)
  • 5.1.0 and below (Linux)

Intro

Signal provides a free, cross-platform private messenger app. Folks in all kinds of unsafe situations rely on Signal, as a highly visible and popular app which the security and privacy professional communities endorse. Journalists rely on Signal to ensure confidential communication with their sources.

What privacy guarantees does one really have though if you can't prove the identity of who you're communicating with?

The problem

Mid-May, I got a new phone. At the time I understood that with *any change* to the device or installation of either party in a chat with message history, the Signal chat "safety number" changes. This used to be but (following an involved email back-and-forth with the Signal team over the course of a month) is no longer reflected in the Signal support documentation.

When a safety number changes, Signal shows a message to both parties in the conversation. The most recent alert I recall seeing prior to this adventure (which I believe was initially received April 14, about a month before I changed phones) looks like this:


Expecting similar alerts to be sent out to my existing chat threads upon phone changeover, I messaged a few of my more recent chats.


Signal has a pretty convenient iOS device transfer method to help migrate everything over (I later discovered it not only transfers your chat threads and settings, it also transfers your key material) by simply scanning a QR code on your new device using the old device. It worked beautifully. But then - nada más. I went to check docs to see if I had missed something obvious.

In the Signal user support documentation:


Also from some old Signal blog posts announcing the most recent to date safety number feature updates (2016, 2017) it seemed like my contacts should have gotten alerted when I changed phones.

What *are* safety numbers anyway?

As far as I know, the idea of safety numbers as implemented in Signal doesn't have a publicly available product-level nor technical specification, unlike some of the other algorithm and protocol components.

Backing up a few steps for folks who aren't familiar (anyone who knows Signal can obviously skip this), here's a bit more on how safety numbers work.

Let's say there are two participants in a Signal chat, Alice and Bob. This chat has a single unique safety number which both parties can check in the app.

This number is a human-readable representation of Alice and Bob's shared public key material. Slightly more technically put, it's a combination of Alice and Bob's individual cryptographic fingerprints in decimal format. There's also a QR code version in the mobile app so it's easy for folks to compare.

Here's a safety number in the Android app:



The safety number for Alice and Bob *should eventually* change whenever Alice *or* Bob change their Signal installation (example: uninstalling and reinstalling the app to prevent a third party seeing the chat history on Alice's phone). This allows Alice and Bob to verify the privacy of their communication over Signal as desired.

Alice and Bob can also mark their safety number verified in a way that is supposed to make sending messages to their chat always require manual approval right in the UI.

Again however there had been no safety number alerts in the UI to any of my existing chat threads. We later found out that it's an intentional product decision by Signal staff to not have device "quick start" transfers cause safety number changes, even though this is inconsistent with the app behavior on every other kind of device or installation change and inconsistent with the documentation.

Sesame documentation

Signal enables users to have multiple devices on the same account through the Sesame device and session management algorithm. The need for users to verify each other to get any real privacy guarantee gets mentioned multiple times in the Sesame doc.

X3DH documentation

Here's another callout about the requirement for identity checking.

Initial Investigation

To reproduce the issue, I transferred Signal to my old device again. John Jackson observed that the safety number for our chat stayed the same before and after transfer, where he should have been alerted and the safety number should have changed:

Before

After


I additionally checked the chat safety number on another device associated with my Signal account and saw it had also not changed there:

Desktop device

Further Investigation: App Uninstallation

Later with Rob, Sick Codes, and others we observed that safety numbers also did not change for multiple user pairs across all device types Signal currently supports (Linux, macOS, Android, iOS, Windows) when at least one party deleted the app and reinstalled it on one of their linked devices. This meant the issue was not something isolated to communicating with my user account.

Requesting a CVE ID and notifying vendor

At this point we felt we had two fairly well-tested cases where the documentation did not reflect the observed behavior:

  • chat safety numbers not changing on device transfer
  • chat safety numbers not changing on Signal uninstall/reinstall on same device

We requested a CVE and emailed security@signal.org a first draft we proposed for the CVE follow-on writeup with this summary:

Missing cryptographic step in Signal Private Messenger ("Signal") across multiple platforms allows an attacker to masquerade as a victim Signal user to the victim's contacts. Signal at time of writing does not rotate the safety key ("safety number") between a user pair upon re-installation of the application, nor on transfer of application data from one device to another using a method such as iOS Quick Start, despite clear indication in the Signal documentation this must occur in order to let the user's contacts know the user's device or installation has changed. Failure of key rotation results in lack of non-repudiation of communications and indeterminate potential for impersonation and man-in-the-middle attacks.

Safety number verification

Then, I started to wonder if safety numbers actually changed under any circumstances.

Rob and I determined that the "you have not marked person as verified"  functionality also did not force a safety number change when Signal was uninstalled and reinstalled on Rob's device, most likely as well due to whatever underlying issue was causing safety numbers to not change on uninstallation and reinstallation:

Prior to verification

Following verification

Following Rob deleting and reinstalling the app

Clearing data then uninstalling does actually cause safety number change on reinstall

Sick Codes and I proved the chat safety number did actually change when upon clearing data in some flavour of the desktop app. Clearing my data then uninstalling Signal on macOS caused me to lose all chat threads and contents of group chats on the macOS device. Chat threads were all still present across my other devices until I resynced my phone to the cleared desktop app, at which point all messages from before clearing my data were gone from my phone as well.

Before clearing data on desktop app

After clearing data

I accidentally went through a CAPTCHA loop twice trying to get everything put back together properly, but it was worth it to prove safety number changes can happen sometimes if you really try.

Vendor communications

Some time passed before Signal requested further information. We provided many of the screenshots included here and detailed some ways we thought the issues could be abused to cause Signal users harm. We felt it was clear the app behaviour and documentation did not match when we reported, but Signal staff surprisingly said they were unable to reproduce.

After several more emails spaced out over multiple days Signal staff requested a call. As everyone mentioned in this blog works full time elsewhere and this is our side gig, it was very kind of Signal staff to agree to take a call after Pacific Time working hours, but the call was ultimately not very useful from our perspective. We arrived at deadlock and pretty much nothing else productive came from our long email chain. Eventually, Signal staff stopped replying after telling us they planned to update the customer docs and not change anything else. We later found after dismissing our report, Signal not only had updated the customer docs, but had started rolling out patches for the issues.

Though as far as we understand it wasn't taken into consideration, we provided Signal via email after the call with what we considered to be our ideal timeline and outcome. We would have liked to see disclosure at each step by Signal in order to show accountability to users and the security and privacy communities. We felt this request was appropriate considering the Signal app is a privacy-critical product, people in unsafe situations may rely on it for communications, and it seemed to us that some of Signal's privacy guarantees weren't being met.

Mobile apps' version bump

We tried to obtain independent verification from Ax Sharma this past week, but he unfortunately wasn't able to reproduce the reinstallation lack of safety number change on iOS and Android on Friday (4 June). This tipped us off to something maybe having quietly been changed in the Signal codebase since we were consistently able to reproduce the issue across multiple user/device matchups for several weeks beforehand.

Turns out, there is a freshly baked new version out in the Android and iOS app stores as of Friday. Even if unintentionally included, it seems we have our fix for at least the uninstall/reinstall safety number change issue on mobile:

Android

iOS

Safety number now changes on mobile for uninstall / reinstall!

Where before you could uninstall the app, wait awhile, reinstall it, and just get dropped back into all your chats with all message history available, now you have to follow a registration workflow, the chat safety number changes, and the party on the other end gets a UI alert. Further, on iOS any previous messages are no longer available on app reinstallation.

From uninstalling and reinstalling on iOS the evening of 4 June:

Before (seen on John's phone)

New prompt on iOS

After (seen on Kelly's phone)

After (seen on John's phone)

Desktop Electron app

Upon learning from Ax that uninstall/reinstall now showed a safety number change and proving that for ourselves, we wanted to see if the same was true for the desktop app, which is more or less the same Electron/NodeJS codebase across macOS, Linux, and Windows. We noticed that despite a similar version bump, the desktop app on macOS still did not produce a safety number update or show alerts after uninstall/reinstall for a chat:

Before uninstall/reinstall

Version info

I was unable to start the reinstalled 5.3.0 dmg I had saved in my downloads and had to grab 5.4.0:

after uninstall/reinstall

Device transfer

I wanted to also see if the device transfer issue was also fixed. I got an interesting interstitial which wouldn't let me complete transfer til I updated the Signal version on my old device, but even after update my contacts didn't see safety number changes.

This time, I verified the safety number for my chat thread with John was the same from both my own devices as well as doing the transfer. John or anyone else communicating with me are still unable to tell anything has changed at all after transfer from my old to my new phone or vice versa.

Before (new phone)

After (back on old phone again)

Why are we doing this?

We don't want anyone to get hurt by way of trusting privacy guarantees which may be more conditional than they appear from the docs!

If Bob notices the chat safety number with Alice has changed and then Alice sends a bunch of suspect-sounding messages or asks to meet in person and Bob has never met Alice in person before, for example, Bob should be wary. After Alice for example is forced to provide device passcode or unlock their device with their fingerprint or face, Alice's device could be cloned over to a new device by way of quick transfer functionality without Alice's consent, and the messages could be coming from the cloned device rather than Alice's actual device.

Timeline

  • 12 May 2021: vulnerability discovered
  • 13 May 2021: CVE requested from Mitre
  • 13 May 2021: vendor notified via security@ email address
  • 14 May 2021: vendor requested additional information
  • 14 May 2021: researchers responded
  • 15 May 2021: vendor requested additional information
  • 15 May 2021: researchers responded
  • 18 May 2021: researchers requested response
  • 18 May 2021: vendor denied vulnerability
  • 19 May 2021: researchers responded
  • 22 May 2021: vendor requested video call
  • 24 May 2021: video call with vendor engineering manager, Kelly Kaoudis, and John Jackson to discuss
  • 25 May 2021: researchers provided sketch of ideal timeline for disclosure to vendor
  • 27 May 2021: vendor notified researchers of planned support page update and lack of plans to mitigate vulnerability or lack of clarity in technical documentation
  • 29 May 2021: researchers discover additional issue
  • 2 June 2021: vendor notified researchers of support page update 
  • 2 June 2021: researchers requested vendor preferred timeline for issue remediation for both initial and second vulnerabilities
  • 2 June 2021: researchers approached Ax Sharma for independent verification and possible writeup
  • 4 June 2021: Ax Sharma unable to reproduce by uninstalling and reinstalling Signal on Android and iOS
  • 4 June 2021: researchers determine safety numbers now change in Android and iOS when uninstalling and reinstalling Signal, but not on macOS nor when performing device transfer between two iOS devices (original device transfer issue remains unpatched)

References

- Signal on GitHub
- Wayback 2021-05-22: Signal Support Center Safety Number Documentation
- Wayback 2021-06-04: Signal Support Center Safety Number Documentation
- Wayback 2021-05-04: Signal blog "Safety Number Updates"
- Wayback 2021-05-04: Signal blog "Verified Safety Number Updates"
- Wayback 2021-06-04: Signal blog "iOS Device Transfer"
- Wayback 2021-06-03: Signal Sesame Specification
- Wayback 2021-06-03: Signal X3DH Specification

Additional Thanks To

Ax Sharma, Amber Harrington, exp1r3d for their independent assistance testing!

04 April, 2021

Adventures in Systemd-Linux-Dockerland

I am just recently starting to use Docker again after an about six year hiatus from it (used regularly ~2014-2015). I do not recommend following the actions I document here in any way, shape, or form, but wanted to chronicle them in case it's educational for others, or in case I don't use Docker again until 2030 and want to know what worked in 2021.

Constructive suggestions or corrections (so I know for the future) are welcome! There are likely better or more efficient ways to do all of the following.

Intro

The proper, real, Docker setup documentation for Arch may be found here. If you actually want to use Docker on Arch, I would recommend trying that. You might also want to get a peep at the systemctl and systemd manpages.

Disclaimer: some of the following may be universal across Linux distributions, but as I run Arch, some of it is likely specific to Arch and perhaps also my own setup.

How I got Docker running as described here, from zero, in Arch, differs rather a bit from how people ought to do it. I was going too fast and assumed I still knew stuff. I also barely tolerate systemd, though I've gotten more accustomed to it in recent years.

I hope this illustrates some potential pitfalls of not properly following official documentation for the thing you want to run or the documentation for running said thing on your own distro when you actually want to get something done fast.

What I did, with explanations:

Took a wild guess here and:

$ pacman -S docker

According to this, you need a loop module and then you'll install Docker from the Snap store. Needing loop tracked with my previous experience with Docker so I didn't question it. I don't use Snap (as recommended in the link) given I have Pacman already in Arch, so I proceeded to skip that part of the tutorial and went on to the next bits. 

We need Docker running in order to create and interact with containers. One can run Docker a couple of different ways: with systemd (systemctl), or just by calling dockerd directly. I went the systemd route this time.

$ sudo tee /etc/modules-load.d/loop.conf <<< "loop"
$ sudo modprobe loop // this loads the loop module we just made
$ sudo systemctl start docker.service

The above call to the docker.service unit with sudo is how this recommended to start Docker, but I felt this didn't make sense for my objective after trying it. With the sudo-prepended call to systemctl, we're as far as I understand affecting the root user environment versus the current user environment even though the start command will not, to my current knowledge, cause docker.service to automatically run during future sessions. Running `sudo systemctl enable docker.service` instead would do that. 

According to the systemctl manpage `systemctl start docker` and `systemctl start docker.service` are equivalent. If not specified systemctl will just add the appropriate unit suffix for us (.service by default), but sudo adds a dimension of weirdness here. It seems to override systemctl's usual requirement to ask for your password. `systemctl start docker` without sudo would have done what I actually wanted to do: simply use systemctl to manage dockerd in a way that is localised to the booted-in user session on my machine. When sudo executes a command in this form, we use it to act as the superuser (root). This I believe implies I'd have to also use sudo to initialise, run, etc any future Docker containers as well, which wasn't the outcome I wanted.

Back to Docker. At this point, Docker Was Not Up.

`systemctl status docker.service` showed the service had failed to start. I got cranky, so went looking for docker.service. The following is what I got by default when I installed the docker package on Arch. As shown here under the Requires section, docker.socket is a dependency of docker.service:

    
$ cat /usr/lib/systemd/system/docker.service
    [Unit]
    Description=Docker Application Container Engine
    Documentation=https://docs.docker.com
    After=network-online.target docker.socket firewalld.service
    Wants=network-online.target
    Requires=docker.socket

    [Service]
    Type=notify
    # the default is not to use systemd for cgroups because the delegate issues still
    # exists and systemd currently does not support the cgroup feature set required
    # for containers run by docker
    ExecStart=/usr/bin/dockerd -H fd://
    ExecReload=/bin/kill -s HUP $MAINPID
    LimitNOFILE=1048576
    # Having non-zero Limit*s causes performance problems due to accounting             overhead
    # in the kernel. We recommend using cgroups to do container-local accounting.
    LimitNPROC=infinity
    LimitCORE=infinity
    # Uncomment TasksMax if your systemd version supports it.
    # Only systemd 226 and above support this version.
    #TasksMax=infinity
    TimeoutStartSec=0
    # set delegate yes so that systemd does not reset the cgroups of docker                 containers
    Delegate=yes
    # kill only the docker process, not all processes in the cgroup
    KillMode=process
    # restart the docker process if it exits prematurely
    Restart=on-failure
    StartLimitBurst=3
    StartLimitInterval=60s

    [Install]
    WantedBy=multi-user.target

So I tried:

$ systemctl start docker.socket

Which worked in that docker.socket came up successfully, but this system state change did not help me to do the command from the tutorial I was still at this point trying to follow, `sudo systemctl start docker.service`. Starting Docker either with systemctl start docker, or with dockerd should, I believe, start the socket automatically for you, though you may notice if you systemctl stop docker but still have docker.socket active that the docker.socket unit can start docker again as well.

Since docker is a socket-activated daemon as installed by default on Arch, I could have enabled the docker.socket unit and then just used that as a base for my containers in the future. In this mode, systemd would listen on the socket in question and start up docker when I start containers. This style of usage is meant to be less resource-intensive since docker daemon would then only run as needed. We could also go full socketception and make socket-activated on-demand container units, if we have containers we want to reuse, and then use systemctl to control them.

But still, all I really wanted to do was `systemctl start docker` and just use docker by itself (no sudo, no extra systemctl units) after that so I tried to fix my environment up again with:

$ systemctl disable --now docker
$ systemctl disable --now docker.socket


So that we can have networking in our containers, it appears Docker will automatically create a docker0 bridge in the DOWN state for us when we start it as our own user account using systemctl:

$ sudo ip link
[sudo] password for user:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback ...
2: enp0s25: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ...
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
    link/ether ...

$ systemctl status docker
○ docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
     Active: inactive (dead)
TriggeredBy: ○ docker.socket
       Docs: https://docs.docker.com

$ systemctl start docker
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start 'docker.service'.
Authenticating as: user
Password:
==== AUTHENTICATION COMPLETE ====

$ systemctl status docker            
● docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
     Active: active (running) since ...; 1h 36min ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 2881 (dockerd)
      Tasks: 20 (limit: 19044)
     Memory: 155.5M
        CPU: 3.348s
     CGroup: /system.slice/docker.service
             ├─2881 /usr/bin/dockerd -H fd://
             └─2890 containerd --config /var/run/docker/containerd/containerd.toml --log-level info

$ sudo ip link           
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback ...
2: enp0s25: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ...
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
    link/ether ...
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether ...

At this point, I had Docker itself working as my own user account. I thought I was ready to pull an image down and try to make a container. 

$ docker pull sickcodes/docker-osx:auto

I was wrong! 

Trying to build a headless container from the docker-osx:auto image as specified in the doc I was following did not fully work:

$ docker run -it \
    --device /dev/kvm \
    -p 50922:10022 \
    sickcodes/docker-osx:auto

I kept getting LibGTK errors which it turned out were not due to anything being wrong with the container, but rather an assortment of packages I still needed to install and a few missing group memberships for my user. I got stuck here for awhile trying to figure out all the different things I didn't have yet from a combination of the Arch documentation, the docker-osx documentation, and the rest of the internet. It's plausible you might encounter a similar error trying to run docker-osx if you don't have xhost available and this stumped me for rather awhile since I figured the issue was just xhost as described in the docker-osx troubleshooting docs at first.

Mini disclaimer: I use yay typically, not pacman, but wanted to provide pacman in the previous example since it's more commonly known. I don't recall which of the following packages are in AUR versus standard repositories, but here is what I think I ended up needing:

$ yay -S qemu libvirt dnsmasq virt-manager bridge-utils flex bison iptables-nft edk2-ovmf

I then set up the additional daemons I needed (actually enabling them this time so they'd be available on future boots) and added my user to the libvirt, kvm, and docker groups.

$ sudo systemctl enable --now libvirtd
$ sudo systemctl enable --now virtlogd
$ echo 1 | sudo tee /sys/module/kvm/parameters/ignore_msrs
$ sudo modprobe kvm
$ lsmod | grep kvm
$ sudo usermod -aG libvirt user
$ sudo usermod -aG kvm user
$ sudo usermod -aG docker user


I figured just in case the LibGTK error *was* really an xhost problem after all that, I'd follow the troubleshooting documentation I was using as well.

$ yay -S
xorg-xhost
$ xhost +


Finally, I was able to create a container and boot into it for the first time using:

$ docker run -it \
    --device /dev/kvm \
    -p 50922:10022 \
    sickcodes/docker-osx:auto

Do note this run command makes you a fresh, brand new container every time you run it. You'll be able to see all of yours with `docker ps --all`.

So that I can install things in my container and reuse it, I'll (after this) use `docker start -ai <CONTAINER ID>` or `docker start -ai <NAME>` instead of another round of `docker run -it`, but you may wish to stick to the run command if you want a new container each time you start up.

I also ran into a small snag when I decided to update my system and then suddenly couldn't start containers anymore with either `docker run -it` or `docker start` following running kernel upgrade ("docker: Error response from daemon: failed to create endpoint dazzling_ptolemy on network bridge: failed to add the host (veth1e8eb9b) <=> sandbox (veth73c911f) pair interfaces: operation not supported"), which was exciting, but fixable with just a reboot. 

The cause of this issue is a mismatch between the running kernel - the kernel still running from before the upgrade - and most recent kernel modules which match the new, upgraded kernel instead of the running kernel. On reboot, we boot into our freshly updated Linux version, which matches our most recent modules, so Docker can load kernel modules which match the running kernel once again.

Thus: Docker is either exactly as terrible as I recall, or worse I had forgotten just about everything useful I used to know about it, but I think from here on out I'll be mostly okay.