Question about SError exception routing

List overview All Threads
Download

newer

older

Non volatile boot index for the...

Trusted Firmware v2.14 release...

Julius Werner

19 Nov 2025 19 Nov '25

5:03 a.m.

Hi,

I have a sort of generic Arm architecture question that's not directly related to TF-A (other than that TF-A controls some of the registers involved in this decision), but I'm hoping that one of the experts here can still help me out or at least refer me to someone who can.

I'm trying to figure out how exception routing for SError aborts works in EL2. Specifically, I have a bootloader (BL33) running in NS-EL2 and I want the "simple" setup that it manages all its own exceptions, the same way that an OS kernel normally manages all exceptions at EL1. I assumed that I could achieve that simply by installing exception handlers, unmasking all exceptions in PSTATE, and leaving all the special trap feature bits in the MSRs at 0 (disabled).

This seems to work for synchronous exceptions and external aborts, but not for SErrors. Looking at the architecture reference manual (revision L.b), table D1-14 in section D1.3.6.3 (page D1-6114), I can see that my case is represented by the first line (all special trap bits 0), which shows that SErrors caused by EL0 and EL1 would be routed to EL1 as expected (though even when PSTATE.A is 1 which seems odd?), but SErrors caused by EL2 will get ignored and remain pending (with no regard to PSTATE.A). Instead, the "default" behavior I expect (aborts get routed to the EL that caused them if PSTATE.A is 0) seems to require me to enable SCTLR_EL2.NMEA. But if you're looking at the description of SCTLR_EL2.NMEA, it says that it controls whether PSTATE.A masks SError exceptions at EL2 (and that if it is 0, SError exceptions are not taken at EL2 if PSTATE.A == 1). Doesn't that imply that SError exceptions *are* taken at EL2 if PSTATE.A == 0? What does a control that seems to be about trapping masked aborts from a lower EL have to do with unmasked aborts from my current EL?

Basically, I think what I'm asking is: is that table really correct as printed (some behavior we've observed seems to indicate it is), and if so, why? Why do SError exceptions seem to behave differently by default in EL1 and EL2 (in regards to unmasked exceptions taken from the same exception level)? Why does the PSTATE.A bit only seem to apply to EL0 and EL1, not EL2 and EL3, even for exceptions taken from the same level, when this peculiarity seems to not be mentioned anywhere else in the manual? Why do SError exceptions get treated so differently from external aborts in EL2/EL3, when in EL1 they seem to mostly count as the same? Is the current description of the NMEA bit in the SCTLR_EL2 register documentation really accurate, if it also seems to make fundamental changes to cases not really mentioned in that description? Is there any way for EL2 to only handle its own SError exceptions without interfering with EL1's exception handling when FEAT_DoubleFault2 is not implemented (other than flipping HCR_EL2.AMO on every EL2 entry/exit)? And am I the only one who finds this all incredibly inconsistent and confusing?

I feel like I'm missing some critical insight in how you were meant to think about this to make it make sense, would appreciate any help in that regard!

Thanks, Julius

Show replies by date

Ash Wilding

19 Nov 19 Nov

2:51 p.m.

Hi Julius,

In the Arm Architecture, all asynchronous exceptions (IRQ, FIQ, SError) target EL1 by default, unless specifically overridden to be routed EL2 by HCR_EL2.{IMO, FMO, AMO} or overridden to be routed to EL3 by SCR_EL3.{IRQ, FIQ, EA}; if both EL2 and EL3 try to route a particular type of asynchronous exception to their respective EL – for example HCR_EL2.AMO == SCR_EL3.EA == 1, then EL3 “wins” as it’s more privileged and the exception will be routed to EL3.

Separately, the PSTATE.{A, I, F} fields control whether the corresponding type of asynchronous exception is taken in the current EL, but does not prevent an exception from being routed to a higher EL.

So for example, you may have SCR_EL3.FIQ == 1, meaning FIQs are routed to EL3. If you’re currently executing at EL1 and an FIQ is asserted to the core, then the value of PSTATE.F is ignored and the exception is taken to EL3. This is by design, otherwise less privileged software could for example DoS the more privileged software if that FIQ corresponded to a timer tick / maintenance interrupt.

Another nuance here is that an asynchronous exception being asserted which targets a lower exception level than we’re currently executing in will be pended until we return back to either that level (or an even lower level).

For example, if we have HCR_EL2.EA == SCR_EL3.EA == 0, meaning SErrors are targeting the default EL1, and an SError is asserted to the core while executing in EL3, then we will not take the SError exception and it will instead become pending until we return back to EL1/EL0, at which point we’ll take the SError exception – if not masked by PSTATE.A.

The final bit of nuance here is around the NMEA and TMEA bits.

The NMEA bit basically says “If we’re executing at the current EL and an SError is asserted that would have been masked by PSTATE.A, take the SError anyway” – with some rules around when this is possible so as to prevent us from e.g. taking an SError in the middle of stacking off context at the entrypoint of some other exception handler.

And the TMEA bit basically says “Route an SError / Synchronous External Abort to me, but only if it’s not been handled by a lower EL”. As an example, if we have something like SCR_EL3.TMEA == HCR_EL2.AMO == PSTATE.A == 1, then we could imagine two scenarios:

1) We’re executing in EL1 and an SError is asserted • PSTATE.A is ignored • the SError is routed to EL2.

2) We’re executing in EL2 and an SError is asserted • PSTATE.A masks the SError • the SError is trapped to EL3 since it wasn’t handled by a lower EL (or more accurately, it was masked by all lower ELs).

Hope that helps to clarify :-)

Cheers, Ash.

Julius Werner

20 Nov 20 Nov

12:57 a.m.

Hi Ash,

Thanks a lot for the detailed explanation, I think this makes a lot more sense now! Starting with "all asynchronous exceptions go to EL1 by default" makes it easier to understand the logic behind these choices (the manual should really mention that somewhere early on before going into all the edge case details). I think I mostly got confused because I grouped SErrors with external aborts in my head (and those work differently), but after some thought I can see why things may need to work differently for asynchronous exceptions. (I thought there was also such a thing as an "asynchronous external abort" in the architecture, but I can't find it anymore? I guess I must have gotten things mixed up with old armv7 terminology there.)

I still don't quite understand the way PSTATE.A is treated in EL2. Like you said, when SCR_EL3.TMEA == HCR_EL2.AMO == PSTATE.A == 1, the exception is taken to EL3 (whereas with PSTATE.A == 0 it is taken to EL2). But when SCR_EL3.TMEA == 0 in the same situation, it looks like the exception is taken to EL2 regardless of the state of PSTATE.A? Why would setting SCR_EL3.TMEA affect whether PSTATE.A "works" in EL2?

Anyway, it seems like there's no better solution to my scenario than setting either AMO or TGE whenever I'm executing in EL2 and clearing it again when transitioning back. But I guess that works.

Thanks, Julius

On Wed, Nov 19, 2025 at 6:52 AM Ash Wilding Ash.Wilding@arm.com wrote:

...

Hi Julius,

In the Arm Architecture, all asynchronous exceptions (IRQ, FIQ, SError) target EL1 by default, unless specifically overridden to be routed EL2 by HCR_EL2.{IMO, FMO, AMO} or overridden to be routed to EL3 by SCR_EL3.{IRQ, FIQ, EA}; if both EL2 and EL3 try to route a particular type of asynchronous exception to their respective EL – for example HCR_EL2.AMO == SCR_EL3.EA == 1, then EL3 “wins” as it’s more privileged and the exception will be routed to EL3.

Separately, the PSTATE.{A, I, F} fields control whether the corresponding type of asynchronous exception is taken in the current EL, but does not prevent an exception from being routed to a higher EL.

So for example, you may have SCR_EL3.FIQ == 1, meaning FIQs are routed to EL3. If you’re currently executing at EL1 and an FIQ is asserted to the core, then the value of PSTATE.F is ignored and the exception is taken to EL3. This is by design, otherwise less privileged software could for example DoS the more privileged software if that FIQ corresponded to a timer tick / maintenance interrupt.

Another nuance here is that an asynchronous exception being asserted which targets a lower exception level than we’re currently executing in will be pended until we return back to either that level (or an even lower level).

For example, if we have HCR_EL2.EA == SCR_EL3.EA == 0, meaning SErrors are targeting the default EL1, and an SError is asserted to the core while executing in EL3, then we will not take the SError exception and it will instead become pending until we return back to EL1/EL0, at which point we’ll take the SError exception – if not masked by PSTATE.A.

The final bit of nuance here is around the NMEA and TMEA bits.

The NMEA bit basically says “If we’re executing at the current EL and an SError is asserted that would have been masked by PSTATE.A, take the SError anyway” – with some rules around when this is possible so as to prevent us from e.g. taking an SError in the middle of stacking off context at the entrypoint of some other exception handler.

And the TMEA bit basically says “Route an SError / Synchronous External Abort to me, but only if it’s not been handled by a lower EL”. As an example, if we have something like SCR_EL3.TMEA == HCR_EL2.AMO == PSTATE.A == 1, then we could imagine two scenarios:

We’re executing in EL1 and an SError is asserted  PSTATE.A is ignored  the SError is routed to EL2.

We’re executing in EL2 and an SError is asserted  PSTATE.A masks the SError  the SError is trapped to EL3 since it wasn’t handled by a lower EL (or more accurately, it was masked by all lower ELs).

Hope that helps to clarify :-)

Cheers,

Ash.

Ash Wilding

10:17 a.m.

Hi Julius,

Happy to help :-)

Asynchronous External Aborts were renamed to SErrors (and expanded in scope/definition) in Armv8. You can still see holdovers of this in certain places, such as with the SError routing bit in SCR[_EL3] still being named “EA”.

I think some of the confusion here may be stemming from the fact that is Table D1-14 is not actually telling you whether the exception is taken/masked, rather, it’s strictly telling you which EL is the target EL of the exception.

That’s a subtle but important difference, and in fact whether an exception is taken/masked is described separately in Table D1-22.

In the example you give, the row in table D1-14 with SCR_EL3.TMEA == 0 and HCR_EL2.AMO == 1 shows that regardless of the value of PSTATE.A, the target exception level is EL2 if we’re currently executing in EL0/EL1/EL2 whereas the exception is pended if we’re currently executing in EL3.

Finding the corresponding rows in Table D1-22 you can see that SCTLR2_EL2.NMEA being toggled on upgrades the masking rules when currently executing at EL2 from “B” (the SError may be subject to masking by PSTATE.A) to “A” (the SError is taken regardless of the value of PSTATE.A”).

The key here is that you need to combine Tables D1-14 and D1-22 together.

For example, you’ll notice that in Table D1-22, toggling SCR_EL3.TMEA on also upgrades the masking rules when currently executing at any of EL0/EL1/EL2 to “A”, meaning the exception is always taken. This may sound odd at first, but it’s correct as it’s saying that we’ll either be taking the exception at a lower EL, or we’ll be trapping the exception up to EL3 if it was masked by all lower levels; so the exception will definitely be taken *somewhere*, and Table D1-14 tells you where that will be depending on the values of SCR_EL3.EA / HCR_EL2.AMO / HCRX_EL2.TMEA / etc.

Cheers, Ash.

Ash Wilding

24 Nov 24 Nov

5:18 p.m.

Hi Julius,

I’ve been discussing this internally and realise my description of NMEA / TMEA earlier was a bit off. Apologies, I’d wrongly conflated some of the separate behaviour around the way FEAT_NMI added support for non-maskable IRQs/FIQs and using PSTATE.SP as a mask.

In my first reply to you I said:

<quote> The NMEA bit basically says “If we’re executing at the current EL and an SError is asserted that would have been masked by PSTATE.A, take the SError anyway” – with some rules around when this is possible so as to prevent us from e.g. taking an SError in the middle of stacking off context at the entrypoint of some other exception handler.

And the TMEA bit basically says “Route an SError / Synchronous External Abort to me, but only if it’s not been handled by a lower EL”. </quote>

In practice, an SError taken when NMEA=1 *can* corrupt state if you happened to be in the middle of stacking off registers / etc. in an exception handler at the time of taking the SError.

With that in mind, setting NMEA=1 at ELx effectively says “I'd rather corrupt volatile state in SPSR_ELx / ELR_ELx / ESR_ELx / etc. than propagate an error by not taking an exception right now”, and then TMEA being set at a higher EL effectively says “Route the SError to me if it’s masked at the target EL, or would corrupt volatile state.”

We can see this in rows 2 and 4 of Table D1-14.

Row 2 says: - HCRX_EL2.TMEA == PSTATE.A == 1 - SCTLR2_EL1.NMEA == 0 - SErrors at EL0 go to EL2 because they’re masked by PSTATE.A - SErrors at EL1 also go to EL2 because they’re masked by PSTATE.A

Row 4 says: - HCRX_EL2.TMEA == PSTATE.A == 1 - SCTLR2_EL1.NMEA == 1 - SErrors at EL0 go to EL1 because NMEA overrides PSTATE.A - SErrors at EL1 instead go to EL2 because NMEA overriding PSTATE.A may lead to corrupt state at EL1, and this is trapped by TMEA

As an aside, a positive outcome of this internal discussion is that we’ve captured a JIRA to improve the descriptions of the NMEA and TMEA bits in a future release of the Arm ARM, which should make the above behaviour more readily understandable :-)

Cheers, Ash.

Julius Werner

25 Nov 25 Nov

8:36 p.m.

Hi Ash,

Thanks for the correction! I appreciate knowing that this stuff can sometimes be difficult to understand perfectly for you guys as well. ;)

If you're taking pointers on how to make the manual more readable, I'd suggest putting that first thing you said (asynchronous exceptions are taken to EL1 by default, unless specifically redirected by a control) somewhere early in the chapter for exceptions, because I really felt like that helps set the framing of what all those controls are trying to achieve in a way that is currently not clear from the text. Maybe mention it in D1.3.1.3 (Synchronous and asynchronous exceptions)? (Then you could explain the equivalent default for synchronous exceptions as well to illustrate the difference — I think that would be something like "synchronous exceptions are taken to the current exception level by default, except EL0->EL1, HVC->EL2, SMC->EL3"?)

I would also consider maybe combining the SError exception routing and masking table into a single table, because it is quite confusing that you have to look through two tables to fully understand what happens. The routing table already has a column for PSTATE.A anyway, so might as well expand that to cover every masked and unmasked case?

On Mon, Nov 24, 2025 at 9:19 AM Ash Wilding Ash.Wilding@arm.com wrote:

...

Hi Julius,

I’ve been discussing this internally and realise my description of NMEA / TMEA earlier was a bit off. Apologies, I’d wrongly conflated some of the separate behaviour around the way FEAT_NMI added support for non-maskable IRQs/FIQs and using PSTATE.SP as a mask.

In my first reply to you I said:

<quote>

The NMEA bit basically says “If we’re executing at the current EL and an SError is asserted that would have been masked by PSTATE.A, take the SError anyway” – with some rules around when this is possible so as to prevent us from e.g. taking an SError in the middle of stacking off context at the entrypoint of some other exception handler.

And the TMEA bit basically says “Route an SError / Synchronous External Abort to me, but only if it’s not been handled by a lower EL”.

</quote>

In practice, an SError taken when NMEA=1 *can* corrupt state if you happened to be in the middle of stacking off registers / etc. in an exception handler at the time of taking the SError.

With that in mind, setting NMEA=1 at ELx effectively says “I'd rather corrupt volatile state in SPSR_ELx / ELR_ELx / ESR_ELx / etc. than propagate an error by not taking an exception right now”, and then TMEA being set at a higher EL effectively says “Route the SError to me if it’s masked at the target EL, or would corrupt volatile state.”

We can see this in rows 2 and 4 of Table D1-14.

Row 2 says:

HCRX_EL2.TMEA == PSTATE.A == 1

SCTLR2_EL1.NMEA == 0

SErrors at EL0 go to EL2 because they’re masked by PSTATE.A

SErrors at EL1 also go to EL2 because they’re masked by PSTATE.A

Row 4 says:

HCRX_EL2.TMEA == PSTATE.A == 1

SCTLR2_EL1.NMEA == 1

SErrors at EL0 go to EL1 because NMEA overrides PSTATE.A

SErrors at EL1 instead go to EL2 because NMEA overriding PSTATE.A may lead to corrupt state at EL1, and this is trapped by TMEA

As an aside, a positive outcome of this internal discussion is that we’ve captured a JIRA to improve the descriptions of the NMEA and TMEA bits in a future release of the Arm ARM, which should make the above behaviour more readily understandable :-)

Cheers,

Ash.

days inactive

days old

tf-a@lists.trustedfirmware.org

5 comments

participants

tags (0)

participants (2)

Ash Wilding
Julius Werner