Hi Bin Wu,
Here's an early observation. On receiving the RAS fiq interrupt the following occurs:
ehf_el3_interrupt_handler => sgi_ras_intr_handler => spm_sp_call (enters/exit the SP to handle the injected RAS error) => sdei_dispatch_event
se = get_event_entry(map); if (!can_sdei_state_trans(se, DO_DISPATCH)) return -1;
p *map $6 = {ev_num = 804, intr = 0, map_flags = 112, reg_count = 0, lock = {lock = 0}} p *se $4 = {ep = 0, arg = 0, affinity = 0, reg_flags = 0, state = 0 '\0'}
sdei_dispatch_event exits in error at this stage, this does not seem a correct behavior. The SDEI handler is not called in NS world and context remains unchanged. The interrupt handler blindly returns to S-EL1 SP context at same location where it last exited. sgi_ras_intr_handler => ehf_el3_interrupt_handler => vector_entry fiq_aarch64 => el3_exit => re-enters the SP with X0=0xC4000061 SP then exits but the EL3 context has not been setup for SP entry leading to crash.
IMO there is an issue around mapping SDEI event number to RAS interrupt number leading to sdei_dispatch_event exiting early.
Regards, Olivier.
________________________________________ From: TF-A tf-a-bounces@lists.trustedfirmware.org on behalf of Matteo Carlini via TF-A tf-a@lists.trustedfirmware.org Sent: 14 April 2020 10:41 To: 吴斌(郅隆); tf-a@lists.trustedfirmware.org; Thomas Abraham; Deepak Pandey Cc: nd Subject: Re: [TF-A] 回复:Re: [RAS] BL32 UnRecognized Event - 0xC4000061 and BL31 Crashed
Looping-in Thomas & Deepak, responsible for the RD-N1 landing team platforms releases. They might be able to help.
Thanks Matteo
From: TF-A tf-a-bounces@lists.trustedfirmware.org On Behalf Of ??(??) via TF-A Sent: 14 April 2020 06:47 To: TF-A tf-a-bounces@lists.trustedfirmware.org; Raghu Krishnamurthy via TF-A tf-a@lists.trustedfirmware.org Subject: [TF-A] 回复:Re: [RAS] BL32 UnRecognized Event - 0xC4000061 and BL31 Crashed
Hi RagHu,
Really appreciate your help.
I was downloaded this software stack from git.linaro.org. This software stack include ATF, kernel, edk2 and so on. The user guide i used from linaro is:https://git.linaro.org/landing-teams/working/arm/arm-reference-platforms.git...
1) What platform you are running on? Can this issue be reproduced outside your testing environment, perhaps on FVP or QEMU? A: I am running on ARM N1-Edge FVP platform. It can reproduced on this FVP platform.
2) What version of TF-A and StandaloneMM is being used? Preferably the commit-id, so that we can be sure we are looking at the same code. A: TF-A: https://git.linaro.org/landing-teams/working/arm/arm-tf.git tag:RD-INFRA-20191024-RC0 StandloneMM seems build from edk2 & edk2-platform. so i just put edk2 and edk2-platform version information. if anything i missed, please let me know. edk2: https://git.linaro.org/landing-teams/working/arm/edk2.githttps://git.linaro.org/landing-teams/working/arm/edk2.git/ tag:RD-INFRA-20191024-RC0 edk2-platform: https://git.linaro.org/landing-teams/working/arm/edk2-platforms.githttps://git.linaro.org/landing-teams/working/arm/edk2-platforms.git/ tag:RD-INFRA-20191024-RC0
3) What version of the kernel and sdei driver is being used? A: kernel-release: https://git.linaro.org/landing-teams/working/arm/kernel-release.git tag:RD-INFRA-20191024-RC0 The sdei driver was included in kernel, do i need to provide sdei driver version? If need please let me know. 4) I can't tell from looking at the log but do you know if writing 0x123 to sde_ras_poison causes a DMC620 interrupt or an SError or external abort through memory access ? A: Sorry, linaro only refered it will inject the DMC-620 single-bit RAS error. So I am also not sure which exception type it will trigger.
BRs, Bin Wu
------------------原始邮件 ------------------ 发件人:TF-A <tf-a-bounces@lists.trustedfirmware.orgmailto:tf-a-bounces@lists.trustedfirmware.org> 发送时间:Tue Apr 14 01:25:47 2020 收件人:Raghu Krishnamurthy via TF-A <tf-a@lists.trustedfirmware.orgmailto:tf-a@lists.trustedfirmware.org> 主题:Re: [TF-A] [RAS] BL32 UnRecognized Event - 0xC4000061 and BL31 Crashed Hello,
Does BL31 need to send 0xC4000061 event to BL32 again?
I don't think it will. It is really odd that 0xC4000061(SP_EVENT_COMPLETE_AARCH64) ever reaches the BL32/MM handler. This is from looking at the upstream code quickly but it definitely depends on the platform you are running, what version of TF-A you are using, build options used. Is it possible that the unhandled exception is occurring after successful handling of the DMC620 error but there is a following issue that occurs right after, causing the crash? From the register dump it looks like there was an Instruction abort exception at address 0 while running in EL3. Something seems to have gone seriously wrong to have 0xC4000061 ever go back to BL32 and to get an instruction abort at address 0.
Does current TF-A support to run RAS test? It seems BL31 will crash.
See above. The answer really depends on the factors mentioned above.
The following would be helpful to know: 1) What platform you are running on? Can this issue be reproduced outside your testing environment, perhaps on FVP or QEMU? 2) What version of TF-A and StandaloneMM is being used? Preferably the commit-id, so that we can be sure we are looking at the same code. 3) What version of the kernel and sdei driver is being used? 4) I can't tell from looking at the log but do you know if writing 0x123 to sde_ras_poison causes a DMC620 interrupt or an SError or external abort through memory access ?
Thanks Raghu
On 4/13/20 12:16 AM, 吴斌(郅隆) via TF-A wrote:
Dear Friends,
I am using TF-A to test RAS feature. When I triggered DMC620 RAS error in Linux(echo 0x123 > /sys/kernel/debug/sdei_ras_poison). BL32 will recieve UnRecognized Event - 0xC4000061(SP_EVENT_COMPLETE_AARCH64) and finally BL31 crashed.
In my understanding, this 0xC4000061 should consumed by BL31, not send it to BL32 again.
A piece of error log as below:
CperWrite - CperAddress@0xFF610064 CperWrite - 1 Section@FFBE91A8, Length 80, SectionType@FFBE9138 CperWrite - Got Error Section: Platform Memory. MmEntryPoint Done Received delegated event X0 : 0xC4000061 X1 : 0x0 X2 : 0x0 X3 : 0x0 Received event - 0xC4000061 on cpu 0 UnRecognized Event - 0xC4000061 Failed delegated event 0xC4000061, Status 0x2 Unhandled Exception in EL3. x30 = 0x0000000000000000 x0 = 0x00000000ff007e00 x1 = 0xfffffffffffffffe x2 = 0x00000000600003c0 x3 = 0x0000000000000000 x4 = 0x0000000000000000 x5 = 0x0000000000000000 x6 = 0x00000000ff015080 x7 = 0x0000000000000000 x8 = 0x00000000c4000061 x9 = 0x0000000000000021 x10 = 0x0000000000000040 x11 = 0x00000000ff00f2b0 x12 = 0x00000000ff0118c0 x13 = 0x0000000000000002 x14 = 0x00000000ff016b70 x15 = 0x00000000ff003f20 x16 = 0x0000000000000044 x17 = 0x00000000ff010430 x18 = 0x0000000000000e3c x19 = 0x0000000000000000 More error log please refer to attachment.
My question is,
- Does BL31 need to send 0xC4000061 event to BL32 again?
- Does current TF-A support to run RAS test? It seems BL31 will crash.
Appreciate your help.
BRs, Bin Wu
-- TF-A mailing list TF-A@lists.trustedfirmware.orgmailto:TF-A@lists.trustedfirmware.org https://lists.trustedfirmware.org/mailman/listinfo/tf-a
tf-a@lists.trustedfirmware.org