Discussion:
BUG at drivers/iommu/amd_iommu.c:1436!
Mark Hounschell
2016-11-17 19:13:59 UTC
Permalink
Kernel version is 4.8.0. No failure when the IOMMU is disabled. This is an out of tree GPL driver using pci_alloc_consistent/pci_free_consistent. The free causes this.

The commit is:

2d4c515bf06c9bce87b546279413621f847ef6a3 is the first bad commit
commit 2d4c515bf06c9bce87b546279413621f847ef6a3
Author: Joerg Roedel <***@suse.de>
Date: Tue Jul 5 16:21:32 2016 +0200

iommu/amd: Remove other remains of old address allocator

There are other remains in the code from the old allocatore.
Remove them all.

Signed-off-by: Joerg Roedel <***@suse.de>

:040000 040000 87b020717cdd7dcab45e3574dfb6649d05dcc044 817e613acede15fafb70b046d831f558e6bffd93 M drivers



Nov 16 08:39:15 harley kernel: kernel BUG at drivers/iommu/amd_iommu.c:1436!
Nov 16 08:39:15 harley kernel: invalid opcode: 0000 [#1] PREEMPT SMP
Nov 16 08:39:15 harley kernel: Modules linked in: gpiohsd(O) bnep bluetooth rfkill nvidia(PO) drm af_packet iscsi_ibft iscsi_boot_sysfs snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel snd_hda_codec snd_hda_core xhci_pci snd_hwdep xhci_hcd snd_pcm synclink_gt osst 3c59x r8169 irqbypass crc32_pclmul dgap(C) snd_timer n_hdlc hdlc crc32c_intel snd tpm_infineon st joydev mii aesni_intel shpchp aes_x86_64 glue_helper tpm_tis tpm_tis_core input_leds lrw tpm k10temp fam15h_power soundcore gf128mul ablk_helper processor i2c_piix4 fjes serio_raw pcspkr cryptd dm_mod sr_mod cdrom ata_generic aic79xx pata_atiixp ohci_pci aic7xxx scsi_transport_spi mxm_wmi wmi button sg autofs4
Nov 16 08:39:15 harley kernel: CPU: 6 PID: 4750 Comm: trusim1a Tainted: P C O 4.8.0 #1
Nov 16 08:39:15 harley kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./990FXA-UD5, BIOS FB 01/23/2013
Nov 16 08:39:15 harley kernel: task: ffff88042d849c00 task.stack: ffff880439910000
Nov 16 08:39:15 harley kernel: RIP: 0010:[<ffffffff8061ebf3>] [<ffffffff8061ebf3>] iommu_unmap_page+0xd3/0xe0
Nov 16 08:39:15 harley kernel: RSP: 0018:ffff880439913b60 EFLAGS: 00010202
Nov 16 08:39:15 harley kernel: RAX: 000000000000122f RBX: 0000000000001000 RCX: 0000000000000027
Nov 16 08:39:15 harley kernel: RDX: fffffffffffffdba RSI: ffff88043b13a0a0 RDI: 0000000000000000
Nov 16 08:39:15 harley kernel: RBP: ffff880439913b90 R08: 0000000000000000 R09: 0000000000000000
Nov 16 08:39:15 harley kernel: R10: 0000000000000005 R11: 0000000000000002 R12: ffff88043b13a0a0
Nov 16 08:39:15 harley kernel: R13: ffff88043b13a000 R14: 0000000000001230 R15: ffffffffffec6260
Nov 16 08:39:15 harley kernel: FS: 00007fb360e6e700(0000) GS:ffff88044fd80000(0000) knlGS:0000000000000000
Nov 16 08:39:15 harley kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 16 08:39:15 harley kernel: CR2: 00007fe345af58f0 CR3: 000000042d48a000 CR4: 00000000000406e0
Nov 16 08:39:15 harley kernel: Stack:
Nov 16 08:39:15 harley kernel: 0000000000000246 0000000000000001 ffffffffffec7000 0000000000000001
Nov 16 08:39:15 harley kernel: ffff88043b13a000 ffffffffffec6000 ffff880439913bd0 ffffffff80620298
Nov 16 08:39:15 harley kernel: ffff88043d2480a0 0000000000001000 ffff88043d2480a0 ffffffffffec6000
Nov 16 08:39:15 harley kernel: Call Trace:
Nov 16 08:39:15 harley kernel: [<ffffffff80620298>] __unmap_single.isra.23+0x58/0x1b0
Nov 16 08:39:15 harley kernel: [<ffffffff80620a66>] free_coherent+0x76/0xc0
Nov 16 08:39:15 harley kernel: [<ffffffffa0f9fda5>] ehsd_set_signal+0x395/0x620 [gpiohsd]
Nov 16 08:39:15 harley kernel: [<ffffffff8027c4f3>] ? finish_task_switch+0x73/0x1e0
Nov 16 08:39:15 harley kernel: [<ffffffffa0fa4629>] ehsd_ioctl+0xdb9/0x17cb [gpiohsd]
Nov 16 08:39:15 harley kernel: [<ffffffff8024ef40>] ? pat_enabled+0x20/0x20
Nov 16 08:39:15 harley kernel: [<ffffffff80260750>] ? walk_system_ram_range+0x70/0xc0
Nov 16 08:39:15 harley kernel: [<ffffffff80373f98>] ? unmap_page_range+0x698/0x8a0
Nov 16 08:39:15 harley kernel: [<ffffffff80524da9>] ? find_next_bit+0x19/0x20
Nov 16 08:39:15 harley kernel: [<ffffffff8050fcfc>] ? cpumask_any_but+0x2c/0x40
Nov 16 08:39:15 harley kernel: [<ffffffff80251bac>] ? flush_tlb_mm_range+0x4c/0x1a0
Nov 16 08:39:15 harley kernel: [<ffffffff80372947>] ? tlb_finish_mmu+0x17/0x50
Nov 16 08:39:15 harley kernel: [<ffffffff8037a3f0>] ? unmap_region+0xe0/0x110
Nov 16 08:39:15 harley kernel: [<ffffffff80516328>] ? __rb_erase_color+0x138/0x280
Nov 16 08:39:15 harley kernel: [<ffffffff803d076f>] do_vfs_ioctl+0x8f/0x5a0
Nov 16 08:39:15 harley kernel: [<ffffffff8037c76d>] ? do_munmap+0x27d/0x370
Nov 16 08:39:15 harley kernel: [<ffffffff803d0cf4>] SyS_ioctl+0x74/0x80
Nov 16 08:39:15 harley kernel: [<ffffffff80867f5f>] entry_SYSCALL_64_fastpath+0x17/0x93
Nov 16 08:39:15 harley kernel: Code: d7 49 01 c7 4c 39 f3 77 89 4d 85 f6 75 14 48 83 c4 08 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 49 8d 46 ff 4c 85 f0 74 e3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 8b 57 7c 48 89 f0 85 d2
Nov 16 08:39:15 harley kernel: RIP [<ffffffff8061ebf3>] iommu_unmap_page+0xd3/0xe0
Nov 16 08:39:15 harley kernel: RSP <ffff880439913b60>
Nov 16 08:39:15 harley kernel: ---[ end trace 45a510d4d695b2d3 ]---


Regards
Mark
Joerg Roedel
2016-11-17 21:41:41 UTC
Permalink
Hi Mark,
Post by Mark Hounschell
Kernel version is 4.8.0. No failure when the IOMMU is disabled. This
is an out of tree GPL driver using
pci_alloc_consistent/pci_free_consistent. The free causes this.
Can you please run the driver with dma-api debugging enabled? Does it
trigger any warnings?



Joerg
Joerg Roedel
2016-11-17 22:00:45 UTC
Permalink
Post by Joerg Roedel
Hi Mark,
Post by Mark Hounschell
Kernel version is 4.8.0. No failure when the IOMMU is disabled. This
is an out of tree GPL driver using
pci_alloc_consistent/pci_free_consistent. The free causes this.
Can you please run the driver with dma-api debugging enabled? Does it
trigger any warnings?
It's been a while since I've done that for ya. Kernel config
options? Kernel cmd line you want?
Just enable CONFIG_DMA_API_DEBUG and boot the resulting kernel. Please
send me full dmesg when the issue happens again.



Joerg
Mark Hounschell
2016-11-18 14:04:05 UTC
Permalink
Post by Joerg Roedel
Post by Joerg Roedel
Hi Mark,
Post by Mark Hounschell
Kernel version is 4.8.0. No failure when the IOMMU is disabled. This
is an out of tree GPL driver using
pci_alloc_consistent/pci_free_consistent. The free causes this.
Can you please run the driver with dma-api debugging enabled? Does it
trigger any warnings?
It's been a while since I've done that for ya. Kernel config
options? Kernel cmd line you want?
Just enable CONFIG_DMA_API_DEBUG and boot the resulting kernel. Please
send me full dmesg when the issue happens again.
OK. It's attached.

Mark
Joerg Roedel
2016-11-21 15:32:50 UTC
Permalink
Post by Mark Hounschell
OK. It's attached.
[ 45.033715] WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:1158 check_unmap+0x4ef/0x990
[ 45.037651] 3c59x 0000:04:04.0: DMA-API: device driver failed to check map error[device address=0x00000000ffedf840] [size=1536 bytes] [mapped as single]
Okay, there is a dma-debug warning from the 3c59x driver. Can you please
blacklist this driver (or compile it out) and re-try? The dma-debug code
will only show the first warning and stop then. So any possible warnings
from your out-of-tree driver will not be shown as long as the 3c59x
driver is loaded.

Alternativly you can write 1 to dma-api/all_errors in debug-fs so that
it will show all errors. But I suggest to unload the 3com driver first
and then load your driver.


Joerg
Mark Hounschell
2016-11-21 21:47:59 UTC
Permalink
Post by Joerg Roedel
Post by Mark Hounschell
OK. It's attached.
[ 45.033715] WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:1158 check_unmap+0x4ef/0x990
[ 45.037651] 3c59x 0000:04:04.0: DMA-API: device driver failed to check map error[device address=0x00000000ffedf840] [size=1536 bytes] [mapped as single]
Okay, there is a dma-debug warning from the 3c59x driver. Can you please
blacklist this driver (or compile it out) and re-try? The dma-debug code
will only show the first warning and stop then. So any possible warnings
from your out-of-tree driver will not be shown as long as the 3c59x
driver is loaded.
Alternativly you can write 1 to dma-api/all_errors in debug-fs so that
it will show all errors. But I suggest to unload the 3com driver first
and then load your driver.
OK, I did get this message before the reported BUG message.

gpiohsd gpiohsd: DMA-API: device driver tries to free DMA memory it has not allocated [device address=0xffffffffffee8000] [size=8192 bytes]

But I've verified that the dma_addr_t that I get for the alloc, and also use for the free is 0x00000000ffee8000 in this case? Is device "address=0xffffffffffee8000" in that message a bug in the message or do we have a sign extended address problem? It seems strange to me, I've never seen a dma_addr_t given, when using the iommu, that high. In the past I've seen them as usually 0x00xxxxxx?

I have also verified that simply changing from pci_alloc/free_consistent to the newer DMA API fixes my issue and I get no such messages.

Mark
Joerg Roedel
2016-11-22 08:45:22 UTC
Permalink
Post by Mark Hounschell
OK, I did get this message before the reported BUG message.
gpiohsd gpiohsd: DMA-API: device driver tries to free DMA memory it has not allocated [device address=0xffffffffffee8000] [size=8192 bytes]
But I've verified that the dma_addr_t that I get for the alloc, and
also use for the free is 0x00000000ffee8000 in this case? Is device
"address=0xffffffffffee8000" in that message a bug in the message or
do we have a sign extended address problem? It seems strange to me,
I've never seen a dma_addr_t given, when using the iommu, that
high. In the past I've seen them as usually 0x00xxxxxx?
I have also verified that simply changing from
pci_alloc/free_consistent to the newer DMA API fixes my issue and I
get no such messages.
Yes, this looks like a sign-extension bug somewhere. But its not in the
amd-iommu driver, because dma-debug also sees it. And from what I can
tell the dma-api interface seems to be fine. It consistently uses
dma_addr_t to pass these values around.

Where can I find the source of the failing code? I need exactly the code
version that triggers the problem.



Joerg
Mark Hounschell
2016-11-22 19:53:25 UTC
Permalink
Post by Joerg Roedel
Post by Mark Hounschell
OK, I did get this message before the reported BUG message.
gpiohsd gpiohsd: DMA-API: device driver tries to free DMA memory it has not allocated [device address=0xffffffffffee8000] [size=8192 bytes]
But I've verified that the dma_addr_t that I get for the alloc, and
also use for the free is 0x00000000ffee8000 in this case? Is device
"address=0xffffffffffee8000" in that message a bug in the message or
do we have a sign extended address problem? It seems strange to me,
I've never seen a dma_addr_t given, when using the iommu, that
high. In the past I've seen them as usually 0x00xxxxxx?
I have also verified that simply changing from
pci_alloc/free_consistent to the newer DMA API fixes my issue and I
get no such messages.
Yes, this looks like a sign-extension bug somewhere. But its not in the
amd-iommu driver, because dma-debug also sees it. And from what I can
tell the dma-api interface seems to be fine. It consistently uses
dma_addr_t to pass these values around.
Where can I find the source of the failing code? I need exactly the code
version that triggers the problem.
I certainly don't have a problem sending you the code but I'm sure you
have better things to do than scour over some out of kernel GPL driver
code. I see many many users of pci_alloc/free in kernel so it can't be
broken as badly as it appears to me here. I'm going to just go ahead and
convert this section of code to use the newer DMA API and be done with it.

It appears pci_alloc/free_consistent is going to be removed completely
soon anyway.

Thanks
Mark

Loading...