Xunlei Pang
2016-11-16 09:02:30 UTC
We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
under kdump, it can be steadily reproduced on several different machines,
the dmesg log is like:
HP HPSA Driver (v 3.4.16-0)
hpsa 0000:02:00.0: using doorbell to reset controller
hpsa 0000:02:00.0: board ready after hard reset.
hpsa 0000:02:00.0: Waiting for controller to respond to no-op
DMAR: Setting identity map for device 0000:02:00.0 [0xe8000 - 0xe8fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xf4000 - 0xf4fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6e000 - 0xbdf6efff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6f000 - 0xbdf7efff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf7f000 - 0xbdf82fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf83000 - 0xbdf84fff]
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [02:00.0] fault addr fffff000 [fault reason 06] PTE Read access is not set
hpsa 0000:02:00.0: controller message 03:00 timed out
hpsa 0000:02:00.0: no-op failed; re-trying
After some debugging, we found that the corresponding pte entry value
is correct, and the value of the iommu caching mode is 0, the fault is
probably due to the old iotlb cache of the in-flight DMA.
Thus need to flush the old iotlb after context mapping is setup for the
device, where the device is supposed to finish reset at its driver probe
stage and no in-flight DMA exists hereafter.
With this patch, all our problematic machines can survive the kdump tests.
CC: Myron Stowe <***@redhat.com>
CC: Don Brace <***@microsemi.com>
CC: Baoquan He <***@redhat.com>
CC: Dave Young <***@redhat.com>
Tested-by: Joseph Szczypek <***@redhat.com>
Signed-off-by: Xunlei Pang <***@redhat.com>
---
drivers/iommu/intel-iommu.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3965e73..eb79288 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2067,9 +2067,16 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
* It's a non-present to present mapping. If hardware doesn't cache
* non-present entry we only need to flush the write-buffer. If the
* _does_ cache non-present entries, then it does so in the special
- * domain #0, which we have to flush:
+ * domain #0, which we have to flush.
+ *
+ * For kdump cases, present entries may be cached due to the in-flight
+ * DMA and copied old pgtable, but there is no unmapping behaviour for
+ * them, so we need an explicit iotlb flush for the newly-mapped device.
+ * For kdump, at this point, the device is supposed to finish reset at
+ * the driver probe stage, no in-flight DMA will exist, thus we do not
+ * need to worry about that anymore hereafter.
*/
- if (cap_caching_mode(iommu->cap)) {
+ if (is_kdump_kernel() || cap_caching_mode(iommu->cap)) {
iommu->flush.flush_context(iommu, 0,
(((u16)bus) << 8) | devfn,
DMA_CCMD_MASK_NOBIT,
under kdump, it can be steadily reproduced on several different machines,
the dmesg log is like:
HP HPSA Driver (v 3.4.16-0)
hpsa 0000:02:00.0: using doorbell to reset controller
hpsa 0000:02:00.0: board ready after hard reset.
hpsa 0000:02:00.0: Waiting for controller to respond to no-op
DMAR: Setting identity map for device 0000:02:00.0 [0xe8000 - 0xe8fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xf4000 - 0xf4fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6e000 - 0xbdf6efff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6f000 - 0xbdf7efff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf7f000 - 0xbdf82fff]
DMAR: Setting identity map for device 0000:02:00.0 [0xbdf83000 - 0xbdf84fff]
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [02:00.0] fault addr fffff000 [fault reason 06] PTE Read access is not set
hpsa 0000:02:00.0: controller message 03:00 timed out
hpsa 0000:02:00.0: no-op failed; re-trying
After some debugging, we found that the corresponding pte entry value
is correct, and the value of the iommu caching mode is 0, the fault is
probably due to the old iotlb cache of the in-flight DMA.
Thus need to flush the old iotlb after context mapping is setup for the
device, where the device is supposed to finish reset at its driver probe
stage and no in-flight DMA exists hereafter.
With this patch, all our problematic machines can survive the kdump tests.
CC: Myron Stowe <***@redhat.com>
CC: Don Brace <***@microsemi.com>
CC: Baoquan He <***@redhat.com>
CC: Dave Young <***@redhat.com>
Tested-by: Joseph Szczypek <***@redhat.com>
Signed-off-by: Xunlei Pang <***@redhat.com>
---
drivers/iommu/intel-iommu.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3965e73..eb79288 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2067,9 +2067,16 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
* It's a non-present to present mapping. If hardware doesn't cache
* non-present entry we only need to flush the write-buffer. If the
* _does_ cache non-present entries, then it does so in the special
- * domain #0, which we have to flush:
+ * domain #0, which we have to flush.
+ *
+ * For kdump cases, present entries may be cached due to the in-flight
+ * DMA and copied old pgtable, but there is no unmapping behaviour for
+ * them, so we need an explicit iotlb flush for the newly-mapped device.
+ * For kdump, at this point, the device is supposed to finish reset at
+ * the driver probe stage, no in-flight DMA will exist, thus we do not
+ * need to worry about that anymore hereafter.
*/
- if (cap_caching_mode(iommu->cap)) {
+ if (is_kdump_kernel() || cap_caching_mode(iommu->cap)) {
iommu->flush.flush_context(iommu, 0,
(((u16)bus) << 8) | devfn,
DMA_CCMD_MASK_NOBIT,
--
1.8.3.1
1.8.3.1