Discussion:
[PATCH v14 00/16] KVM PCIe/MSI passthrough on ARM/ARM64
Eric Auger
2016-10-12 13:22:08 UTC
Permalink
This is the second respin on top of Robin's series [1], addressing Alex' comments.

Major changes are:
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved

IOVA range exclusion will be handled in a separate series

The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?

Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.

Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.

dependency:
the series depends on Robin's generic-v7 branch:
[1] [PATCH v7 00/22] Generic DT bindings for PCI IOMMUs and ARM SMMU
http://www.spinics.net/lists/arm-kernel/msg531110.html

Best Regards

Eric

Git: complete series available at
https://github.com/eauger/linux/tree/generic-v7-pcie-passthru-v14

the above branch includes a temporary patch to work around a ThunderX pci
bus reset crash (which I think unrelated to this series):
"vfio: pci: HACK! workaround thunderx pci_try_reset_bus crash"
Do not take this one for other platforms.


Eric Auger (15):
iommu/iova: fix __alloc_and_insert_iova_range
iommu: Introduce DOMAIN_ATTR_MSI_RESV
iommu/dma: MSI doorbell alloc/free
iommu/dma: Introduce iommu_calc_msi_resv
iommu/arm-smmu: Implement domain_get_attr for DOMAIN_ATTR_MSI_RESV
irqchip/gic-v2m: Register the MSI doorbell
irqchip/gicv3-its: Register the MSI doorbell
vfio: Introduce a vfio_dma type field
vfio/type1: vfio_find_dma accepting a type argument
vfio/type1: Implement recursive vfio_find_dma_from_node
vfio/type1: Handle unmap/unpin and replay for VFIO_IOVA_RESERVED slots
vfio: Allow reserved msi iova registration
vfio/type1: Check doorbell safety
iommu/arm-smmu: Do not advertise IOMMU_CAP_INTR_REMAP
vfio/type1: Introduce MSI_RESV capability

Robin Murphy (1):
iommu/dma: Allow MSI-only cookies

drivers/iommu/Kconfig | 4 +-
drivers/iommu/arm-smmu-v3.c | 10 +-
drivers/iommu/arm-smmu.c | 10 +-
drivers/iommu/dma-iommu.c | 184 ++++++++++++++++++++++++++
drivers/iommu/iova.c | 2 +-
drivers/irqchip/irq-gic-v2m.c | 10 +-
drivers/irqchip/irq-gic-v3-its.c | 13 ++
drivers/vfio/vfio_iommu_type1.c | 279 +++++++++++++++++++++++++++++++++++++--
include/linux/dma-iommu.h | 59 +++++++++
include/linux/iommu.h | 8 ++
include/uapi/linux/vfio.h | 30 ++++-
11 files changed, 587 insertions(+), 22 deletions(-)
--
1.9.1
Eric Auger
2016-10-12 13:22:09 UTC
Permalink
Fix the size check within start_pfn and limit_pfn.

Signed-off-by: Eric Auger <***@redhat.com>

---

the issue was observed when playing with 1 page iova domain with
higher iova reserved.
---
drivers/iommu/iova.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e23001b..ee29dbf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -147,7 +147,7 @@ move_left:
if (!curr) {
if (size_aligned)
pad_size = iova_get_pad_size(size, limit_pfn);
- if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
+ if ((iovad->start_pfn + size + pad_size - 1) > limit_pfn) {
spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
return -ENOMEM;
}
--
1.9.1
Eric Auger
2016-10-12 13:22:10 UTC
Permalink
Introduce a new DOMAIN_ATTR_MSI_RESV domain attribute and associated
iommu_domain msi_resv field. It comprises the size and alignment
of the IOVA reserved window dedicated to MSI mapping. This attribute
only is supported when MSI must be IOMMU mapped.

This is the case on ARM.

Signed-off-by: Eric Auger <***@redhat.com>
Suggested-by: Alex Williamson <***@redhat.com>

---
v13 -> v14:
- new msi_resv type and name

v12 -> v13:
- reword the commit message

v8 -> v9:
- rename programmable into iommu_msi_supported
- add iommu_domain_msi_aperture_valid

v8: creation
- deprecates DOMAIN_ATTR_MSI_MAPPING flag
---
include/linux/iommu.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 436dc21..aaeb598 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -52,6 +52,12 @@ struct iommu_domain_geometry {
bool force_aperture; /* DMA only allowed in mappable range? */
};

+/* MSI reserved IOVA window requirements */
+struct iommu_domain_msi_resv {
+ size_t size; /* size in bytes */
+ size_t alignment; /* byte alignment */
+};
+
/* Domain feature flags */
#define __IOMMU_DOMAIN_PAGING (1U << 0) /* Support for iommu_map/unmap */
#define __IOMMU_DOMAIN_DMA_API (1U << 1) /* Domain for use in DMA-API
@@ -83,6 +89,7 @@ struct iommu_domain {
iommu_fault_handler_t handler;
void *handler_token;
struct iommu_domain_geometry geometry;
+ struct iommu_domain_msi_resv msi_resv;
void *iova_cookie;
};

@@ -108,6 +115,7 @@ enum iommu_cap {

enum iommu_attr {
DOMAIN_ATTR_GEOMETRY,
+ DOMAIN_ATTR_MSI_RESV,
DOMAIN_ATTR_PAGING,
DOMAIN_ATTR_WINDOWS,
DOMAIN_ATTR_FSL_PAMU_STASH,
--
1.9.1
Eric Auger
2016-10-12 13:22:11 UTC
Permalink
From: Robin Murphy <***@arm.com>

IOMMU domain users such as VFIO face a similar problem to DMA API ops
with regard to mapping MSI messages in systems where the MSI write is
subject to IOMMU translation. With the relevant infrastructure now in
place for managed DMA domains, it's actually really simple for other
users to piggyback off that and reap the benefits without giving up
their own IOVA management, and without having to reinvent their own
wheel in the MSI layer.

Allow such users to opt into automatic MSI remapping by dedicating a
region of their IOVA space to a managed cookie.

Signed-off-by: Robin Murphy <***@arm.com>
Signed-off-by: Eric Auger <***@redhat.com>

---

v13 -> v14:
- restore reserve_iova for iova >= base + size

v1 -> v13 incorpration:
- compared to Robin's version
- add NULL last param to iommu_dma_init_domain
- set the msi_geometry aperture
- I removed
if (base < U64_MAX - size)
reserve_iova(iovad, iova_pfn(iovad, base + size), ULONG_MAX);
don't get why we would reserve something out of the scope of the iova domain?
what do I miss?
---
drivers/iommu/dma-iommu.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 9 +++++++++
2 files changed, 48 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c5ab866..d45f9a0 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -716,3 +716,42 @@ void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg)
msg->address_lo += lower_32_bits(msi_page->iova);
}
}
+
+/**
+ * iommu_get_dma_msi_region_cookie - Configure a domain for MSI remapping only
+ * @domain: IOMMU domain to prepare
+ * @base: Base address of IOVA region to use as the MSI remapping aperture
+ * @size: Size of the desired MSI aperture
+ *
+ * Users who manage their own IOVA allocation and do not want DMA API support,
+ * but would still like to take advantage of automatic MSI remapping, can use
+ * this to initialise their own domain appropriately.
+ */
+int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size)
+{
+ struct iommu_dma_cookie *cookie;
+ struct iova_domain *iovad;
+ int ret;
+
+ if (domain->type == IOMMU_DOMAIN_DMA)
+ return -EINVAL;
+
+ ret = iommu_get_dma_cookie(domain);
+ if (ret)
+ return ret;
+
+ ret = iommu_dma_init_domain(domain, base, size, NULL);
+ if (ret) {
+ iommu_put_dma_cookie(domain);
+ return ret;
+ }
+
+ cookie = domain->iova_cookie;
+ iovad = &cookie->iovad;
+ if (base < U64_MAX - size)
+ reserve_iova(iovad, iova_pfn(iovad, base + size), ULONG_MAX);
+
+ return 0;
+}
+EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie);
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 32c5890..05ab5b4 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -67,6 +67,9 @@ int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
/* The DMA API isn't _quite_ the whole story, though... */
void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg);

+int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size);
+
#else

struct iommu_domain;
@@ -90,6 +93,12 @@ static inline void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg)
{
}

+static inline int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
+ dma_addr_t base, u64 size)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_IOMMU_DMA */
#endif /* __KERNEL__ */
#endif /* __DMA_IOMMU_H */
--
1.9.1
Eric Auger
2016-10-12 13:22:12 UTC
Permalink
We introduce the capability to (un)register MSI doorbells.

A doorbell region is characterized by its physical address base, size,
and whether it its safe (ie. it implements IRQ remapping). A doorbell
can be per-cpu or global. We currently only care about global doorbells.

A function returns whether all registered doorbells are safe.

MSI controllers likely to work along with IOMMU that translate MSI
transaction must register their doorbells to allow device assignment
with MSI support. Otherwise the MSI transactions will cause IOMMU faults.

Signed-off-by: Eric Auger <***@redhat.com>

---

v13 -> v14:
- previously in msi-doorbell.h/c
---
drivers/iommu/dma-iommu.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 41 ++++++++++++++++++++++++++
2 files changed, 116 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d45f9a0..d8a7d86 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -43,6 +43,38 @@ struct iommu_dma_cookie {
spinlock_t msi_lock;
};

+/**
+ * struct iommu_msi_doorbell_info - MSI doorbell region descriptor
+ * @percpu_doorbells: per cpu doorbell base address
+ * @global_doorbell: base address of the doorbell
+ * @doorbell_is_percpu: is the doorbell per cpu or global?
+ * @safe: true if irq remapping is implemented
+ * @size: size of the doorbell
+ */
+struct iommu_msi_doorbell_info {
+ union {
+ phys_addr_t __percpu *percpu_doorbells;
+ phys_addr_t global_doorbell;
+ };
+ bool doorbell_is_percpu;
+ bool safe;
+ size_t size;
+};
+
+struct iommu_msi_doorbell {
+ struct iommu_msi_doorbell_info info;
+ struct list_head next;
+};
+
+/* list of registered MSI doorbells */
+static LIST_HEAD(iommu_msi_doorbell_list);
+
+/* counts the number of unsafe registered doorbells */
+static uint nb_unsafe_doorbells;
+
+/* protects the list and nb_unsafe_doorbells */
+static DEFINE_MUTEX(iommu_msi_doorbell_mutex);
+
static inline struct iova_domain *cookie_iovad(struct iommu_domain *domain)
{
return &((struct iommu_dma_cookie *)domain->iova_cookie)->iovad;
@@ -755,3 +787,46 @@ int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
return 0;
}
EXPORT_SYMBOL(iommu_get_dma_msi_region_cookie);
+
+struct iommu_msi_doorbell_info *
+iommu_msi_doorbell_alloc(phys_addr_t base, size_t size, bool safe)
+{
+ struct iommu_msi_doorbell *db;
+
+ db = kzalloc(sizeof(*db), GFP_KERNEL);
+ if (!db)
+ return ERR_PTR(-ENOMEM);
+
+ db->info.global_doorbell = base;
+ db->info.size = size;
+ db->info.safe = safe;
+
+ mutex_lock(&iommu_msi_doorbell_mutex);
+ list_add(&db->next, &iommu_msi_doorbell_list);
+ if (!db->info.safe)
+ nb_unsafe_doorbells++;
+ mutex_unlock(&iommu_msi_doorbell_mutex);
+ return &db->info;
+}
+EXPORT_SYMBOL_GPL(iommu_msi_doorbell_alloc);
+
+void iommu_msi_doorbell_free(struct iommu_msi_doorbell_info *dbinfo)
+{
+ struct iommu_msi_doorbell *db;
+
+ db = container_of(dbinfo, struct iommu_msi_doorbell, info);
+
+ mutex_lock(&iommu_msi_doorbell_mutex);
+ list_del(&db->next);
+ if (!db->info.safe)
+ nb_unsafe_doorbells--;
+ mutex_unlock(&iommu_msi_doorbell_mutex);
+ kfree(db);
+}
+EXPORT_SYMBOL_GPL(iommu_msi_doorbell_free);
+
+bool iommu_msi_doorbell_safe(void)
+{
+ return !nb_unsafe_doorbells;
+}
+EXPORT_SYMBOL_GPL(iommu_msi_doorbell_safe);
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 05ab5b4..9640a27 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -19,6 +19,8 @@
#ifdef __KERNEL__
#include <asm/errno.h>

+struct iommu_msi_doorbell_info;
+
#ifdef CONFIG_IOMMU_DMA
#include <linux/iommu.h>
#include <linux/msi.h>
@@ -70,6 +72,31 @@ void iommu_dma_map_msi_msg(int irq, struct msi_msg *msg);
int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
dma_addr_t base, u64 size);

+/**
+ * iommu_msi_doorbell_alloc - allocate a global doorbell
+ * @base: physical base address of the doorbell
+ * @size: size of the doorbell
+ * @safe: true is irq_remapping implemented for this doorbell
+ *
+ * Return: the newly allocated doorbell info or a pointer converted error
+ */
+struct iommu_msi_doorbell_info *
+iommu_msi_doorbell_alloc(phys_addr_t base, size_t size, bool safe);
+
+/**
+ * iommu_msi_doorbell_free - free a global doorbell
+ * @db: doorbell info to free
+ */
+void iommu_msi_doorbell_free(struct iommu_msi_doorbell_info *db);
+
+/**
+ * iommu_msi_doorbell_safe - return whether all registered doorbells are safe
+ *
+ * Safe doorbells are those which implement irq remapping
+ * Return: true if all doorbells are safe, false otherwise
+ */
+bool iommu_msi_doorbell_safe(void);
+
#else

struct iommu_domain;
@@ -99,6 +126,20 @@ static inline int iommu_get_dma_msi_region_cookie(struct iommu_domain *domain,
return -ENODEV;
}

+static inline struct iommu_msi_doorbell_info *
+iommu_msi_doorbell_alloc(phys_addr_t base, size_t size, bool safe)
+{
+ return NULL;
+}
+
+static inline void
+iommu_msi_doorbell_free(struct msi_doorbell_info *db) {}
+
+static inline bool iommu_msi_doorbell_safe(void)
+{
+ return false;
+}
+
#endif /* CONFIG_IOMMU_DMA */
#endif /* __KERNEL__ */
#endif /* __DMA_IOMMU_H */
--
1.9.1
Punit Agrawal
2016-10-14 11:25:11 UTC
Permalink
Hi Eric,

One query and a comment below.
Post by Eric Auger
We introduce the capability to (un)register MSI doorbells.
A doorbell region is characterized by its physical address base, size,
and whether it its safe (ie. it implements IRQ remapping). A doorbell
can be per-cpu or global. We currently only care about global doorbells.
A function returns whether all registered doorbells are safe.
MSI controllers likely to work along with IOMMU that translate MSI
transaction must register their doorbells to allow device assignment
with MSI support. Otherwise the MSI transactions will cause IOMMU faults.
---
- previously in msi-doorbell.h/c
---
drivers/iommu/dma-iommu.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 41 ++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d45f9a0..d8a7d86 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -43,6 +43,38 @@ struct iommu_dma_cookie {
spinlock_t msi_lock;
};
+/**
+ * struct iommu_msi_doorbell_info - MSI doorbell region descriptor
+ */
+struct iommu_msi_doorbell_info {
+ union {
+ phys_addr_t __percpu *percpu_doorbells;
Out of curiosity, have you come across systems that have per-cpu
doorbells? I couldn't find a system that'd help solidify my
understanding on it's usage.
Post by Eric Auger
+ phys_addr_t global_doorbell;
+ };
+ bool doorbell_is_percpu;
+ bool safe;
Although you've got the comment above, 'safe' doesn't quite convey it's
purpose. Can this be renamed to something more descriptive -
'intr_remapping' or 'intr_isolation' perhaps?

Thanks,
Punit


[...]
Auger Eric
2016-10-17 14:25:36 UTC
Permalink
Hi Punit,
Post by Punit Agrawal
Hi Eric,
One query and a comment below.
Post by Eric Auger
We introduce the capability to (un)register MSI doorbells.
A doorbell region is characterized by its physical address base, size,
and whether it its safe (ie. it implements IRQ remapping). A doorbell
can be per-cpu or global. We currently only care about global doorbells.
A function returns whether all registered doorbells are safe.
MSI controllers likely to work along with IOMMU that translate MSI
transaction must register their doorbells to allow device assignment
with MSI support. Otherwise the MSI transactions will cause IOMMU faults.
---
- previously in msi-doorbell.h/c
---
drivers/iommu/dma-iommu.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 41 ++++++++++++++++++++++++++
2 files changed, 116 insertions(+)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d45f9a0..d8a7d86 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -43,6 +43,38 @@ struct iommu_dma_cookie {
spinlock_t msi_lock;
};
+/**
+ * struct iommu_msi_doorbell_info - MSI doorbell region descriptor
+ */
+struct iommu_msi_doorbell_info {
+ union {
+ phys_addr_t __percpu *percpu_doorbells;
Out of curiosity, have you come across systems that have per-cpu
doorbells? I couldn't find a system that'd help solidify my
understanding on it's usage.
This came out after a discussion With Marc. However at the moment I am
not aware of any MSI controller featuring per-cpu doorbell. Not sure
whether it stays relevant to keep this notion at that stage.
Post by Punit Agrawal
Post by Eric Auger
+ phys_addr_t global_doorbell;
+ };
+ bool doorbell_is_percpu;
+ bool safe;
Although you've got the comment above, 'safe' doesn't quite convey it's
purpose. Can this be renamed to something more descriptive -
'intr_remapping' or 'intr_isolation' perhaps?
Yes definitively

Thanks

Eric
Post by Punit Agrawal
Thanks,
Punit
[...]
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Eric Auger
2016-10-12 13:22:13 UTC
Permalink
iommu_calc_msi_resv() sum up the number of iommu pages of the lowest
order supported by the iommu domain requested to map all the registered
doorbells. This function will allow to dimension the intermediate
physical address (IPA) aperture requested to map the MSI doorbells.

Signed-off-by: Eric Auger <***@redhat.com>

---

v13 -> v14
- name and proto changed, moved to dma-iommu
---
drivers/iommu/dma-iommu.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dma-iommu.h | 11 +++++++-
2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d8a7d86..3a4b73b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -830,3 +830,73 @@ bool iommu_msi_doorbell_safe(void)
return !nb_unsafe_doorbells;
}
EXPORT_SYMBOL_GPL(iommu_msi_doorbell_safe);
+
+/**
+ * calc_region_reqs - compute the number of pages requested to map a region
+ *
+ * @addr: physical base address of the region
+ * @size: size of the region
+ * @order: the page order
+ *
+ * Return: the number of requested pages to map this region
+ */
+static int calc_region_reqs(phys_addr_t addr, size_t size, unsigned int order)
+{
+ phys_addr_t offset, granule;
+ unsigned int nb_pages;
+
+ granule = (uint64_t)(1 << order);
+ offset = addr & (granule - 1);
+ size = ALIGN(size + offset, granule);
+ nb_pages = size >> order;
+
+ return nb_pages;
+}
+
+/**
+ * calc_dbinfo_reqs - compute the number of pages requested to map a given
+ * MSI doorbell
+ *
+ * @dbi: doorbell info descriptor
+ * @order: page order
+ *
+ * Return: the number of requested pages to map this doorbell
+ */
+static int calc_dbinfo_reqs(struct iommu_msi_doorbell_info *dbi,
+ unsigned int order)
+{
+ int ret = 0;
+
+ if (!dbi->doorbell_is_percpu) {
+ ret = calc_region_reqs(dbi->global_doorbell, dbi->size, order);
+ } else {
+ phys_addr_t __percpu *pbase;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ pbase = per_cpu_ptr(dbi->percpu_doorbells, cpu);
+ ret += calc_region_reqs(*pbase, dbi->size, order);
+ }
+ }
+ return ret;
+}
+
+int iommu_calc_msi_resv(struct iommu_domain *domain)
+{
+ unsigned long order = __ffs(domain->pgsize_bitmap);
+ struct iommu_msi_doorbell *db;
+ phys_addr_t granule;
+ int size = 0;
+
+ mutex_lock(&iommu_msi_doorbell_mutex);
+ list_for_each_entry(db, &iommu_msi_doorbell_list, next)
+ size += calc_dbinfo_reqs(&db->info, order);
+
+ mutex_unlock(&iommu_msi_doorbell_mutex);
+
+ granule = (uint64_t)(1 << order);
+ domain->msi_resv.size = size * granule;
+ domain->msi_resv.alignment = granule;
+
+ return 0;
+}
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 9640a27..95875c8 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -97,6 +97,16 @@ void iommu_msi_doorbell_free(struct iommu_msi_doorbell_info *db);
*/
bool iommu_msi_doorbell_safe(void);

+/**
+ * iommu_calc_msi_resv - compute the number of pages of the lowest order
+ * supported by @domain, requested to map all the registered doorbells.
+ *
+ * @domain: iommu_domain
+ * @msi_resv: MSI reserved window requirements
+ *
+ */
+int iommu_calc_msi_resv(struct iommu_domain *domain);
+
#else

struct iommu_domain;
@@ -139,7 +149,6 @@ static inline bool iommu_msi_doorbell_safe(void)
{
return false;
}
-
#endif /* CONFIG_IOMMU_DMA */
#endif /* __KERNEL__ */
#endif /* __DMA_IOMMU_H */
--
1.9.1
Eric Auger
2016-10-12 13:22:14 UTC
Permalink
ARM smmu and smmu-v3 translate MSI transactions so their driver
are must implement domain_get_attr for DOMAIN_ATTR_MSI_RESV.

This allows to retrieve the size and alignment requirements of
the MSI reserved IOVA window.

Also IOMMU_DMA gets selected since it exposes the API to map the
MSIs.

Signed-off-by: Eric Auger <***@redhat.com>
---
drivers/iommu/Kconfig | 4 ++--
drivers/iommu/arm-smmu-v3.c | 7 +++++++
drivers/iommu/arm-smmu.c | 7 +++++++
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 8ee54d7..f5e5e4b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -297,7 +297,7 @@ config SPAPR_TCE_IOMMU
config ARM_SMMU
bool "ARM Ltd. System MMU (SMMU) Support"
depends on (ARM64 || ARM) && MMU
- select IOMMU_API
+ select IOMMU_DMA
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU if ARM
help
@@ -310,7 +310,7 @@ config ARM_SMMU
config ARM_SMMU_V3
bool "ARM Ltd. System MMU Version 3 (SMMUv3) Support"
depends on ARM64
- select IOMMU_API
+ select IOMMU_DMA
select IOMMU_IO_PGTABLE_LPAE
select GENERIC_MSI_IRQ_DOMAIN
help
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 15c01c3..572cad8 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1559,6 +1559,9 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
if (ret < 0)
free_io_pgtable_ops(pgtbl_ops);

+ if (domain->type == IOMMU_DOMAIN_UNMANAGED)
+ iommu_calc_msi_resv(domain);
+
return ret;
}

@@ -1840,6 +1843,10 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_NESTING:
*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
return 0;
+ case DOMAIN_ATTR_MSI_RESV:
+ *(struct iommu_domain_msi_resv *)data =
+ smmu_domain->domain.msi_resv;
+ return 0;
default:
return -ENODEV;
}
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ac4aab9..ae20b9c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -943,6 +943,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
domain->geometry.aperture_end = (1UL << ias) - 1;
domain->geometry.force_aperture = true;

+ if (domain->type == IOMMU_DOMAIN_UNMANAGED)
+ iommu_calc_msi_resv(domain);
+
/* Initialise the context bank with our page table cfg */
arm_smmu_init_context_bank(smmu_domain, &pgtbl_cfg);

@@ -1486,6 +1489,10 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
case DOMAIN_ATTR_NESTING:
*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
return 0;
+ case DOMAIN_ATTR_MSI_RESV:
+ *(struct iommu_domain_msi_resv *)data =
+ smmu_domain->domain.msi_resv;
+ return 0;
default:
return -ENODEV;
}
--
1.9.1
Eric Auger
2016-10-12 13:22:15 UTC
Permalink
Register the GIC V2M global doorbell. The registered information
are used to set up the KVM passthrough use case.

Signed-off-by: Eric Auger <***@redhat.com>

---
v13 -> v14:
- use iommu_msi_doorbell_alloc/free

v12 -> v13:
- use new msi doorbell registration prototype
- remove iommu protection attributes
- add unregistration in teardown

v11 -> v12:
- use irq_get_msi_doorbell_info new name
- simplify error handling

v10 -> v11:
- use the new registration API and re-implement the msi_doorbell_info
ops

v9 -> v10:
- introduce the registration concept in place of msi_doorbell_info
callback

v8 -> v9:
- use global_doorbell instead of percpu_doorbells

v7 -> v8:
- gicv2m_msi_doorbell_info does not return a pointer to const
- remove spurious !v2m check
- add IOMMU_MMIO flag

v7: creation
---
drivers/irqchip/irq-gic-v2m.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index 863e073..33acfe0 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -70,6 +70,7 @@ struct v2m_data {
u32 spi_offset; /* offset to be subtracted from SPI number */
unsigned long *bm; /* MSI vector bitmap */
u32 flags; /* v2m flags for specific implementation */
+ struct iommu_msi_doorbell_info *doorbell_info; /* MSI doorbell */
};

static void gicv2m_mask_msi_irq(struct irq_data *d)
@@ -254,6 +255,7 @@ static void gicv2m_teardown(void)
struct v2m_data *v2m, *tmp;

list_for_each_entry_safe(v2m, tmp, &v2m_nodes, entry) {
+ iommu_msi_doorbell_free(v2m->doorbell_info);
list_del(&v2m->entry);
kfree(v2m->bm);
iounmap(v2m->base);
@@ -370,12 +372,18 @@ static int __init gicv2m_init_one(struct fwnode_handle *fwnode,
goto err_iounmap;
}

+ v2m->doorbell_info = iommu_msi_doorbell_alloc(v2m->res.start,
+ sizeof(u32), false);
+ if (IS_ERR(v2m->doorbell_info))
+ goto err_free_bm;
+
list_add_tail(&v2m->entry, &v2m_nodes);

pr_info("range%pR, SPI[%d:%d]\n", res,
v2m->spi_start, (v2m->spi_start + v2m->nr_spis - 1));
return 0;
-
+err_free_bm:
+ kfree(v2m->bm);
err_iounmap:
iounmap(v2m->base);
err_free_v2m:
--
1.9.1
Eric Auger
2016-10-12 13:22:16 UTC
Permalink
This patch registers the ITS global doorbell. Registered information
are needed to set up the KVM passthrough use case.

Signed-off-by: Eric Auger <***@redhat.com>

---
v13 -> v14:
- use iommu_msi_doorbell_alloc/free

v12 -> v13:
- use new doorbell registration prototype

v11 -> v12:
- use new irq_get_msi_doorbell_info name
- simplify error handling

v10 -> v11:
- adapt to new doorbell registration API and implement msi_doorbell_info
---
drivers/irqchip/irq-gic-v3-its.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 98ff669..b42e006 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -86,6 +86,7 @@ struct its_node {
u32 ite_size;
u32 device_ids;
int numa_node;
+ struct iommu_msi_doorbell_info *doorbell_info;
};

#define ITS_ITT_ALIGN SZ_256
@@ -1717,6 +1718,7 @@ static int __init its_probe(struct device_node *node,

if (of_property_read_bool(node, "msi-controller")) {
struct msi_domain_info *info;
+ phys_addr_t translater;

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) {
@@ -1724,10 +1726,21 @@ static int __init its_probe(struct device_node *node,
goto out_free_tables;
}

+ translater = its->phys_base + GITS_TRANSLATER;
+ its->doorbell_info =
+ iommu_msi_doorbell_alloc(translater, sizeof(u32), true);
+
+ if (IS_ERR(its->doorbell_info)) {
+ kfree(info);
+ goto out_free_tables;
+ }
+
+
inner_domain = irq_domain_add_tree(node, &its_domain_ops, its);
if (!inner_domain) {
err = -ENOMEM;
kfree(info);
+ iommu_msi_doorbell_free(its->doorbell_info);
goto out_free_tables;
}
--
1.9.1
Eric Auger
2016-10-12 13:22:17 UTC
Permalink
We introduce a vfio_dma type since we will need to discriminate
different types of dma slots:
- VFIO_IOVA_USER: IOVA region used to map user vaddr
- VFIO_IOVA_RESERVED_MSI: IOVA region reserved to map MSI doorbells

Signed-off-by: Eric Auger <***@redhat.com>
Acked-by: Alex Williamson <***@redhat.com>

---
v9 -> v10:
- renamed VFIO_IOVA_RESERVED into VFIO_IOVA_RESERVED_MSI
- explicitly set type to VFIO_IOVA_USER on dma_map

v6 -> v7:
- add VFIO_IOVA_ANY
- do not introduce yet any VFIO_IOVA_RESERVED handling
---
drivers/vfio/vfio_iommu_type1.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba1942..a9f8b93 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,12 @@ module_param_named(disable_hugepages,
MODULE_PARM_DESC(disable_hugepages,
"Disable VFIO IOMMU support for IOMMU hugepages.");

+enum vfio_iova_type {
+ VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+ VFIO_IOVA_RESERVED_MSI, /* reserved to map MSI doorbells */
+ VFIO_IOVA_ANY, /* matches any IOVA type */
+};
+
struct vfio_iommu {
struct list_head domain_list;
struct mutex lock;
@@ -75,6 +81,7 @@ struct vfio_dma {
unsigned long vaddr; /* Process virtual addr */
size_t size; /* Map size (bytes) */
int prot; /* IOMMU_READ/WRITE */
+ enum vfio_iova_type type; /* type of IOVA */
};

struct vfio_group {
@@ -607,6 +614,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
dma->iova = iova;
dma->vaddr = vaddr;
dma->prot = prot;
+ dma->type = VFIO_IOVA_USER;

/* Insert zero-sized and grow as we map chunks of it */
vfio_link_dma(iommu, dma);
--
1.9.1
Eric Auger
2016-10-12 13:22:18 UTC
Permalink
In our RB-tree we get prepared to insert slots of different types
(USER and RESERVED). It becomes useful to be able to search for dma
slots of a specific type or any type.

This patch introduces vfio_find_dma_from_node which starts the
search from a given node and stops on the first node that matches
the @start and @size parameters. If this node also matches the
@type parameter, the node is returned else NULL is returned.

At the moment we only have USER SLOTS so the type will always match.

In a separate patch, this function will be enhanced to pursue the
search recursively in case a node with a different type is
encountered.

Signed-off-by: Eric Auger <***@redhat.com>

---

v13 -> v14:
- remove top_node variable
---
drivers/vfio/vfio_iommu_type1.c | 52 +++++++++++++++++++++++++++++++++--------
1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a9f8b93..1bd16ff 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -94,25 +94,55 @@ struct vfio_group {
* into DMA'ble space using the IOMMU
*/

-static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
- dma_addr_t start, size_t size)
+/**
+ * vfio_find_dma_from_node: looks for a dma slot intersecting a window
+ * from a given rb tree node
+ * @top: top rb tree node where the search starts (including this node)
+ * @start: window start
+ * @size: window size
+ * @type: window type
+ */
+static struct vfio_dma *vfio_find_dma_from_node(struct rb_node *top,
+ dma_addr_t start, size_t size,
+ enum vfio_iova_type type)
{
- struct rb_node *node = iommu->dma_list.rb_node;
+ struct rb_node *node = top;
+ struct vfio_dma *dma;

while (node) {
- struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
-
+ dma = rb_entry(node, struct vfio_dma, node);
if (start + size <= dma->iova)
node = node->rb_left;
else if (start >= dma->iova + dma->size)
node = node->rb_right;
else
- return dma;
+ break;
}
+ if (!node)
+ return NULL;
+
+ /* a dma slot intersects our window, check the type also matches */
+ if (type == VFIO_IOVA_ANY || dma->type == type)
+ return dma;

return NULL;
}

+/**
+ * vfio_find_dma: find a dma slot intersecting a given window
+ * @iommu: vfio iommu handle
+ * @start: window base iova
+ * @size: window size
+ * @type: window type
+ */
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+ dma_addr_t start, size_t size,
+ enum vfio_iova_type type)
+{
+ return vfio_find_dma_from_node(iommu->dma_list.rb_node,
+ start, size, type);
+}
+
static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
{
struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
@@ -484,19 +514,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
* mappings within the range.
*/
if (iommu->v2) {
- dma = vfio_find_dma(iommu, unmap->iova, 0);
+ dma = vfio_find_dma(iommu, unmap->iova, 0, VFIO_IOVA_USER);
if (dma && dma->iova != unmap->iova) {
ret = -EINVAL;
goto unlock;
}
- dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
+ dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0,
+ VFIO_IOVA_USER);
if (dma && dma->iova + dma->size != unmap->iova + unmap->size) {
ret = -EINVAL;
goto unlock;
}
}

- while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
+ while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size,
+ VFIO_IOVA_USER))) {
if (!iommu->v2 && unmap->iova > dma->iova)
break;
unmapped += dma->size;
@@ -600,7 +632,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,

mutex_lock(&iommu->lock);

- if (vfio_find_dma(iommu, iova, size)) {
+ if (vfio_find_dma(iommu, iova, size, VFIO_IOVA_ANY)) {
mutex_unlock(&iommu->lock);
return -EEXIST;
}
--
1.9.1
Eric Auger
2016-10-12 13:22:19 UTC
Permalink
This patch handles the case where a node is encountered, matching
@start and @size arguments but not matching the @type argument.
In that case, we need to skip that node and pursue the search in the
node's leaves. In case @start is inferior to the node's base, we
resume the search on the left leaf. If the recursive search on the left
leaves did not produce any match, we search the right leaves recursively.

Signed-off-by: Eric Auger <***@redhat.com>
Acked-by: Alex Williamson <***@redhat.com>

---

v10: creation
---
drivers/vfio/vfio_iommu_type1.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1bd16ff..1f120f9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -125,7 +125,17 @@ static struct vfio_dma *vfio_find_dma_from_node(struct rb_node *top,
if (type == VFIO_IOVA_ANY || dma->type == type)
return dma;

- return NULL;
+ /* restart 2 searches skipping the current node */
+ if (start < dma->iova) {
+ dma = vfio_find_dma_from_node(node->rb_left, start,
+ size, type);
+ if (dma)
+ return dma;
+ }
+ if (start + size > dma->iova + dma->size)
+ dma = vfio_find_dma_from_node(node->rb_right, start,
+ size, type);
+ return dma;
}

/**
--
1.9.1
Eric Auger
2016-10-12 13:22:20 UTC
Permalink
Before allowing the end-user to create VFIO_IOVA_RESERVED dma slots,
let's implement the expected behavior for removal and replay.

As opposed to user dma slots, reserved IOVAs are not systematically bound
to PAs and PAs are not pinned. VFIO just initializes the IOVA "aperture".
IOVAs are allocated outside of the VFIO framework, by the MSI layer which
is responsible to free and unmap them. The MSI mapping resources are freed
by the IOMMU driver on domain destruction.

On the creation of a new domain, the "replay" of a reserved slot simply
needs to set the MSI aperture on the new domain.

Signed-off-by: Eric Auger <***@redhat.com>

---
v13 -> v14:
- make iommu_get_dma_msi_region_cookie's failure not passable
- remove useless "select IOMMU_DMA" causing cyclic dependency
- set the MSI region only if needed

v12 -> v13:
- use dma-iommu iommu_get_dma_msi_region_cookie

v9 -> v10:
- replay of a reserved slot sets the MSI aperture on the new domain
- use VFIO_IOVA_RESERVED_MSI enum value instead of VFIO_IOVA_RESERVED

v7 -> v8:
- do no destroy anything anymore, just bypass unmap/unpin and iommu_map
on replay
---
drivers/vfio/vfio_iommu_type1.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1f120f9..2108e2e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/workqueue.h>
+#include <linux/dma-iommu.h>

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <***@redhat.com>"
@@ -386,7 +387,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
struct vfio_domain *domain, *d;
long unlocked = 0;

- if (!dma->size)
+ if (!dma->size || dma->type != VFIO_IOVA_USER)
return;
/*
* We use the IOMMU to track the physical addresses, otherwise we'd
@@ -717,12 +718,24 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
return -EINVAL;

for (; n; n = rb_next(n)) {
+ struct iommu_domain_msi_resv msi_resv;
struct vfio_dma *dma;
dma_addr_t iova;

dma = rb_entry(n, struct vfio_dma, node);
iova = dma->iova;

+ if ((dma->type == VFIO_IOVA_RESERVED_MSI) &&
+ (!iommu_domain_get_attr(domain->domain,
+ DOMAIN_ATTR_MSI_RESV,
+ &msi_resv))) {
+ ret = iommu_get_dma_msi_region_cookie(domain->domain,
+ dma->iova,
+ dma->size);
+ if (ret)
+ return ret;
+ }
+
while (iova < dma->iova + dma->size) {
phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
size_t size;
--
1.9.1
Eric Auger
2016-10-12 13:22:21 UTC
Permalink
The user is allowed to register a reserved MSI IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
This region is stored in the vfio_dma rb tree. At that point the iova
range is not mapped to any target address yet. The host kernel will use
those iova when needed, typically when MSIs are allocated.

Signed-off-by: Eric Auger <***@redhat.com>
Signed-off-by: Bharat Bhushan <***@freescale.com>

---
v13 -> v14:
- in vfio_set_msi_aperture, unfold in case of failure
- Get DOMAIN_ATTR_MSI_RESV attribute to decide whether to set the MSI
aperture

v12 -> v13:
- use iommu_get_dma_msi_region_cookie

v9 -> v10
- use VFIO_IOVA_RESERVED_MSI enum value

v7 -> v8:
- use iommu_msi_set_aperture function. There is no notion of
unregistration anymore since the reserved msi slot remains
until the container gets closed.

v6 -> v7:
- use iommu_free_reserved_iova_domain
- convey prot attributes downto dma-reserved-iommu iova domain creation
- reserved bindings teardown now performed on iommu domain destruction
- rename VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA into
VFIO_DMA_MAP_FLAG_RESERVED_MSI_IOVA
- change title
- pass the protection attribute to dma-reserved-iommu API

v3 -> v4:
- use iommu_alloc/free_reserved_iova_domain exported by dma-reserved-iommu
- protect vfio_register_reserved_iova_range implementation with
CONFIG_IOMMU_DMA_RESERVED
- handle unregistration by user-space and on vfio_iommu_type1 release

v1 -> v2:
- set returned value according to alloc_reserved_iova_domain result
- free the iova domains in case any error occurs

RFC v1 -> v1:
- takes into account Alex comments, based on
[RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
- use the existing dma map/unmap ioctl interface with a flag to register
a reserved IOVA range. A single reserved iova region is allowed.
---
drivers/vfio/vfio_iommu_type1.c | 97 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 10 ++++-
2 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2108e2e..e0c97ef 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -441,6 +441,40 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
vfio_lock_acct(-unlocked);
}

+/**
+ * vfio_set_msi_aperture - Sets the msi aperture on all domains
+ * requesting MSI mapping
+ *
+ * @iommu: vfio iommu handle
+ * @iova: MSI window iova base address
+ * @size: size of the MSI reserved iova window
+ *
+ * Return: 0 if the MSI reserved region was set on at least one domain,
+ * negative value on failure
+ */
+static int vfio_set_msi_aperture(struct vfio_iommu *iommu,
+ dma_addr_t iova, size_t size)
+{
+ struct iommu_domain_msi_resv msi_resv;
+ struct vfio_domain *d;
+ int ret = -EINVAL;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ if (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_RESV,
+ &msi_resv))
+ continue;
+ ret = iommu_get_dma_msi_region_cookie(d->domain, iova, size);
+ if (ret)
+ goto unfold;
+ }
+ return 0;
+unfold:
+ list_for_each_entry(d, &iommu->domain_list, next)
+ iommu_put_dma_cookie(d->domain);
+
+ return ret;
+}
+
static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
vfio_unmap_unpin(iommu, dma);
@@ -690,6 +724,63 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
return ret;
}

+static int vfio_register_msi_range(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ dma_addr_t iova = map->iova;
+ size_t size = map->size;
+ int ret = 0;
+ struct vfio_dma *dma;
+ unsigned long order;
+ uint64_t mask;
+
+ /* Verify that none of our __u64 fields overflow */
+ if (map->size != size || map->iova != iova)
+ return -EINVAL;
+
+ order = __ffs(vfio_pgsize_bitmap(iommu));
+ mask = ((uint64_t)1 << order) - 1;
+
+ WARN_ON(mask & PAGE_MASK);
+
+ if (!size || (size | iova) & mask)
+ return -EINVAL;
+
+ /* Don't allow IOVA address wrap */
+ if (iova + size - 1 < iova)
+ return -EINVAL;
+
+ mutex_lock(&iommu->lock);
+
+ if (vfio_find_dma(iommu, iova, size, VFIO_IOVA_ANY)) {
+ ret = -EEXIST;
+ goto unlock;
+ }
+
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+
+ dma->iova = iova;
+ dma->size = size;
+ dma->type = VFIO_IOVA_RESERVED_MSI;
+
+ ret = vfio_set_msi_aperture(iommu, iova, size);
+ if (ret)
+ goto free_unlock;
+
+ vfio_link_dma(iommu, dma);
+ goto unlock;
+
+free_unlock:
+ kfree(dma);
+unlock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
static int vfio_bus_type(struct device *dev, void *data)
{
struct bus_type **bus = data;
@@ -1068,7 +1159,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
} else if (cmd == VFIO_IOMMU_MAP_DMA) {
struct vfio_iommu_type1_dma_map map;
uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
- VFIO_DMA_MAP_FLAG_WRITE;
+ VFIO_DMA_MAP_FLAG_WRITE |
+ VFIO_DMA_MAP_FLAG_RESERVED_MSI_IOVA;

minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

@@ -1078,6 +1170,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
if (map.argsz < minsz || map.flags & ~mask)
return -EINVAL;

+ if (map.flags & VFIO_DMA_MAP_FLAG_RESERVED_MSI_IOVA)
+ return vfio_register_msi_range(iommu, &map);
+
return vfio_dma_do_map(iommu, &map);

} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..4a9dbc2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -498,12 +498,19 @@ struct vfio_iommu_type1_info {
*
* Map process virtual addresses to IO virtual addresses using the
* provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case RESERVED_MSI_IOVA flag is set, the API only aims at registering an
+ * IOVA region that will be used on some platforms to map the host MSI frames.
+ * In that specific case, vaddr is ignored. Once registered, an MSI reserved
+ * IOVA region stays until the container is closed.
*/
struct vfio_iommu_type1_dma_map {
__u32 argsz;
__u32 flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_RESERVED_MSI_IOVA (1 << 2)
__u64 vaddr; /* Process virtual address */
__u64 iova; /* IO virtual address */
__u64 size; /* Size of mapping (bytes) */
@@ -519,7 +526,8 @@ struct vfio_iommu_type1_dma_map {
* Caller sets argsz. The actual unmapped size is returned in the size
* field. No guarantee is made to the user that arbitrary unmaps of iova
* or size different from those used in the original mapping call will
- * succeed.
+ * succeed. Once registered, an MSI region cannot be unmapped and stays
+ * until the container is closed.
*/
struct vfio_iommu_type1_dma_unmap {
__u32 argsz;
--
1.9.1
Eric Auger
2016-10-12 13:22:22 UTC
Permalink
On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller.

Since we currently have no way to detect whether the MSI controller is
upstream or downstream to the IOMMU we rely on the MSI doorbell information
registered by the interrupt controllers. In case at least one doorbell
does not implement proper isolation, we state the assignment is unsafe
with regard to interrupts. This is a coarse assessment but should allow to
wait for a better system description.

At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
removed in next patch.

Signed-off-by: Eric Auger <***@redhat.com>

---
v13 -> v15:
- check vfio_msi_resv before checking whether msi doorbell is safe

v9 -> v10:
- coarse safety assessment based on MSI doorbell info

v3 -> v4:
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
and irq_remapping into safe_irq_domains

v2 -> v3:
- protect vfio_msi_parent_irq_remapping_capable with
CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
drivers/vfio/vfio_iommu_type1.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e0c97ef..c18ba9d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -442,6 +442,29 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
}

/**
+ * vfio_msi_resv - Return whether any VFIO iommu domain requires
+ * MSI mapping
+ *
+ * @iommu: vfio iommu handle
+ *
+ * Return: true of MSI mapping is needed, false otherwise
+ */
+static bool vfio_msi_resv(struct vfio_iommu *iommu)
+{
+ struct iommu_domain_msi_resv msi_resv;
+ struct vfio_domain *d;
+ int ret;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ ret = iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_RESV,
+ &msi_resv);
+ if (!ret)
+ return true;
+ }
+ return false;
+}
+
+/**
* vfio_set_msi_aperture - Sets the msi aperture on all domains
* requesting MSI mapping
*
@@ -945,8 +968,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
INIT_LIST_HEAD(&domain->group_list);
list_add(&group->next, &domain->group_list);

+ /*
+ * to advertise safe interrupts either the IOMMU or the MSI controllers
+ * must support IRQ remapping (aka. interrupt translation)
+ */
if (!allow_unsafe_interrupts &&
- !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+ (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) &&
+ !(vfio_msi_resv(iommu) && iommu_msi_doorbell_safe()))) {
pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
__func__);
ret = -EPERM;
--
1.9.1
Diana Madalina Craciun
2016-11-03 13:45:11 UTC
Permalink
Hi Eric,
Post by Eric Auger
On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller.
Since we currently have no way to detect whether the MSI controller is
upstream or downstream to the IOMMU we rely on the MSI doorbell information
registered by the interrupt controllers. In case at least one doorbell
does not implement proper isolation, we state the assignment is unsafe
with regard to interrupts. This is a coarse assessment but should allow to
wait for a better system description.
At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
removed in next patch.
---
- check vfio_msi_resv before checking whether msi doorbell is safe
- coarse safety assessment based on MSI doorbell info
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
and irq_remapping into safe_irq_domains
- protect vfio_msi_parent_irq_remapping_capable with
CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
drivers/vfio/vfio_iommu_type1.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e0c97ef..c18ba9d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -442,6 +442,29 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
}
/**
+ * vfio_msi_resv - Return whether any VFIO iommu domain requires
+ * MSI mapping
+ *
+ *
+ * Return: true of MSI mapping is needed, false otherwise
+ */
+static bool vfio_msi_resv(struct vfio_iommu *iommu)
+{
+ struct iommu_domain_msi_resv msi_resv;
+ struct vfio_domain *d;
+ int ret;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ ret = iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_RESV,
+ &msi_resv);
+ if (!ret)
+ return true;
+ }
+ return false;
+}
+
+/**
* vfio_set_msi_aperture - Sets the msi aperture on all domains
* requesting MSI mapping
*
@@ -945,8 +968,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
INIT_LIST_HEAD(&domain->group_list);
list_add(&group->next, &domain->group_list);
+ /*
+ * to advertise safe interrupts either the IOMMU or the MSI controllers
+ * must support IRQ remapping (aka. interrupt translation)
+ */
if (!allow_unsafe_interrupts &&
- !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+ (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) &&
+ !(vfio_msi_resv(iommu) && iommu_msi_doorbell_safe()))) {
pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
__func__);
ret = -EPERM;
I understand from the other discussions that you will respin these
series, but anyway I have tested this version with GICV3 + ITS and it
stops here. As I have a GICv3 I am not supposed to enable allow unsafe
interrupts. What I see is that vfio_msi_resv returns false just because
the iommu->domain_list list is empty. The newly created domain is
actually added to the domain_list at the end of this function, so it
seems normal for the list to be empty at this point.

Thanks,

Diana
Auger Eric
2016-11-03 14:14:56 UTC
Permalink
Hi Diana,
Post by Punit Agrawal
Hi Eric,
Post by Eric Auger
On x86 IRQ remapping is abstracted by the IOMMU. On ARM this is abstracted
by the msi controller.
Since we currently have no way to detect whether the MSI controller is
upstream or downstream to the IOMMU we rely on the MSI doorbell information
registered by the interrupt controllers. In case at least one doorbell
does not implement proper isolation, we state the assignment is unsafe
with regard to interrupts. This is a coarse assessment but should allow to
wait for a better system description.
At this point ARM sMMU still advertises IOMMU_CAP_INTR_REMAP. This is
removed in next patch.
---
- check vfio_msi_resv before checking whether msi doorbell is safe
- coarse safety assessment based on MSI doorbell info
- rename vfio_msi_parent_irq_remapping_capable into vfio_safe_irq_domain
and irq_remapping into safe_irq_domains
- protect vfio_msi_parent_irq_remapping_capable with
CONFIG_GENERIC_MSI_IRQ_DOMAIN
---
drivers/vfio/vfio_iommu_type1.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e0c97ef..c18ba9d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -442,6 +442,29 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
}
/**
+ * vfio_msi_resv - Return whether any VFIO iommu domain requires
+ * MSI mapping
+ *
+ *
+ * Return: true of MSI mapping is needed, false otherwise
+ */
+static bool vfio_msi_resv(struct vfio_iommu *iommu)
+{
+ struct iommu_domain_msi_resv msi_resv;
+ struct vfio_domain *d;
+ int ret;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ ret = iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_RESV,
+ &msi_resv);
+ if (!ret)
+ return true;
+ }
+ return false;
+}
+
+/**
* vfio_set_msi_aperture - Sets the msi aperture on all domains
* requesting MSI mapping
*
@@ -945,8 +968,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
INIT_LIST_HEAD(&domain->group_list);
list_add(&group->next, &domain->group_list);
+ /*
+ * to advertise safe interrupts either the IOMMU or the MSI controllers
+ * must support IRQ remapping (aka. interrupt translation)
+ */
if (!allow_unsafe_interrupts &&
- !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
+ (!iommu_capable(bus, IOMMU_CAP_INTR_REMAP) &&
+ !(vfio_msi_resv(iommu) && iommu_msi_doorbell_safe()))) {
pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
__func__);
ret = -EPERM;
I understand from the other discussions that you will respin these
series, but anyway I have tested this version with GICV3 + ITS and it
stops here. As I have a GICv3 I am not supposed to enable allow unsafe
interrupts. What I see is that vfio_msi_resv returns false just because
the iommu->domain_list list is empty. The newly created domain is
actually added to the domain_list at the end of this function, so it
seems normal for the list to be empty at this point.
Thanks for reporting the issue. You are fully right. I must have missed
that test. I should just check the current iommu_domain attribute I think.

waiting for a fix, please probe the vfio_iommu_type1 module with
allow_unsafe_interrupts=1

Thanks

Eric
Post by Punit Agrawal
Thanks,
Diana
Eric Auger
2016-10-12 13:22:23 UTC
Permalink
Do not advertise IOMMU_CAP_INTR_REMAP for arm-smmu(-v3). Indeed the
irq_remapping capability is abstracted on irqchip side for ARM as
opposed to Intel IOMMU featuring IRQ remapping HW.

So to check IRQ remapping capability, the msi domain needs to be
checked instead.

This commit affects platform and PCIe device assignment use cases
on any platform featuring an unsafe MSI controller (currently the
ARM GICv2m). For those platforms the VFIO module must be loaded with
allow_unsafe_interrupts set to 1.

Signed-off-by: Eric Auger <***@redhat.com>

---

v9 -> v10:
- reword the commit message (allow_unsafe_interrupts)
---
drivers/iommu/arm-smmu-v3.c | 3 ++-
drivers/iommu/arm-smmu.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 572cad8..d71a955 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1371,7 +1371,8 @@ static bool arm_smmu_capable(enum iommu_cap cap)
case IOMMU_CAP_CACHE_COHERENCY:
return true;
case IOMMU_CAP_INTR_REMAP:
- return true; /* MSIs are just memory writes */
+ /* interrupt translation handled at MSI controller level */
+ return false;
case IOMMU_CAP_NOEXEC:
return true;
default:
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ae20b9c..becad89 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1361,7 +1361,8 @@ static bool arm_smmu_capable(enum iommu_cap cap)
*/
return true;
case IOMMU_CAP_INTR_REMAP:
- return true; /* MSIs are just memory writes */
+ /* interrupt translation handled at MSI controller level */
+ return false;
case IOMMU_CAP_NOEXEC:
return true;
default:
--
1.9.1
Eric Auger
2016-10-12 13:22:24 UTC
Permalink
This patch allows the user-space to retrieve the MSI reserved region
requirements, if any. The implementation is based on capability chains,
now also added to VFIO_IOMMU_GET_INFO.

The returned info comprises the size and the alignment requirements

In case the userspace must provide the IOVA aperture, we currently report
a size/alignment based on all the doorbells registered by the host kernel.
This may exceed the actual needs.

Signed-off-by: Eric Auger <***@redhat.com>

---
v13 -> v14:
- new capability struct
- change the padding in vfio_iommu_type1_info

v11 -> v12:
- msi_doorbell_pages was renamed msi_doorbell_calc_pages

v9 -> v10:
- move cap_offset after iova_pgsizes
- replace __u64 alignment by __u32 order
- introduce __u32 flags in vfio_iommu_type1_info_cap_msi_geometry and
fix alignment
- call msi-doorbell API to compute the size/alignment

v8 -> v9:
- use iommu_msi_supported flag instead of programmable
- replace IOMMU_INFO_REQUIRE_MSI_MAP flag by a more sophisticated
capability chain, reporting the MSI geometry

v7 -> v8:
- use iommu_domain_msi_geometry

v6 -> v7:
- remove the computation of the number of IOVA pages to be provisionned.
This number depends on the domain/group/device topology which can
dynamically change. Let's rely instead rely on an arbitrary max depending
on the system

v4 -> v5:
- move msi_info and ret declaration within the conditional code

v3 -> v4:
- replace former vfio_domains_require_msi_mapping by
more complex computation of MSI mapping requirements, especially the
number of pages to be provided by the user-space.
- reword patch title

RFC v1 -> v1:
- derived from
[RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
drivers/vfio/vfio_iommu_type1.c | 67 ++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 20 +++++++++++-
2 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c18ba9d..6775da3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1147,6 +1147,46 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
return ret;
}

+static int msi_resv_caps(struct vfio_iommu *iommu, struct vfio_info_cap *caps)
+{
+ struct iommu_domain_msi_resv msi_resv = {.size = 0, .alignment = 0};
+ struct vfio_iommu_type1_info_cap_msi_resv *cap;
+ struct vfio_info_cap_header *header;
+ struct iommu_domain_msi_resv iter;
+ struct vfio_domain *d;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ if (iommu_domain_get_attr(d->domain,
+ DOMAIN_ATTR_MSI_RESV, &iter))
+ continue;
+ if (iter.size > msi_resv.size) {
+ msi_resv.size = iter.size;
+ msi_resv.alignment = iter.alignment;
+ }
+ }
+
+ if (!msi_resv.size)
+ return 0;
+
+ mutex_unlock(&iommu->lock);
+
+ header = vfio_info_cap_add(caps, sizeof(*cap),
+ VFIO_IOMMU_TYPE1_INFO_CAP_MSI_RESV, 1);
+
+ if (IS_ERR(header))
+ return PTR_ERR(header);
+
+ cap = container_of(header, struct vfio_iommu_type1_info_cap_msi_resv,
+ header);
+
+ cap->alignment = msi_resv.alignment;
+ cap->size = msi_resv.size;
+
+ return 0;
+}
+
static long vfio_iommu_type1_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -1168,8 +1208,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
}
} else if (cmd == VFIO_IOMMU_GET_INFO) {
struct vfio_iommu_type1_info info;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ int ret;

- minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+ minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);

if (copy_from_user(&info, (void __user *)arg, minsz))
return -EFAULT;
@@ -1181,6 +1223,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

+ ret = msi_resv_caps(iommu, &caps);
+ if (ret)
+ return ret;
+
+ if (caps.size) {
+ info.flags |= VFIO_IOMMU_INFO_CAPS;
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg +
+ sizeof(info), caps.buf,
+ caps.size)) {
+ kfree(caps.buf);
+ return -EFAULT;
+ }
+ info.cap_offset = sizeof(info);
+ }
+
+ kfree(caps.buf);
+ }
+
return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4a9dbc2..e34a9a6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -488,7 +488,23 @@ struct vfio_iommu_type1_info {
__u32 argsz;
__u32 flags;
#define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
- __u64 iova_pgsizes; /* Bitmap of supported page sizes */
+#define VFIO_IOMMU_INFO_CAPS (1 << 1) /* Info supports caps */
+ __u64 iova_pgsizes; /* Bitmap of supported page sizes */
+ __u32 cap_offset; /* Offset within info struct of first cap */
+ __u32 __resv;
+};
+
+/*
+ * The MSI_RESV capability allows to report the MSI reserved IOVA requirements:
+ * In case this capability is supported, the userspace must provide an IOVA
+ * window characterized by @size and @alignment using VFIO_IOMMU_MAP_DMA with
+ * RESERVED_MSI_IOVA flag.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_MSI_RESV 1
+struct vfio_iommu_type1_info_cap_msi_resv {
+ struct vfio_info_cap_header header;
+ __u64 size; /* requested IOVA aperture size in bytes */
+ __u64 alignment; /* requested byte alignment of the window */
};

#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -503,6 +519,8 @@ struct vfio_iommu_type1_info {
* IOVA region that will be used on some platforms to map the host MSI frames.
* In that specific case, vaddr is ignored. Once registered, an MSI reserved
* IOVA region stays until the container is closed.
+ * The requirement for provisioning such reserved IOVA range can be checked by
+ * checking the VFIO_IOMMU_TYPE1_INFO_CAP_MSI_RESV capability.
*/
struct vfio_iommu_type1_dma_map {
__u32 argsz;
--
1.9.1
Punit Agrawal
2016-10-14 11:24:40 UTC
Permalink
Hi Eric,

I am a bit late in joining, but I've tried to familiarise
myself with earlier discussions on the series.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
IMHO, this is headed in the opposite direction, i.e., away from the
owner of the information - the doorbells are the property of the MSI
controller. The MSI controllers know the location, size and interrupt
remapping capability as well. On the consumer side, VFIO needs access to
the doorbells to allow userspace to carve out a region in the IOVA.

I quite liked what you had in v13, though I think you can go further
though. Instead of adding new doorbell API [un]registration calls, how
about adding a callback to the irq_domain_ops? The callback will be
populated for irqdomains registered by MSI controllers.

From VFIO, we can calculate the required aperture reservation by
iterating over the irqdomains (something like irq_domain_for_each). The
same callback can also provide information about support for interrupt
remapping.

For systems where there are no separate MSI controllers, i.e., the IOMMU
has a fixed reservation, no MSI callbacks will be populated - which
tells userspace that no separate MSI reservation is required. IIUC, this
was one of Alex' concerns on the prior version.

Thoughts, opinions?

Punit
Post by Eric Auger
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
[1] [PATCH v7 00/22] Generic DT bindings for PCI IOMMUs and ARM SMMU
http://www.spinics.net/lists/arm-kernel/msg531110.html
Best Regards
Eric
Git: complete series available at
https://github.com/eauger/linux/tree/generic-v7-pcie-passthru-v14
the above branch includes a temporary patch to work around a ThunderX pci
"vfio: pci: HACK! workaround thunderx pci_try_reset_bus crash"
Do not take this one for other platforms.
iommu/iova: fix __alloc_and_insert_iova_range
iommu: Introduce DOMAIN_ATTR_MSI_RESV
iommu/dma: MSI doorbell alloc/free
iommu/dma: Introduce iommu_calc_msi_resv
iommu/arm-smmu: Implement domain_get_attr for DOMAIN_ATTR_MSI_RESV
irqchip/gic-v2m: Register the MSI doorbell
irqchip/gicv3-its: Register the MSI doorbell
vfio: Introduce a vfio_dma type field
vfio/type1: vfio_find_dma accepting a type argument
vfio/type1: Implement recursive vfio_find_dma_from_node
vfio/type1: Handle unmap/unpin and replay for VFIO_IOVA_RESERVED slots
vfio: Allow reserved msi iova registration
vfio/type1: Check doorbell safety
iommu/arm-smmu: Do not advertise IOMMU_CAP_INTR_REMAP
vfio/type1: Introduce MSI_RESV capability
iommu/dma: Allow MSI-only cookies
drivers/iommu/Kconfig | 4 +-
drivers/iommu/arm-smmu-v3.c | 10 +-
drivers/iommu/arm-smmu.c | 10 +-
drivers/iommu/dma-iommu.c | 184 ++++++++++++++++++++++++++
drivers/iommu/iova.c | 2 +-
drivers/irqchip/irq-gic-v2m.c | 10 +-
drivers/irqchip/irq-gic-v3-its.c | 13 ++
drivers/vfio/vfio_iommu_type1.c | 279 +++++++++++++++++++++++++++++++++++++--
include/linux/dma-iommu.h | 59 +++++++++
include/linux/iommu.h | 8 ++
include/uapi/linux/vfio.h | 30 ++++-
11 files changed, 587 insertions(+), 22 deletions(-)
Auger Eric
2016-10-17 14:19:33 UTC
Permalink
Hi Punit,
Post by Punit Agrawal
Hi Eric,
I am a bit late in joining, but I've tried to familiarise
myself with earlier discussions on the series.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
IMHO, this is headed in the opposite direction, i.e., away from the
owner of the information - the doorbells are the property of the MSI
controller. The MSI controllers know the location, size and interrupt
remapping capability as well. On the consumer side, VFIO needs access to
the doorbells to allow userspace to carve out a region in the IOVA.
I quite liked what you had in v13, though I think you can go further
though. Instead of adding new doorbell API [un]registration calls, how
about adding a callback to the irq_domain_ops? The callback will be
populated for irqdomains registered by MSI controllers.
Thank you for jumping into that thread. Any help/feedback is greatly
appreciated.

Regarding your suggestion, the irq_domain looks dedicated to the
translation between linux irq and HW irq. I tend to think adding an ops
to retrieve the MSI doorbell info at that level looks far from the
original goal of the infrastructure. Obviously the fact there is a list
of such domain is tempting but I preferred to add a separate struct and
separate list.

In the v14 release I moved the "doorbell API" in the dma-iommu API since
Alex recommended to offer a unified API where all pieces would be at the
same place.

Anyway I will follow the guidance of maintainers.
Post by Punit Agrawal
From VFIO, we can calculate the required aperture reservation by
iterating over the irqdomains (something like irq_domain_for_each). The
same callback can also provide information about support for interrupt
remapping.
For systems where there are no separate MSI controllers, i.e., the IOMMU
has a fixed reservation, no MSI callbacks will be populated - which
tells userspace that no separate MSI reservation is required. IIUC, this
was one of Alex' concerns on the prior version.
I'am working on a separate series to report to the user-space the usable
IOVA range(s).

Thanks

Eric
Post by Punit Agrawal
Thoughts, opinions?
Punit
Post by Eric Auger
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
[1] [PATCH v7 00/22] Generic DT bindings for PCI IOMMUs and ARM SMMU
http://www.spinics.net/lists/arm-kernel/msg531110.html
Best Regards
Eric
Git: complete series available at
https://github.com/eauger/linux/tree/generic-v7-pcie-passthru-v14
the above branch includes a temporary patch to work around a ThunderX pci
"vfio: pci: HACK! workaround thunderx pci_try_reset_bus crash"
Do not take this one for other platforms.
iommu/iova: fix __alloc_and_insert_iova_range
iommu: Introduce DOMAIN_ATTR_MSI_RESV
iommu/dma: MSI doorbell alloc/free
iommu/dma: Introduce iommu_calc_msi_resv
iommu/arm-smmu: Implement domain_get_attr for DOMAIN_ATTR_MSI_RESV
irqchip/gic-v2m: Register the MSI doorbell
irqchip/gicv3-its: Register the MSI doorbell
vfio: Introduce a vfio_dma type field
vfio/type1: vfio_find_dma accepting a type argument
vfio/type1: Implement recursive vfio_find_dma_from_node
vfio/type1: Handle unmap/unpin and replay for VFIO_IOVA_RESERVED slots
vfio: Allow reserved msi iova registration
vfio/type1: Check doorbell safety
iommu/arm-smmu: Do not advertise IOMMU_CAP_INTR_REMAP
vfio/type1: Introduce MSI_RESV capability
iommu/dma: Allow MSI-only cookies
drivers/iommu/Kconfig | 4 +-
drivers/iommu/arm-smmu-v3.c | 10 +-
drivers/iommu/arm-smmu.c | 10 +-
drivers/iommu/dma-iommu.c | 184 ++++++++++++++++++++++++++
drivers/iommu/iova.c | 2 +-
drivers/irqchip/irq-gic-v2m.c | 10 +-
drivers/irqchip/irq-gic-v3-its.c | 13 ++
drivers/vfio/vfio_iommu_type1.c | 279 +++++++++++++++++++++++++++++++++++++--
include/linux/dma-iommu.h | 59 +++++++++
include/linux/iommu.h | 8 ++
include/uapi/linux/vfio.h | 30 ++++-
11 files changed, 587 insertions(+), 22 deletions(-)
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Punit Agrawal
2016-10-17 15:08:57 UTC
Permalink
Post by Auger Eric
Hi Punit,
Post by Punit Agrawal
Hi Eric,
I am a bit late in joining, but I've tried to familiarise
myself with earlier discussions on the series.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
IMHO, this is headed in the opposite direction, i.e., away from the
owner of the information - the doorbells are the property of the MSI
controller. The MSI controllers know the location, size and interrupt
remapping capability as well. On the consumer side, VFIO needs access to
the doorbells to allow userspace to carve out a region in the IOVA.
I quite liked what you had in v13, though I think you can go further
though. Instead of adding new doorbell API [un]registration calls, how
about adding a callback to the irq_domain_ops? The callback will be
populated for irqdomains registered by MSI controllers.
Thank you for jumping into that thread. Any help/feedback is greatly
appreciated.
Regarding your suggestion, the irq_domain looks dedicated to the
translation between linux irq and HW irq. I tend to think adding an ops
to retrieve the MSI doorbell info at that level looks far from the
original goal of the infrastructure. Obviously the fact there is a list
of such domain is tempting but I preferred to add a separate struct and
separate list.
Guilty as charged - the suggestion was definitely borne out of
convenience. I originally looked at adding this functionality to the MSI
domains but couldn't find an obvious way to enumerate all of them.

Also, the way the domain hierarchy is setup for MSI controllers, the
actual doorbell is not available to the MSI domains.
Post by Auger Eric
In the v14 release I moved the "doorbell API" in the dma-iommu API since
Alex recommended to offer a unified API where all pieces would be at the
same place.
Anyway I will follow the guidance of maintainers.
Ok, makes sense.
Post by Auger Eric
Post by Punit Agrawal
From VFIO, we can calculate the required aperture reservation by
iterating over the irqdomains (something like irq_domain_for_each). The
same callback can also provide information about support for interrupt
remapping.
For systems where there are no separate MSI controllers, i.e., the IOMMU
has a fixed reservation, no MSI callbacks will be populated - which
tells userspace that no separate MSI reservation is required. IIUC, this
was one of Alex' concerns on the prior version.
I'am working on a separate series to report to the user-space the usable
IOVA range(s).
Is that the alternative to reporting the aperture size that needs to be
reserved?
Post by Auger Eric
Thanks
Eric
Post by Punit Agrawal
Thoughts, opinions?
Punit
Post by Eric Auger
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
[1] [PATCH v7 00/22] Generic DT bindings for PCI IOMMUs and ARM SMMU
http://www.spinics.net/lists/arm-kernel/msg531110.html
Best Regards
Eric
Git: complete series available at
https://github.com/eauger/linux/tree/generic-v7-pcie-passthru-v14
the above branch includes a temporary patch to work around a ThunderX pci
"vfio: pci: HACK! workaround thunderx pci_try_reset_bus crash"
Do not take this one for other platforms.
iommu/iova: fix __alloc_and_insert_iova_range
iommu: Introduce DOMAIN_ATTR_MSI_RESV
iommu/dma: MSI doorbell alloc/free
iommu/dma: Introduce iommu_calc_msi_resv
iommu/arm-smmu: Implement domain_get_attr for DOMAIN_ATTR_MSI_RESV
irqchip/gic-v2m: Register the MSI doorbell
irqchip/gicv3-its: Register the MSI doorbell
vfio: Introduce a vfio_dma type field
vfio/type1: vfio_find_dma accepting a type argument
vfio/type1: Implement recursive vfio_find_dma_from_node
vfio/type1: Handle unmap/unpin and replay for VFIO_IOVA_RESERVED slots
vfio: Allow reserved msi iova registration
vfio/type1: Check doorbell safety
iommu/arm-smmu: Do not advertise IOMMU_CAP_INTR_REMAP
vfio/type1: Introduce MSI_RESV capability
iommu/dma: Allow MSI-only cookies
drivers/iommu/Kconfig | 4 +-
drivers/iommu/arm-smmu-v3.c | 10 +-
drivers/iommu/arm-smmu.c | 10 +-
drivers/iommu/dma-iommu.c | 184 ++++++++++++++++++++++++++
drivers/iommu/iova.c | 2 +-
drivers/irqchip/irq-gic-v2m.c | 10 +-
drivers/irqchip/irq-gic-v3-its.c | 13 ++
drivers/vfio/vfio_iommu_type1.c | 279 +++++++++++++++++++++++++++++++++++++--
include/linux/dma-iommu.h | 59 +++++++++
include/linux/iommu.h | 8 ++
include/uapi/linux/vfio.h | 30 ++++-
11 files changed, 587 insertions(+), 22 deletions(-)
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
_______________________________________________
iommu mailing list
https://lists.linuxfoundation.org/mailman/listinfo/iommu
Will Deacon
2016-10-20 17:32:24 UTC
Permalink
Hi Eric,

Thanks for posting this.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
Marc, Robin and I sat down and had a look at the series and, whilst it's
certainly addressing a problem that we desperately want to see fixed, we
think that it's slightly over-engineering in places and could probably
be simplified in the interest of getting something upstream that can be
used as a base, on which the ABI can be extended as concrete use-cases
become clear.

Stepping back a minute, we're trying to reserve some of the VFIO virtual
address space so that it can be used by devices to map their MSI doorbells
using the SMMU. With your patches, this requires that (a) the kernel
tells userspace about the size and alignment of the doorbell region
(MSI_RESV) and (b) userspace tells the kernel the VA-range that can be
used (RESERVED_MSI_IOVA).

However, this is all special-cased for MSI doorbells and there are
potentially other regions of the VFIO address space that are reserved
and need to be communicated to userspace as well. We already know of
hardware where the PCI RC intercepts p2p accesses before they make it
to the SMMU, and other hardware where the MSI doorbell is at a fixed
address. This means that we need a mechanism to communicate *fixed*
regions of virtual address space that are reserved by VFIO. I don't
even particularly care if VFIO_MAP_DMA enforces that, but we do need
a way to tell userspace "hey, you don't want to put memory here because
it won't work well with devices".

In that case, we end up with something like your MSI_RESV capability,
but actually specifying a virtual address range that is simply not to
be used by MAP_DMA -- we don't say anything about MSIs. Now, taking this
to its logical conclusion, we no longer need to distinguish between
remappable reserved regions and fixed reserved regions in the ABI.
Instead, we can have the kernel allocate the virtual address space for
the remappable reserved regions (probably somewhere in the bottom 4GB)
and expose them via the capability. This simplifies things in the
following ways:

* You don't need to keep track of MSI vs DMA addresses in the VFIO rbtree
* You don't need to try collapsing doorbells into a single region
* You don't need a special MAP flavour to map MSI doorbells
* The ABI is reusable for PCI p2p and fixed doorbells

I really think it would make your patch series both generally useful and
an awful lot smaller, whilst leaving the door open to ABI extension on
a case-by-case basis when we determine that it's really needed.

Thoughts?

Will
Auger Eric
2016-10-21 09:26:48 UTC
Permalink
Hi Will,
Post by Punit Agrawal
Hi Eric,
Thanks for posting this.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
Marc, Robin and I sat down and had a look at the series and, whilst it's
certainly addressing a problem that we desperately want to see fixed, we
think that it's slightly over-engineering in places and could probably
be simplified in the interest of getting something upstream that can be
used as a base, on which the ABI can be extended as concrete use-cases
become clear.
Stepping back a minute, we're trying to reserve some of the VFIO virtual
address space so that it can be used by devices to map their MSI doorbells
using the SMMU. With your patches, this requires that (a) the kernel
tells userspace about the size and alignment of the doorbell region
(MSI_RESV) and (b) userspace tells the kernel the VA-range that can be
used (RESERVED_MSI_IOVA).
However, this is all special-cased for MSI doorbells and there are
potentially other regions of the VFIO address space that are reserved
and need to be communicated to userspace as well. We already know of
hardware where the PCI RC intercepts p2p accesses before they make it
to the SMMU, and other hardware where the MSI doorbell is at a fixed
address. This means that we need a mechanism to communicate *fixed*
regions of virtual address space that are reserved by VFIO. I don't
even particularly care if VFIO_MAP_DMA enforces that, but we do need
a way to tell userspace "hey, you don't want to put memory here because
it won't work well with devices".
I think we all agree on this. Exposing an API to the user space
reporting *fixed* reserved IOVA ranges is a requirement anyway. The
problem was quite clearly stated by Alex in
http://lkml.iu.edu/hypermail/linux/kernel/1610.0/03308.html
(VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE)

I started working on this VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
capability but to me and I think according to Alex, it was a different
API from MSI_RESV.
Post by Punit Agrawal
In that case, we end up with something like your MSI_RESV capability,
but actually specifying a virtual address range that is simply not to
be used by MAP_DMA -- we don't say anything about MSIs. Now, taking this
to its logical conclusion, we no longer need to distinguish between
remappable reserved regions and fixed reserved regions in the ABI.
Instead, we can have the kernel allocate the virtual address space for
the remappable reserved regions (probably somewhere in the bottom 4GB)
and expose them via the capability.
If I understand correctly you want the host to arbitrarily choose where
it puts the iovas reserved for MSI and not ask the userspace.

Well so we are back to the discussions we had in Dec 2015 (see Marc's
answer in http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858).

- So I guess you will init an iova_domain seomewhere below the 4GB to
allocate the MSIs. what size are you going to choose. Don't you have the
same need to dimension the iova range.
- we still need to assess the MSI assignment safety. How will we compute
safety for VFIO?

This simplifies things in the
Post by Punit Agrawal
* You don't need to keep track of MSI vs DMA addresses in the VFIO rbtree
right: I guess you rely on iommu_map to return an error in case the iova
is already mapped somewhere else.
Post by Punit Agrawal
* You don't need to try collapsing doorbells into a single region
why? at host level I guess you will init a single iova domain?
Post by Punit Agrawal
* You don't need a special MAP flavour to map MSI doorbells
right
Post by Punit Agrawal
* The ABI is reusable for PCI p2p and fixed doorbells
right

Aren't we moving the issue at user-space? Currently QEMU mach-virt
address space is fully static. Adapting mach-virt to adjust to host
constraints is not straightforward. It is simple to reject the
assignment in case of collision but more difficult to react positively.
Post by Punit Agrawal
I really think it would make your patch series both generally useful and
an awful lot smaller, whilst leaving the door open to ABI extension on
a case-by-case basis when we determine that it's really needed.
I would like to have a better understanding of how you assess the
security and dimension the iova domain. This is the purpose of msi
doorbell registration, which is not neat at all I acknowledge but well I
did not find any other solution and did not get any other suggestion.
Besides I think the per-cpu thing is over-engineered and this can
definitively be simplified.

VFIO part was reviewed by Alex and I don't have the impression that this
is the blocking part. besides there is on iova.c fix,
IOMMU_CAP_INTR_REMAP removal; so is it really over-complicated?

Thanks

Eric
Post by Punit Agrawal
Thoughts?
Will
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Robin Murphy
2016-10-24 19:39:40 UTC
Permalink
Post by Auger Eric
Hi Will,
Post by Punit Agrawal
Hi Eric,
Thanks for posting this.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
Marc, Robin and I sat down and had a look at the series and, whilst it's
certainly addressing a problem that we desperately want to see fixed, we
think that it's slightly over-engineering in places and could probably
be simplified in the interest of getting something upstream that can be
used as a base, on which the ABI can be extended as concrete use-cases
become clear.
Stepping back a minute, we're trying to reserve some of the VFIO virtual
address space so that it can be used by devices to map their MSI doorbells
using the SMMU. With your patches, this requires that (a) the kernel
tells userspace about the size and alignment of the doorbell region
(MSI_RESV) and (b) userspace tells the kernel the VA-range that can be
used (RESERVED_MSI_IOVA).
However, this is all special-cased for MSI doorbells and there are
potentially other regions of the VFIO address space that are reserved
and need to be communicated to userspace as well. We already know of
hardware where the PCI RC intercepts p2p accesses before they make it
to the SMMU, and other hardware where the MSI doorbell is at a fixed
address. This means that we need a mechanism to communicate *fixed*
regions of virtual address space that are reserved by VFIO. I don't
even particularly care if VFIO_MAP_DMA enforces that, but we do need
a way to tell userspace "hey, you don't want to put memory here because
it won't work well with devices".
I think we all agree on this. Exposing an API to the user space
reporting *fixed* reserved IOVA ranges is a requirement anyway. The
problem was quite clearly stated by Alex in
http://lkml.iu.edu/hypermail/linux/kernel/1610.0/03308.html
(VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE)
I started working on this VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
capability but to me and I think according to Alex, it was a different
API from MSI_RESV.
Post by Punit Agrawal
In that case, we end up with something like your MSI_RESV capability,
but actually specifying a virtual address range that is simply not to
be used by MAP_DMA -- we don't say anything about MSIs. Now, taking this
to its logical conclusion, we no longer need to distinguish between
remappable reserved regions and fixed reserved regions in the ABI.
Instead, we can have the kernel allocate the virtual address space for
the remappable reserved regions (probably somewhere in the bottom 4GB)
and expose them via the capability.
If I understand correctly you want the host to arbitrarily choose where
it puts the iovas reserved for MSI and not ask the userspace.
Well so we are back to the discussions we had in Dec 2015 (see Marc's
answer in http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858).
To an extent, yes, however the difference is that we now know we
definitely have to deal with situations in which userspace *cannot* be
in total control of the memory map, and that changes the game:

_________
/ \
/ Fixed \
/ things (A) \
( _________ )
\ / MSI \ /
X doorbells X
/ \___(B)___/ \
( )
\ Remappable /
\ things (C)/
\_________/

In the absence of A, then B == C so it was very hard to not want to
implement C. As soon as A *has* to be implemented for other reasons,
then that is also sufficient to encompass B. C can still be implemented
later as a nice-to-have, but is not necessary to get B off the ground.
Post by Auger Eric
- So I guess you will init an iova_domain seomewhere below the 4GB to
allocate the MSIs. what size are you going to choose. Don't you have the
same need to dimension the iova range.
- we still need to assess the MSI assignment safety. How will we compute
safety for VFIO?
Absolutely. We're talking in general terms of the userspace ABI here,
although that can't help but colour the underlying implementation
decisions. Of course the VFIO internals still have to handle the
specific case of MSIs, but that's basically no more than this:

static bool msi_isolation = true; /* until proven otherwise */
static unsigned long msi_remap_virt_base = 0x08000000; /* fits QEMU */
static size_t msi_remap_size;

vfio_msi_thing_callback(thing) {
msi_remap_size += thing->info.size;
msi_isolation &= thing->info.flags & PROVIDES_ISOLATION;
}

vfio_msi_init(...) {
...
#ifdef CONFIG_X86
msi_remap_virt_base = 0xfee00000;
msi_remap_size = 0x100000;
msi_isolation = irq_remapping_enabled;
#else
irq_for_each_msi_thing(vfio_msi_thing_callback);
#endif
...
}

vfio_attach_group(...) {
...
if (!msi_isolation && !allow_unsafe_interrupts)
return -ENOWAY;
...
get_msi_region_cookie(domain, msi_remap_base, msi_remap_size);
...
}

And when a well-behaved userspace queries the reserved regions, that
base address and size is just one of potentially several that it should
get back. It's that "querying the reserved regions" bit that needs to be
gotten right first time.

Note that at this point I'm no longer even overly bothered about the
details of irq_for_each_msi_thing(), as it's an internal kernel
interface and thus malleable, although obviously the simpler the better.
I have to say Punit's idea of iterating irq_domains does actually look
really neat and tidy as a proof-of-concept, and also makes me think off
on a tangent that it would be sweet to be able to retrieve base+size
from dev->msi_domain to pre-allocate MSI pages in default domains, and
obviate the compose 'failure' case.
Post by Auger Eric
This simplifies things in the
Post by Punit Agrawal
* You don't need to keep track of MSI vs DMA addresses in the VFIO rbtree
right: I guess you rely on iommu_map to return an error in case the iova
is already mapped somewhere else.
Post by Punit Agrawal
* You don't need to try collapsing doorbells into a single region
why? at host level I guess you will init a single iova domain?
Yeah, right now this one goes either way - as things stand, it does make
life easier on the host side to make a single region to hang off the
back of the current iova_cookie magic, and as illustrated above it's
possibly the most trivial part of the whole thing, but the point is we
still don't *need* to. Since a userspace ABI for generic reservations
has to be able handle more than one region for the sake of non-MSI
things, we'd be free to change the kernel-side implementation in future
to just report multiple doorbells as individual regions - for starters,
if and when we add dynamic reservations and userspace gets to pick its
own IOVAs for those, it'll be a damn sight easier *not* to coalesce
everything.
Post by Auger Eric
Post by Punit Agrawal
* You don't need a special MAP flavour to map MSI doorbells
right
Post by Punit Agrawal
* The ABI is reusable for PCI p2p and fixed doorbells
right
Aren't we moving the issue at user-space? Currently QEMU mach-virt
address space is fully static. Adapting mach-virt to adjust to host
constraints is not straightforward. It is simple to reject the
assignment in case of collision but more difficult to react positively.
The point is that we *have* to move at least some of the issue to
userspace, and by then I'm struggling to see any real difference between
these situations:

a) QEMU asks VFIO to map some pages for DMA, gets an error back because
VFIO detects it conflicts with a reserved region, and gives up.
b) QEMU starts by asking VFIO what regions are reserved, realises they
will overlap with its hard-coded RAM address, and gives up.

where (a) requires a bunch of kernel machinery to second-guess
userspace, while (b) simply relies on userspace not being broken. And if
userspace fails at not being broken, then we simply retain the behaviour
which actually happens right now:

c) QEMU maps some pages for DMA at the same address as PCI config space
on the underlying hardware. Hilarity ensues.

Of course, userspace could be anything other than QEMU as well, so it's
not necessarily second-guessable at all; maybe we make the arbitrary
msi_remap_virt_base a VFIO module parameter to be more accommodating.
Who knows, maybe it turns out that's enough to keep users happy and we
never need to implement fully dynamic reservations.

Robin.
Post by Auger Eric
Post by Punit Agrawal
I really think it would make your patch series both generally useful and
an awful lot smaller, whilst leaving the door open to ABI extension on
a case-by-case basis when we determine that it's really needed.
I would like to have a better understanding of how you assess the
security and dimension the iova domain. This is the purpose of msi
doorbell registration, which is not neat at all I acknowledge but well I
did not find any other solution and did not get any other suggestion.
Besides I think the per-cpu thing is over-engineered and this can
definitively be simplified.
VFIO part was reviewed by Alex and I don't have the impression that this
is the blocking part. besides there is on iova.c fix,
IOMMU_CAP_INTR_REMAP removal; so is it really over-complicated?
Thanks
Eric
Post by Punit Agrawal
Thoughts?
Will
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Auger Eric
2016-11-02 16:15:23 UTC
Permalink
Hi Robin,
Post by Robin Murphy
Post by Auger Eric
Hi Will,
Post by Punit Agrawal
Hi Eric,
Thanks for posting this.
Post by Eric Auger
This is the second respin on top of Robin's series [1], addressing Alex' comments.
- MSI-doorbell API now is moved to DMA IOMMU API following Alex suggestion
to put all API pieces at the same place (so eventually in the IOMMU
subsystem)
- new iommu_domain_msi_resv struct and accessor through DOMAIN_ATTR_MSI_RESV
domain with mirror VFIO capability
- more robustness I think in the VFIO layer
- added "iommu/iova: fix __alloc_and_insert_iova_range" since with the current
code I failed allocating an IOVA page in a single page domain with upper part
reserved
IOVA range exclusion will be handled in a separate series
The priority really is to discuss and freeze the API and especially the MSI
doorbell's handling. Do we agree to put that in DMA IOMMU?
Note: the size computation does not take into account possible page overlaps
between doorbells but it would add quite a lot of complexity i think.
Tested on AMD Overdrive (single GICv2m frame) with I350 VF assignment.
Marc, Robin and I sat down and had a look at the series and, whilst it's
certainly addressing a problem that we desperately want to see fixed, we
think that it's slightly over-engineering in places and could probably
be simplified in the interest of getting something upstream that can be
used as a base, on which the ABI can be extended as concrete use-cases
become clear.
Stepping back a minute, we're trying to reserve some of the VFIO virtual
address space so that it can be used by devices to map their MSI doorbells
using the SMMU. With your patches, this requires that (a) the kernel
tells userspace about the size and alignment of the doorbell region
(MSI_RESV) and (b) userspace tells the kernel the VA-range that can be
used (RESERVED_MSI_IOVA).
However, this is all special-cased for MSI doorbells and there are
potentially other regions of the VFIO address space that are reserved
and need to be communicated to userspace as well. We already know of
hardware where the PCI RC intercepts p2p accesses before they make it
to the SMMU, and other hardware where the MSI doorbell is at a fixed
address. This means that we need a mechanism to communicate *fixed*
regions of virtual address space that are reserved by VFIO. I don't
even particularly care if VFIO_MAP_DMA enforces that, but we do need
a way to tell userspace "hey, you don't want to put memory here because
it won't work well with devices".
I think we all agree on this. Exposing an API to the user space
reporting *fixed* reserved IOVA ranges is a requirement anyway. The
problem was quite clearly stated by Alex in
http://lkml.iu.edu/hypermail/linux/kernel/1610.0/03308.html
(VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE)
I started working on this VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
capability but to me and I think according to Alex, it was a different
API from MSI_RESV.
Post by Punit Agrawal
In that case, we end up with something like your MSI_RESV capability,
but actually specifying a virtual address range that is simply not to
be used by MAP_DMA -- we don't say anything about MSIs. Now, taking this
to its logical conclusion, we no longer need to distinguish between
remappable reserved regions and fixed reserved regions in the ABI.
Instead, we can have the kernel allocate the virtual address space for
the remappable reserved regions (probably somewhere in the bottom 4GB)
and expose them via the capability.
If I understand correctly you want the host to arbitrarily choose where
it puts the iovas reserved for MSI and not ask the userspace.
Well so we are back to the discussions we had in Dec 2015 (see Marc's
answer in http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858).
To an extent, yes, however the difference is that we now know we
definitely have to deal with situations in which userspace *cannot* be
_________
/ \
/ Fixed \
/ things (A) \
( _________ )
\ / MSI \ /
X doorbells X
/ \___(B)___/ \
( )
\ Remappable /
\ things (C)/
\_________/
In the absence of A, then B == C so it was very hard to not want to
implement C. As soon as A *has* to be implemented for other reasons,
then that is also sufficient to encompass B. C can still be implemented
later as a nice-to-have, but is not necessary to get B off the ground.
Post by Auger Eric
- So I guess you will init an iova_domain seomewhere below the 4GB to
allocate the MSIs. what size are you going to choose. Don't you have the
same need to dimension the iova range.
- we still need to assess the MSI assignment safety. How will we compute
safety for VFIO?
Absolutely. We're talking in general terms of the userspace ABI here,
although that can't help but colour the underlying implementation
decisions.
Sorry for the delay I was out of the office last week.

The userspace ABI to retrieve reserved regions is the *easy* part. It is
based on VFIO capability chain and I have an RFC ready.

Of course the VFIO internals still have to handle the
Post by Robin Murphy
static bool msi_isolation = true; /* until proven otherwise */
static unsigned long msi_remap_virt_base = 0x08000000; /* fits QEMU */
static size_t msi_remap_size;
vfio_msi_thing_callback(thing) {
msi_remap_size += thing->info.size;
msi_isolation &= thing->info.flags & PROVIDES_ISOLATION;
}
vfio_msi_init(...) {
...
#ifdef CONFIG_X86
msi_remap_virt_base = 0xfee00000;
msi_remap_size = 0x100000;
msi_isolation = irq_remapping_enabled;
#else
irq_for_each_msi_thing(vfio_msi_thing_callback);
#endif
...
}
vfio_attach_group(...) {
...
if (!msi_isolation && !allow_unsafe_interrupts)
return -ENOWAY;
...
get_msi_region_cookie(domain, msi_remap_base, msi_remap_size);
...
}
I doubt Alex will accept to put that code in VFIO. He suggested in the
past to use the IOMMU API to retrieve the reserved region(s).

what about adding a reserved_regions list in iommu_domain and add a new
iommu_ops, something like
void add_reserved_regions(struct iommu_domain *, struct device *dev)
whose role would be to populate the list. This add_reserved_regions()
would be called on __iommu_attach_device. The list would be emptied on
iommu_domain_free().

arm-smmu cb implementation would be in charge of
- computing non ACS PCI host bridge windows from @dev,
- computing msi_rebase_map/size computation

on x86, cb would simply populate the MSI window.

vfio would lookup the iommu domain reserved_regions list on
VFIO_IOMMU_GET_INFO

Drawback of this approach is the security aspect is not handled by the
IOMMU API.

Note that combining v14 series and this one would implement everything I
think + giving the flexibility for the userspace to choose where it put
things. But well, LPC discussions will bring the last word obviously.
Post by Robin Murphy
And when a well-behaved userspace queries the reserved regions, that
base address and size is just one of potentially several that it should
get back. It's that "querying the reserved regions" bit that needs to be
gotten right first time.
Note that at this point I'm no longer even overly bothered about the
details of irq_for_each_msi_thing(), as it's an internal kernel
interface and thus malleable, although obviously the simpler the better.
I have to say Punit's idea of iterating irq_domains does actually look
really neat and tidy as a proof-of-concept, and also makes me think off
on a tangent that it would be sweet to be able to retrieve base+size
from dev->msi_domain to pre-allocate MSI pages in default domains, and
obviate the compose 'failure' case.
As Punit mentionned, the natural place where the msi doorbell base,
size and irq_remapping can be retrieved looks to be the irqchip itself.
It works perfectly fine for v2m and its. Hence my first attempt to use a
cb at this level (irqchip msi_doorbell_info up to v11).

Adding a cb at irq_domain level looks quite impractical to me to
retrieve the info. Actually I don't see how to manage that without
adding new fields in irq_domain struct. If you have any suggestion,
please let me know.

Thanks

Eric
Post by Robin Murphy
Post by Auger Eric
This simplifies things in the
Post by Punit Agrawal
* You don't need to keep track of MSI vs DMA addresses in the VFIO rbtree
right: I guess you rely on iommu_map to return an error in case the iova
is already mapped somewhere else.
Post by Punit Agrawal
* You don't need to try collapsing doorbells into a single region
why? at host level I guess you will init a single iova domain?
Yeah, right now this one goes either way - as things stand, it does make
life easier on the host side to make a single region to hang off the
back of the current iova_cookie magic, and as illustrated above it's
possibly the most trivial part of the whole thing, but the point is we
still don't *need* to. Since a userspace ABI for generic reservations
has to be able handle more than one region for the sake of non-MSI
things, we'd be free to change the kernel-side implementation in future
to just report multiple doorbells as individual regions - for starters,
if and when we add dynamic reservations and userspace gets to pick its
own IOVAs for those, it'll be a damn sight easier *not* to coalesce
everything.
Post by Auger Eric
Post by Punit Agrawal
* You don't need a special MAP flavour to map MSI doorbells
right
Post by Punit Agrawal
* The ABI is reusable for PCI p2p and fixed doorbells
right
Aren't we moving the issue at user-space? Currently QEMU mach-virt
address space is fully static. Adapting mach-virt to adjust to host
constraints is not straightforward. It is simple to reject the
assignment in case of collision but more difficult to react positively.
The point is that we *have* to move at least some of the issue to
userspace, and by then I'm struggling to see any real difference between
a) QEMU asks VFIO to map some pages for DMA, gets an error back because
VFIO detects it conflicts with a reserved region, and gives up.
b) QEMU starts by asking VFIO what regions are reserved, realises they
will overlap with its hard-coded RAM address, and gives up.
where (a) requires a bunch of kernel machinery to second-guess
userspace, while (b) simply relies on userspace not being broken. And if
userspace fails at not being broken, then we simply retain the behaviour
c) QEMU maps some pages for DMA at the same address as PCI config space
on the underlying hardware. Hilarity ensues.
Of course, userspace could be anything other than QEMU as well, so it's
not necessarily second-guessable at all; maybe we make the arbitrary
msi_remap_virt_base a VFIO module parameter to be more accommodating.
Who knows, maybe it turns out that's enough to keep users happy and we
never need to implement fully dynamic reservations.
Robin.
Post by Auger Eric
Post by Punit Agrawal
I really think it would make your patch series both generally useful and
an awful lot smaller, whilst leaving the door open to ABI extension on
a case-by-case basis when we determine that it's really needed.
I would like to have a better understanding of how you assess the
security and dimension the iova domain. This is the purpose of msi
doorbell registration, which is not neat at all I acknowledge but well I
did not find any other solution and did not get any other suggestion.
Besides I think the per-cpu thing is over-engineered and this can
definitively be simplified.
VFIO part was reviewed by Alex and I don't have the impression that this
is the blocking part. besides there is on iova.c fix,
IOMMU_CAP_INTR_REMAP removal; so is it really over-complicated?
Thanks
Eric
Post by Punit Agrawal
Thoughts?
Will
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Loading...